fifo or sdram bug?

In our system a signal is passed through a couple of fifos inside FPGA and then onto external sdram to be read by application software. All looks ok except that some units in the field show occasional errors in that signal read from sdram. The error is as follows: odd samples are offset by 8 samples from the even. So if we remove this offset then signal looks ok.

I can't reproduce the error in the lab. So I depend on some speculations. It could be the fifos or the sdram. Anyone has come across such issue? my suspicion is on the sdram as it is configured as 8 pages? Also the sdram itself has an internal fifo that muxes 128 bits onto 16 (again factor of

8)? any input appreciated.

Kaz

--------------------------------------- Posted through

formatting link

Reply to
kaz
Loading thread data ...

d

Is the problem that the data is off by 8, or that the data gets stored in a location that is off by 8? If it's that the data is off by 8, then the nu mber of sdram pages or the sdram muxing is not relevant. What exactly do y ou mean by 'off by 8'? Is data bit 3 in the wrong state? Is it that data bit 3 is always 1 when wrong or is it that data bit 3 is wrong, and when it is wrong that bit might be 1 or it might be 0.

I would also highly doubt that the problem is in the commercial sdram, almo st without doubt it is in something that you have designed, not elsewhere.

- Has your design passed static timing analysis?

- Are all of the I/O and clock frequencies correctly specified to the timin g analysis tool?

- Try warming up the part with a heat gun or cooling it off with cool spray in the lab. Does the design still work in the lab? If not, you have a ti ming problem.

In fact, based on your description so far, it is almost certainly a timing issue, so that would be the best place to start looking.

Kevin

Reply to
KJ

in

the

do

data

data is 16 bits wide, nothing wrong with bits. all sample values are correct.

odd samples do not follow their even members e.g. if a correct stream is indexed as 0,1,2,3,4,5,6,7,8,9,10...etc then what we get is:

0,9,2,11,4,13,6,15,8,17,10

Thus all samples are correct individually. even stream is correct as

0,2,4,6,8... and odd stream is also correct as sequence 9,11,13,15,...etc

but there is this offset where instead of 0,1,2,3.. I get 0,9,2,11...

yes certainly,

yes

have a

can't do that in the field, units are concealed mobile radio heads. we have deployed many thousands of them. only a tiny percentage shows the issue.

Kaz

--------------------------------------- Posted through

formatting link

Reply to
kaz

have a

timing

We have done that in the lab, warming/freezing across full range but could not reproduce the issue. test was iterated over thousand times to catch any intermittent behaviour but all passed.

Kaz

--------------------------------------- Posted through

formatting link

Reply to
kaz

in

the

exactly do

data

when

data is 16 bits wide, nothing wrong with bits. all sample values are correct. odd samples do not follow their even members e.g. if a correct stream is indexed as 0,1,2,3,4,5,6,7,8,9,10...etc then what we get is:

0,9,2,11,4,13,6,15,8,17,10

Thus all samples are correct individually. even stream is correct as

0,2,4,6,8... and odd stream is also correct as sequence 9,11,13,15,...etc

but there is this offset where instead of 0,1,2,3.. I get 0,9,2,11...

yes certainly,

yes

have a

can't do that in the field, units are concealed mobile radio heads. we have deployed many thousands of them. only a tiny percentage shows the issue.

*****************************************************************

Sounds like you have a conflict between your SDRAM column addressing and the setting of the SDRAM's burst mode. If your column address does not step in lumps of SDRAM burst addressing then your output data sequencing can (and will) get screwed. Make sure you have correctly specified the burst length for your SDRAM driver and make sure your column address stepping agrees.

Andy

Reply to
Andy Bennett

You haven't said anything about your FPGA design in terms of where you could lose 8 samples of data or how your data is split between the odd and even samples. If your FPGA design does not split the data between odd and even, I'm not sure how you could have this problem.

You also don't mention if the "lost" 8 samples of odd data ever show up somewhere or if it is just a synchronization problem from the first sample.

Is there a place in your design where the odd and even samples are handled separately? Are you using two ADC converters to sample the same analog at a higher rate, for example?

Do you write the odd/even samples to the SDRAM separately?

--

Rick
Reply to
rickman

The 8 samples are not lost but odd substream is offset from the even substream regularly.

Inside fpga the data is never split up into odd/even streams. data 16 bit wide enters a fifo (dc fifo with 16 bits output width). Then into another fifo(sc with output width 32 bits) then back to 16 bits at i/o to sdram.

The two fifos are few words deep but could be a cause in theory i.e. if fifo ptr toggles between two separate counting sequences. Though Altera experts looked at them and were happy about the design and we added extra pipe just in case.

no.

The problem is also sometimes self rectifying after some time.

I assume that a glitch in control signals to sdram may change the column addressing mechanism as suggested by Andy but the sdram is 8k x128 x 128 x

8 banks thus 16 bits of data is muxed as 128 bits into each cell!

Kaz

--------------------------------------- Posted through

formatting link

Reply to
kaz

d

Some next steps to consider...

- You did the thermal testing with field return boards that exhibited the p roblem, correct?

- Are there other differences between lab use and field use that could cont ribute such as with the power supply? The power supply is probably a stret ch given the symptoms you describe, but just wondering what environmental d ifferences might be going on.

- What kind of DRAM are you using?

- Commercial IP for the controller or home grown?

- Since you said in another post that the data starts at 16 bits, widens to 32 (I presume at input to the DRAM Controller) and then narrows back down to 16 at the I/O pins. Are clock domains being crossed along the way? Is your timing analysis set to ignore crossings? If so, shut that off and re- analyze each crossing.

- How does the DDR controller receive input commands? By that I mean is it given addresses for each read/write to be performed or is it given a start address and a burst size? If given a start address and burst size, then t hat would likely exonerate everything that is upstream of the DRAM controll er (except for possible clock domain crossing issues)

- Review the PCB routing and look at signal integrity on the PCB?

Focus on getting the field returns to fail in the lab. Without that you'll have no way to verify any potential fix candidate.

Kevin Jennings

Reply to
KJ

Then are you certain all your timing constraints are correct? I'm with KJ, this problem description makes me immediately leap to a timing problem.

--
Rob Gaddi, Highland Technology -- www.highlandtechnology.com 
Email address domain is currently out of order.  See above to fix.
Reply to
Rob Gaddi

How large is a "sample"?

???

The two FIFOs are sequential, not parallel, right? So how would the cause a shift in the odd/even data? Do the FIFOs use block RAM? I don't recall Altera having distributed memory so I guess block RAM is the only thing available. That means the FIFO memory is one block of memory unless you have fairly large FIFOs. Is any of this right?

Not following this well. I think you are simply saying that the internal writes in the SDRAM are 128 bits so your 16 bit samples(?) are written 8 at a time. Unless you have some separation of odd/even samples I don't see how that would matter.

How do you have your burst addressing set? There are different modes with different addressing. Only one is sequential. It has been too long since I've worked with SDRAM and I don't recall what that is all about. If this is the issue, it won't reproduce the symptoms as you have described. I believe you say that at the beginning of the fault 8 odd samples are dropped leaving the rest of the sequence out of alignment with the even samples. If they aren't dropped, where do they show up? With a burst addressing error the samples would be moved about, scrambled in some way, but not lost at all.

When the unit "recovers", where does the extra data come from? Are 8 odd samples repeated?

If you can figure out more details of the glitch at the beginning and end of the error sequence it might help explain where the problem is.

--

Rick
Reply to
rickman

(snip)

Somehow this reminds me of something from years ago, which was using real IC FIFOs instead of FPGA ones.

Somehow the system wasn't following FULL and ALMOST FULL, and would wrap the FIFO. But that usually results in data loss, which you seem to indicate doesn't happen.

I don't know SDRAM timing enough to say. If all the data paths are 16 bits, it is funny to have an offset on eight bit boundaries!

-- glen

Reply to
glen herrmannsfeldt

I will try to give more details instead of my "reduced simplified version" and hopefully answer some of your questions.

I am talking about a DPD functionality where software reads from sdram

2,457,600 samples of each of TxI,TxQ,sRxI,sRxQ.

all these four slots are 16 bits signed and interleaved in above order giving a total stream size of 2,457,600 x 4 samples.

inside FPGA: TxI and TxQ are first concatenated as(16 x 2 bits), then passed through a small dc fifo for clock crossing.

sRxI and sRxQ are data received from Tx after going through DAC & PA then sampled back by an ADC for DPD algorithm. sRxI and sRxQ are also concatenated as 16 x 2 bits. They also go through their dc fifo for clock crossing.

Then all four data are concatenated as 16 x 4 = 64 bits.

The stream is then passed as 128 bits using sc fifo for sdram controller IF (Altera sdram controller). At the i/o data is passed as two streams each 16 bits and each has its own sdram. Thus we have two sdrams (one for Tx data and one for sRx data)

Almost all field units work without any problem. Occasionally, it is reported that DPD algorithm fails and when I looked at captured files I noticed that sRx data was ok but TxI and TxQ each shows same problem I described where their odd samples had shifted location relative to even ones. So instead of the normal order of 0,1,2,3,4,...etc. I noticed it was

0,9,2,11,4,13,6,15,8,... from beginning to the end of 2,456,7600

Apart from that there is no other error and all values are correct judging by spectrum and time domain.

What happens at the moment of the glitch we don't know, I haven't tested any failed units in the lab though I requested that. We have inserted some extra logic to capture data directly from fifos in case of the event but we failed to reproduce the error. Units are in different countries and it is hard to keep track of debugging.

My first conclusion is that there must be memory involved and it must be a case of read/write toggling. The basic fpga concatenation logic does not involve storage and so is ruled out. FPGA fifos are block ram based and we have hundreds of them all across the design for various parts without issues.

sdram controller and i/o timing have been done by Altera experts.

Design is timing clean, lab tested across full range of temperature.

Kaz

--------------------------------------- Posted through

formatting link

Reply to
kaz

(snip)

How small is this FIFO? (depth x width)

By they way, it is usual to use Gray code when passing the FIFO address across the clock domain. I think they convert back, but maybe just address the BRAM with Gray code.

-- glen

Reply to
glen herrmannsfeldt

Should I assume DC means "dual clock"? So this FIFO is 32 bits wide?

I don't find this part clear at all. Above you say the data stream is

64 bits, then 128, then two streams of 16 bits. So the data is packed with one sample of each of the four data streams (TxI,TxQ,sRxI,sRxQ) to make 64 bits, then two words of this are grouped to make 128 bits. But then it is all broken back down into 16 bit individual samples?

So you can't say what happened to samples 1, 3, 5, etc? The data is being handed to the SDRAM as 16 bit samples, TxI0, TxQ0, TxI1, TxQ1,...?

So when you have the glitch the alignment is shifted for both TxI and TxQ or just one? If both, that would be 16 samples of 16 bit data, right?

--

Rick
Reply to
rickman

it is dual clock fifo(368.64Mhz => 245.76MHz, 32 bits wide, 16 words deep it is altera core, we just write/read under our rate control logic avoiding empty/full situation

The sRx fifo is 245.76 => 245.76 with same above width/depth

Kaz

--------------------------------------- Posted through

formatting link

Reply to
kaz

correct, that is how it is designed (I assume it is to do with SOPC interface)

correct

I should correct myself about the offset value, it is 16 samples(not 8) in the sense of stream index i.e. I get samples in the order

0,17,2,19,4,21,...etc

right?

both I and Q symmetrically, if I reverse the offset of both I get proper signal. I don't have two captures and I assume the error wraps.

--------------------------------------- Posted through

formatting link

Reply to
kaz

Focus on reproducing it in the lab - or in simulation.

Xilinx FPGAs have multiple clock modules (DCMs) - you're using Altera so you'll have to translate terms.

These have ways of generating a derivative clock with adjustable timing for clock phase adjustment : I have attacked similar problems by setting up the timing in a software-writable register, running memtests with every possible phase adjustment and mapping out the valid range of timings.

If you have one or more of these spare, attach it to the SDRAM clock, and if you have another, attach it to your incoming data register, or your SDRAM address bus output, etc...

Now you can run memory tests, stretching the timings until it fails. Hopefully one of the failure modes (but not more than one) will reproduce the error you are seeing.

In my case, having found the likely failure mode this way, I was able to reproduce the effect in simulation, the rest was plain sailing.

Incidentally I also recall a correlation between memory manufacturer and one failure mechanism : I concluded there was nothing specific wrong with the memory itself, but some disagreement between it and my RAM interface; I could make that one go away by specifying another memory. Is there any such variability in your case?

-- Brian

Reply to
Brian Drummond

I am still trying to figure out this issue of odd/even offset. My suspicion has fallen on a dual clock fifo (32 bit wide) because when at some stage its depth was 8 then odd/even offset was 8 samples. Now its depth is 16 and odd/even offset is 16.

The next question is why would a fifo behave like that even if clocks change phase or fifo gets empty/full.

The fifo is protected against empty/full preventing read/write. The clocks are asynchronous. It is a straight forward dc fifo with several other like it in the design but only this one shows the problem occasionally.

I am planning to use dual port ram instead but wanted to know what has gone wrong.

Kaz

--------------------------------------- Posted through

formatting link

Reply to
kaz

If your suspicion is correct, then it is because there is a bug in the dual clock design. I know that sounds trite, but you can't discount the obviou s.

s
e

Home grown fifo design?

Even if the fifo is not home grown, I would suggest switching to another du al clock fifo design first for the following reasons:

- It's going to be quicker to check this out since in some sense all you ha ve to change is the entity that is being instantiated and possibly renaming parameters and ports

- You get another tidbit of information. If the design still fails in the same way, then it 'could' be that the fifo is OK after all. If the design works, then it 'could' be that you're right about there being a problem in the fifo.

If you go down the 'use dual port ram instead' path instead, this is simply re-inventing the dual clock fifo. Dual clock fifo designs are already bas ed on dual port ram anyway. Going down that road is probably not the best way to go about getting to a solution.

If the problem really is in the fifo, then it would be better to mentally t race it back in order to figure out what you could then instrument. In thi s particular case what I mean is that from what you describe, it looks like a bit in the read address is perhaps into the wrong state. So accept that as a given and work out what are the implications of that condition? One implication if the address pointer is suddenly wrong might be an unusual ch ange in the number of entries in the fifo (like increasing after a read, or decreasing after a write or just changing by a 'large' amount over a short period). Now add some logic to monitor that condition and bring the resul ts of that monitor out to a pin that you can trigger on. For example, mayb e the number of words in the fifo should never change by more than 4 betwee n any two read side clock cycles. So add some code that will detect that c ondition. Repeat for other conditions that you can come up with.

Kevin Jennings

Reply to
KJ

The fifo is not home grown. It is altera fifo core. We never discard well tried cores for home made work.

DC fifo is built by Altera around dual ram but if (as in my case) the clock rates are predictable then one can control wr/rd pointers each in their clock domain without having to cross clock domains thus reducing risk and resource. That is my point and is well known design recommendation.

The fifo in question is just 32 bit wide dc fifo from altera core with internal pipe set to 3, rd/wr protected, connected to clk 368.64 at write side enabled 2/3 and connected to 245.76 on the read side always enabled. Initially the read

enable is delayed to wait for few words (even though it is protected).

Timing is clean. I imagine the write pointer is working but the read pointer is toggling between 0 and 15 with two clock delays leading to samples 0,17,2,19 ...etc. Just a guess.

I have put a ram to capture few data from this fifo in the field when problem occurs and I am awaiting results.

--------------------------------------- Posted through

formatting link

Reply to
kaz

ElectronDepot website is not affiliated with any of the manufacturers or service providers discussed here. All logos and trade names are the property of their respective owners.