Chasing Bugs in the Fog

I have a bug in a test fixture that is FPGA based. I had thought it was in the software which controls it, but after many hours of chasing it around I've concluded it must be in the FPGA code.

I didn't think it was in the VHDL because it had been simulated well and the nature of the bug is an occasional dropped character on the receive side. Who can't design a UART? Well, it could be in the handshake with the state machine, but still...

So I finally got around to adding some debug signals which I would monitor on an analyzer and guess what, the bug is gone! I *hate* when that happens. I can change the code so the debug signals only appear when a control register is set to enable them, but still, I don't like this. I want to know what is causing this DURN THING!

Anyone see this happen to them before?

Oh yeah, someone in another thread (that I can't find, likely because I don't recall the group I posted it in) suggested I add synchronizing FFs to the serial data in. Sure enough I had forgotten to do that. Maybe that was the fix... of course! It wasn't metastability, I bet it was feeding multiple bits of the state machine! Durn, I never make that sort of error. Thanks to whoever it was that suggested the obvious that I had forgotten.

--

Rick
Reply to
rickman
Loading thread data ...

Not metastability, a race condition. Asynchronous external input headed to multiple clocked elements, each of which it reaches via a different path with a different delay.

When you added debugging signals you changed the netlist, which changed the place and route, making unpredictable changes to those delays. In this case, it happened to push it into a place where _as far as you tested_, it seems happy. But it's still unsafe, because as you change other parts of the design, the P&R of that section will still change anyhow, and you start getting my favorite situation, the problem that comes and goes based on entirely unrelated factors.

The fix you fixed fixes it. When you resynchronized it on the same clock as you're running around the rest of the logic, you forced that path to become timing constrained. As such, the P&R takes it upon itself to make sure that the timing of that route is irrelevant with respect to the clock period, and your problem goes away for good.

--
Rob Gaddi, Highland Technology -- www.highlandtechnology.com 
Email address domain is currently out of order.  See above to fix.
Reply to
Rob Gaddi

Yes, This is called a "Heisenbug". Usually involves a clock domain crossing

mistake.

John Eaton

--------------------------------------- Posted through

formatting link

Reply to
jt_eaton

One mistake that is not too hard to make is forgetting to put a synchronize r flop on the input of an edge detector, like you might have on a UART inpu t (so that the edge detector has two flops, total). Depending on the routi ng delays, this can cause you to miss a sizable percentage of edges. (Not just delayed, but missed completely.) Using only a single flop is sometime s known as using the "greedy path".

(Actually, to mitigate metastability as well, an edge detector ought to hav e three flops and an AND gate. Using two is sometimes known as using the " sneaky path".)

Reply to
Kevin Neilson

No, when changing the debug output I added the synchronization FFs which fixed the problem.

My point was that when the other poster suggested that I need to sync to the clock I mistook that for metastability forgetting that the input went to multiple sections of logic. So actually I made the same mistake twice... lol

Just to make sure of what was what (it has been two years since I last worked with this design) I pulled the FFs out and added back just one. Sure enough the bug reappears with no FFs, but goes away with just one. The added debug info available allowed me to see exactly the error and sure enough, when a start bit comes in there is a chance that the two counters are not properly set and the error shows up in the center of the bit where the current contents of the shift register are moved into the holding register as a new char.

I guess what most likely happened is that when I wrote the UART code I assumed the sync FFs would be external and when I wrote the wrapper code I assumed the FFs were inside the UART. In other words, I didn't have a proper spec and never gave this problem proper consideration.

I will revisit this design and look at the other inputs. No reason to assume I didn't make the same mistake elsewhere.

--

Rick
Reply to
rickman

flop on the input of an edge detector, like you might have on a UART input (so that the edge detector has two flops, total). Depending on the routing delays, this can cause you to miss a sizable percentage of edges. (Not just delayed, but missed completely.) Using only a single flop is sometimes known as using the "greedy path".

three flops and an AND gate. Using two is sometimes known as using the "sneaky path".)

Everyone is saying the same thing, so I guess I didn't explain clearly. Someone had already pointed out to me that I needed a synchronizer on the received data signal in another thread that I can't find now. I took them at their word, but was thinking they meant it was about metastability which I figured was not a problem at these speeds (yes, the speeds do make a difference for metastability since you never chase it away, you just minimize it). I wasn't thinking about the serial in signal feeding the state machine, just the shift register.

So when I made the changes, which included the synchronizer, it worked. Because I didn't expect the synchronizer to do anything, I had forgotten about it until I was typing the post here. I remembered at the end of the message and realized that was what fixed the problem...

Sorry for the confusion. Still, thanks to all who replied and especially the mystery person who suggested it in the other thread wherever that was.

--

Rick
Reply to
rickman

Le 18/06/2013 23:45, rickman a écrit :

Several years ago a young engineer reused my long proven UART code and modified it, carelessly removing the synchronizing FF. He came to see me and complained that my UART didn't work, it hung after some unpredictable time. I thought for a few minutes, guessed he probably had removed the FF and fixed his problem right away.

Nicolas

Reply to
Nicolas Matringe

That's the same thing that happened to me when I had the problem last. I h ad an edge detector connected to a big synchronizer module that was in turn connected to all the input pins. When I had problems I looked inside the synchronizer module and found that it didn't have a flop on that line; it w as just wired straight through.

Reply to
Kevin Neilson

There's 3 things that could have gone wrong (and might still be doing wrong):

You failed to synchronise between the clock domain of the input serial link and the clock of your system (sounds like you fixed this one)

You failed to constrain the clocks and other inputs so the synthesis tool knows what timing budget it has to meet

You failed timing analysis and didn't notice - in other words the synthesis tool says the design it produced doesn't meet your supplied timing constraints, despite its best efforts. If the failure is small it may still work in some voltage/temperature/silicon situations, but it isn't guaranteed in all cases.

Normally the last one will raise big red flags in the tool, assuming the timing analyser does get run as part of the build. However the first two are easy to overlook and you get no warning from the tools.

Theo

Reply to
Theo Markettos

Any time you recompile an FPGA and the problem disappears or changes, it is a STRONG indication it is a timing problem. Regenerating the place & route changes timings subtly between sections, and may eliminate a marginal setup or hold time problem. You should make sure all signals that cross clock boundaries are properly synchronized, and that you are giving the right clock specification to your clocks in the ucf file. If there are tricky timings on parts connected to the FPGA, then you need to define the timings in the ucf file.

Jon

Reply to
Jon Elson

ElectronDepot website is not affiliated with any of the manufacturers or service providers discussed here. All logos and trade names are the property of their respective owners.