advice on occasional memory error

Hi, I am confronted with the followig problem: we are building devices with

8-bit cpus and about 128 K EPROM (banked) and 16 K RAM. They have been working since many years flawlessly. Now we get sometimes reports from customers about mysterious failures. These failures seem to be caused by wrong memory contents, i.e. there are single bits flipped, sometimes from 0 to one, sometime from 1 to 0. Most of the memory contents is beeing changed all the time except a quite small number of bytes which are going through a sequence of values during startup and are not going to be changed or refreshed later. Now if one of those "static" bytes changes its contents, usually the operation of the device is disturbed. This effect even occurs in a test environment, unfortunately very seldom (about 6-8 weeks interval, not reproducable). I don't have an ICE for this processor (not available since the processor is far too old) and don't see how to find the reason of the problem. I already built some test versions which are checking some of those static variables, but inevitably the next time the error occured on another, not watched address. For example, I observed the following changes: AA -> 2A, 59 -> 58, 05 -> 07, 05 -> 00. All those bytes had different addresses, all in the range from 8800 to 8d00 however. Btw the one double-bit error in this list occurred in the field and I have no idea how long it took until the error was detected, so I don't know if both bits changed at the same time. Now my question: do you think I am correct if I conclude from the pattern of error occurrences a hardware fault of the memory system? I think it's quite likely for a software error to change more than a single bit if for example a bad pointer is used for a write operation. Especially the fact that different but singular addresses are affected seems to point in this direction. Or is this conclusion unfounded? Do you have any idea what could be done to find the reason of the failures? My only idea is to connect a logic analyzer to the test device, but (1) this would block the only one we have for many weeks, (2) I doubt I could define a trigger condition complex enough to catch the problem if it occurs and (3) it has 16 channels and I think I would need more to trace address lines plus bank-, data- and control lines. Sincerely -- Dirk
Reply to
Dirk Zabel
Loading thread data ...

I agree that the single bit errors tend to point to flaky hardware. flaky software tends to hose an entire byte. The exception is if the processor has bit manipulation instructions or if the software has bit manipulation routines.

The first step is to try to figure out what changed. If the unit has the same software as the ones that have been "working since many years flawlessly" then I doubt that it is a software bug. It can happen that a hidden bug pops up when something else changes, but not often. Just to be sure, get an old unit that has been working for years and a new unit off the line. Read the contents of the old unit's EPROM and program them into the new unit. Then load the program you are shipping now into the old unit. Set up five of each kind of board in a test rig and let them run 24/7.

Once you have that running, get a can of freeze spray and a hot air gun and see whether heat or cold turns that 6-8 weeks between failures to a few seconds.

Check your power supply with a good DMM (for proper vottage) and with a good scope to see if there is too much noise or spikes. Put the old and new boards next to each other and start comparing digital signals. looking for excess ringing, slow rise/fall times, etc.

Compare the old and new. Are any of the ICs by a new vendor? Are all the resistors and capacitors the same values?

You could also try a huntron tracker - it often identifies which part of a design has suddenly started acting diferently. [

formatting link
]

The next step is to stop trying to troubleshoot the boards with your application software running. Instead, write a test program that looks for memory problems. Here is a test sequence that I often use: TEST PROGRAM ONE:

Do a checkerboard / reverse checkerboard memory test in a loop, stopping on error. Start five boards running the test. This finds flaky writing.

TEST PROGRAM TWO:

Do a walking ones and walking zeros memory tests in a loop, stopping on error. Start five boards running the test. This finds any bits that change when another bit changes, and is the frameworj=k for the following tests.

TEST PROGRAM THREE

Same test as before, but with a delay before each read, and have the delay double in length each time through the loop. This tests for errors that happen when a bit is left alone for a long time.

TEST PROGRAM FOUR:

Same test as before, but instead of an increasing delay before each read, make it an increasing number of reads of the same location. This tests for errors that happen when a bit is read many times without being written to.

So you will end up with 30 boards running tests. Which ones (if any) fail will tell you a lot about the cause of the failures.

Please post your results even if you solve the issue. Thanks!

--
Guy Macon
Reply to
Guy Macon

You have not said if this is Code or DATA ram corruption, but these low duty-cycle failures can be hard to nail :)

If DATA errors, they can also be caused by SW errors, where a some operation is not as atomic as you think, and IF an interrupt hits at the critical timeslot in the operation, it mangles.

Next step would be code that verifies by readback, and checksums blocks (if you have room, of course), and even redundant storage. Even idle-loop code that runs a rolling signature in scattered areas of ram, looking for faults can be useful.

-jg

Reply to
Jim Granville

These 4 test programs are really famous good techniques used w.r.t memory that will give you clear information and take you near to the solution.

Karthik Balaguru

Reply to
karthikbalaguru

The processor does not have bit manipulation instructions which could reach the affected memory addresses, but bit manipulation subroutines DO exist.

Will do that (should have done it before, in fact).

Did it already, but just do be sure will try it again

Lot of work. Will need to get some help from the hardware guys.

Before going into this direction I wanted some opinion if it is reasonable to look for hardware failure. Seems like you say 'yes' :-)

I don't think I can get that many boards. They are only manufactured on demand. But of course, if I have less boards, I will need more time..

Will do this if I get results (or cannot get results), but it will take some time.

Thank you very much for your comments.

-- Dirk

Reply to
Dirk Zabel

ElectronDepot website is not affiliated with any of the manufacturers or service providers discussed here. All logos and trade names are the property of their respective owners.