Hi, I am confronted with the followig problem: we are building devices with
8-bit cpus and about 128 K EPROM (banked) and 16 K RAM. They have been working since many years flawlessly. Now we get sometimes reports from customers about mysterious failures. These failures seem to be caused by wrong memory contents, i.e. there are single bits flipped, sometimes from 0 to one, sometime from 1 to 0. Most of the memory contents is beeing changed all the time except a quite small number of bytes which are going through a sequence of values during startup and are not going to be changed or refreshed later. Now if one of those "static" bytes changes its contents, usually the operation of the device is disturbed. This effect even occurs in a test environment, unfortunately very seldom (about 6-8 weeks interval, not reproducable). I don't have an ICE for this processor (not available since the processor is far too old) and don't see how to find the reason of the problem. I already built some test versions which are checking some of those static variables, but inevitably the next time the error occured on another, not watched address. For example, I observed the following changes: AA -> 2A, 59 -> 58, 05 -> 07, 05 -> 00. All those bytes had different addresses, all in the range from 8800 to 8d00 however. Btw the one double-bit error in this list occurred in the field and I have no idea how long it took until the error was detected, so I don't know if both bits changed at the same time. Now my question: do you think I am correct if I conclude from the pattern of error occurrences a hardware fault of the memory system? I think it's quite likely for a software error to change more than a single bit if for example a bad pointer is used for a write operation. Especially the fact that different but singular addresses are affected seems to point in this direction. Or is this conclusion unfounded? Do you have any idea what could be done to find the reason of the failures? My only idea is to connect a logic analyzer to the test device, but (1) this would block the only one we have for many weeks, (2) I doubt I could define a trigger condition complex enough to catch the problem if it occurs and (3) it has 16 channels and I think I would need more to trace address lines plus bank-, data- and control lines. Sincerely -- Dirk- posted
16 years ago