Invalid instruction pointer

I have been working on a problem with an installed product for approximately a year now. After much investigation including careful review of the code and repeat of ANSI hardware testing, we have been unable to recreate the problem in house. However, through analysis of the symptoms we have come to believe that something is causing the instruction pointer in this embedded application to be pointed to the wrong code address.

My question is what external events can affect a microprocessor in such a way that it essentially gets "lost" in execution? We are reasonably certain that an external event is the cause, rather than a stack problem, as the majority of the installed product are working fine and have been for over a year.

The microprocessor we are using does not have an illegal instruction trap or watchdog timer, so in order to fix the problem, we would likely need hardware modifications. I would like any information that any of you might have gleaned in past experience with similar issues so that we can pursue testing based on most likely causes.

Reply to
ginger.zinkowski
Loading thread data ...

I have a project that is using salvaged components (desoldered ICs with short leads) installed in tin-plate sockets; with temp. and humidity changes, the parts move and integrity of connections suffers so that the mcu does bad external memory fetches. All unused areas of the firmware store contain a 'jump relative to self' instruction so that when the execution goes south I can reset and debug with less chance of losing state. Is your product using socketed ICs?

I also have some projects working in very harsh RFI environments; these necessitated complete enclosure in tin can faraday shields, together with ample power rail isolation and ferrite bead installations in order to stop the glitches. What is the operating environment of your product?

Regards,

Michael

Reply to
msg

Intel issued an app-note many years ago (for the 8048, no less!), "Designing High-Reliability Software for Automotive Applications", or something like that. It assumed that someone would be careless with a hot sparkplug lead some day, & the CPU would make a random jump to *any* accessible location. The idea was to be able to recover from that. They programmed in assembler, not C, which allowed some cunning tricks. Instance, share out the unused ROM space, so that there is a dead zone after every unconditional jump/return. Fill such space with jumps to recovery code. There was much more in that vein. (Sorry, I don't have a copy to hand.)

Reply to
David R Brooks

Stack overruns are only one source of this kind of problem. (And you are absolutely 100% positively certain with no doubt that it is not a stack overrun?)

The overrun does not have to be corrupting the stack. does the code run from RAM? You may have had a pointer overwrite code with some data.

Do you have any state machine tables that use function pointers? You may have a bad state value so that you jump to a non-existant function.

Do you have fully debugged interrupt routines? You might be forgetting to push/pop a value on the stack in some certain special case.

Are you 100% sure the hardware is working? No overclocking the CPU or memory? Do you run diagnostics on power- up? ESD protection??

What is the difference between your lab set up and the field? There could be some keys there. (In this regard, I spent nearly a year trying to debug an intermittent freeze up on a machine. In the lab I ran the system using an ICE (In Circuit Emulator). I finally concluded there was a hardware issue and using the ICE changed the impedance of the circuit enough to avoid the problem. (Of course we couldn't ship an ICE with each unit sold!)

Without knowing your code, only vague generalities like above come to mind.

It is very hard to track down intermittent problems. Patience and lots of information are required. You need a lot of evidence to push the problem to the hardware side. That's why I'll end the way I started: Are you really sure this is not a buffer overrun (stack) issue?

HTH, Ed Prochak Let me know if you want more detailed help off-line

Reply to
Ed Prochak

That's a fairly risky basis for certainty. The one problem unit may be experiencing very different task loading, external timing intervals, etc than the others. Just because most of them work doesn't mean they don't all have a deadly bug.

If you cannot recreate the failure, you will probably have to go to the problem location and try modifications to both software and hardware. Add tracing output. Change the power supply to an external one. Add shielding. Etc... figure out what it is that makes the difference.

Also, you may not have dedicated trap capabilities, but with some care you may be able to insert jumps to a trap routine between your operational code and data.

Reply to
cs_posting

Not long ago, I was working on a PowerPC 860 operating at 3.3V that had an 74HCxxx OR gate operating at 5V driving an interrupt input. Occasionally, an undershoot of about 2.5V on the interrupt would cause the system to go out into the weeds. The moral of the story is... beware of mixed voltage system and fast edges ;-)

--
Michael N. Moran           (h) 770 516 7918
5009 Old Field Ct.         (c) 678 521 5460
 Click to see the full signature
Reply to
Michael N. Moran

Electrical noise. Insufficient filtering or decoupling. Noisy peripherals (solenoids or Motors).

Reply to
Neil

  • Brown-outs. Is the power supply dipping below the recommended minimum voltage?
  • Interrupt frequency. Is some external process causing an interrupt to be hammered at a much higher frequency than you anticipated? This can cause problems by disturbing the timing of other code, or by using a higher than anticipated amount of memory on the stack (or stacks, if you use a kernel that puts interrupt responses on the task stacks).
  • Noisy communication. Is the equipment that your thing is connected to sending invalid comms data, or is your comms data getting otherwise corrupted? Bad comms data in conjunction with fragile parsing could lead to stack overflows, memory leaks, or other primary faults that then result in branches to East Fishkill.

In summary: Look for strange electrical events that are either taking the pins of the processor out of their safe operating range, or look for environmental effects that are unusual and may be lighting up software bugs that you never tested for.

If you can, you should make some software that's instrumented for things like heap usage (if you use a heap), buffer usage for all your comms, stack usage, etc., and that either logs events (carefully -- event logging can cause problems on its own) or that saves the state of the machine for later analysis. Then try to use these results to further your investigations.

--
Tim Wescott
Control systems and communications consulting
 Click to see the full signature
Reply to
Tim Wescott

Are you perhaps referring to the following document? Designing Microcontroller Systems for Electrically Noisy Environments

formatting link

Which itself refers to the following document: Yarkoni, B. and Wharton, J. Designing Reliable Software for Automotive Applications SAE Transactions, 790237, July 1979

cf. also

formatting link

Reply to
Spoon

Applications

see also

formatting link

w..

Reply to
Walter Banks

Yes indeed :-) Yarkoni & Wharton is the one I was thinking of.

Reply to
David R Brooks

ElectronDepot website is not affiliated with any of the manufacturers or service providers discussed here. All logos and trade names are the property of their respective owners.