Debugging crashes which appear after a long time

Gurus, I am encountering a strange situation in one of our consumer electronics embedded product. The product runs on prorietory RTOS and a custom processor supplied by ST Micro electronics.Now what happens is,the box runs for around 8 hours with out any problem.After 8 hours it crashes.I am clueless on how to debug it.The debugger window throws error "task out of scope,stack frame cannot be set!". I suspected this to be a stack over flow problem for the tasks running and increased the stack size.Still that did not help me. I tried to put trace messages and print it on console.Problem is, due to time required to print to console,my application does not come up even properly.I am not able to use the debugger too because of this out of scope error.Also irony is this problem appears only after 8 hours,which makes me wait for another 8 hours to get the issue. I am wondering are there any good approaches you experts would have used to solve such problems? It will be helpful if some one can point me in right direction.I am looking for some debugging tips which can help me to sort out this issue. Looking farward for your replies and advanced thanks for the same, Regards, s.subbarayan

Reply to
ssubbarayan
Loading thread data ...

If it happens consistentry after a period of time - it is a good sign :) Things to ckeck:

- memory leak?

- hardware counter gets overflown?

- variable counter gets overflown?

- some free-running hardware timer generates interrupt that is not handled?

- the building air conditioning system starts/stops, generating surge in the mains line?

Would it crash if you ran another (idle) task instead of yours? If yes - check the RTOS.

HTH,

Vadim

Reply to
Vadim Borshchev

Try stuffing a known pattern into the RAM used for the stacks. After it crashes, check those areas to see if that pattern gets overwritten.

Reply to
GaryKato

Do you have any theories at all? If so, you need to devise a configuration that will make it crash more often.

I had a problem years ago that occurred once every few *weeks*. Fortunately a code review threw up one theory reasonably early - and I spent the next few weeks proving it.

  1. Can you accelerate the tasks that the system is doing in order to decrease the time between crashes? For example, if there is a regular task scheduled for, say every minute, drop that down to every 30s and see if it crashes after 4 hours. Every 5 sec? etc...

This could be done on a system-wide level (just to decrease the turn-around) or on a task level (to identify the task at fault).

Is it a consistent 8 hours? Or is that an average based on some probability of two inter-related events happening? Can you increase the number of tasks running in parallel to accelerate the crash?

  1. Divide and conquer. Is it possible to disable certain tasks? Does it still crash when these tasks are not running? Does the system crash even with only a single idle task (and nothing else) running?

Your ideal situation is having (a) a configuration that doesn't crash (however cut-down that is) as well as (b) a configuration that crashes after a few mins. Then you can narrow in on the problem from there.

Regards,

--
Mark McDougall, Engineer
Virtual Logic Pty Ltd, 
21-25 King St, Rockdale, 2216
Ph: +612-9599-3255 Fax: +612-9599-3266
Reply to
Mark McDougall

Even if the stack is overflowed you should be able to look at the end of the stack. You'll be able to find the last few values that were placed there. The return addresses will tell you where these calls came from. This might give you a clue as to what was happening just before it crashed.

Peter

Reply to
Peter

Hello,

Really very challenging and interesting problem......!!!!!

As people have already mentioned, Proper Code review and some brainstorming with all the people in the team will help a lot in identifying the problem area and to reproduce the problem early ( say after 10 mins ).

Some more points to add.

Instead of printf's, use while(1); or assert() in all the conditions that you have assumed as impossible to occour. ( Doubtful conditions )

Use your own debug versions of malloc and free and write some signature in each malloc ed block and check for this signature while freeing ( Basically to check memory leaks ).

Above all I suggest you to read the article " Proactive Debugging " by Jack Ganssle. This is a really Good article and you will definately get some points.

Please keep us updated about the status of the problem as it will be a very good learning for all.

Best Regards, Venkatesh Manja.

ssubbarayan wrote:

Reply to
manja

Hello,

Really very challenging and interesting problem......!!!!!

As people have already mentioned, Proper Code review and some brainstorming with all the people in the team will help a lot in identifying the problem area and to reproduce the problem early ( say after 10 mins ).

Some more points to add.

Instead of printf's, use while(1); or assert() in all the conditions that you have assumed as impossible to occour. ( Doubtful conditions )

Use your own debug versions of malloc and free and write some signature in each malloc ed block and check for this signature while freeing ( Basically to check memory leaks ).

Above all I suggest you to read the article " Proactive Debugging " by Jack Ganssle. This is a really Good article and you will definately get some points.

Please keep us updated about the status of the problem as it will be a very good learning for all.

Best Regards, Venkatesh Manja.

ssubbarayan wrote:

Reply to
manja

If it were stack leakage then increasing stack size would likely have increased the time to failure.

You might look at separating code space.from data space by a greater distance to see if that changes the timing. Buffer overrrun and the like...

Also, changing the order of global storage might give you some insight.

Regards, Ken Asbury

Reply to
Ken Asbury

Well, here's a scattering of suggestions. Take the ones that fit and trash the rest:

1 - Above all, keep trying to divide the problem in two with experiments that have two nearly equally likely outcomes. If you can do this, you will get through the problem fairly quickly no matter where the issue lies. 2 - If it's a proprietary RTOS, you have the source, right? Track down that error message and become on expert on what causes it. Oh, the message is from the debugger, isn't it? Still, find out what you can from the vendor docs / Google. 3 - The message sounds like a bad task pointer is being dereferenced. Since you have control of the OS source code, you can add a signature value to the task structure that you set when the task is created. During debug, have the OS check for the correct signature value each time it gets ready to dereference a pointer to a task structure, and then pop up the debugger if the check fails. 4 - Check the state of your heap. Is it close to full? If so, track it down as a heap issue. Start looking for leaks, determine how much memory you think should be used and determine why there is a difference (or if there isn't one, change the design.) 5 - Can you get 3 or 4 systems up so you can get several experiments in per day? 6 - Can you get access to the failing system at night so you can get close to 3 experiments in per day, morning, evening, and night? 7 - If you have values that roll over such as indices into ring buffers, initialize them at startup to be close to rollover so you find problems early on. 8 - You can define a structure which describes events in your code, and make a ring buffer full of them somewhere you can find it after the crash. Then log each interesting section of the code to see what is happening shortly before the crash. 9 - How reliable is that 8 hours? Can you find a place to stick a breakpoint before the crash? 10 - If the hardware is not yet reliable and you just can't explain the problem in terms of software processes, check that the power, clock, and reset lines are all clean and stabile before you spend too much time pulling your hair out. (I don't suspect this is the issue, though, given the repeated 8 hour time span.) 11 - Is something unusual happening around the time of the failure? If so, it's probably not a coincidence, and you can look for how that occurance is handled in the code (and/or hardware). 12 - Put a logic analyzer on the processor address bus, record the addresses, and find a way to trigger on the fault. This should give you some idea of where you are in the code at the time of failure. Of course, the instruction stream is almost certainly cached, so you may need to be more clever about what you watch, but the hardware events should be closely related to where you are in the code, right? 13 - As previous writers have said, identify the critical parameters of your system and double some of them and see if the frequency rises. Packet size, number of sessions open, etc.

Hope something here helps. Personally, I'd go for #3 first.

- Tim.

Reply to
tbroberg

ElectronDepot website is not affiliated with any of the manufacturers or service providers discussed here. All logos and trade names are the property of their respective owners.