Unexplained Hang During Boot

E

eon_blue_80 20 years ago

I am experiencing a very bizarre problem with vxWorks and I am hoping that someone might be able to offer some suggestions on where to start looking to determine the root of the problem.

VxWorks is being used on a Synergy Microsystems VME SBC which is PPC based. The problem seems to arise at random times after rebuilding the OS image. For instance, by commenting out a single 'printf' statement such as "printf("Message Received\n"); in an application level piece of code that is not even invoked; and rebuilding the image, the image can hang while booting (early in the boot procedure). Uncomment this 'printf' statement, rebuild the image, and the OS will boot without error. Note that this routine is not called at any time during the boot procedure so the code containing that printf is never even executed.

This problem has been experienced by multiple developers on different modules. I am not sure if this is a hardware, or a software type of problem. Can anyone think of any reason why something as non-intrusive as commenting out a printf statement, in a function that is never even invoked, would cause the OS to hang during boot?

The printf statement is only adding a handful of bytes to the resultant image and larger images than the ones that fail have been booted successfully.

Similar hangs have been produced by changing array sizes in uncalled routines, etc., (i.e., add a few more bytes to an array in an uncalled function and the images hangs during boot, add a few more bytes and the image loads fine).

Vote

B

Bill Pringlemeir 20 years ago

[snip]

This sounds like a cache problem. The "printf" is unrelated to the code. It just changes the image size at the "right" place. You could add a ".bytes 7" or something in the code section and the same thing would result.

At some point in the boot sequence, there may be an alias between data and code cache. It could be when the MMU is turned on. The address space will change and code must often jump in a very specific sequence. It maybe a conflict with a device. For instance an "eieio" instruction may be necessary in some cases, but due to code section alignment, the code is executing in different times and the "eieio" become necessary/un-necessary depending on the build.

It is very good that you try to hunt this down. I've known several "senior" people who have let this type of problem go on for ever.

You can toggle an LED, an general purpose I/O with scope or you can use some polled console output to provide check points in the boot sequence to see where the hang occurs.

The important point is that the "printf" has nothing to do with the problem besides making the code move around. You can verify this by inserting different dummy routines with different lengths (a cache line is typically 32/64 bytes). Observing a map file of the full image and knowing the location of these bytes can be helpful. For instance if code following this is an ethernet driver, then that may be helpful to know.

It could also be reading of garbage strings, code, constant data. I have also seen one section of code round MMU rights and another read to the byte. Sometimes this rounding is wrong and a "bus error" happens due to memory not being sized right.

hth, Bill Pringlemeir.

You have the right to remain silent -- so shut up! vxWorks FAQ, "http://www.xs4all.nl/~borkhuis/vxworks/vxworks.html"

Vote

M

MetalHead 20 years ago

Another possibility is that errant code is corrupting memory during the boot process. The commonest case is the "wild pointer" where an uninitialized pointer is used to write data. Other possibilites would be over-running the stack reserved area or using pointers to buffers that have been returned to the buffer pool and re-used. I have also seen incorrect function prototypes cause this type of problem. If you are using vector tables in RAM, walking on them will cause this type of problem too.

The way I would attempt to solve this problem is with a logic analyzer. Start out by finding where the code hangs. Then see if the instruction sequence to get there took any un-explainable jumps. See if the departure point for the unexplainable sequence values match the expected values for that address. If they don't match the expected values, use writes to those locations to trigger the logic analyzer and you should be able to locate the errant code. The departure from expected execution could also be un-initialized or corrupted vectors in the vector table.

I am not familiar with the particular VME card you mentioned, but memory management hardware could protect you from a number of the things I described. Because it is a boot sequence problem, memory management hardware may not be operational at this point.

Another place to look would be the linker command file. Are all of the segements large enough and in non-overlapping regions of memory? The logic analyzer approach would leady you to this type of problem, but it could be a painful path that could be avoided by careful study.

Good Luck, Bob

Vote

S

ssubbarayan 20 years ago

Bill\Others, Excellent and enlightening explaination from you all.We are facing a similar kind of issue with STMicroElectronics prop board and we were using prop OS.Though the OS is different,the problem seems to be similar to query we are addressing here I believe.

We faced a situation where if we just type printf inside one function or just introduce one i=1(Though we did not use 'i' variable further anywhere) will make the feature to work and removing this statement made us to loose the feature. We were trying hard to figure the problem until one day when we inspected the cache and disabled the data cache the feature was working just fine.

Now the question I would like to understand it,whats the best way to figure out whether the problem is with cache memory?One more behaviour I have observed is when we debug with break point the feature was working fine and when we use binary production version of same code it never works! This made debugging further difficult.Will the role of cache have something to do to bring this difference between debug and production version?

I would like to avoid such problems in future so it will be helpful if some of you enlightened ones explain me this.

I am posting the query also to comp.arch.embedded as this will help me to get lot of experienced people's inputs.Pardon me incase I am wrong.

Looking farward for all your replys and advanced thanks for the same,

Regards, s.subbarayan

Vote

B

Bo 20 years ago

So when you determine for certain that cache is the problem, what typically is the solution? Sprinkling cache flushes throughout the code? or what?

Bo

Vote

B

Bill Pringlemeir 20 years ago

This is *unlikely* as the OP noted that adding un-executed code would cause the problem. If the code is directly corrupting memory this would be unlikely to introduce the problem. Especially if the added code make no types of allocation, nor writes to memory. If simply changing the cache on/off will cause the crash, I find it extremely unlikely that it is a memory corruption.

So there is a quick way to rule this out. Disable/enable the cache with a crashing image. Often you can arrange the code so that the size is the same, just a constant has changed to disable/enable the cache.

fwiw, Bill Pringlemeir.

Anyone who trades liberty for security deserves neither liberty nor security - Benjamin Franklin vxWorks FAQ, "http://www.xs4all.nl/~borkhuis/vxworks/vxworks.html"

Vote

M

MetalHead 20 years ago

I have seen this happen in the past in this manner. By adding code into the code segement, you move the relative position of stuff around. Even if the code you added does not get executed, if the I/O drivers are at opposite end of the link map from the boot code, just increasing or decreasing the relative separation of components can cause the corruption to occur in a place that does not get executed during the boot process or causes a different kind of problem. C libraries are another good candidate for winding up at the far end of the link map. If you are lucky, this will show up as an illegal instruction trap, and if you are unlucky, it shows up as branches to nowhere or tight loops.

This would be a good first step. The OP sounded like he was fishing for ideas, so I threw out a couple that I have run into in the past.

Bob

Vote

D

Didi 20 years ago

I also would tip on cache handling problems in the code. Forgotten flush of the i-cache is something I have had to chase with my early versions. There is one more possibility I know of. If the processor is a 405, check its errata sheet. I recently discovered (while considering a device, I opted not to use it) a late published error to be saying basically you may not use its cache in copyback mode, it does not work. Use write through....

Dimiter

------------------------------------------------------ Dimiter Popoff Transgalactic Instruments

formatting link

------------------------------------------------------

MetalHead wrote:

Vote

M

mrfirmware 20 years ago

We haven't used write-through, ever, on the 405GPr and it has had narry a problem with copy-back at least for the past 4 years of the product life (thousands of blade servers). Do you have an errata number or doc. I could look at WRT to this cache bug? If you are referring to CPU_213 you need only to set CCR0 as specified. Setting write-through mode is simply too big a hammer (for us).

- Mark

Vote

J

Jim Stewart 20 years ago

Reading your post, it's not clear how many different physical units you've tried this on. If the answer is one, the problem could be a bad byte with a bad bit of flash memory.

Vote

D

Didi 20 years ago

This is what I was referring to, apparently you have it under control. It was enough to stop me from using the 405 (I opted for the 5200).

Dimiter

------------------------------------------------------ Dimiter Popoff Transgalactic Instruments

formatting link

------------------------------------------------------

mrfirmware wrote:

Vote

E

eon_blue_80 20 years ago

Thank you everyone for all of your suggestions. These suggestions will be a great help when troubleshooting future problems.

As far as the original problem goes, using I/O probing we were able to successfully narrow the error down to a relatively large segment of the BSP. Apparently there is a problem in the SCSI section of the BSP (wild pointer or out of order type operation??) that causes the image to hang when the bytes of the image are aligned in just the right way. We have made a decision to disable SCSI support within the OS (which has corrected the problem). Hopefully, if time ever becomes available, we can look into the SCSI section of the BSP; and find the exact bug.

Vote

Unexplained Hang During Boot

Join the Discussion

Didn't find your answer?