Same code, same data, different results

- T
- Tim Wescott
  
  Contact options for registered users
posted
8 years ago

Wed, Oct 7, 2015 12:35 AM

This is about code that clings to "embedded" by it's fingernails -- it's running on a fast PC-compatible single-board computer, under Windows, as a DLL. So it's not exactly some little thing shoehorned into 4kB of flash.

At any rate:

I have a rather complicated algorithm that I've coded up, to do marvelous stuff for my customer. It recently grew quite a bit, and in the process I've introduced some subtle bugs. I'm looking for ideas on things to look for to see if I can figure out what's going on.

Here's the deal:

First, some time this spring I got a shiny new machine, and went ahead and loaded 64-bit Linux onto it, with all its 64-bit appurtenances. This did not, at the time, cause problems.

I coded up a bunch of changes, tested it on my 64-bit machine, and happily shipped it off to my customer -- who reported that it broke, horribly.

Oh drat. On top of this, at some point the MinGW stream library broke, so my test code no longer worked under Wine -- I could only test with the Linux version.

After much trial and tribulation, I managed to get Linux 32 and 64-bit versions, and Windows 32-bit versions all working. I tracked down my problems (size_t and unsigned int are not the same size in gcc 64 bit for Linux), fixed them, and shipped.

So now I'm getting four different results from three different software loads and two different circumstances. I can't go into detail, but I'm going to give a general story 'cause I'm looking for general things to look for:

Under Linux 32-bit I get behavior A (correct operation)

Under Linux 64-bit I get behavior B (correct operation, just different)

Under Wine running a 32-bit Windows program I get behavior B

My customer calls my DLL from Labview. Nine times out of ten he gets some correct behavior -- he's not sophisticated enough that I can know whether it's A, B or something else. The tenth time the thing fails to work correctly.

So, I suspect that I've got some uninitialized memory someplace. But, I'm running the Linux versions under Valgrind and it's not finding any problems (Valgrind is great, by the way -- great enough that for my embedded ARM stuff I do unit testing under Linux and Valgrind).

I'm going through the code with a fine-toothed comb, and so far I've only found a few very minor problems that border on the stylistic, although one of the changes that I made did improve things a bit.

So -- other than picking through the code line by line, can you guys suggest anything that I can do or look for in specific?

Also, does anyone know of a Linux tool that'll randomly populate the heap with junk then call a program? I suspect that I'm not seeing the "sometimes it is, sometimes not" behavior that my customer is because of the different environment, not because Linux is magically fixing my bugs. Suggestions on how to make the bugs apparent would be helpful.

Thanks for reading, suggestions welcome -- I'm becoming a candidate for a rubber room over this one.

--

Tim Wescott 
Wescott Design Services 
http://www.wescottdesign.com

- R
- Rob Gaddi
  
  Contact options for registered users
Vote on answer
posted
8 years ago

Wed, Oct 7, 2015 1:36 AM

Without getting into the A/B specifics, is the difference something that could be chalked up to floating point error?

--
Rob Gaddi, Highland Technology -- www.highlandtechnology.com 
Email address domain is currently out of order.  See above to fix.

- D
- Dimiter_Popoff
  
  Contact options for registered users
Vote on answer
posted
8 years ago

Wed, Oct 7, 2015 1:54 AM

Hi Tim, I do not think you can get real help on this by anything but banging your head into the rubber walls for as much time as it takes. I am in a similar state at the moment - have been chasing for hours WHY the same code after some modifications within the loop produces one line too many at times. Way simpler than yours, has only to list a page of parameters but well, yet to work so I can go to sleep. Hopefully knowing you are not alone banging your head in the walls is some consolation as help is not on the horizon... :D.

Dimiter

- P
- Paul Rubin
  
  Contact options for registered users
Vote on answer
posted
8 years ago

Wed, Oct 7, 2015 2:35 AM

Have you run the code with undefined behaviour and address sanitizers turned on?

- C
- Clifford Heath
  
  Contact options for registered users
Vote on answer
posted
8 years ago

Wed, Oct 7, 2015 2:59 AM

When chasing a really difficult "Heisenbug" like this that have involved data structures, I've sometimes had joy by adding invariant-checking code - it can be as costly ans slow as you like - to verify that everything which should add up at any point in time actually does. Sort-of like an "fsck" for the data structure.

Every such bug has a symptom, and that symptom is associated with some specific precursor. If you can test for that precursor condition before the completion of the algorithm, you have a tool to narrow down the source of the problem.

Add a few calls to the invariant-checker in assorted places and run the code until you start to narrow down where it's failing.

Clifford Heath.

- R
- Roberto Waltman
  
  Contact options for registered users
Vote on answer
posted
8 years ago

Wed, Oct 7, 2015 3:14 AM

Let Gimpel's PC-Lint do that for you.

formatting link

R.W.

- T
- Tim Wescott
  
  Contact options for registered users
Vote on answer
posted
8 years ago

Wed, Oct 7, 2015 3:19 AM

Between A and B, yes. In fact, it was tweaks to some floating point calculations to make them more kosher that caused the change in the Windows version.

However, the customer's one out of ten problem is, I'm pretty sure, different -- first, because it's a failure and not just a little difference, and second, he's running the same file through all the time, and occasionally it's spitting up. I don't know what could cause that in my code other than using an uninitialized variable.

It may possibly be a bug on his side, but I don't want to start pointing at his side of things unless I'm pretty certain of mine.

--

Tim Wescott 
Wescott Design Services 
http://www.wescottdesign.com

- T
- Tim Wescott
  
  Contact options for registered users
Vote on answer
posted
8 years ago

Wed, Oct 7, 2015 3:20 AM

The only such sanitizer I know of is valgrind. Do you have other tools to suggest?

--

Tim Wescott 
Wescott Design Services 
http://www.wescottdesign.com

- P
- Paul Rubin
  
  Contact options for registered users
Vote on answer
posted
8 years ago

Wed, Oct 7, 2015 3:22 AM

The most obvious debugging suggestion once you're digging into it is log intermediate results in the algorithm in all environments, run with the same inputs and compare the logs to see where the intermediate results diverge. There are even some automated tools to bisect for the behaviour: see some of the links at

formatting link

The old "Ask Igor" site was amazing and now you can download the code tha ran it (linked from the article above).

Interesting question. Nothing immediately comes to mind but I'm not up to date with this stuff. Couldn't you just patch the program to do that, e.g. wrap malloc etc.? Also turn on all the stack guard and other sanitization options in GCC and maybe also try Clang (which has some different sanitization features iirc).

Generally this sounds like run of the mill debugging. It's time consuming but don't get discouraged. The basic method is run the program under GDB until you can see that it's going wrong, then reason backwards to figure out what made it go wrong.

- P
- Paul Rubin
  
  Contact options for registered users
Vote on answer
posted
8 years ago

Wed, Oct 7, 2015 3:27 AM

I mean compiler flags like -fsanitize-address. Clang has some other ones like bounds checking, but I've only used GCC.

You could look at Frama-C (frama-c.com) which is sort of a lint on steroids. I've never tried it myself but have been wanting to look into it.

- R
- rickman
  
  Contact options for registered users
Vote on answer
posted
8 years ago

Wed, Oct 7, 2015 3:31 AM

Looks like Linux 32 is your gold standard. Rather than try to debug this by looking hard at one use case by standard test methods, is it feasible for your program to print to a file info from internal points of the program so you can comparing the output across the different platforms? This may help you narrow down the section of code that is failing.

If the differences pop up in different area randomly, that will tell you perhaps that it is not a coding error per-se, but rather a problem between the program and its environment.

--

Rick

- U
- upsidedown
  
  Contact options for registered users
Vote on answer
posted
8 years ago

Wed, Oct 7, 2015 4:28 AM

Exactly what kind of processor is the target using and from which manufacturer ? There might be some minor differences e.g. in IEEE floating point such as handling of non-normalized values.

Sometimes an interrupt occurs during your code and sometimes not ? Any bugs in the interrupt processing (either HW or SW) would cause such problems.

- L
- Les Cargill
  
  Contact options for registered users
Vote on answer
posted
8 years ago

Wed, Oct 7, 2015 4:32 AM

That's a big red flag, unless you know what causes this. Now, it easily could be some interaction between WINE and MinGW but the stream library ( I presume you mean the stuff in file.h? stdio.h? ) should *always* work. If you mean C++ >> stuff, then that should pretty much always work, too.

Can you mock out your .dll and see if it still breaks in WINE?

Really fishy. How hard would it be to change *every* int in the whole shebang to an int32_t or uint32_t from stdint.h?

This should not matter because ints are all 4 bytes in gcc unless they aren't. But the story about size_t makes me think....

Verifying that everything is initialized by hand shouldn't be that horrible.

This ain't a line by line - this is a Heisenbug. I'd seriously consider mocking the whole thing out, routine by routine - but you can't tell if behavior B is the precursor to failure or not. And to really get it to fail, your customer is in the loop.

Can you get a test vector from the customer?

??? First line of main() is "char *p = malloc()"?

--
Les Cargill

- J
- John Temples
  
  Contact options for registered users
Vote on answer
posted
8 years ago

Wed, Oct 7, 2015 4:38 AM

If I'm understanding correctly, your customer is using your code with LabVIEW with Windows, yet you're not testing your code under Windows or with LabVIEW? I can understand there might be difficulty testing with LabVIEW (e.g., expensive hardware involved), but do you have reason to be believe that testing with Wine is the same as testing with Windows? Is there some reason you can't test it under Windows using the same version as the customer?

- D
- David Brown
  
  Contact options for registered users
Vote on answer
posted
8 years ago

Wed, Oct 7, 2015 8:15 AM

Floating point can be difficult to get /exactly/ the same between different systems. In particular, even on the same x86 cpu, you can get slightly different results (but still within IEEE specs) if the calculations are done in the MMX registers, with SIMD instructions, or with x87 floating point. Your two options are to make sure you use the same compiler, the same flags, the same code, and avoid passing floating point data directly between functions, or to accept that floating point results are not bit-for-bit repeatable. (The issue with passing floats and doubles into and out of functions is that Windows and Linux use different calling conventions, as do 32-bit and 64-bit.)

- D
- David Brown
  
  Contact options for registered users
Vote on answer
posted
8 years ago

Wed, Oct 7, 2015 8:25 AM

There are bigger differences than that. In particular, "long" is 64-bit in 64-bit Linux, but only 32-bit in 64-bit Windows. And size_t is going to be 64-bit in 64-bit Linux and Windows, but 32-bit in 32-bit Linux and Windows. And while size_t may happen to be the same size as unsigned int on some combinations, don't forget that it is not necessarily the same type.

Your best way forward here is to treat your programming as carefully as you would for embedded programming. Never make any assumptions about the relationships between types, other than as given by the law (the C or C++ standards, as appropriate). When you want something of a particular size, use the types.

The other big question here is what language(s) you are using, and what compiler(s) you are using. That would be helpful to know.

If you are using up-to-date gcc or clang, you have a variety of "sanitize" options that can help. AFAIK some of them only run on 64-bit Linux, since they use memory management tricks that require the wider address space and greater flexibility, but they should still be helpful.

- G
- glen herrmannsfeldt
  
  Contact options for registered users
Vote on answer
posted
8 years ago

Wed, Oct 7, 2015 8:54 AM

Another one I remember some time ago, I believe on Windows, is not intializing the x87 control register, such that rounding modes and precision are different between runs. (That is, what the previous program left.)

If you use a test system that detects all attempts to use memory that hasn't be given a value, it likely won't notice x87 registers.

-- glen

- T
- Tim Wescott
  
  Contact options for registered users
Vote on answer
posted
8 years ago

Wed, Oct 7, 2015 8:27 PM

gcc 4.8.4 (it's what came with Ubuntu). "-fsanitize=address" and "- fsanitize=thread" seem to be the only sanitize options -- and it passes those.

--

Tim Wescott 
Wescott Design Services 
http://www.wescottdesign.com

- T
- Tim Wescott
  
  Contact options for registered users
Vote on answer
posted
8 years ago

Wed, Oct 7, 2015 8:28 PM

I did not know about this feature -- thanks.

--

Tim Wescott 
Wescott Design Services 
http://www.wescottdesign.com

- D
- David Brown
  
  Contact options for registered users
Vote on answer
posted
8 years ago

Thu, Oct 8, 2015 7:32 AM

I haven't heard of Frama-C before, but I've found the website, and it is now high up on my list of things to read as soon as I get the time. Thanks for the pointer.