Same code, same data, different results

It should not be hard to install a newer gcc and get access to a lot more sanitize options. I haven't tried them myself, but they might help you out.

Reply to
David Brown
Loading thread data ...

Coming rather late to this, but occasional random FP errors were occurring in a system I wrote. Turned out the standard ISR pre-amble/post-amble did not save the FP state properly which was normally fine as I tended to avoid FP in ISR (for reasons which are now probably considered pre-historic).

Reply to
Bill Davy

Tim said he was using MinGW (GCC) and I don't recall that being an issue in either MinGW or Cygwin's GCC.

It definitely was a problem in Microsoft's compilers, but I recall it only in 16-bit versions. All the 32-bit and later compilers do initialize the x87. However ...

... the x87 is initialized only if it is used. Since Pentium 4 (SSE

2) compilers have defaulted to using SIMD for most floating point - the x87 isn't used unless you specifically enable it - e.g., to get extended precision transcendentals. That can cause problems if the program does not use the x87, but calls libraries which do.

This may have nothing whatsoever to do with Tim's problem, but it's a good practice always to initialize the x87 even if you don't plan on using it, because libraries are allowed to assume that the program has done so.

Or any other registers.

George

Reply to
George Neuner

Memory Alignment? #pragma Pack in a third party lib? I vaguly recall memory page boundaries in shared dll's and padding structs accordingly, but that may not apply here.

Cheers

Reply to
Martin Riddle

That reminds me of one possible issue with dll's in Windows. gcc generates code that keeps the stack aligned on 16-byte boundaries, to allow better cache line usage and faster SIMD instructions. But Windows and MS compilers use 4-byte stack alignment on 32-bit Windows. There is no problem within a mingw-compiled program, since the startup code sorts out the stack alignment. But if your functions are called from somewhere else, as exported dll functions or as callbacks from Windows, then the stack alignement may be bad.

The way to fix this is the gcc function attribute "force_align_arg_pointer" to functions that could be called from outside. Alternatively, you can use the "-mstackrealign" compiler flag to make all functions properly align the stack if they need it (the

16-byte alignment is only actually necessary for some SIMD instructions, but these could be generated for code that moves a lot of data around).

Stack misalignments are more likely to cause a crash than other incorrect behaviour, but perhaps exception handling or other error trapping is hiding the real issue.

There are also gcc options for controlling the details of floating point, which may have different default settings on Windows and Linux or in 32-bit and 64-bit modes, leading to marginal differences in some calculation results.

Reply to
David Brown

For anyone following this saga, I have resorted to weirdness: I've overloaded new to pack the allocated space with random data before the constructor gets to it.

I haven't found a problem so far in 12000 runs, each with a different random number generator seed. So either it's a Windows thing that Wine does not replicate, or it's in my customer's code.

void * operator new (std::size_t size) throw () { void * p = malloc(size);

if (p == 0) { throw std::bad_alloc(); }

for (size_t n = 0; n < size; ++n) { static_cast(p)[n] = rand(); }

return p; }

--
Tim Wescott 
Wescott Design Services 
 Click to see the full signature
Reply to
Tim Wescott

I agree with people who suggested you duplicate the customer's environment as much as possible, so you have better chance of reproducing the problem. Have you figured out what causes the discrepancy between the 32 and 64 bit versions on your own system? If it's significant and you're not doing something numerically unstable, that seems worth chasing down. I think if the problem was uninitialized data, valgrind memcheck is supposed to have found it. How much code are you talking about?

Reply to
Paul Rubin

Am 09.10.2015 um 19:25 schrieb Tim Wescott:

That does beg one question: just how sure are you that the effect is even happening on the heap, and not the stack?

Reply to
Hans-Bernhard Bröker

Does it crash every 10th time you run it with the same parameters or what ?

Anyway, if the fatal problems occurs on some Windows machine (desktop or embedded Windows ?) why do you insist on using Linux or some Windows emulator on Linux to try to figure out what is wrong in a Windows system ?

If you can't get exactly the same configuration, at least use some native Windows version on your own test machine.

Using different versions of MS compilers and you can end up in problem. If the .EXE and .DLL are compiled with a different version of the compiler, you may encounter problems, such as when allocating dynamic memory in .exe and freeing in it .dll.

You should find out at what compiler (and version) the LabView has been compiled with and preferably use the same compiler (with same version and settings) for compiling your DLL. We have had lots of problems due to different compiler versions settings.

Look carefully what LabView compiler settings for your DLL are suggested. Make sure you use the same LabView version as your customer.

Are you using DllMain to attach to process (and thread, if you are using multithreading) ?

If multithreaded, are all the libraries all multithread ? Some standard C functions are not multithreaded and require special caution if used in multithreaded environment.

An other kettle of worms is that one system i truly multicore and the other is not, when running multithread applications.

With multithread applications, in which different scheduling algorithms could give different results in Windows and Linux, if there are some bugs in the application.

An unrelated device driver in the final target system could handle interrupts improperly (such as failing to save and restore some registers in interrupts), which will generate random problems. In the final target system, disable preferably all device drivers to check if it affects the result of your code.

I don't think so. If the program reports different results depending on the time of day (or phase of the moon) with _exactly_ the same parameters and sequences, this should not happen.

After all both Windows and Linux virtual memory systems will create zeroed pages for the dynamic memory manager, so if a virtual memory page is delivered to the C dynamic memory manager, it will be zeroed no matter if malloc() or calloc() is called. Only if a block of memory in a process has first been free() and then you call malloc () it may contain some random data, but I always use calloc() to get properly initialized dynamic memory areas in all cases.

While you may have identified some of the potential problems, there are dozens of alternative explanations to your problems.

Reply to
upsidedown

Not to mention about half a dozen other possibilities (see my other post:-).

Reply to
upsidedown

On average, yes.

Because I don't have any spare pots of money lying around with which to buy a new machine, Labview, and Windows.

It may come to that, but I'd rather avoid the expense.

We are not there, thankfully.

Thank you, no, I am not a Microsoft shop. If it comes to that I'll send them code for them to compile.

I have no clue. I am emailing a dll to my customer.

This may be a worthwhile path to pursue. My understanding is that they have my DLL running in its own thread, but what's happening outside of that is unknown.

Alternative explanations are good -- it'll help me figure out what's going on.

--
Tim Wescott 
Wescott Design Services 
 Click to see the full signature
Reply to
Tim Wescott

You can get a Windows remote desktop (not a super powerful one) for 3 months for free here:

formatting link

There are also tons of commercial providers who rent stuff like that for a few cents an hour. I guess that doesn't help with Labview, but can't you mock that out?

How do their code communicate with yours?

Are you anywhere near the customer physically, so you can debug at their site, and is that workable? If yes that may be the simplest.

Reply to
Paul Rubin

If you're mixing MinGW-compiled code and MSVC-compiled code in the same process, one factor to consider are that the ABIs aren't quite the same. Alignment constraints can be different, and some versions of MinGW have 80-bit "long double" whereas MSVC has 64-bit "long double".

Also, don't try to transfer ownership of heap blocks between DLLs (or between DLLs and the EXE). Whichever module created a block with malloc/calloc/realloc is the only one which can safely call realloc() or free() on it. The reason is that each module can link to a different version of the MSVCRT DLL, each with a separate heap. Similar issues may exist for other types, e.g. FILE*.

If it was an uninitialised data bug in your code, I think valgrind would find it.

If you have numerically-unstable floating-point algorithms, try calling fesetround() explicitly (if that's available).

Reply to
Nobody

That's version dependent. Up to 2005, MSVC supported the 80-bit extended type. Versions from 2005 onward still support using the x87 with appropriate architecture switches, but they do not support loading or storing extended values in memory.

George

Reply to
George Neuner

I don't believe any of the 32 bit versions of MSVC ever supported

80-bit long doubles. AFAIK, all of the 16-bit versions did, but those are not relevant for linking with Mingw.
Reply to
Robert Wessel

I tried to keep that part of the interface fairly simple -- they call an "update" function which returns a status, the status indicates when various bits of information are available, and there's various functions to fetch the information that's available.

All of the guts are well encapsulated in the DLL, and the problem is happening in the guts -- not in the interface part.

14 hour flight and an international border. It's not totally out of reason, but buying a cheap Windows machine and Labview would be easier and cheaper.
--
Tim Wescott 
Wescott Design Services 
 Click to see the full signature
Reply to
Tim Wescott

Just to throw in my two cents.

  1. Reentrability of your code is certainly not an issue?

  1. I presume that endianness at your and customer's machines is taken into account? (Just mentioning that because it's not exactly impossible.)

Regards, Evgeny.

Reply to
Evgeny Filatov

In the sense that it should only ever be called from one thread, yes. In the sense that it is 100% reentrant -- I suspect not.

If that was an issue I would expect the problems to be immediate and horrible, not merely false results every once in a while.

--
Tim Wescott 
Wescott Design Services 
 Click to see the full signature
Reply to
Tim Wescott

Yuggh.

One issue there is it's best if you can run the customer's entire application on your Windows computer: would they allow that? If not, maybe you can get remote access to their machine somehow. Labview is pretty expensive too, if you have to buy it. I wonder if there are cloud instances available.

Reply to
Paul Rubin

...

The first-ever port of Unix to a big-endian architecture was done at the University of Melbourne. The bootstrap loaders of the time would print the name of the file to be loaded - "unix" then load it into memory and jump to the start.

The first (and only) thing that the Interdata 8/32 (big-endian) port said?

(... silence)

I think that "nuxi!" should be adopted as the correct expletive for anyone who encounters endian problems. :)

Clifford Heath.

Reply to
Clifford Heath

ElectronDepot website is not affiliated with any of the manufacturers or service providers discussed here. All logos and trade names are the property of their respective owners.