The HV output wasn't required. The silliness was the effort to save a package of *gates* (or an AOI, etc.) -- at the expense of puzzling anyone who had to look at the circuit thereafter (Can you rattle off the decoding functions of any particular
7seg decoder, "off the top of your head"? e.g., does 6 have a tail? 9? how are the non-decimal codes handled? etc. By contrast, some *gates* wired together would be incredibly obvious to damn near anyone looking at the schematic!)
If you are looking at the general principle, rather than this particular example (since the two versions don't do the same thing), then the answer is ... it depends :-)
These sorts of things often depend on the code before and after the tests, the number of tests, the hardware, the compiler, and the optimisation flags. Testing the sign of a floating point number is just a single bit test - in theory, the compiler could do this by loading a and b (or the appropriate byte from them) into integer registers, xor'ing it, then doing a test for that single bit (assuming normal floating point, rather than NaNs, etc.).
Code before or after this section may affect the speed - perhaps the compiler already knows something about a and b, or can make use of parts of the calculation at a later state. Perhaps /you/ know something, such as that it is likely for a to be non-negative - and by telling the compiler about it in the conditional, you'll get the likeliest path executed fastest.
Some processors are good at handling conditional jumps (or conditional instructions) smoothly, and have no problem with a series of tests. Others are better with a single multiply and a single jump.
Different compilers may handle the ordering differently. One thing to be sure of is to use good optimisation (typically -Os or -O2), make sure you target the actual hardware (-march=native, if that's not the default), and use "-ffast-math" to tell gcc not to be painfully pedantic about IEEE rules for re-arranging and optimising floating point.
I would not have been a bit surprised if gcc did in fact figure this out, but it appears it does not. In any case, on a *86 machine, the time is entirely dominated by the time takes to load the two cache lines. Whether logical operations on the sign bits, or arithmetic comparisons, or the floating point multiplier is used to compute the result, all three will spend most of their time waiting for memory loads.
Try compiling the following file using "gcc -O6 -S foo.c":
The first thing we do when we find someone has that attitude is to revert their clever codez, and the second one is to fire them. So in this case, your "job security" has a negative value.
Your memory is far better than mine! That was the *one* instance where I used a TTL 7segment decoder in my career. The only other times were CMOS devices (e.g., 14543, 4511, etc.) driving non-muxed LCD or PGD displays.
Any other "display interfaces" had the decoding done in software lookup tables (i.e., direct control of each segment so you could make "special characters")
Just as a sidenote, GCC's optimization levels top out at -O3. The generated code will also depend on the used -march, -mtune, -mfpmath and other options.
The 7447 was the first IC I ever bought, with a single digit display. It absorbed my pocket money saved for a month, so you can imagine my dismay when I found I'd bought a common-cathode display that wouldn't work with the 7447 - and the supplier wouldn't take either of them back. That was after fretting for a while about how to get 5V, when batteries only came in 1.5v increments. I really needed a mentor or a book.
I never did end up building the digital dice (die?) circuit I had designed, but I did design and build a lot of other stuff.
Still annoyed about this one because it's just happened to me (last night) and I sunk quite some time into it because I wrongly thought initially I had done something stupid to my code while changing it.
MIPS gcc cross-compiler, targeting a M4K core, -Os in effect.
Made a change to a routine (called from lots of places) which _reduced_ the amount of code in the routine. End result: final binary _grew_ from
3520 bytes to 4844 bytes and hence smashed through the 4K SRAM limit available for the code to execute in.
Even though -Os was in use, it appears gcc decided to inline my new smaller routine; I guess the final binary size wasn't considered by gcc for some reason.
-fno-inline fixed the problem. (As did -O1 :-))
This was with the MIPS sourced version of the compiler which is quite old by now (gcc 4.4.6 IIRC) so I don't know if it's a problem on current gcc versions.
The message for Tim is that you need to see the generated code in the final binary at the optimisation level you are _actually_ using before you can decide which one is best. (And don't assume higher optimisation level equals "better". :-))
Simon.
--
Simon Clubley, clubley@remove_me.eisner.decus.org-Earth.UFP
Microsoft: Bringing you 1980s technology to a 21st century world
That kind of "nonsense" is one reason why the best /general purpose/ language/environment is Java. HotSpot optimises what is /actually/ executing as opposed to what the compiler can /safely guess/ what /might/ be executing. GP = in the absence of other constraints such as might be found in embedded systems.
HP's Dynamo C compiler from the 1990s is a remarkable demonstration of that kind of thing: - take a processor X, and emulate that processor on that processor - take optimised CPU-bound C (Oc) running in the emulator to discover what code is actually executing - change the binary to optimise what is actually executing (Oe) - measure performance of original C (Oc) running in X => Px - measure performance of the changed binary (Oe) running in the emulator of X running in processor X => Pe - now Px = +- Pe
So, C slows down code as much as emulating a processor! Quite Remarkable, and one reason why Java isn't as slow as naive oldies assume.
Or use __attribute__((noinline)) on the function in question.
gcc uses a number of heuristics and hints to decide whether or not a function should be inlined. These include optimisation options (such as
-Os, -O2 or -O3), knowledge about the code usage (a single-use static function is almost always inlined with modern gcc), and estimations about how the total code size will change by inline the code. It is often not clear whether inlining will cause code to grow or shrink, since inlining eliminates function call overhead (which can be large in some cases), and can allow for many other optimisations (keeping data in registers before and after the call, constant propagation, etc.).
gcc usually does a reasonable job, but certainly not always - it sometimes gets things wrong due to bugs (suboptimal code generation is still correct code generation, and so such issues are harder to spot and lower priority for fixing), limited functionality (such as overly simplistic models of code size), or simply because the balancing heuristics about when a function is "too big to inline" or "small enough to always inline, even with -Os" don't match your particular requirements.
More recent versions of gcc generally have better tuning (and fewer bugs) than older ones, but it can never be right for everyone. You can fiddle with many of the heuristic parameters manually, but usually the best method is a few "noinline" or "always_inline" (or even "flatten") attributes in critical functions.
Look at it this way - it is one of the many little quirks that keep our jobs from getting boring!
Many modern C/C++ compilers (eg. GCC, Visual Studio, Clang, ICC) have similar features, usually called "profile guided optimization." Having to generate and export the execution profile however means it's not usually an option for smaller embedded systems.
There is an additional problem with profile-guided optimisation. It is based on the idea that you measure how often functions (and to a lesser extent, loops) are executed - and thus where the optimisation effort should be concentrated. This plays badly with modern compiler optimisation techniques such as inlining, function cloning, partial inlining, hot/cold partitioning, link-time optimisation, which all blur the concept of separate functions and the match between source-code functions and object-code functions.
If I had to process these values in bulk, and didn't have to care for things like NaNs, I'd probably try to play some tricks with (integer) SIMD instructions.
Am 05.08.2015 um 19:31 schrieb snipped-for-privacy@gmail.com:
That's easy to believe, but still wrong. Benchmarking a micro-optimization like that, particularly if done in any other context than the actual project, is clearly pointless. _Any_ difference between the benchmark platform and the actual one will render the benchmark result useless. A 10% difference like that can easily be inverted by just about any unrelated change to the surrounding source code, or any change to the tools and their settings.
The old truth still holds: premature optimization _is_ the root of all evil.
If you enable the optimiser, neither of these two executes the loop. Since a and b don't change inside the loop, j will always be either unchanged or incremented by 10000000, and the compiler figures that out and removes the loop.
ElectronDepot website is not affiliated with any of the manufacturers or service providers discussed here.
All logos and trade names are the property of their respective owners.