Slightly OT: speed of operation on a PC

- D
- Don Y
  
  Contact options for registered users
Vote on answer
posted
8 years ago

Wed, Aug 5, 2015 12:36 AM

The HV output wasn't required. The silliness was the effort to save a package of *gates* (or an AOI, etc.) -- at the expense of puzzling anyone who had to look at the circuit thereafter (Can you rattle off the decoding functions of any particular

7seg decoder, "off the top of your head"? e.g., does 6 have a tail? 9? how are the non-decimal codes handled? etc. By contrast, some *gates* wired together would be incredibly obvious to damn near anyone looking at the schematic!)

- G
- glen herrmannsfeldt
  
  Contact options for registered users
Vote on answer
posted
8 years ago

Wed, Aug 5, 2015 5:43 AM

(snip, I wrote)

This one I still remember from years ago. The 7447 does not have tails, the 74247 does.

-- glen

- D
- David Brown
  
  Contact options for registered users
Vote on answer
posted
8 years ago

Wed, Aug 5, 2015 6:32 AM

If you are looking at the general principle, rather than this particular example (since the two versions don't do the same thing), then the answer is ... it depends :-)

These sorts of things often depend on the code before and after the tests, the number of tests, the hardware, the compiler, and the optimisation flags. Testing the sign of a floating point number is just a single bit test - in theory, the compiler could do this by loading a and b (or the appropriate byte from them) into integer registers, xor'ing it, then doing a test for that single bit (assuming normal floating point, rather than NaNs, etc.).

Code before or after this section may affect the speed - perhaps the compiler already knows something about a and b, or can make use of parts of the calculation at a later state. Perhaps /you/ know something, such as that it is likely for a to be non-negative - and by telling the compiler about it in the conditional, you'll get the likeliest path executed fastest.

Some processors are good at handling conditional jumps (or conditional instructions) smoothly, and have no problem with a series of tests. Others are better with a single multiply and a single jump.

Different compilers may handle the ordering differently. One thing to be sure of is to use good optimisation (typically -Os or -O2), make sure you target the actual hardware (-march=native, if that's not the default), and use "-ffast-math" to tell gcc not to be painfully pedantic about IEEE rules for re-arranging and optimising floating point.

- C
- Clifford Heath
  
  Contact options for registered users
Vote on answer
posted
8 years ago

Wed, Aug 5, 2015 7:25 AM

I would not have been a bit surprised if gcc did in fact figure this out, but it appears it does not. In any case, on a *86 machine, the time is entirely dominated by the time takes to load the two cache lines. Whether logical operations on the sign bits, or arithmetic comparisons, or the floating point multiplier is used to compute the result, all three will spend most of their time waiting for memory loads.

Try compiling the following file using "gcc -O6 -S foo.c":

int foo(double a, double b) { return (a

- C
- Clifford Heath
  
  Contact options for registered users
Vote on answer
posted
8 years ago

Wed, Aug 5, 2015 7:28 AM

The first thing we do when we find someone has that attitude is to revert their clever codez, and the second one is to fire them. So in this case, your "job security" has a negative value.

- D
- Don Y
  
  Contact options for registered users
Vote on answer
posted
8 years ago

Wed, Aug 5, 2015 7:44 AM

Your memory is far better than mine! That was the *one* instance where I used a TTL 7segment decoder in my career. The only other times were CMOS devices (e.g., 14543, 4511, etc.) driving non-muxed LCD or PGD displays.

Any other "display interfaces" had the decoding done in software lookup tables (i.e., direct control of each segment so you could make "special characters")

- A
- Anders.Montonen
  
  Contact options for registered users
Vote on answer
posted
8 years ago

Wed, Aug 5, 2015 8:34 AM

Just as a sidenote, GCC's optimization levels top out at -O3. The generated code will also depend on the used -march, -mtune, -mfpmath and other options.

-a

- C
- Clifford Heath
  
  Contact options for registered users
Vote on answer
posted
8 years ago

Wed, Aug 5, 2015 9:14 AM

The 7447 was the first IC I ever bought, with a single digit display. It absorbed my pocket money saved for a month, so you can imagine my dismay when I found I'd bought a common-cathode display that wouldn't work with the 7447 - and the supplier wouldn't take either of them back. That was after fretting for a while about how to get 5V, when batteries only came in 1.5v increments. I really needed a mentor or a book.

I never did end up building the digital dice (die?) circuit I had designed, but I did design and build a lot of other stuff.

- S
- Simon Clubley
  
  Contact options for registered users
Vote on answer
posted
8 years ago

Wed, Aug 5, 2015 10:25 AM

You have no idea just how true that is. :-( :-(

Still annoyed about this one because it's just happened to me (last night) and I sunk quite some time into it because I wrongly thought initially I had done something stupid to my code while changing it.

MIPS gcc cross-compiler, targeting a M4K core, -Os in effect.

Made a change to a routine (called from lots of places) which _reduced_ the amount of code in the routine. End result: final binary _grew_ from

3520 bytes to 4844 bytes and hence smashed through the 4K SRAM limit available for the code to execute in.

Even though -Os was in use, it appears gcc decided to inline my new smaller routine; I guess the final binary size wasn't considered by gcc for some reason.

-fno-inline fixed the problem. (As did -O1 :-))

This was with the MIPS sourced version of the compiler which is quite old by now (gcc 4.4.6 IIRC) so I don't know if it's a problem on current gcc versions.

The message for Tim is that you need to see the generated code in the final binary at the optimisation level you are _actually_ using before you can decide which one is best. (And don't assume higher optimisation level equals "better". :-))

Simon.

--
Simon Clubley, clubley@remove_me.eisner.decus.org-Earth.UFP 
Microsoft: Bringing you 1980s technology to a 21st century world

- T
- Tom Gardner
  
  Contact options for registered users
Vote on answer
posted
8 years ago

Wed, Aug 5, 2015 10:42 AM

Yes indeed.

That kind of "nonsense" is one reason why the best /general purpose/ language/environment is Java. HotSpot optimises what is /actually/ executing as opposed to what the compiler can /safely guess/ what /might/ be executing. GP = in the absence of other constraints such as might be found in embedded systems.

HP's Dynamo C compiler from the 1990s is a remarkable demonstration of that kind of thing: - take a processor X, and emulate that processor on that processor - take optimised CPU-bound C (Oc) running in the emulator to discover what code is actually executing - change the binary to optimise what is actually executing (Oe) - measure performance of original C (Oc) running in X => Px - measure performance of the changed binary (Oe) running in the emulator of X running in processor X => Pe - now Px = +- Pe

So, C slows down code as much as emulating a processor! Quite Remarkable, and one reason why Java isn't as slow as naive oldies assume.

- D
- David Brown
  
  Contact options for registered users
Vote on answer
posted
8 years ago

Wed, Aug 5, 2015 10:59 AM

Or use __attribute__((noinline)) on the function in question.

gcc uses a number of heuristics and hints to decide whether or not a function should be inlined. These include optimisation options (such as

-Os, -O2 or -O3), knowledge about the code usage (a single-use static function is almost always inlined with modern gcc), and estimations about how the total code size will change by inline the code. It is often not clear whether inlining will cause code to grow or shrink, since inlining eliminates function call overhead (which can be large in some cases), and can allow for many other optimisations (keeping data in registers before and after the call, constant propagation, etc.).

gcc usually does a reasonable job, but certainly not always - it sometimes gets things wrong due to bugs (suboptimal code generation is still correct code generation, and so such issues are harder to spot and lower priority for fixing), limited functionality (such as overly simplistic models of code size), or simply because the balancing heuristics about when a function is "too big to inline" or "small enough to always inline, even with -Os" don't match your particular requirements.

More recent versions of gcc generally have better tuning (and fewer bugs) than older ones, but it can never be right for everyone. You can fiddle with many of the heuristic parameters manually, but usually the best method is a few "noinline" or "always_inline" (or even "flatten") attributes in critical functions.

Look at it this way - it is one of the many little quirks that keep our jobs from getting boring!

- A
- Anders.Montonen
  
  Contact options for registered users
Vote on answer
posted
8 years ago

Wed, Aug 5, 2015 11:23 AM

Many modern C/C++ compilers (eg. GCC, Visual Studio, Clang, ICC) have similar features, usually called "profile guided optimization." Having to generate and export the execution profile however means it's not usually an option for smaller embedded systems.

-a

- D
- David Brown
  
  Contact options for registered users
Vote on answer
posted
8 years ago

Wed, Aug 5, 2015 11:58 AM

There is an additional problem with profile-guided optimisation. It is based on the idea that you measure how often functions (and to a lesser extent, loops) are executed - and thus where the optimisation effort should be concentrated. This plays badly with modern compiler optimisation techniques such as inlining, function cloning, partial inlining, hot/cold partitioning, link-time optimisation, which all blur the concept of separate functions and the match between source-code functions and object-code functions.

- D
- Don Y
  
  Contact options for registered users
Vote on answer
posted
8 years ago

Wed, Aug 5, 2015 3:04 PM

Mine would probably have been a `138 (ubiquitous "address decoder" for small systems; or, `74 and `157's (DRAM controller)

Most of my "discrete" logic designs were specialty CPU's. So, any "junk logic" was typically implemented in bipolar ROMs (e.g., microcode store).

- S
- Stefan Reuther
  
  Contact options for registered users
Vote on answer
posted
8 years ago

Wed, Aug 5, 2015 4:43 PM

If I had to process these values in bulk, and didn't have to care for things like NaNs, I'd probably try to play some tricks with (integer) SIMD instructions.

Stefan

- E
- edward.ming.lee
  
  Contact options for registered users
Vote on answer
posted
8 years ago

Wed, Aug 5, 2015 5:31 PM

I just wrote a test program. The first statement is 10% faster than the second. You can argue all you want, but nothing beat benchmarking.

- H
- Hans-Bernhard BrÃ¶ker
  
  Contact options for registered users
Vote on answer
posted
8 years ago

Wed, Aug 5, 2015 6:11 PM

Am 05.08.2015 um 19:31 schrieb snipped-for-privacy@gmail.com:

That's easy to believe, but still wrong. Benchmarking a micro-optimization like that, particularly if done in any other context than the actual project, is clearly pointless. _Any_ difference between the benchmark platform and the actual one will render the benchmark result useless. A 10% difference like that can easily be inverted by just about any unrelated change to the surrounding source code, or any change to the tools and their settings.

The old truth still holds: premature optimization _is_ the root of all evil.

- R
- rickman
  
  Contact options for registered users
Vote on answer
posted
8 years ago

Thu, Aug 6, 2015 12:19 AM

What code is generated? Seems like 10% is close enough that "faster" may change with different hardware and tools.

--

Rick

- E
- edward.ming.lee
  
  Contact options for registered users
Vote on answer
posted
8 years ago

Thu, Aug 6, 2015 2:45 AM

I timed two loops, looking at the user time.

t1.c: for(i=0; i= 0.0 && b < 0.0) j++; }

t2.c: for(i=0; i

- C
- Clifford Heath
  
  Contact options for registered users
Vote on answer
posted
8 years ago

Thu, Aug 6, 2015 2:56 AM

If you enable the optimiser, neither of these two executes the loop. Since a and b don't change inside the loop, j will always be either unchanged or incremented by 10000000, and the compiler figures that out and removes the loop.

Clifford Heath.