code optimization in embedded systems

- S
- Stefan Reuther
  
  Contact options for registered users
Vote on answer
posted
16 years ago

Thu, Nov 15, 2007 8:12 PM

There is no such thing as an "optimal sequence of machine code". Optimal with respect to what? Size? Speed? Debugability? Code generation always is a trade-off between various conflicting goals. I agree that compilers usually pick their preferences differently than assembler programmers.

Such code *might* be generated with an assembler if you've got an assembler guru with you and infinite time. Just doing it in assembler doesn't mean it's automatically good.

Stefan

- J
- Jyrki Saarinen
  
  Contact options for registered users
Vote on answer
posted
16 years ago

Thu, Nov 15, 2007 8:46 PM

Good point, although I have also experince about this situtation, which I didn't seem to bother writing about, but here it goes:

~10 years ago I was involved in the Amiga demo scene, with mostly realtime 3D graphics. For linear interpolation (for good results for Gouraud shading, and not so good with texture mapping, since the lack of perspective division.. :)) addx.l (add with extend) insn was used. It allowed linear interpolation of fixed-point values without right-shifting the interpolated value:

Gouraud shading with 16.16 precision, although this precision wasn't very useful with only 16 colours, even the gouraudTable was dithered, but the quality was quite good, and even a lowly Amiga 1200 with 14MHz

68020 could achieve quite nice frame rates:

lea gouraudTable,a0 lea chipRAMBuffer,a1 ... .innerloop move.l (a0,d0.w*4),(a1)+ # 4 bits per pixel: 8 pixels # drawn from a LUT addx.l d1,d0 # g += dg; dbf d2,.innerloop

(edges had to be handled separately with and/or)

Gouraud shading, "chunky mode", up to 24.8 precision: lea chunkyBuffer,a0 ... .innerloop move.b d0,(a0)+ addx.l d1,d0 dbf d2,.innerloop

Texture mapping (16.16), "chunky mode" (byte == one pixel, 256 colours):

lea textureMap,a0 # 256x256 texture map lea chunkyBuffer,a1 # a buffer to be c2p'ed to # chip ram, as fast as # just copying from fast2chip ...

# large bits denote integer part, small bits fraction # d0: u: %vvvvvvvv0000000000000000UUUUUUUU # d1: v: %uuuuuuuu00000000VVVVVVVVvvvvvvvv # d2: du: %vvvvvvvv0000000000000000UUUUUUUU # d3: dv: %uuuuuuuu00000000VVVVVVVVvvvvvvvv

sub.w d2,d0 # reduce error from using addx.l d3,d1 # two addx.l .innerloop move.w d1,d4 # v is good enough. And if you know to think like a compiler, you can also

For most transformations (assuming a modern compiler), you don't have to "think like a compiler". The compiler does it for you.

If the access to a static variable is an issue in time critical code, you are doing something funny (ISRs and volatile of course is a different issue).

Anyway, related to your "packing the variables into a struct", modern compilers are quite often to do this:

/* before optimization */ static int key[SIZE]; static int value[SIZE];

/* afterwards */ struct merged { int key, value; };

static struct merged[SIZE];

Loop optimizations - for better cache usage - that are usually done include loop exchange and loop fusion. Pretty standard stuff.

Examples:

(better cache hit rate by having better spatial locality)

/* before loop exchange */ for (j = 0; j < 100; j++) for (i = 0; i < 100; i++) x[i][j] = 2 * x[i][j];

/* after */ for (i = 0; i < 100; i++) for (j = 0; j < 100; j++) x[i][j] = 2 * x[i][j];

(better cache hit rate by accessing memory "horizontally" instead of "vertically" - better temporal locality)

/* before loop fusion */ for (i = 0; i < N; i++) for (j = 0; j < N; j++) a[i][j] = 1 / b[i][j] * c[i][j]; for (i = 0; i < N; i++) for (j = 0; j < N; j++) d[i][j] = a[i][j] + c[i][j];

/* after */ for (i = 0; i < N; i++) for (j = 0; j < N; j++) { a[i][j] = 1 / b[i][j] * c[i][j]; d[i][j] = a[i][j] + c[i][j]; }

(these examples were from my seminar report in 2003, sadly it is in Finnish..) In general, programming languages having pointers have a problem called aliasing, which may prevent the optimizer for functioning 100%. This is one reason why Just In Time compiled Java can be even faster than statically compiled C or C++ code. Another reason is that the JIT compiler is able to analyze the behaviour of the program, although this can be done with statically compiling compilers also, with the limitation that one has only one profile (== one case of input) of the executing program, which can be used for optimization, while the JIT compiler can do it all the time.

--
Jyrki Saarinen
http://koti.welho.com/jsaari88/

- J
- Jyrki Saarinen
  
  Contact options for registered users
Vote on answer
posted
16 years ago

Thu, Nov 15, 2007 8:51 PM

Yep. The problem is, that the optimization of switch-case -statement usually happens at higher level of intermediate code (IR) - "the middle-end" - when there's no information about the target arch.

Done that (implemented a state machine for a fire detection system, in one part of the SW), although with not so many states. And it wasn't performance critical part of the system at all.

In OOP, this kind of pattern can be implemented with using the State design pattern. The code is much more readable, than a series of switch-cases - the performance is likely to be the same. Instead of the compiler doing a jump table or binary search for the case values, a "vtable" (indirect jump, no conditional jumps at all) is used.

formatting link

--
Jyrki Saarinen
http://koti.welho.com/jsaari88/

- J
- Jyrki Saarinen
  
  Contact options for registered users
Vote on answer
posted
16 years ago

Thu, Nov 15, 2007 8:55 PM

Ok, got it what you meant.

As I stated in an earlier post, the problem is that these kind of optimizations are done (in a retargetable compiler, doing a for example C compiler for _only_ H8/S might be a different situation) with IR code, without knowing the details of the target.

--
Jyrki Saarinen
http://koti.welho.com/jsaari88/

- H
- Hans-Bernhard Bröker
  
  Contact options for registered users
Vote on answer
posted
16 years ago

Thu, Nov 15, 2007 9:08 PM

This tip is based on an assumption you have no way to back up, and it has no particular relevance to embedded systems.

C compilers have all kinds of different strategies to turn switch() statements into machine code. Some of them will optimally reorder the cases themselves if you provide them with profiling data from typical executions of the program.

Again, in the generality you state it, this tip is based on an unsupportable assumption.

... typically achieves quite exactly nothing, in any state-of-the-art compiler.

- N
- Niklas Holsti
  
  Contact options for registered users
Vote on answer
posted
16 years ago

Thu, Nov 15, 2007 9:16 PM

No, nothing using tables or subroutine calls. I meant in-line code that compares the switch-index to the constant numbers of the cases, but organized in a "binary search tree": first compare to the middle value; if equal that's your case; if less, handle the lesser values recursively (recursively at compile-time, that is); if greater, handle the greater values recursively likewise.

Just as gcc 3.4.4 does it in Stefan Reuther's post. In most processors this is almost certain to be faster, on average, than comparisons in ascending numerical order. (The exception: If the comparisons in ascending order can use faster instructions, perhaps smaller literals, for example by successively subtracting the "deltas" between the case numbers from the switch-index variable.)

--
Niklas Holsti
Tidorum Ltd
niklas holsti tidorum fi
       .      @       .

- N
- Niklas Holsti
  
  Contact options for registered users
Vote on answer
posted
16 years ago

Thu, Nov 15, 2007 9:24 PM

Strong agreement. Except in very special cases.

Unfortunately the cases are often identified by enumerated types, so it is cumbersome to look up the numerical order -- and it might change later...

If you use Ada, the compiler will tell you which values you left out, and will reject the program -- unless you have a default ("others") case. One of the nicest ways in which Ada supports the fallible programmer.

--
Niklas Holsti
Tidorum Ltd
niklas holsti tidorum fi
       .      @       .

- N
- nospam
  
  Contact options for registered users
Vote on answer
posted
16 years ago

Thu, Nov 15, 2007 9:38 PM

The time (and so reasonableness) is very much dependant on the problem size.

The most obvious targets for hand optimisation are small fragments of time critical or very frequently executed code.

You write compilers? is 100% of your runtime library written in C?

- W
- Wilco Dijkstra
  
  Contact options for registered users
Vote on answer
posted
16 years ago

Thu, Nov 15, 2007 9:42 PM

I don't mind sounding argumentative, but 10 times is just ridiculous, it is simply impossible for a compiler to produce code that bad. Even unoptimized code is rarely twice as slow as optimized code. So a compiler would have to emit many NOPs after every instruction to get anywhere near 10 times slowdown vs assembler.

Modern compilers routinely beat all but the most experienced assembly programmers. On ARM for example you'd be having difficulty getting more than 10% speedup over compiled code, and getting more than that is only possible in extremely rare cases. However even 2 times speedup for a non-trivial piece of code is impossible.

You're completely misunderstanding his claim. He is saying that EVERY program in assembler has an equivalent C variant (on his compiler). That means that any assembly program you write can be translated back into a C program with identical performance, ie. it is pointless to make X times faster claims as they are equally fast.

At the end of the day writing optimal code is an art which few people master. Whether you use your talents on C or assembler is a matter of preference rather than one of performance.

Wilco

- W
- Wilco Dijkstra
  
  Contact options for registered users
Vote on answer
posted
16 years ago

Thu, Nov 15, 2007 9:52 PM

You'd still be looking at a factor of 2, maybe 3. Most compilers have intrinsics nowadays so you can use most instructions from C. Very difficult targets often mean it's harder for the assembly programmer as well as the compiler (you already mentioned deep interlocked pipelines and large register files, but what about rotating register files, conditional execution, complex addressing modes, etc?). For every gifted assembly programmer there is a gifted C programmer, so you can't get an advantage that way. For equally gifted programmers, the C one will win as he can concentrate more on the algorithm rather than the low-level details.

Actually you can: optimise the C code or use a better compiler and pretend you wrote it :-)

Wilco

- S
- Stefan Reuther
  
  Contact options for registered users
Vote on answer
posted
16 years ago

Thu, Nov 15, 2007 9:59 PM

Compilers have to handle all border cases, and thus may generate code you don't want. Simplest example: "i/4". You want a right-shift because you know i is positive. The compiler assumes it might be negative and generates code to handle that. Solution: declare i as unsigned.

Exactly that's what I'm doing :-) ISR fires, looks into my device object whether there's another request, and posts that. And if I get my ISR down from 600 to 500 cycles using that trivial optimisation, I'll happily do it. In addition, the code gets smaller and makes more precious L1 memory available.

What compiler does this *in practice*? I haven't seen one do that yet. I haven't even seen one that dissects an array into a set of registerizable scalars, even if possible (such as the state array of a

3-tap FIR filter).

(I have however seen compilers who merge the static variables into a single allocation unit and use a pointer register and offset, as if it were a struct. But I haven't seen anyone changing the layout of my variables.)

[snip]

Another fine example why you should think like a compiler :-) a regular C compiler will not perform the optimisations because it has to assume aliasing between all the arrays. Either use 'restrict' (might help some cases), or let the compiler know that these arrays are disjoint, by packing them in a structure maybe. Packing into structure helps with VisualDSP++, for example.

Stefan

- G
- Grant Edwards
  
  Contact options for registered users
Vote on answer
posted
16 years ago

Thu, Nov 15, 2007 10:03 PM

Indeed.

I once profiled a system running a TCP/IP stack on an ARM platform and about 50% of the CPU time was spent in the IP checksum routine, so I decided to re-write it in assembly. Just for fun, I did it without first looking at the compiler generated code. My first several attempts were considerably slower than GCC's code. After a day or so, I managed to equal the speed of the C version.

After about a week's work and some help from somebody who knew every last corner of the ARM instruction set, I managed about a

25-30% speed increase compared to the compiled code. I was only able to achieve that because the IP checksum algorithm's 1's compliment arithmetic allows an obscure trick with the carry bit when done in assembly language.

I really can't beleive that in general the average assembly language programmer is going to be able to beat (or even equal) a modern compiler.

--
Grant Edwards                   grante             Yow! Should I get locked
                                  at               in the PRINCICAL'S
                               visi.com            OFFICE today -- or have
                                                   a VASECTOMY??

- S
- Stefan Reuther
  
  Contact options for registered users
Vote on answer
posted
16 years ago

Thu, Nov 15, 2007 10:07 PM

Then it must be taught about the target architecture :-) A simple heuristic may also do.

It's not performance critical, but the compiler doesn't know that. Why should it generate worse code than it would generate if this state machine were the ISR for a 100 kHz bit-banging protocol?

I know, but that would increase code size (both source and object code) even more. If you got 1000 events, every vtbl would have 1000 entries. (Borland once had "dynamic methods", where only the implemented methods occured in the method table, but I haven't seen that again since then.)

Stefan

- W
- Wilco Dijkstra
  
  Contact options for registered users
Vote on answer
posted
16 years ago

Thu, Nov 15, 2007 10:16 PM

I do. I know various compilers that have all runtime libraries done in C except for a small amount of startup code to setup the C stack etc. It is often done when supporting many different architectures, where the cost of writing assembler is prohibitive on all but a few key targets. Doing it in C means any effort you put into improving the compiler is paid off on all targets.

You can write much of the C library in C and get good performance. Even low-level routines like division or memcpy can be done in C with as good as assembler performance. For example my C floating point multiply code beats the fastest optimized assembler code that is available on ARM.

Wilco

- W
- Wilco Dijkstra
  
  Contact options for registered users
Vote on answer
posted
16 years ago

Thu, Nov 15, 2007 10:30 PM

All of this can be done in a machine independent way. If you have 20+ non-dense cases it is pretty obvious that linear search is much worse than binary search. If you want to do better, then the traditional way is to let the backend decide when the linear search becomes better. This may depend not only on the target but also on the actual case values or the optimization goal.

Wilco

- W
- Walter Banks
  
  Contact options for registered users
Vote on answer
posted
16 years ago

Thu, Nov 15, 2007 10:45 PM

Even without profiling information the control flow analysis will yield at least three approaches. Zero based indexes in a block with few gaps, Nonzero based indexes with few gaps and sparse indexes. The target architecture will tune the implementation choices.

This is going to be a long thread...

Regards

-- Walter Banks Byte Craft Limited Tel. (519) 888-6911 Fax (519) 746 6751

formatting link

snipped-for-privacy@bytecraft.com

- W
- Wilco Dijkstra
  
  Contact options for registered users
Vote on answer
posted
16 years ago

Thu, Nov 15, 2007 11:24 PM

If the compiler can prove it was positive (eg. if (i > 10) i = i / 4), it will use a single shift. However you raise a good point: in order to write efficient code you need to understand how compilers work as well as the details of the micro architecture you're targeting.

The Sun compiler does this to optimise for better caching effects, however it is only interesting on large systems. Compilers routinely split structures and arrays into scalars but only if they are not address taken or indexed by a variable (ie. a[1] is OK, but a[i] is not).

That's routine as well on compilers targeting RISC architectures. Reordering variables to minimise base pointer offsets and alignment padding is often done too (much to the disappointment of people who believe it is their right to access y in int x, y as *(&x + 1) :-).

Yes it is usually possible to avoid aliasing. The most common technique is to cache pointer or array accesses in scalars. Like restrict you may need to do some of this by hand, but this transformation can make the code easier to understand.

Wilco

- W
- Wilco Dijkstra
  
  Contact options for registered users
Vote on answer
posted
16 years ago

Fri, Nov 16, 2007 1:53 AM

Everybody agrees the original advice was incorrect, but what is the correct advice? This advice applies to most CPUs, from simple 8-bit cores to 64-bit superscalar monsters:

Put performance critical cases in separate if statements before the switch (an if statement is always faster than a switch, and switch doesn't guarantee which particular case is tested first). Try using mostly contiguous case values within a small range to enable a branch table to be used.

First optimise the algorithm, design and finally the C implementation. Use language extensions and intrinsics as neccessary. If performance still isn't good enough, don't forget to enable the optimizer and select the best possible compiler options (can easily help 10%), or buy a better compiler. Use assembler as a last resort, and only if you are an expert in writing assembly code, or you'll do worse than the compiler.

Never use the register keyword as it is ignored. For local variables ensure you use the natural data type for the CPU rather than the size of the data (eg. char on 8/16-bit but int on 32-bit CPUs). Use unsigned rather than signed arithmetic where possible. Avoid taking the address of scalar variables. Avoid using more than 4-8 distinct variables (using the same variable for different purposes doesn't improve things). Do use const for global constant data. Do use volatile when accessing global variables from interrupts (very common mistake). Don't make any assumptions about global data ordering or alignment.

Do use C library functions rather than reinventing the wheel (many people write their own slow/buggy memcpy for example). Use the efficient initialization of global variables at startup when possible. Only use floating point when absolutely required as it is emulated in software on most embedded systems and few people understand how to deal with rounding errors. Avoid using general C library functions where more specific ones exist (eg. str[n]cpy/str[c]cat rather than s[n]printf). 4*atan(1) looks cool but takes a lot of code and cycles for a constant...

Wilco

- J
- Jyrki Saarinen
  
  Contact options for registered users
Vote on answer
posted
16 years ago

Fri, Nov 16, 2007 3:12 AM

Not being offensive, but does data-flow analysis ring any bell?

Even gcc does this. For some architecures, but this isn't rocket science. These stuff have been in common knowledge for at least 20 years (I'm only 30, so I was a little boy when RISC was "invented").

Caches have been present for a long time ago, predating RISC machines. Even my lowly Amiga 1200 with a 14MHz 68020 had an instruction cache of

256 bytes.

Exactly. Compilers are able to do things.

gcc has a switch for very strict aliasing. It might breake some badly written code, but likely it will produce better machine code.

DSP: I have only studied the theory, not in practice, so I will not parcitipate in this discussion.

--
Jyrki Saarinen
http://koti.welho.com/jsaari88/

- G
- Grant Edwards
  
  Contact options for registered users
Vote on answer
posted
16 years ago

Fri, Nov 16, 2007 3:19 AM

So it generates an arithmetic right shift instruction if "i" is signed, and a logical right shift if "i" is unsigned. Same as an assembly programmer would.

I don't get your point. If "i" is always positive, then the ASR generated by the compiler is just as correct as an LSR.

--
Grant Edwards                   grante             Yow!  I'm a nuclear
                                  at               submarine under the
                               visi.com            polar ice cap and I need
                                                   a Kleenex!