code optimization in embedded systems

  1. using switch kerword properly

Remember that put the most frequently used case branch on the first place of switch-case branches. It helps to reduce time spent on detemining which branch to execute.

  1. using assembly language for time-critical operations

Replace codes written in C language with assembly language would greatly improve the performance especially when the code is executed frequently. Note that using assembly language may decrease the readability of the code

  1. using register keyword

Using a register variable tell the compiler to put this variable in the CPU's registers instead of stack. The speed to read variables stored in registers is faster than read variables store in memory.

  1. Avoid using some standard C library functions

Standard C functions are designed for general use. They must consider all the possibility, this increase the code size(e.g, springf is oversized for some embedded systems). You can write your own standard- C-fuction-like functions, with some features ignored.

for more info, refer to

formatting link

Reply to
yusibin
Loading thread data ...

We have found this to be largely false. Dropping into assembly interrupts C optimization, and can thereby eliminate any savings.

We claim that C can match and beat assembly, especially when using small, clear statements. For a quick introduction to the way we verify this, see

formatting link

Don't give up on C, but do choose the correct C data types for your programming problem. All things being equal, this makes the developer more efficient as well. Say you need 16 bits of resolution for your calculations. Cudgeling an int16 into a role where a _Fract would be more natural won't make much of a difference program-wise but will add to your burden later on, and will burden programmers down the road.

Kirk.

--
Kirk Zurell
Byte Craft Limited
 Click to see the full signature
Reply to
Kirk Zurell

Any reasonable compiler will compile a switch statement into a jump table, so the ordering doesn't really matter (well, if the CPU has an instruction cache and prefetcing, it might do a little different).

Depends on the CPU. On RISCy architecture with 32 registers and pipelines (isn't ARM RISC-based design?) it is likely that the compiler produces much more efficient code than the average assembly programmer. Even good assembly programmers will have hard time keeping track of up to 32 registers and scheduling the instructions to keep the pipeline(s) busy.

Of course this doesn't apply to very limited architectures.

No, it doesn't. "register" modifier is just a hint to the compiler. Usually the register allocator of the compiler is able to do better job than human specifying "register" modifiers on variables. Not specifying "register" doesn't imply, that the variable would be stored in stack either.

This is probably true. Although sprintf() very unsafe because of the possibility of buffer overflows. This applies also to hand-written "standard-C-function-like" functions.

Anyhow, your remarks about C compilers are somewhere like from the 70s.

--
Jyrki Saarinen
http://koti.welho.com/jsaari88/
Reply to
Jyrki Saarinen

"Escepcially when using small, clear statements". I don't agree with this part of your article. Any decent optimizer with local and global optimizations done on the level of AST and IR code will eliminate redundancies, resulting in "small and clean statements".

Anyhow, even though I replied seriously for the original poster, I have now started to think that perhaps he is a troll..

--
Jyrki Saarinen
http://koti.welho.com/jsaari88/
Reply to
Jyrki Saarinen

This is true. I was thinking about the same thing, but forgot to write about it..

By "balanced tree of comparisons", do you mean something similar to the following:

Build a sorted array from case-values at compile time and use binary search to jump to the destination associated with the matching case-value. Just wondering if any compiler does these kind of things.. guaranteed O(log n) complexity, when n being the number of cases.

Implementation might look something like this:

typedef CaseValueType;

typedef ArrayLengthType;

struct arrayElement { CaseValueType caseValue; void *destinationAddress; };

sizeof(struct arrayElement) should probably be a 2^n, n > 0 to eliminate multiplication from indexing the array.

then

void *searchJumpAddress(CaseValueType caseValue, struct arrayElement *array, ArrayLengthType arrayLength) { ArrayLengthType index = binarySearch(caseValue, array, arrayLength); return array[index].destinationAddress; }

(now the problem is, how to handle case default: ... maybe one could make array size +1 and make *array point to the second element, and make binarySearch() to return -1 when index to caseValue is not found, ie. for the default case).

Since CaseValueType is unsigned, negative case values cannot be handled, but that could be fixed by either making it signed and put *array pointing to the proper element, or offsetting caseValue at compile time so that the caseValue is always >= 0.

Anyway, this discussion should probably be in comp.compilers..

--
Jyrki Saarinen
http://koti.welho.com/jsaari88/
Reply to
Jyrki Saarinen

Except when the cases in the switch statement are numbered sparsely, with large gaps in the interval from the smallest to the largest number. A jump table is then very large and sparsely filled (unless a hashed table is used, which is rare and complex). It is certainly reasonable to compile such sparse switch statements into a sequence of comparisons, making the case matching the first comparison the fastest case.

Some compilers generate the comparison sequence in numerical order, not in lexical order, and then the case with the smallest number will be fastest. It is not always possible to give the smallest number to the most frequent case. A balanced tree of comparisons would be more efficient, on the average, if the number of cases is largish.

Another consideration is whether you want to minimize the average execution time (the frequent cases should be in the fastest positions) or the worst-case execution time (the heaviest cases should be in the fastest positions, to keep the sum of the execution times for "choose case" and "execute case" small).

Some compilers have options, or pragmas, to select the way in which all switch statements, or particular switch statements, should be implemented in the generated code. If switch statements are critical for performance, you should experiment with different options for the best space-time combination. I've seen a factor 10 difference in the worst-case execution time between such options.

--
Niklas Holsti
Tidorum Ltd
 Click to see the full signature
Reply to
Niklas Holsti

gcc does it (version 3.4.4 in case that matters).

----8

Reply to
Stefan Reuther

Not to sound argumentative, but my experience (and a very good living over the last 30+ years) has repeatedly demonstrated that, in an embedded environment, someone familiar with both disciplines can routinely achieve an order of magnitude improvements in speed of selected functions. The trick is to be able to determine which functions to rewrite in assembler.

You can claim whatever you choose but that doesn't make it true. Make it worth my while and I'll be happy to demonstrate...Oh, that's right, people already make it worth my while to do it on a daily basis.

Note to OP: Only true for people who cannot read assembly language.

I wouldn't dream of it. That's been the "good" part of my very good living. The "very" part comes from being able to make ISRs and frequently-used functions "scream".

Ken Asbury

Reply to
Ken Asbury

This must apply only for architectures, for which it is difficult for the compiler to emit good code. If you get order of magnitude improvement in performance when hand-coding a function in assembly, the compiler must be quite bad. The reverse statement might apply on architectures with deep pipelines, several functional units and a large register set.

For the last sentence for yours: one should profile before doing anything else..

--
Jyrki Saarinen
http://koti.welho.com/jsaari88/
Reply to
Jyrki Saarinen

Oh, this is quite surprising (at least for me). Although it has been a while since I have been disassembling a compiler's output..

3.4.4 is quite old, by the way.
Reply to
Jyrki Saarinen

Replying to myself: deciding this might be difficult for the compiler, since this requires a lot of information about the target architecture, while these kind of optimizations are usually (to my understanding, I might be wrong since I am not an expert in compiler technology) made on the intermediate code level.

--
Jyrki Saarinen
http://koti.welho.com/jsaari88/
Reply to
Jyrki Saarinen

I have found it to be mostly true.

i guess you mean C like assembly instructions.

For a quick introduction to the way we

Hasn't verified anything.

Reply to
cbarn24050

An order of magnitude is pretty much the norm.

The reverse statement might apply on architectures with deep

Not likely but you'd have to be a brave man to try assembly on these processors.

Reply to
cbarn24050

A obviously false claim. For a given function there is one (or more equally) optimal sequence(s) of machine code. Such optimal code *can* be generated with an assembler, such optimal code *might* be generated by a compiler (C or other).

The OP's claim that replacing C code with assembler *will* greatly improve performance is also false.

Reply to
nospam

Nice Link :):)

Karthik Balaguru

Reply to
karthikbalaguru

That should be a comparatively simple decision. A "jmp" costs X cycles, a "cmp" costs Y cycles, and then count instructions.

Happens to me all the time. Granted, it's not "thousands", but it's a nontrivial amount. Our device's HMI has several dozen states, and each is a function containing a switch() over incoming events. Incoming events can be timers, button presses/releases, hardware signals, etc. Each function handles the events it can handle (a subset of the full set), and calls its parent state if it can't. This finally yields many sparse switch()es.

Stefan

Reply to
Stefan Reuther

There may be instructions the compiler doesn't know about (how do you write down a bit-reversed addition or a rotate-through-carry in C?). And if you do assembly, you can ignore calling conventions and other rules that limit your creativity. I have a boot loader in assembly, which does LZ77 unpacking and decryption, and is faster than memcpy would be (i.e. compression saves more memory bandwidth than it costs CPU time). It juuuuuuust fits into the processor's register file. That would be hard in C.

Yes. My experience is that 99% of the time, the compiler generated code is good enough. And if you know to think like a compiler, you can also do a number of optimisations on the C code itself.

For example, I usually pack the static variables of my modules into a struct. Advantage 1: on RISC architectures, all variables can from then on be addressed using a single pointer register and an offset. If they were individual variables, the compiler would have to reload the address into a pointer register all the time. Advantage 2: I can see the state of my module by observing just a single variable with the debugger. Disadvantage: a little more typing, and the linker can no longer eliminate unused variables.

Stefan

Reply to
Stefan Reuther

... snip ...

I would expect such to arise from something controlled by an enumerated state. This would make the enumerations contiguous, and the transfer table will be dense, and quite feasible.

--
 Chuck F (cbfalconer at maineline dot net)
   
 Click to see the full signature
Reply to
CBFalconer

Please give an example of such a bad compiler. An order of magnitude implies at least 10x speedup, which I find possible only in the following conditions:

- very poor compiler

- very difficult target

- target has instructions, that cannot be expressed in C, for example

- the assembly programmer is very gifted

Now, an example where all this conditions are true at the same time would be interesting.

Even better example of the reverse statement: Itanium.

You might want to participate in a challenge: let's say that the target is a RISCy architecture, PPC perhaps. Then we choose a non-trivial piece of algorithm, maybe a symmetric cipher (AES?), and a compiler which isn't at it's best on the chosen platform - gcc (at least this used to be the case at one point - don't know how much gcc back-end for PPC has been improved). I am quite sure, that you are not able to beat the compiler in assembly _in reasonable time_.

One realization of AES you can find here:

formatting link

--
Jyrki Saarinen
http://koti.welho.com/jsaari88/
Reply to
Jyrki Saarinen

Actually, emitting optimal code is a NP-complete problem (ie. not computable in reasonable time - though this statement this is a simplification of the NP-completeness, but that is an another issue, since this is not comp.theory).

Find more information here:

formatting link
formatting link

--
Jyrki Saarinen
http://koti.welho.com/jsaari88/
Reply to
Jyrki Saarinen

ElectronDepot website is not affiliated with any of the manufacturers or service providers discussed here. All logos and trade names are the property of their respective owners.