code optimization in embedded systems

yusibin · 2007-11-15T13:53:08+00:00

1. using switch kerword properlyRemember that put the most frequently used case branch on the firstplace of switch-case branches. It helps to reduce time spent ondetemining which branch to execute. 2. using assembly language for time-critical operationsReplace codes written in C language with assembly language wouldgreatly improve the performance especially when the code is executedfrequently. Note that using assembly language may decrease thereadability of the code3. using register keywordUsing a register variable tell the compiler to put this variable inthe CPU's registers instead of stack. The speed to read variablesstored in registers is faster than read variables store in memory. 4. Avoid using some standard C library functionsStandard C functions are designed for general use. They must considerall the possibility, this increase the code size(e.g, springf isoversized for some embedded systems). You can write your own standard-C-fuction-like functions, with some features ignored.for more info, refer to

J

Jyrki Saarinen 18 years ago

If you don't understand what NP means, read my statement again. Also, I provided URLs which may explain this issue.

I'm not a compiler writer. In Finland, it is very rare that one gets such a job. In Nokia it might be possible, but whatever. You simply don't seem to understand. That is not my problem.

In general, I'm just interested in implementing a modern aggressively optimizing compiler (also re-targetable), and I chose Pascal from the

70's. It has pretty simply syntax and semantics. Also, because of my background in the demo scene, I find it interesting to create a compiler that would finally schedule insn for MC68060 (gcc doesn't do that, neither vbcc).

Jyrki Saarinen http://koti.welho.com/jsaari88/

Vote

J

Jyrki Saarinen 18 years ago

But, now you are challenging the masters of assembly! It is a magniture faster than crappy code emitted by a compiler.. ;)

Jyrki Saarinen http://koti.welho.com/jsaari88/

Vote

J

Jyrki Saarinen 18 years ago

(this really should be in comp.compilers, but perhaps this could be educational for .. people)

For your statement, on the level of IR code, how are you supposed you know whether it does pay off to do linear search or binary search?

Jyrki Saarinen http://koti.welho.com/jsaari88/

Vote

J

Jyrki Saarinen 18 years ago

Well, that is the common way to do that. But if I want to keep the "middle-end" free of target specific details?

Ok. I have to confess, that I have never been on that freq. On the Amiga, OTOH, everything was synched into the 50Hz screen. But that is an another issue again, which isn't really related to this group.

So does the compiler when it'll decide to make your switch-case

-statement into a table? I don't quite understand what do you mean by increasing the source size. Normally, when one applies the State pattern, the source gets quite small. This pattern can be made in C, even though it is PITA.

Jyrki Saarinen http://koti.welho.com/jsaari88/

Vote

J

Jyrki Saarinen 18 years ago

I don't agree.

By the way, are you related to "the" Dijkstra?

Now, you are being a troll. As it has been stated _several_ times in this thread (which was created by a troll), you need to do this only with very bad compilers.

A hint: Single Statement Assignment (SSA)

The first sentence is quite nice. I have seen (or forced) to refactor code in C++, which did not use STL. Instead, some wise guy did his own implementation of string and hashtable. Which is nice.

Jyrki Saarinen http://koti.welho.com/jsaari88/

Vote

R

robertwessel2 18 years ago

Indeed: The usual approach is to turn a sparse tree into a hard coded binary search for the upper part of the tree, and then a linear search for the lower part. The point where you switch depends on the cost of mispredicted branches. In the binary tree part, about half the branches will mispredict (not counting the branch-eq leg of the common compare / branch-gt / branch-eq construction), in the sequential part, only one (the one that's actually taken) will mispredict. Each extra layer of the binary search adds a compare, and half each of a predicted and mispredicted branch, but halves the number of compares and correctly predicted branches in the sequential part.

And of course switches with both spares and dense sections often generate code that does computed branches in the dense parts of the tree.

Vote

G

Gene S. Berkowitz 18 years ago

Well, there's your problem right there. Here in the USA, we had to do everything 20% faster than you because our Amigas refreshed at

60 Hz...

--Gene

Vote

J

Jyrki Saarinen 18 years ago

Right on.. in the US Amiga wasn't that popular. Though you used to make several kinds of peripherals, at very good quality.

Now we are once again out of context..

Jyrki Saarinen http://koti.welho.com/jsaari88/

Vote

N

Neil 18 years ago

Is this for Inline assembler or, A Full ASM function in its own module?

Vote

W

Wilco Dijkstra 18 years ago

No, but I'm planning to be more famous so people stop asking that question.

The word contiguous doesn't imply linearly incrementing, it means no gaps. Compilers always sort case values so the order is irrelevant. If your case labels are not dense (ie. have large gaps) then you're unlikely to get a branch table.

Now you're trolling. SSA has absolutely nothing to do with switches... FYI, few compilers use SSA today and the ones that do only use it in a limited set of optimizations.

C++ and STL is a different matter altogether. I would not advocate STL in programs that need to be efficient, use as little memory as possible and be small (ie. most embedded systems and many big desktop applications). It's great for fast development or if you simply don't care about memory use or performance.

STL often uses too much resources even for desktop systems, so many performance critical applications avoid it. I've seen cases where STL string concatenation slowed down command-line parsing so much that command-line parsing it became the most performance critical part of a compiler...

Many performance critical programs use their own memory allocation strategies as new/malloc are not good enough. So in specific situations it is a good idea to create your own implementation - but you'd have to have a really good reason and do significantly better than existing solutions.

Wilco

Vote

W

Wilco Dijkstra 18 years ago

The code generated for i/4 goes like this (for 32-bit int):

res = (i + ((unsigned)(i >> 31) >> 30)) >> 2;

ASR cannot be used for division by powers of 2 as it rounds towards minus infinity rather than zero, so an explicit round step done to round the result up if it is negative.

Wilco

Vote

K

karthikbalaguru 18 years ago

I came across the following w.r.t ARM processor w.r.t Efficient C:-

1) Unsigned types are faster during divisions. 2) Do not use implicit/explicit narrowing casts as they will cost extra cycles. 3) Do not use char and short types for function arguments or return values. Try, using the int type even if the range of the parameter is smaller. This makes the compiler not to perform unnecessary casts.

Karthik Balaguru

Vote

K

karthikbalaguru 18 years ago

Optimization techniques largely vary from processor to processor w.r.t architecture, compiler, linker , OS and many other factors.

Karthik Balaguru

Vote

W

Wilco Dijkstra 18 years ago

comp.compilers is not easy to post to so many of us don't bother.

As I said, you query the backend. The intermediate code is usually very close to the final code, so you can ask questions like: "how many cycles/ bytes is this compare going to take?". A good compiler asks the backend similar questions all the time so that most target-specific decisions are made well before the IR enters the backend.

However in this case it is good enough to ask at how many case labels one should change from binary into linear search for the remaining cases.

Wilco

Vote

J

Jyrki Saarinen 18 years ago

I wouldn't consider intermediate code being "close to the target". Quite often it is a "three-address code" in SSA form. Your other statements apply, of course.

Jyrki Saarinen http://koti.welho.com/jsaari88/

Vote

J

Jyrki Saarinen 18 years ago

:)

I was trying to make a point of SSA enabling some (actually quite many) optimizations which are algorithmically more difficult to do in non-SSA form.

STL has very carefully selected algorithms, and being based on templates, it is also likely to be faster than any custom made data structure(s).

I don't undestand how a custom made linked list for integers (for example) should be any faster compared to:

formatting link

You must be kidding (about the last sentence). How many command line options did you have, a million?

Jyrki Saarinen http://koti.welho.com/jsaari88/

Vote

W

Walter Banks 18 years ago

Yes. Our libraries and startup code is written in C.

Regards

-- Walter Banks Byte Craft Limited Tel. (519) 888-6911 Fax (519) 746 6751

formatting link

snipped-for-privacy@bytecraft.com

Vote

W

Wilco Dijkstra 18 years ago

Actually he is right. NP complete problems can be solved in reasonable time as long as the problem size stays small. It only becomes impossible to solve as the size grows.

The fact is compilers and humans can produce optimal code for small functions (have you heard about super optimizers?). It only becomes impractical for large programs.

If you are interested in such a job and don't mind going abroad then I know several UK companies who are keen to hire.

Indeed Pascal is not bad choice to start with.

Wilco

Vote

W

Wilco Dijkstra 18 years ago

it

Most targets use 3-address instructions or a subset thereof, so it is a good match. In a compiler I worked on the intermediate code could represent almost every feature of the target, including some of the more complex features like conditional execution and addressing modes. So even for CISC targets you'd be close to final code.

Wilco

Vote

W

Wilco Dijkstra 18 years ago

Actually the quality and performance of STL varies greatly between implementations. Some functions have algorithmic complexity guarantees, but even that doesn't necessarily result in good performance. If it guarantees an operation in O(log(N)) time but you need it in O(1) time you're stuck.

Just look at the size() operation - it takes O(N) time. Are you saying that it is not possible to do this in O(1) in a custom implementation?

No, just the usual ~5 arguments. The compiler slowed down by about a factor of 2, and the startup time was noticeably slower. The commandline options were stored in a string hash (key->value) and needed various string concatenations to calculate the key and return the result. The STL version was a literal translation from C into STL C++. The slowdown was caused by doing lots of lookups during compilation based on the assumption they were fast operations - which was true for the C implementation but not for STL.

At the time nobody would believe STL could be that slow, but I compared STL strings with C strings, and you do indeed get a 30-50 times slowdown when doing strcat and similar operations. The total slowdown of the lookup was initially around 400 times and after a week of hard work the author got it down to only 200 times slower... That was the point I decided to ban the use of STL even in non-performance critical code.

Wilco

Vote

code optimization in embedded systems

Join the Discussion

Didn't find your answer?