reducing flash size in embedded processors?

Then I guess I've never seen "an experienced and good assembly programmer": somebody who can take advantage of delayed branches, piplelines, register windows -- somebody who can keep track of a dozen different variables and intermediate values in registers and on the stack, and all the other things that compilers are good at. I suspect those people are so rare these days that the chances of finding one are negligible.

--
Grant Edwards                   grante             Yow!  Does someone from
                                  at               PEORIA have a SHORTER
                               visi.com            ATTENTION span than me?
Reply to
Grant Edwards
Loading thread data ...

... on *tiny* examples only, and even then by a small margin (say < 20%). This is typically because a particular trick is used that is not available to the compiler. I've used the stackpointer as a general purpose register by saving it in a global variable. Not something compilers will ever do (especially not if you still need to take interrupts!), but great if you're trying to squeeze out the last few percent of a bit blitter. However such tricks only apply in very specialized circumstances.

When you take any non-trivial amount of code (say 10+K lines) then it is obvious even the best assembly programmer is going to lose big time against a compiler. Even if he is able to apply some of those tricks, he cannot routinely apply all the global transformations compilers do, even if given an infinite amount of time. Humans simply cannot do it - just like we can't compute the 100-th digit of sin(x) in our head.

How would that matter? The C++ programmers were most likely not experts anyway. They moved to a new language and architecture and made a significant improvement doing so. A lot of the improvement was due to the compiler being able to aggressively inline functions and remove redundant code exposed by inlining. Not something assembly programmers can do.

Wilco

Reply to
Wilco Dijkstra

I think this discussion goes the wrong path. There is no sense in an "Assembly vs Compiled" kind of thinking in that both aproaches if taken as a religion are plain wrong! If possible, IMHO a better aproach is to code in a higher level language like 'C' or the like, then if the specs require it make an in depth performance analysis of the resulting code. Take a very close look at the code that is time critical and called frequently. So far I always was able to optimize these cases. Sometimes the optimisation was done by choosing a better algorythm or by using a different aproach to the "sub problem", reversing the order things are done - you name it. Then there are also those rare cases where parts of the code were hand crafted in assembly. There ARE examples where assembly can be better. Just imagine a situation where you have to send data from say ComactFlash out to the lan as fast as possible. If you use assembly, you can create code which at the same time it reads the CF data register calculates the TCP checksum in an aditional CPU register before the data is stored into the network controller. Such an optimisation would avoid fetching the data twice. I don't think it would be easy or possible at all to do the same thing in 'C' so it's IMHO sometimes better to use the brain instead of following a dogma blindly :-)

Just my 2¢ of course.

Markus, running to get a flame proof suit :-)

Reply to
Markus Zingg

They are probably the ones who wrote the compiler code generator and optimizer.

--
Chuck F (cbfalconer@yahoo.com) (cbfalconer@worldnet.att.net)
   Available for consulting/temporary embedded and systems.
     USE worldnet address!
Reply to
CBFalconer

Well, this has been the mantra ever since we had languages more advanced than assembly language, and it used to be unequivocally true, no doubt. However code for modern 32-bit(+) micros is very difficult to hand-optimize, especially for speed; I'm no longer firmly convinced that this statement is globally true. And I'm /utterly/ convinced that sometimes a HLL is overall the most efficient way of doing things, even sometimes on "small" (8-bit) systems.

There are so many subtle things to remember on these complex modern cores. I doubt there are many, if any people who can sit down and hand-write "the optimized loop" for a given function on x86, with the same degree of one-pass optimization that, say, a really experienced

6502 programmer could show. *Maybe* given enough time and analysis, a really dedicated and skilful programmer could beat the world's best compiler. The human will probably be able to perceive the global (system-wide) context better, hence can perform some cross-module optimization. But this argues that the algorithm should be restructured so the compiler can do the hard work.
Reply to
Lewin A.R.W. Edwards

... snip ...

No argument. I was simply pointing out a theoretical limit. It is almost always cheaper to spend money on more hardware than on the last ounce of efficiency and compression.

--
Chuck F (cbfalconer@yahoo.com) (cbfalconer@worldnet.att.net)
   Available for consulting/temporary embedded and systems.
     USE worldnet address!
Reply to
CBFalconer

Obviously not, because a well-tooled assembly programmer can always run the compiler and inspect its output to find ways to improve it (or keep it as it and call it "assembler source code"). The compiler can never do anything like that, so the human programmer has an unfair advantage.

--
Hans-Bernhard Broeker (broeker@physik.rwth-aachen.de)
Even if all the snow were burnt, ashes would remain.
Reply to
Hans-Bernhard Broeker

A much more interesting exercise recoding an assembly routine in C and looking generated difference from a good compiler. The first time I did this as a serious exercise was with a math library for our compiler that supports the microchip PIC. The C version of the library was as tight as assembler and would run on all variations and addressing modes of the PIC.

w..

CBarn24050 wrote:

Reply to
Walter Banks

serious

Had you tried the exercise with a math library different from the C standard you would have found a relatively large difference in favor of the assembly program. It stands to reason that the C compiler is (well, should be) very good at doing C things. The further you deviate from those things, the better assembly looks.

Best regards, Spehro Pefhany

--
"it's the network..."                          "The Journey is the reward"
speff@interlog.com             Info for manufacturers: http://www.trexon.com
Embedded software/hardware/analog  Info for designers:  http://www.speff.com
Reply to
Spehro Pefhany

Finally. The right point. Thanks!

Jon

Reply to
Jonathan Kirwan

Whilst it may be true that compilers are improving, it is an unfortunate fact that programmers are not. Particularly on smaller embedded systems where resources like RAM and MIPS are in short supply, it is all too easy for programmers brought up on HLLs alone to create algorithms which make poor use of of the available resources. A compiler may well be able to produce code as good as any assembly programmer given the algorithm but that is not where the problem often lies. IME, assembly language programmers are much better at creating resource efficient algorithms, whether coded in HLL or assembler, than programmers with no assembler experience.

Ian

--
Ian Bell
Reply to
Ian Bell

My experience with many compilers for many different languages over the years has been that a decent (not a super) ASM programmer can reduce a program size and/or increase its speed by about a factor of 4. The only exceptions I've ever encountered were the compilers DEC produced for the VAX -- it was very difficult to beat those.

[I fed one of the "standard" benchmark programs to the VAX Fortran compiler one time for fun. The comiler analyzed the source, realized that the program did nothing useful, and optimized it to zero code and zero running time.]
Reply to
Everett M. Greene

It clearly depends on the unit cost increase multiplied by the number to be manufactured versus the cost of developing the more efficient version. if the quantities are large it is almost always cheaper to spend the money on additional development.

Ian

--
Ian Bell
Reply to
Ian Bell

While I fully agree that even an experienced assembly program could not consistently keep up with a good C-compiler in a large system, a good assembly programmer can improve the performance in some device specific operations, e.g. copying data between normal memory and video RAM compared to the standard memcpy library routine.

First of all, the assembly programmer can force the critical data to be aligned according to memory width or cache line width. Of course, the C-compiler could do this alignment, but if applied to all data, quite a lot of memory would be lost. Often the assembler programmer knows how much data is actually going to be transferred, which helps to unwind the loop for best performance.

Prefetching, i.e. touching (but not transferring) one byte at each cache line will load a cache line into the cache and only after this start the actual copying and by doing this cache preload avoids some memory bandwidth bottlenecks.

While a compiler writer could include such features, using such features in the generated code might not be sensible, if it would apply only to one processor and it might in fact be counterproductive on some other processor version.

It seems to be a sport to write very efficient memcpy routines, but in general, simple memory to memory transfers do not make much sense in real life, since most of such transfers could have been eliminated with better data structure design. However, with wildly different memory read, write or read/write performance, it could make a lot of sense to transfer large amounts of data as a block transfer, especially when no DMA transfer is available.

Paul

Reply to
Paul Keinanen

Ans whilst it may be true that compilers are improving, it is an unfortunate fact that programmers are not.  Particularly on smaller embedded systems where resources like RAM and MIPS are in short supply, it is all too easy for programmers brought up on HLLs alone to create algorithms which make poor use of of the available resources.  A compiler may well be able to produce code as good as any assembly programmer given the algorithm but that is not where the problem often lies.  IME, assembly language programmers are much better at creating resource efficient algorithms, whether coded in HLL or assembler, than programmers with no assembler experience.

Ian

--
Ian Bell
Reply to
Ian Bell

unfortunate

Maybe asm programmers are better at this than pure (or asm-unaware) HLL programmers, but here are a few key points that both must be familiar with:

- asm optimization won't improve much an initially poor algorithm (I mean, if it's implemened with O(N**2) MIPS cost and is worse than possible O(N*log(N)), doesn't matter what language is chosen, unless your N is a small number and there are no outer loops)

- redundant operations (no matter in which language written) will lower the performance, so any calculation or access that can be avoided, should be, no matter what language is used (with asm, though, you have a better control here)

- if there are parts in the program that need data memory at mutually exclusive periods of time, it makes sense to share memory between them. Even C/C++ has "union" keyword helping to do it (in asm you can always do things like that).

- if your memory isn't strictly divided into two distinct categories, program and data (the harvard architecture), and can be used for both, then you can share the stack/heap/uninitialized-yet-bss with some of the startup code that runs just once in the beginning of the program. And this mostly have to do with the linker, so the language is again up to you because this sort of size optimization is language independent

- finally, without resorting to asm optimization, one could HLL constructs that fit the target CPU instructions better. I mean, if branching is expensive, unroll the loops and replace conditional branching by logic. If logical operations are performed worse than arithmetical (which may happen with e.g. DSPs that have quick addition and multiplication and other things), consider replacing (A && B) by (!(A*B)), etc etc.

In practice, when embedded systems are considered and the cost should be real low, all kinds of optimizations and tricks are handy. If the production volumes are big, higher development cost will be covered by the savings.

And I'd say that algorithm replacement/refinement is the ultimate optimization. Then you can further squeeze and speed things up in asm to the absolute minimum. Bad algorithm, especially optimized with asm, can be as expensive as several algorithms because a. it's bad and b. if you consider its replacement, earlier asm optimization is just thrown away.

Alex

Reply to
Alexei A. Frounze

snip of much sensible stuff

Which is just what I was trying to say. The key question is 'how do we best implement this function on this platform?' To answer this you need to understand 'this platform' in sufficient depth and know how to create an efficient algorithm.

Ian

--
Ian Bell
Reply to
Ian Bell

Again, agreed. I just haven't spent my life in the large volume situation, but I have spent it in the reliability situation. Now consider Microsoft ....

--
Chuck F (cbfalconer@yahoo.com) (cbfalconer@worldnet.att.net)
   Available for consulting/temporary embedded and systems.
     USE worldnet address!
Reply to
CBFalconer

Sometimes but not always...

That is very target and particular part of application specific, whereas most compilers are general purpose so will have flaws for some circumstances.

Or for that matter in other applications on the same processor.

Depends how you have saved configuration data for that specific unit and how big the size of that data set is. No hard drive or media card is a good enough reason to save in Flash or even EEPROM setup data that may well be read by memcpy. A typical printer (even networked) does not need a large dataset of customisation variables.

It also depends on a lot of things, for example I quite often use memcpy to copy some small sets of system defaults between external Flash and internal RAM, because the overhead of setting up a DMA transfer would take longer. However If I am copying large chunks of data from Flash to dual ported hardware RAM or similar then I do use DMA transfer, even for a memory to memory mapped transfer. Often followed by memcmp to validate the data set, then a smaller C routine I have control over to isolate the actual failing location and values, to report and action upon.

-- Paul Carpenter | snipped-for-privacy@pcserviceselectronics.co.uk PC Services GNU H8 & mailing list info For those web sites you hate

Reply to
Paul Carpenter

serious

In what way could a math library be C specific? The math functions are the same independently of the language - at the end of the day you pass in some floating point value(s), some computation happens, and you get a result - that's it. So it's no surprise that same libraries are used for various languages such as C, C++, Java, Fortran as they all define a common set of math functions.

Wilco

Reply to
Wilco Dijkstra

ElectronDepot website is not affiliated with any of the manufacturers or service providers discussed here. All logos and trade names are the property of their respective owners.