reducing flash size in embedded processors? - Page 3

Do you have a question? Post it now! No Registration Necessary

Translate This Thread From English to

Threaded View
Re: Assembly vs Compiled (Was: reducing flash size in embedded processors?)
On 5 Oct 2004 04:43:53 -0700, snipped-for-privacy@larwe.com (Lewin A.R.W. Edwards)

Quoted text here. Click to load it

While I fully agree that even an experienced assembly program could
not consistently keep up with a good C-compiler in a large system, a
good assembly programmer can improve the performance in some device
specific operations, e.g. copying data between normal memory and video
RAM compared to the standard memcpy library routine.

First of all, the assembly programmer can force the critical data to
be aligned according to memory width or cache line width. Of course,
the C-compiler could do this alignment, but if applied to all data,
quite a lot of memory would be lost. Often the assembler programmer
knows how much data is actually going to be transferred, which helps
to unwind the loop for best performance.  

Prefetching, i.e. touching (but not transferring) one byte at each
cache line will load a cache line into the cache and only after this
start the actual copying and by doing this cache preload avoids some
memory bandwidth bottlenecks.

While a compiler writer could include such features, using such
features in the generated code might not be sensible, if it would
apply only to one processor and it might in fact be counterproductive
on some  other processor version.

It seems to be a sport to write very efficient memcpy routines, but in
general, simple memory to memory transfers do not make much sense in
real life, since most of such transfers could have been eliminated
with better data structure design. However, with wildly different
memory read, write or read/write performance, it could make a lot of
sense to transfer large amounts of data as a block transfer,
especially when no DMA transfer is available.

Paul


Re: Assembly vs Compiled (Was: reducing flash size in embedded processors?)

Quoted text here. Click to load it

Ans whilst it may be true that compilers are improving, it is an unfortunate
fact that programmers are not.††Particularly†on†smaller†embedded†systems
where resources like RAM and MIPS are in short supply, it is all too easy
for programmers brought up on HLLs alone to create algorithms which make
poor use of of the available resources.††A†compiler†may†well†be†able†to
produce code as good as any assembly programmer given the algorithm but
that is not where the problem often lies.††IME,†assembly†language
programmers are much better at creating resource efficient algorithms,
whether coded in HLL or assembler, than programmers with no assembler
experience.

Ian

--
Ian Bell

Re: Assembly vs Compiled (Was: reducing flash size in embedded processors?)
Quoted text here. Click to load it
unfortunate

Maybe asm programmers are better at this than pure (or asm-unaware) HLL
programmers, but here are a few key points that both must be familiar with:

- asm optimization won't improve much an initially poor algorithm (I mean,
if it's implemened with O(N**2) MIPS cost and is worse than possible
O(N*log(N)), doesn't matter what language is chosen, unless your N is a
small number and there are no outer loops)

- redundant operations (no matter in which language written) will lower the
performance, so any calculation or access that can be avoided, should be, no
matter what language is used (with asm, though, you have a better control
here)

- if there are parts in the program that need data memory at mutually
exclusive periods of time, it makes sense to share memory between them. Even
C/C++ has "union" keyword helping to do it (in asm you can always do things
like that).

- if your memory isn't strictly divided into two distinct categories,
program and data (the harvard architecture), and can be used for both, then
you can share the stack/heap/uninitialized-yet-bss with some of the startup
code that runs just once in the beginning of the program. And this mostly
have to do with the linker, so the language is again up to you because this
sort of size optimization is language independent

- finally, without resorting to asm optimization, one could HLL constructs
that fit the target CPU instructions better. I mean, if branching is
expensive, unroll the loops and replace conditional branching by logic. If
logical operations are performed worse than arithmetical (which may happen
with e.g. DSPs that have quick addition and multiplication and other
things), consider replacing (A && B) by (!(A*B)), etc etc.

In practice, when embedded systems are considered and the cost should be
real low, all kinds of optimizations and tricks are handy. If the production
volumes are big, higher development cost will be covered by the savings.

And I'd say that algorithm replacement/refinement is the ultimate
optimization. Then you can further squeeze and speed things up in asm to the
absolute minimum. Bad algorithm, especially optimized with asm, can be as
expensive as several algorithms because a. it's bad and b. if you consider
its replacement, earlier asm optimization is just thrown away.

Alex



Re: Assembly vs Compiled (Was: reducing flash size in embedded processors?)

Quoted text here. Click to load it

snip of much sensible stuff
Quoted text here. Click to load it

Which is just what I was trying to say.  The key question is 'how do we best
implement this function on this platform?' To answer this you need to
understand 'this platform' in sufficient depth and know how to create an
efficient algorithm.

Ian
--
Ian Bell

Re: Assembly vs Compiled (Was: reducing flash size in embedded processors?)

Quoted text here. Click to load it
unfortunate

It is certainly true that a programmer who is familiar with the target's
assembly language is going to be able to write C code that compiles to
smaller and faster object code.  Whenever I start working with a new
architecture, I make a point of writing at least some code in assembly, just
to get the feel of it.  When coding in C, I like to look at the generated
assembly code to see what is being produced.  There are plenty of occasions
when the generated code is not critical (remember Knuth's second law of
optomisation (for experts only): "Don't do it yet").  But when it is
critical, and especially on small micros, knowing the target assembly is
essential.  You can shave a few cycles off a loop by choosing while() or
for() loops, or having your index variables count up or down - depending on
the target and the compiler.  On 8-bitters, it is particularly easy to spot
code writen by people who don't understand their target - the code is full
of int's instead of uint_8 or similar.




Re: Assembly vs Compiled (Was: reducing flash size in embedded processors?)
On Tuesday, in article

Quoted text here. Click to load it

Sometimes but not always...

Quoted text here. Click to load it

That is very target and particular part of application specific, whereas
most compilers are general purpose so will have flaws for some circumstances.

Quoted text here. Click to load it

Or for that matter in other applications on the same processor.

Quoted text here. Click to load it

Depends how you have saved configuration data for that specific unit
and how big the size of that data set is. No hard drive or media card
is a good enough reason to save in Flash or even EEPROM setup data
that may well be read by memcpy. A typical printer (even networked)
does not need a large dataset of customisation variables.

Quoted text here. Click to load it

It also depends on a lot of things, for example I quite often use
memcpy to copy some small sets of system defaults between external
Flash and internal RAM, because the overhead of setting up a DMA
transfer would take longer. However If I am copying large chunks
of data from Flash to dual ported hardware RAM or similar then I
do use DMA transfer, even for a memory to memory mapped transfer.
Often followed by memcmp to validate the data set, then a smaller
C routine I have control over to isolate the actual failing location
and values, to report and action upon.

--
Paul Carpenter          | snipped-for-privacy@pcserviceselectronics.co.uk
<http://www.pcserviceselectronics.co.uk/ PC Services
We've slightly trimmed the long signature. Click to see the full one.
Re: Assembly vs Compiled

Quoted text here. Click to load it

Obviously not, because a well-tooled assembly programmer can always
run the compiler and inspect its output to find ways to improve it (or
keep it as it and call it "assembler source code").  The compiler can
never do anything like that, so the human programmer has an unfair
advantage.

--
Hans-Bernhard Broeker ( snipped-for-privacy@physik.rwth-aachen.de)
Even if all the snow were burnt, ashes would remain.

Re: reducing flash size in embedded processors?

Quoted text here. Click to load it

Whilst it may be true that compilers are improving, it is an unfortunate
fact that programmers are not.  Particularly on smaller embedded systems
where resources like RAM and MIPS are in short supply, it is all too easy
for programmers brought up on HLLs alone to create algorithms which make
poor use of of the available resources.  A compiler may well be able to
produce code as good as any assembly programmer given the algorithm but
that is not where the problem often lies.  IME, assembly language
programmers are much better at creating resource efficient algorithms,
whether coded in HLL or assembler, than programmers with no assembler
experience.

Ian

--
Ian Bell

Re: reducing flash size in embedded processors?
Quoted text here. Click to load it

Eh.  I have yet to see a good C compiler, then.  For example, in one
project, I had a simple data copying loop written in C.  I tried two
different compilers and they both compiled the loop to eight (8) machine
language instructions (although the instructions used were different ones).
Any asm programmer would have used only two (2) instructions for the same
loop.

This is an extreme example, it isn't that bad all the time.  Sometimes
C compilers know tricks that an asm programmer would never bother with -
but a few instructions later, they do something utterly stupid and waste
both memory and execution cycles.

I have been told that a C compiler can beat an asm programmer for almost
20 years, but looking at the code produced by compilers, this _still_
isn't true.  This doesn't matter so much any more, thanks to faster CPU's
and larger memories, though.

And yes, I'm using C for almost everything for portability and other
reasons, but it still makes me cry when I disassemble an interrupt
routine compiled by a modern C compiler..

  -jm

Re: reducing flash size in embedded processors?

Quoted text here. Click to load it

What architecture was this? It's well known that most 8- and 16-bit CISC
architectures are hard (sometimes terrible) compiler targets, and so there
are few professional compilers available. Things are very different in the
32-bit world, especially on the RISCs.

In any case, why didn't you use memcpy? Most compilers would then
understand that you wanted to copy some bytes and emit efficient code
to do so (memcpy is used a lot in embedded applications). Advanced
compilers do pattern matching and replace copy loops with the real
thing, but this isn't widely done yet.

Quoted text here. Click to load it

Perhaps you should complain to your compiler vendor. Especially with
free/cheap compilers, if people don't show inefficient examples, code
quality won't improve.

Quoted text here. Click to load it

In the embedded space it's true for ARM at least, and I wouldn't be
surprised if it's true for PowerPC, SH and MIPS too. Outside the embedded
world I don't think there are many people who understand the basics of the
CPU they are programming on (instruction cycle timings have been obsolete
for almost 15 years), let alone write highly optimal code. x86 and IA-64
are examples of complicated architectures with even more complicated
implementations where assembly programming is almost exclusively done
for fun by the 10 or so people left who know how to do it!

Wilco



Re: reducing flash size in embedded processors?

Quoted text here. Click to load it

Even the more recent 8 and 16 bit architectures (AVR, H8, etc.)
are much better targets than something like the 6811 (mediocre)
and the 8051 (awful).  The last "6811" compiler I used (Introl)
just wasn't very good.  It didn't even use the second index
register, and a brain-dead assembly-language peephole optimizer
with a 3-instruction window could have made significant
improvements in the code. Today, the HC11/HC12 port of GCC
generates much better code than the commercial compilers I used
a few years ago.

--
Grant Edwards                   grante             Yow!  Is it NOUVELLE
                                  at               CUISINE when 3 olives are
We've slightly trimmed the long signature. Click to see the full one.
Re: reducing flash size in embedded processors?
A much more interesting exercise recoding an assembly routine in C and looking
generated difference from a good compiler. The first time I did this as a serious
exercise was with a math library for our compiler that supports the microchip
PIC. The C version of the library was as tight as assembler and would run on
all variations and addressing modes of the PIC.

w..

CBarn24050 wrote:

Quoted text here. Click to load it


Re: reducing flash size in embedded processors?
On Tue, 05 Oct 2004 09:05:19 -0400, the renowned Walter Banks

Quoted text here. Click to load it
serious
Quoted text here. Click to load it

Had you tried the exercise with a math library different from the C
standard you would have found a relatively large difference in favor
of the assembly program. It stands to reason that the C compiler is
(well, should be) very good at doing C things. The further you deviate
from those things, the better assembly looks.


Best regards,
Spehro Pefhany
--
"it's the network..."                          "The Journey is the reward"
snipped-for-privacy@interlog.com             Info for manufacturers: http://www.trexon.com
We've slightly trimmed the long signature. Click to see the full one.
Re: reducing flash size in embedded processors?
On Tue, 05 Oct 2004 10:09:37 -0400, Spehro Pefhany

Quoted text here. Click to load it

Finally.  The right point.  Thanks!

Jon

Re: reducing flash size in embedded processors?
On Tue, 05 Oct 2004 17:15:53 GMT, Jonathan Kirwan

Quoted text here. Click to load it

As the years go by, I seem to be coding less and less assembly, even
though the chips aren't getting any bigger...

Just a datapoint.  Conclusions I leave to the reader.

I looked through some recent projects.  These are the uses for
assembly I have had in the last year or so:

1) Modifying C startup code to perform a RAM test before initializing
the data segment.  You can't do that in C.

2) Calling HC908 resident ROM routines to erase and write Flash pages.
The assembly simply puts the parameters into the correct registers and
performs a call the absolute address.  I might be able to do that in C
by using special extensions, but I'm not sure.  

3) Reversing the bits in a byte before writing them to a resistor
ladder DAC that was wired in backwards.  This could be done quite
easily in C, so I wrote a quickie comparison (avr-gcc):

--- begin included file ---
typedef unsigned char U8;

U8 reverse_a(U8 v)
{
    U8 retval, bit_ctr;

    asm (

        "ldi    %1, 8           \n\t"
    "reverse_0:                 \n\t"
        "ror    %2              \n\t"
        "rol    %0              \n\t"
        "dec    %1              \n\t"
        "brne   reverse_0       \n\t"
        : "=&r" (retval), "=&r" (bit_ctr)
        : "r"  (v)

        );

    return retval;
}


U8 reverse_c(U8 v)
{
    U8 retval;
    U8 count=8;

    do
    {
        retval <<= 1;
        if (v&1) retval |= 1;
        v >>= 1;

    } while (--count);

    return retval;
}

---end included file---

With -Os, the inline assembly version was 8 words long, compared to 10
for the C (AVR instruction memory is 16 bits wide).  Both the extra
instructions are in the loop, so the C version is approximately 25%
slower.  The assembly version benefits from the use of the carry bit,
which allows us to combine the test and OR operation with a shift
(i.e., the source of the two extra instructions).

Note: The assembly version is tested and works. The C version is
untested, but took less time to write (the gcc inline assembly syntax
is arcane and I don't use it very often).  The function probably isn't
called often enough nor is memory tight enough to justify dropping
into assembly, but maintenance is a non-issue (a new spin of the board
wired the DAC correctly, and the routine was excised from the code.)

Regards,

                               -=Dave
--
Change is inevitable, progress is not.

Re: reducing flash size in embedded processors?

Quoted text here. Click to load it

Not as a specific response to you, but just using your comment as a foil to say
something else....

Reasons I've used assembly:

(1) On a DSP with only 1k word, but costing only a few dollars, I wrote an
application entirely in assembly.  No hardware FP, so I wrote custom FP routines
for the need -- a 32/16 divide, with normalization from an existing barrel
shifter; a specialized least squares fitting routine; etc.  There is (and was)
no possible way to write this in C and still fit the necessary features.
(You'll just have to take my word on that fact.)  System was written in 1990 and
is still in use in various products as well as this DSP and the code has been
ported into new products with very little required change to existing routines
and only a few additional ones.  Well fitted to the application space in the
first design effort.  Entirely in assembly.  No C.

(2) In a case where I needed to again fit into a 2k word micro space (cost of
the larger 32k chips would have 'broken' the budget), wrote an application
entirely in assembly.  Specialized use of PIC features (direct modification of
the low order instruction pointer byte for a calculated jump) allowed me to
shovel in a very sophisticated state machine needed for operation.  None of the
C compilers available could have used that feature on this chip and, as a
result, still fit the application into the space available.  Entirely in
assembly.  No C.  Necessity.

(3) In a retrofit task, I needed to support high speed resource negotiation on
two I/O lines (all that was available) for a widely varying number of entirely
asynchronous devices all competing for a single common resource -- an RS-232
level shifter connected to a PC serial port.  Variability in timing, putting out
a 1 bit or a 0 bit, had to be added to overall wait time for each processor
trying to negotiate (actually, 2X that uncertainty.)  Since this was wire-OR and
since the PIC requires different logic for handling a 1 or 0 in that case, the
IF and ELSE cases for the bit value had to either have exactly the same timing
or else the variability in the two would have to be doubled and then added to
the total bit time.  C has no #pragma on the compilers I've seen to permit
specifying that the IF and ELSE clauses have the same timing.  In this case,
that's important.  Again, in assembly -- NOPs easily added as needed.  (Took me
only the morning to completely write and retrofit the code, we were running
correctly in the afternoon, with dozens of devices which previously had never
been connected in this way, working fine.)

(4) A case where extremely low power, tiny size, sophisticated performance and
exact timing requirements... the whole thing fits into a tiny TO-5 can in
wire-bonded die form.  Going to C would have necessitated a larger die for the
micro.  No room, sorry.  Using C would have forced a more expensive can, larger,
probably some more power which would mean ... more heat to remove ... etc.  It
would have been less competitive.  Again?  Assembly.  No C.  Necessity.

And many, many others reasons, as well.

Of course, many of my applications also use C and/or C++.  But there is time and
place for assembly.

A key point is that it many embedded applications are highly cost sensitive; or
highly size sensitive; or highly power consumption sensitive (battery); or
combinations of these and other qualities.  Even when these don't necessarily
dictate a specific approach, they may mean that your relative competitiveness is
different if you choose C, because of the impact it may have on one or more of
these characteristics.  I've had a case where the change in die size needed when
supporting C development would cause:  (1) a larger die; (2) leading to a larger
metal can package; with (3) more power consumption; (4) leading to the need for
more peltier TEC cooling power required; resulting in (5) much higher system
cost, larger size, and (6) very much lower competitiveness and utility.
"If one is willing to narrow their vision and refuse to be comprehensive in
their perspective, they might well be willing to believe the world is flat."

When you are looking at the product from a comprehensive system engineering
view, whether or not to use C or assembly or some mix of those or other things
becomes only one factor of many to consider.  It is in this fuller context,
taken together with other priorities such as power consumption, cost, size, and
perhaps even performance speed -- it is in balancing all these and comparing the
overall difference in the __total time__ to produce the product -- that one
could reasonably make the choice to use assembly instead of C.  Looking only at
the difference in time of C vs assembly coding effort for a single function, for
example, tells you very little about this larger reality.

Jon

Re: reducing flash size in embedded processors?
Quoted text here. Click to load it

You can do that perfectly well in C.  The last time I wrote specialised
startup code (for an MPC561), virtually everything was in C.  I had to have
assembly macros for accessing special function registers, since there is no
was to use them in C.  Other than that, I have three assembly instructions -
two to set up a stack pointer, and one to jump to the start of the C startup
routine.  All the code to set up chip selects, clear out the bss, copy the
program from flash to ram, etc., is a C routine - ending with a call to
main().

It might still be better to write such code in assembly - I've done both.
But you don't have to have your data segment ready before using C - you only
have to have it ready before main() starts.


Quoted text here. Click to load it

For that sort of thing, go for assembly every time - that way you are sure
you get it right.

Quoted text here. Click to load it
<snip example details>

Some algorithms, particularly ones that work best with access to a carry bit
and rotate instructions, are a lot more efficient in assembly - C simply
lacks any way to express these concepts properly.  They can often be easier
to write and understand in assembly, as well as being smaller and faster.
Crc routines are another good example.





Re: reducing flash size in embedded processors?
On Wed, 6 Oct 2004 08:57:49 +0200, "David Brown"

[...I wrote...]

Quoted text here. Click to load it

With linker tricks you can write C code that will be executed before
main.  But as you note later, you need a stack to call it.  And
writing a proper RAM test in C is, ummm, more than difficult.

[...]
Quoted text here. Click to load it

My routine to erase a page of flash looks like this:

   void FLASH_erase_page(void High *page)
   {
   // Low byte of page is in a, high byte in x
   //
       #asm
       pshx                            ; push high byte
       pulh                            ; pull into h
       tax                             ; copy low byte into x
       lda     #OPERATING_FREQUENCY
       sta     _CPUSPD
       bclr    MASS_ERASE,_CTRLBYT     ; set up for page erase
       jsr     ERARNGE                 ; erase page
    #endasm
   }

Where the symbols are defined earlier in the file.  Some compilers off
extensions that might allow this to be written as

   void FLASH_erase_page(void High *page)
   {
      CPUSPD = OPERATING_FREQUENCY;
      CTRLBYTE &= ~MASS_ERASE_MASK;
      __HX = page;
      ERARNGE();
   }

FWIW, I don't believe the compiler I'm using provides such extensions.

Quoted text here. Click to load it

The point of this example seems to be that this appeared to be such a
case, but the actual gain was smaller than might be expected.

However, a few years back I posted a pathological checksum function
(involving the rotate instruction and carry bit) that was six
instructions long in assembly, but I couldn't get to less than about
40 instructions using C.  See
http://groups.google.com/groups?selm36%a8c1b6.10586772%40192.168.2.34
for details if you're interested.

Regards,

                               -=Dave
--
Change is inevitable, progress is not.

Re: reducing flash size in embedded processors?

Quoted text here. Click to load it
serious
Quoted text here. Click to load it

In what way could a math library be C specific? The math functions
are the same independently of the language - at the end of the day
you pass in some floating point value(s), some computation happens,
and you get a result - that's it. So it's no surprise that same libraries
are used for various languages such as C, C++, Java, Fortran as they
all define a common set of math functions.

Wilco



Re: reducing flash size in embedded processors?

Quoted text here. Click to load it
looking
a serious
microchip
run on
Math functions for embedded work are often tailored to the specific
application set. E.g.: I might need arctan that *never* takes longer than
100 instructions to 10ppm WC accuracy w/o a FPU that can be called in an
interrupt. I code that myself in assembly using tables. The C libs I've seen
are *not* re-entrant and might take as long as 1000 instructions. (numbers
are all ball-park, but the example is real) The C versions are more
accurate, but I don't need that accuracy. I need speed and I need it *now*.
I also don't really need or want floating point so I usually output the
result in a fixed radix ready to scale to a DAC.

Bob




Site Timeline