reducing flash size in embedded processors?

- D
- Dave Hansen
  
  Contact options for registered users
Vote on answer
posted
19 years ago

Tue, Oct 5, 2004 10:59 PM

As the years go by, I seem to be coding less and less assembly, even though the chips aren't getting any bigger...

Just a datapoint. Conclusions I leave to the reader.

I looked through some recent projects. These are the uses for assembly I have had in the last year or so:

1) Modifying C startup code to perform a RAM test before initializing the data segment. You can't do that in C. 2) Calling HC908 resident ROM routines to erase and write Flash pages. The assembly simply puts the parameters into the correct registers and performs a call the absolute address. I might be able to do that in C by using special extensions, but I'm not sure. 3) Reversing the bits in a byte before writing them to a resistor ladder DAC that was wired in backwards. This could be done quite easily in C, so I wrote a quickie comparison (avr-gcc):

--- begin included file --- typedef unsigned char U8;

U8 reverse_a(U8 v) { U8 retval, bit_ctr;

asm (

"ldi %1, 8 \n\t" "reverse_0: \n\t" "ror %2 \n\t" "rol %0 \n\t" "dec %1 \n\t" "brne reverse_0 \n\t" : "=&r" (retval), "=&r" (bit_ctr) : "r" (v)

);

return retval; }

U8 reverse_c(U8 v) { U8 retval; U8 count=8;

do { retval = 1;

} while (--count);

return retval; }

---end included file---

With -Os, the inline assembly version was 8 words long, compared to 10 for the C (AVR instruction memory is 16 bits wide). Both the extra instructions are in the loop, so the C version is approximately 25% slower. The assembly version benefits from the use of the carry bit, which allows us to combine the test and OR operation with a shift (i.e., the source of the two extra instructions).

Note: The assembly version is tested and works. The C version is untested, but took less time to write (the gcc inline assembly syntax is arcane and I don't use it very often). The function probably isn't called often enough nor is memory tight enough to justify dropping into assembly, but maintenance is a non-issue (a new spin of the board wired the DAC correctly, and the routine was excised from the code.)

Regards,

-=Dave

--
Change is inevitable, progress is not.

- B
- Bob
  
  Contact options for registered users
Vote on answer
posted
19 years ago

Tue, Oct 5, 2004 11:00 PM

looking

a serious

microchip

run on

Math functions for embedded work are often tailored to the specific application set. E.g.: I might need arctan that *never* takes longer than

100 instructions to 10ppm WC accuracy w/o a FPU that can be called in an interrupt. I code that myself in assembly using tables. The C libs I've seen are *not* re-entrant and might take as long as 1000 instructions. (numbers are all ball-park, but the example is real) The C versions are more accurate, but I don't need that accuracy. I need speed and I need it *now*. I also don't really need or want floating point so I usually output the result in a fixed radix ready to scale to a DAC.

Bob

- J
- Jonathan Kirwan
  
  Contact options for registered users
Vote on answer
posted
19 years ago

Tue, Oct 5, 2004 11:11 PM

In every way, actually. It's part of a standard. How could it be application specific?

Jon

- W
- Wilco Dijkstra
  
  Contact options for registered users
Vote on answer
posted
19 years ago

Tue, Oct 5, 2004 11:38 PM

Yes, I agree with all that. The problem however is that many people skip the intermediate steps. They write some bad code, compile it (sometimes even without turning on full optimization), look at the compiler output, then decide their only option is to write it all in assembler... As you say in most cases you can optimise the design, algorithm and implementation to get most, if not all, of the possible improvement. Only if all else fails and you know you can beat the compiler, you should consider of doing it in assembler.

Hmm, now you're falling into your own trap... How do you mean it wouldn't be possible in C? Because C is a low-level language there is almost a one-to-one mapping from the source to assembly language, so just about anything you can write in assembly code can be written in C (the only exceptions are features C doesn't directly support, such as saturated arithmetic - however most compilers provide intrinsics for these sorts of things). In fact when doing really low-level source optimizations, I typically use simple assignments which compile into single instructions, so there is virtually no difference between doing this and writing assembler. For example (using ARM syntax):

x = *p++; // LDR x, [p], #4 a = b + c; // ADD a, b, c

A simple 32-bit CRC checksum that compiles to 4 instructions could look like this:

sum = ((sum > 24)) ^ crc_table[data];

The equivalent low-level form:

ptr = &crc_table[0]; // LDR ptr, =crc_table tmp = ptr[data]; // LDR tmp, [ptr, data] sum = (sum > 24); // MOV sum, sum, ROR #24 sum ^= tmp; // EOR sum, sum, tmp

If it was a function you would need one extra instruction to return of course.

Don't worry, you're making sense!

Wilco

- J
- Jonathan Kirwan
  
  Contact options for registered users
Vote on answer
posted
19 years ago

Wed, Oct 6, 2004 12:10 AM

Not as a specific response to you, but just using your comment as a foil to say something else....

Reasons I've used assembly:

(1) On a DSP with only 1k word, but costing only a few dollars, I wrote an application entirely in assembly. No hardware FP, so I wrote custom FP routines for the need -- a 32/16 divide, with normalization from an existing barrel shifter; a specialized least squares fitting routine; etc. There is (and was) no possible way to write this in C and still fit the necessary features. (You'll just have to take my word on that fact.) System was written in 1990 and is still in use in various products as well as this DSP and the code has been ported into new products with very little required change to existing routines and only a few additional ones. Well fitted to the application space in the first design effort. Entirely in assembly. No C.

(2) In a case where I needed to again fit into a 2k word micro space (cost of the larger 32k chips would have 'broken' the budget), wrote an application entirely in assembly. Specialized use of PIC features (direct modification of the low order instruction pointer byte for a calculated jump) allowed me to shovel in a very sophisticated state machine needed for operation. None of the C compilers available could have used that feature on this chip and, as a result, still fit the application into the space available. Entirely in assembly. No C. Necessity.

(3) In a retrofit task, I needed to support high speed resource negotiation on two I/O lines (all that was available) for a widely varying number of entirely asynchronous devices all competing for a single common resource -- an RS-232 level shifter connected to a PC serial port. Variability in timing, putting out a 1 bit or a 0 bit, had to be added to overall wait time for each processor trying to negotiate (actually, 2X that uncertainty.) Since this was wire-OR and since the PIC requires different logic for handling a 1 or 0 in that case, the IF and ELSE cases for the bit value had to either have exactly the same timing or else the variability in the two would have to be doubled and then added to the total bit time. C has no #pragma on the compilers I've seen to permit specifying that the IF and ELSE clauses have the same timing. In this case, that's important. Again, in assembly -- NOPs easily added as needed. (Took me only the morning to completely write and retrofit the code, we were running correctly in the afternoon, with dozens of devices which previously had never been connected in this way, working fine.)

(4) A case where extremely low power, tiny size, sophisticated performance and exact timing requirements... the whole thing fits into a tiny TO-5 can in wire-bonded die form. Going to C would have necessitated a larger die for the micro. No room, sorry. Using C would have forced a more expensive can, larger, probably some more power which would mean ... more heat to remove ... etc. It would have been less competitive. Again? Assembly. No C. Necessity.

And many, many others reasons, as well.

Of course, many of my applications also use C and/or C++. But there is time and place for assembly.

A key point is that it many embedded applications are highly cost sensitive; or highly size sensitive; or highly power consumption sensitive (battery); or combinations of these and other qualities. Even when these don't necessarily dictate a specific approach, they may mean that your relative competitiveness is different if you choose C, because of the impact it may have on one or more of these characteristics. I've had a case where the change in die size needed when supporting C development would cause: (1) a larger die; (2) leading to a larger metal can package; with (3) more power consumption; (4) leading to the need for more peltier TEC cooling power required; resulting in (5) much higher system cost, larger size, and (6) very much lower competitiveness and utility. "If one is willing to narrow their vision and refuse to be comprehensive in their perspective, they might well be willing to believe the world is flat."

When you are looking at the product from a comprehensive system engineering view, whether or not to use C or assembly or some mix of those or other things becomes only one factor of many to consider. It is in this fuller context, taken together with other priorities such as power consumption, cost, size, and perhaps even performance speed -- it is in balancing all these and comparing the overall difference in the __total time__ to produce the product -- that one could reasonably make the choice to use assembly instead of C. Looking only at the difference in time of C vs assembly coding effort for a single function, for example, tells you very little about this larger reality.

Jon

- J
- Jukka Marin
  
  Contact options for registered users
Vote on answer
posted
19 years ago

Wed, Oct 6, 2004 5:42 AM

Eh. I have yet to see a good C compiler, then. For example, in one project, I had a simple data copying loop written in C. I tried two different compilers and they both compiled the loop to eight (8) machine language instructions (although the instructions used were different ones). Any asm programmer would have used only two (2) instructions for the same loop.

This is an extreme example, it isn't that bad all the time. Sometimes C compilers know tricks that an asm programmer would never bother with - but a few instructions later, they do something utterly stupid and waste both memory and execution cycles.

I have been told that a C compiler can beat an asm programmer for almost

20 years, but looking at the code produced by compilers, this _still_ isn't true. This doesn't matter so much any more, thanks to faster CPU's and larger memories, though.

And yes, I'm using C for almost everything for portability and other reasons, but it still makes me cry when I disassemble an interrupt routine compiled by a modern C compiler..

-jm

- D
- David Brown
  
  Contact options for registered users
Vote on answer
posted
19 years ago

Wed, Oct 6, 2004 6:57 AM

You can do that perfectly well in C. The last time I wrote specialised startup code (for an MPC561), virtually everything was in C. I had to have assembly macros for accessing special function registers, since there is no was to use them in C. Other than that, I have three assembly instructions - two to set up a stack pointer, and one to jump to the start of the C startup routine. All the code to set up chip selects, clear out the bss, copy the program from flash to ram, etc., is a C routine - ending with a call to main().

It might still be better to write such code in assembly - I've done both. But you don't have to have your data segment ready before using C - you only have to have it ready before main() starts.

For that sort of thing, go for assembly every time - that way you are sure you get it right.

Some algorithms, particularly ones that work best with access to a carry bit and rotate instructions, are a lot more efficient in assembly - C simply lacks any way to express these concepts properly. They can often be easier to write and understand in assembly, as well as being smaller and faster. Crc routines are another good example.

- D
- David Brown
  
  Contact options for registered users
Vote on answer
posted
19 years ago

Wed, Oct 6, 2004 7:02 AM

looking

a serious

microchip

run on

How about if you want fp using a non-standard format? For embedded use, a format with a 16-bit signed mantisa (including the "hidden one") and an

8-bit signed exponent might suit better, and be faster to work with. There are also often features in the C maths libraries that you don't need and don't want - NaN handling, or a global error variable (which exists purely because of the limitations of C).

- D
- David Brown
  
  Contact options for registered users
Vote on answer
posted
19 years ago

Wed, Oct 6, 2004 7:09 AM

unfortunate

It is certainly true that a programmer who is familiar with the target's assembly language is going to be able to write C code that compiles to smaller and faster object code. Whenever I start working with a new architecture, I make a point of writing at least some code in assembly, just to get the feel of it. When coding in C, I like to look at the generated assembly code to see what is being produced. There are plenty of occasions when the generated code is not critical (remember Knuth's second law of optomisation (for experts only): "Don't do it yet"). But when it is critical, and especially on small micros, knowing the target assembly is essential. You can shave a few cycles off a loop by choosing while() or for() loops, or having your index variables count up or down - depending on the target and the compiler. On 8-bitters, it is particularly easy to spot code writen by people who don't understand their target - the code is full of int's instead of uint_8 or similar.

- 4
- 42Bastian Schick
  
  Contact options for registered users
Vote on answer
posted
19 years ago

Wed, Oct 6, 2004 8:08 AM

But when managers decide to pay the extra HW costs to get Linux running in there product (because it is sooo stylish) the extra costs for a slightly faster CPU are irrelevant.

--
42Bastian
Do not email to bastian42@yahoo.com, it's a spam-only account :-)
Use @epost.de instead !

- W
- Wilco Dijkstra
  
  Contact options for registered users
Vote on answer
posted
19 years ago

Wed, Oct 6, 2004 8:58 AM

What architecture was this? It's well known that most 8- and 16-bit CISC architectures are hard (sometimes terrible) compiler targets, and so there are few professional compilers available. Things are very different in the

32-bit world, especially on the RISCs.

In any case, why didn't you use memcpy? Most compilers would then understand that you wanted to copy some bytes and emit efficient code to do so (memcpy is used a lot in embedded applications). Advanced compilers do pattern matching and replace copy loops with the real thing, but this isn't widely done yet.

Perhaps you should complain to your compiler vendor. Especially with free/cheap compilers, if people don't show inefficient examples, code quality won't improve.

In the embedded space it's true for ARM at least, and I wouldn't be surprised if it's true for PowerPC, SH and MIPS too. Outside the embedded world I don't think there are many people who understand the basics of the CPU they are programming on (instruction cycle timings have been obsolete for almost 15 years), let alone write highly optimal code. x86 and IA-64 are examples of complicated architectures with even more complicated implementations where assembly programming is almost exclusively done for fun by the 10 or so people left who know how to do it!

Wilco

- C
- CBFalconer
  
  Contact options for registered users
Vote on answer
posted
19 years ago

Wed, Oct 6, 2004 9:21 AM

... snip ...

Don't bother trying it. I did, about 30 years ago, for a 3 byte format and changed to the normal offset exponent, sign, and implied leading 1 significand later. The result was much faster and less complicated, besided being compact. The original signed significand (not mantissa) code was contributed to the early Intel user library for the 8080, and was publicly available there. The later (and much better) version was published in DDJ.

Denormals and over/underflow handling are entirely separate matters.

However, a three byte (24 bit) real is a very useful animal, and more than adequate most of the time. In higher level languages it needs to be padded out to four bytes for alignment purposes, and to allow for substitution of other real arithmetic modules. Such things as logs, exponentials, trigonometric, root, etc. functions can be built with very few operations on Tchebychev polynomials in assembly, and greatly reduce the associated table sizes. The format yields 4.7 decimal digit accuracy.

--
Chuck F (cbfalconer@yahoo.com) (cbfalconer@worldnet.att.net)
   Available for consulting/temporary embedded and systems.
     USE worldnet address!

- E
- Endymion Ponsonby-Withermoor III
  
  Contact options for registered users
Vote on answer
posted
19 years ago

Wed, Oct 6, 2004 9:30 AM

Can you explain what saturated arithmetic is, please ?

Richard [in PE12]

- M
- Markus Zingg
  
  Contact options for registered users
Vote on answer
posted
19 years ago

Wed, Oct 6, 2004 11:09 AM

[arm example code snipped]

For the ARM you are right and as such my example was badly choosen. The CPU I had in mind is having a severe addressing limitation when it comes to addressing external memory (memory with > 16 addresses) which make the compiler generate code that initializes and loads and initializes the only address register pair spending the wast majority of cycles only doing this. In the particluar case using internal memory was no option. Hence, writing the routine in assembly really saved more than 70% of cycle time as oposed to the generated code. In general this particular compiler does well, but there are special cases where it does not. We are back to my statement. It pays off to first use HLL, then analyze and change algorithms etc. if possible. If this is done then maybe assembly can do the trick if the gain in speed and the amount of work needed to get there justify it.

Markus

- I
- Ian Bell
  
  Contact options for registered users
Vote on answer
posted
19 years ago

Wed, Oct 6, 2004 11:42 AM

Sorry, I didn't think we were discussing poor management.

Ian

--
Ian Bell

- I
- Ian Bell
  
  Contact options for registered users
Vote on answer
posted
19 years ago

Wed, Oct 6, 2004 11:43 AM

Do I have to? ;-)

Ian

--
Ian Bell

- D
- Dave Hansen
  
  Contact options for registered users
Vote on answer
posted
19 years ago

Wed, Oct 6, 2004 1:19 PM

[...I wrote...]

With linker tricks you can write C code that will be executed before main. But as you note later, you need a stack to call it. And writing a proper RAM test in C is, ummm, more than difficult.

[...]

My routine to erase a page of flash looks like this:

void FLASH_erase_page(void High *page) { // Low byte of page is in a, high byte in x // #asm pshx ; push high byte pulh ; pull into h tax ; copy low byte into x lda #OPERATING_FREQUENCY sta _CPUSPD bclr MASS_ERASE,_CTRLBYT ; set up for page erase jsr ERARNGE ; erase page #endasm }

Where the symbols are defined earlier in the file. Some compilers off extensions that might allow this to be written as

void FLASH_erase_page(void High *page) { CPUSPD = OPERATING_FREQUENCY; CTRLBYTE &= ~MASS_ERASE_MASK; __HX = page; ERARNGE(); }

FWIW, I don't believe the compiler I'm using provides such extensions.

The point of this example seems to be that this appeared to be such a case, but the actual gain was smaller than might be expected.

However, a few years back I posted a pathological checksum function (involving the rotate instruction and carry bit) that was six instructions long in assembly, but I couldn't get to less than about

40 instructions using C. See

formatting link

for details if you're interested.

Regards,

-=Dave

--
Change is inevitable, progress is not.

- G
- Grant Edwards
  
  Contact options for registered users
Vote on answer
posted
19 years ago

Wed, Oct 6, 2004 2:36 PM

Even the more recent 8 and 16 bit architectures (AVR, H8, etc.) are much better targets than something like the 6811 (mediocre) and the 8051 (awful). The last "6811" compiler I used (Introl) just wasn't very good. It didn't even use the second index register, and a brain-dead assembly-language peephole optimizer with a 3-instruction window could have made significant improvements in the code. Today, the HC11/HC12 port of GCC generates much better code than the commercial compilers I used a few years ago.

--
Grant Edwards                   grante             Yow!  Is it NOUVELLE
                                  at               CUISINE when 3 olives are
                               visi.com            struggling with a scallop
                                                   in a plate of SAUCE MORNAY?

- A
- Alexei A. Frounze
  
  Contact options for registered users
Vote on answer
posted
19 years ago

Wed, Oct 6, 2004 8:40 PM

It's when you get overflow (normally comes with improper sign of the result), you replace the overflowed value with maximum possible or minimum possible, but with correct sign.

Example (16-bit signed integer arithmetics):

32767+1 yields -32768, which is overflowed and thus -32768 is replaced with 32767 (maximum, which is much closer to 32767+1=32768 than -32768 is)

-32768-1 yields 32767, which is overflowed and thus 32767 is replaced with -32768 (minimum, which is much closer to -32768-1=-32769 than 32767 is)

This is useful in integration/summation, in control systems, where you can't afford such huge errors due to the possible overflows.

It just doesn't let it invert the sign and keeps it between the minimum and maximum.

Alex

- R
- R Adsett
  
  Contact options for registered users
Vote on answer
posted
19 years ago

Wed, Oct 6, 2004 9:28 PM

Worse, the change in sign is likely to cause the control to work in the wrong direction (Imagine pressing the brake and having the car speed up).

Robert