Code size reduction migrating from PIC18 to Cortex M0

K

Kvik 14 years ago

Hi

We are digging deeper into the Cortex M0 processor versus a PIC18.

Seemingly objective material (Coremark data) at page 32 of:

formatting link

List a reduction in code size from PIC18 to M0 by a factor 2.

But, anyone with a real-life experience of the possible code size reduction?

Thanks

Klaus

Vote

W

Walter Banks 14 years ago

formatting link

I have written code generators for both processors. This specific powerpoint has been around for a while and show more about what the Cortex is good at and the Microchip PIC's less so.

Cortex is smaller than PIC18 in some embedded applications that require 32 bit math. In many applications Cortex has higher RAM requirements.

Vote

M

Miro Samek 14 years ago

This codesize reduction from PIC to Cortex-M seems about right.

The 8-bit PICs have lousy code density according to many studies. In my own comparison (see my blog post "Insects of Computer World" at

formatting link

I've got something like factor of 5 (!) code size difference between PIC18 and Cortex-M3. This was for a small RTOS-like state machine framework source code. The PIC18 code was created by the free student- edition of the Microchip C18 compiler. I suspect that the payed edition can do somewhat better code size optimization.

But the truth remains that the 8-bit PICs have the worst code density in the industry. Also, contrary to widespread misconceptions, the 8- bitters are not inherently efficient in memory usage. It turns out that code density has nothing to do with the the register file width (8-, 16-, or 32-bits), but how old a given CPU design is. The old designs, such as the 8-bit PIC and 8051 are lousy. New designs, regardless of the register size are much better. ARM Cortex-M are pretty good. So is MSP430. But I think that the current winner in terms of best code density could be the new Renesas RX.

Miro Samek

formatting link

Vote

K

Kvik 14 years ago

at

formatting link

Hi Miro

Thats a great link, thankyou very much :-)

I will take some representative code on a PIC18 and compare that to the M3 and post the results back in the forum, just for fun

Regards

Klaus

Vote

D

David Brown 14 years ago

formatting link

These sorts of things are always written with a purpose in mind. There are three sorts of lies...

When looking at ram requirements, it's worth noting the ratio of flash to ram sizes on common devices. Typically, microcontrollers with 8-bit cores have much more flash per byte of ram than those with 32-bit cores. Though there is obviously lots of variation and different types, it is common for 8-bit devices to have 8 to 32 times as much flash as ram, while for 32-bit devices the range is perhaps 2 to 8. This means that the bigger ram requirements caused by things like 32-bit values being more common than 8-bit, pointers moving from 16-bit to 32-bit, and greater stack space requirements for functions and interrupts, have little impact in real-world usage.

Vote

F

FreeRTOS info 14 years ago

at

formatting link

As Dave Brown's reply implies, when comparing any attribute about any product, the manufacturers own data is the last place to look. You are always best off taking your own measurements for the use case you are actually interested in - because the results are use case specific.

Following Miro's email - the free version of the PIC compilers does not (normally) include the best optimisation, unless it is during its evaluation period.

Regards, Richard.

formatting link
Designed for microcontrollers. More than 7000 downloads per month.

formatting link

15 interconnected trace views. An indispensable productivity tool.

Vote

W

Walter Banks 14 years ago

formatting link

Good point about rom/ram ratio's Your numbers are consistent with our experience. To take the point forward the high rom/ram ratios on many small 8 bit micros changes the way code is generated for these parts by trading ram savings for rom and execution cycles.

Walter..

Vote

D

David Brown 14 years ago

formatting link

Yes, ram is "cheaper" on many 32-bit devices than 8-bit devices. On the other hand, sometimes /accessing/ ram is more expensive (maybe you can't do direct addressing but must first load a pointer register, maybe you've got ram that takes multiple cpu clock cycles, maybe you have code running out of ram as well). If it were easy getting these balances right, your job wouldn't be half as much fun!

Having more ram on hand also changes the way users write code, and gives the programmer more freedom.

Vote

P

peter_gotkatov 14 years ago

I've ported a fairly large app from a PIC18 to a Cortex M3 (which I believe is just a superset of the M0) and the code size actually INCREASED from about 62K to 129K. This was just plain C code without any processor-specific optimizations or tricks that was just cut & pasted from one compiler to the other. While the Cortex does get better density on things like 32 X 32 multiplies or divides it suffers horribly on simple control structures.

For example, clearing a timer interrupt flag:

On the PIC18 this takes 2 bytes: PIR1 &=3D ~TMR1IF; 2108: BCF F9E.0

On the Cortex M3 it takes 40 bytes: TIM1->SR &=3D ~TIM_SR_UIF; F6424200 movw r2, #0x2C00 F2C40201 movt r2, #0x4001 F6424300 movw r3, #0x2C00 F2C40301 movt r3, #0x4001 8A1B ldrh r3, [r3, #16] B29B uxth r3, r3 4619 mov r1, r3 F64F73FE movw r3, #0xFFFE F2C00300 movt r3, #0 EA010303 and.w r3, r1, r3 4619 mov r1, r3 460B mov r3, r1 8213 strh r3, [r2, #16]

A simple countdown:

On the PIC18 it takes 6 bytes: if (--timeout) return; 210A: DECF x3B,F 210C: BZ 2110 210E: BRA 2114

On the Cortex M3 it takes 40 bytes: if (--timeout) return; F2400360 movw r3, #0x60 F2C20300 movt r3, #0x2000 7B5B ldrb r3, [r3, #13] F10333FF add.w r3, r3, #0xFFFFFFFF B2DA uxtb r2, r3 F2400360 movw r3, #0x60 F2C20300 movt r3, #0x2000 735A strb r2, [r3, #13] F2400360 movw r3, #0x60 F2C20300 movt r3, #0x2000 7B5B ldrb r3, [r3, #13] 2B00 cmp r3, #0 D128 bne 0x08000F92

This may not be a very fair comparison since both compilers (CCS for the PIC and gcc for the Cortex) are set to non-optimized mode but even when gcc is set to optimize it only drops from 129K down to 104K which is not much of a savings and still worse than the PIC18. When I first started this exercise I was quite disappointed by the poor density so I tried a simple exercise: I took one single C function that had more than doubled in size and re-wrote it so as to take advantage of the Cortex strengths. I made heavy use of 32-bit variables, careful use of the "register" keyword, always accessing global variables through a pointer, combining bit shifts with other arithmetic operations, using bit-banding for IO registers wherever possible, etc. In the end I managed to get it down to almost half its size, but still couldn't match the PIC18.

Perhaps the final answer depends on what kind of application you're writing. In my case it's very IO intensive with a lot of peripherals being used and a simple touchscreen UI with very little math involved. Perhaps the Cortex was not the best choice here.

Vote

F

FreeRTOS info 14 years ago

Did you look at the map file to see why? If using GCC, did you set the compile options to remove dead code (most linkers will do it automatically). If using GCC, did you avoid using libraries that were written for a much larger class of processor?

"The Cortex-M3 processor has a feature known as "bit-banding". This allows an individual bit in a memory-mapped mailbox or peripheral register to be set/cleared by a single store/load instruction to an bit-band aliased memory address, rather than using a conventional read/modify/write instruction sequence."

Regards, Richard.

formatting link
Designed for microcontrollers. More than 7000 downloads per month.

formatting link

15 interconnected trace views. An indispensable productivity tool.

Vote

P

peter_gotkatov 14 years ago

I actually spent a lot of time looking at the map file. The total "overhead" including the vector table, C startup, and the two library functions that I actually use (printf and memcpy) is around 2K, the rest is all my code.

I've used bit banding in some spots, and although it's great for RAM variables it's not very elegant for the peripheral registers since the header files define bits by value, not position whereas any bitbanding C macro that I could come up with would require the bit number, not position. So while TIM1->CCER |=3D TIM_CCER_CC4E clearly enables the timer's CC4 output, BITBANDSET(TIM1->CCER, 12) is less intuitive. Instead of a macro I thought about making an ASM inline function that would use the CLZ instruction to do this efficiently but for some reason gcc didn't want to inline any of my functions (C or ASM) in debug mode so I just gave up at that point.

Vote

M

Mark Borgerson 14 years ago

I think there are two things going on here:

The GCC compiler isn't very good at producing compact code.

I tried one of your examples on IAR EW-ARM with optimization set to low (my usual default):

119 TIM1->SR &= TIM_SR_UIF; \ 00000010 0x.... LDR.N R0,??DataTable6_6 ;; 0x40010010 \ 00000012 0x8800 LDRH R0,[R0, #+0] \ 00000014 0xF010 0x0001 ANDS R0,R0,#0x1 \ 00000018 0x.... LDR.N R1,??DataTable6_6 ;; 0x40010010 \ 0000001A 0x8008 STRH R0,[R1, #+0]

That's just 10 bytes----4 times better than the GCC result.

Cortex IO registers may be 16 or 32 bits and there are enough of them that you need 32-bit pointers to get at them. Loading those pointers is going to take more code.

I suspect that the IAR compiler would reduce the code expansion to about a factor of 1.5. Since a lot Cortex MCUs have up to 1MB of flash while the PIC18 maxes out at 128KB, the ratio of program size to available flash may be better on the Cortex than on the PIC18.

Mark Borgerson

Vote

D

David Brown 14 years ago

Saying you use a compiler but don't enable optimisation, then complaining about the code generated, is like saying you drive a car but never bother changing out of first gear and then complaining about the lack of speed.

When you say you tried using the "register" keyword, I have to assume you learned C from a 30 year old book. One thing that is worth learning about modern toolchains (for the PIC, the Cortex, or whatever) is that they generate better code from well-written C using a clear, modern style, and using appropriate command-line switches. Don't try and second-guess your tools by adding irrelevant keywords (like "register") or "hand-optimising" by using extra pointers. Learn to use the tools properly, then let them do their job.

When you say you use "gcc", which version? There are still some people using ancient versions of gcc which were very poor for ARM code (which has lead to a long-lasting myth that gcc is bad for ARM).

To test your issues, I compiled this test code:

#include

typedef struct { uint16_t padding[8]; volatile uint16_t SR; } TIM_t;

#define TIM1 (((TIM_t*)(0x40012c00))) #define TIM_SR_UIF 0x0002

#define timeout (*((uint8_t*)(0x20000060)))

void test2(void) { if (--timeout) return; TIM1->SR &= ~TIM_SR_UIF; }

I used gcc 4.6.1 (from CodeSourcery Lite version 2011.09-69), with flags "-mcpu=cortex-m3 -mthumb -S".

Even with no optimisation, I am failing to generate code quite as bad as you have.

With -Os (which is the norm for embedded systems), I get:

test2: @ args = 0, pretend = 0, frame = 0 @ frame_needed = 0, uses_anonymous_args = 0 @ link register save eliminated. ldr r2, .L3 ldrb r1, [r2, #0] @ zero_extendqisi2 subs r0, r1, #1 uxtb r3, r0 strb r3, [r2, #0] cbnz r3, .L1 ldr r2, .L3+4 ldrh ip, [r2, #16] bic r1, ip, #2 lsls r0, r1, #16 lsrs r3, r0, #16 strh r3, [r2, #16] @ movhi .L1: bx lr .L4: .align 2 .L3: .word 536871008 .word 1073818624

Real-world code will be even better, as the compiler can re-use base pointers and otherwise optimise larger code sections.

Vote

D

David Brown 14 years ago

That is unlikely to be true, but without knowing your code or the map file, there is no way to be sure. It is not a surprise that the code size has increased in moving from the PIC18 - differences here will vary wildly according to the type of code. But it /is/ a surprise that you only have 2K of startup, vector tables, and library code.

Code clarity is more important than code efficiency. But if code efficiency is important, then put such code in little "static inline" functions with appropriate comments.

First off, you should not need to resort to assembly to get basic instructions working - the compiler should produce near-optimal code as long as you let it (by enabling optimisations and writing appropriate C code).

Secondly, don't use "ASM functions" - they are normally only needed by more limited compilers. If you need to use assembly with gcc, use gcc's extended "asm" syntax.

Finally, if you are not getting inlining when debugging it is because you have got incorrect compiler switches. You should not have different "debug" and "release" (or "optimised") builds - do a single build with the proper optimisation settings (typically -Os unless you know what you are doing) and "-g" to enable debugging. You never want to be releasing code that is built differently from the code you debugged.

Vote

P

peter_gotkatov 14 years ago

Not all that surprising, here are the sizes in bytes: .vectors 304 .init 508 __putchar 40 __vprintf 1498 memcpy 56

I've tried several ways of writing a counleadingzeroes() function that would use the Cortex CLZ instruction but even with optimization turned on it still wouldn't do it.

s

There are some things like the bootloader that need to be ASM functions in their own separate .S file anyway since they need to copy portions of themselves to RAM in order to execute. But a bootloader is a special case and I do agree that normal code shouldn't need to rely on ASM functions. I must say I'm not familiar with gcc's extended asm syntax and although I did look at it briefly it seemed like it was more complicated than a plain old .S file and it was mostly geared towards mixing C and ASM together in the same function and accessing variables by name etc. Not something I needed for a simple bootloader.

t

g

I was fighting with this for a while when I was first handed this toolchain, and it seems that in debug mode, there is no -O switch at all and in release mode it defaults to -O1. When I change this -Os it does produce the same code as the sample that you posted above from gcc 4.6.1 (mine is 4.4.4 by the way). However even with manually adding the -g switch I still don't get source annotation in the ELF file unless I use debug mode. This effectively limits any development/ debugging to unoptimized code, which still has to fit into the 256K somehow.

As for using register keywords and accessing globals through pointers, I normally don't do this (haven't used the register keyword in years) and I certainly wouldn't be doing it at all if it didn't have such a significant effect on the code size:

unsigned long a,b;

void test(void) { B4B0 push {r4-r5, r7} AF00 add r7, sp, #0

---------------------------------------- register unsigned long x, y; a=3Db+5; F2402314 movw r3, #0x214 F2C20300 movt r3, #0x2000 681B ldr r3, [r3, #0] F1030205 add.w r2, r3, #5 F240231C movw r3, #0x21C F2C20300 movt r3, #0x2000 601A str r2, [r3, #0]

---------------------------------------- x=3Dy+5; F1040505 add.w r5, r4, #5

---------------------------------------- } 46BD mov sp, r7 BCB0 pop {r4-r5, r7} 4770 bx lr BF00 nop

Vote

A

Arlet Ottens 14 years ago

The difference is due to the fact that a, b are global and x, y are local.

Try removing the 'register' keyword, but leaving everything else the same. You should get the same code (assuming optimization enabled)

Vote

D

David Brown 14 years ago

I still think it is surprising, because these library functions often pull in other library code (such as for floating point support), and quite often there is library code for small "helper" functions. But it depends on the configuration, and what is in the rest of your source code.

Did you try using the "__builtin_clz()" function described in the gcc manual?

I have written bootloaders for several microcontrollers (though not for a Cortex). I write them in C.

I have written startup code for several microcontrollers, handling the setup of the stack, memory, the C environment, clearing bss, copying constants, etc. I write such code in C.

You can't avoid writing two or three of the instructions in assembly, but usually it's not more than that. It's certainly not enough to bother with separate .S files - normally not even individual assembly functions (though sometimes I've used "naked" C functions as wrappers for a few lines of pure assembly).

Usually I write startup code when I am unhappy with the code supplied by the toolchain vendor - which is invariably written in assembly. Re-writing it in C gives code that is far clearer, and smaller and faster (sometimes many times faster).

It is aimed at mixing C and assembly, yes. It lets you do the minimal work in assembly, while letting the compiler handle as much as possible, including optimising around your assembly code. Let the compiler do the things it is good at.

This is some limitation or misunderstanding of your IDE or other tools, not gcc. Most likely it is a misunderstanding rather than a limitation, but without knowing your particular toolchain it is hard to give specific help.

Most serious developers use something like -Os (which is -O2 with an emphasis on size). The only reason to use -O1 is if you have a very slow computer and a very large code base, as it is faster than -Os/-O2, or very occasionally in testing or debugging. The only reason to use no optimisation is because you don't understand your tools.

And sometimes it is useful to use higher optimisations, or enable specific optimisations, because of particular effects. They don't tend to have much effect on most code, but can make a big difference to particular parts (perhaps unrolling a loop, or re-arranging nested loops to fit cache line sizes, etc.). I tend to use "optimize" function attributes or pragmas for such special cases.

The "register" keyword has always been ignored in gcc except in -O0 mode, unless of course you are using the extended syntax to specific a particular register.

You are seeing a difference in the code because one set of variables is global, and must be accessed externally, while the other set is local and uses registers.

And if you had enabled optimisations, the "x = y + 5;" would have been eliminated entirely because it has no effect.

And if you had enabled warnings, as you always should, the compiler would have complained about the code using variables before they are initialised, and about setting a variable that has no effect.

mvh.,

David

Vote

M

Mark Borgerson 14 years ago

(36 bytes of code)

I'd be surprised that your compiler did not complain about x and y not being initialized before the addition. EW ARM warned me that y was used before its value was set and that x was set but never used.

Here is the code it gave me----with some extra comments added afterwards

111 void test(void){ 112 register unsigned long x,y; 113 a = b+5; \ test: \ 00000000 0x.... LDR.N R1,??DataTable8_6 \ 00000002 0x6809 LDR R1,[R1, #+0] \ 00000004 0x1D49 ADDS R1,R1,#+5 \ 00000006 0x.... LDR.N R2,??DataTable8_7 \ 00000008 0x6011 STR R1,[R2, #+0] 114 x = y+5; \ 0000000A 0x1D40 ADDS R0,R0,#+5 // R0 is y // sum not saved 115 } \ 0000000C 0x4770 BX LR ;; return 116 (14 bytes of code)

Apparently, R0, R1, R2 are scratch registers for IAR and don't need to be saved and restored.

Adding actual initialization to x and y and saving the result in b produced the following:

In section .text, align 2, keep-with-next 110 void test(void){ 111 register unsigned long x=3,y=4; \ test: \ 00000000 0x2003 MOVS R0,#+3 \ 00000002 0x2104 MOVS R1,#+4 112 a = b+5; \ 00000004 0x.... LDR.N R2,??DataTable8_6 \ 00000006 0x6812 LDR R2,[R2, #+0] \ 00000008 0x1D52 ADDS R2,R2,#+5 \ 0000000A 0x.... LDR.N R3,??DataTable8_7 \ 0000000C 0x601A STR R2,[R3, #+0] 113 x = y+5; \ 0000000E 0x1D49 ADDS R1,R1,#+5 \ 00000010 0x0008 MOVS R0,R1 // x = sum 114 b = x; // this time save the result \ 00000012 0x.... LDR.N R1,??DataTable8_6 \ 00000014 0x6008 STR R0,[R1, #+0] 115 } \ 00000016 0x4770 BX LR ;; return 116

Still accomplished with scratch registers----no need to save any on the stack. I changed from my default optimization of 'low' to 'none' and got exactly the same code.

Finally, I took out the 'register' key word before x and y----and got exactly the same result as above.

It seems that GCC just doesn't match up to IAR at producing compact code at low optimization levels. OTOH, given that EW_ARM costs several KBucks, it SHOULD do better!

Mark Borgerson

Vote

D

David Brown 14 years ago

The problems here don't lie with the compiler - they lie with the user. I'm sure that EW_ARM produces better code than gcc (correctly used) in some cases - but I am also sure that gcc can do better than EW_ARM in other cases. I really don't think there is going to be a big difference in code generation quality - if that's why you paid K$ for EW, you've probably wasted your money. There are many reasons for choosing different toolchains, but generally speaking I don't see a large difference in code generation quality between the major toolchains (including gcc) for 32-bit processors. Occasionally you'll see major differences in particular kinds of code, but for the most part it is the user that makes the biggest difference.

On place where EW_ARM might score over the gcc setup this user has (he hasn't yet said anything about the rest - is it home-made, CodeSourcery, Code Red, etc.?) is that EW_ARM might make it easier to get the compiler switches correct, and avoid this "I don't know how to enable debugging and optimisation" or "what's a warning?" nonsense.

It hardly needs saying, but when run properly, my brief test with gcc produces the same code here as you get with EW_ARM, and the same warnings about x and y.

I'm sure that EW_ARM has a similar option, but gcc has a "-fno-common" switch to disable "common" sections. With this disabled, definitions like "unsigned long a, b;" can only appear once in the program for each global identifier, and the space is allocated directly in the .bss inside the module that made the definition. gcc can use this extra information to take advantage of relative placement between variables, and generate addressing via section anchors:

Command line: arm-none-eabi-gcc -mcpu=cortex-m3 -mthumb -S testcode.c -Wall -Os

-fno-common

test: @ args = 0, pretend = 0, frame = 0 @ frame_needed = 0, uses_anonymous_args = 0 @ link register save eliminated. ldr r3, .L6 ldr r0, [r3, #4] adds r2, r0, #5 str r2, [r3, #0] bx lr .L7: .align 2 .L6: .word .LANCHOR0 .size test, .-test .global b .global a .bss .align 2 .set .LANCHOR0,. + 0 .type a, %object .size a, 4 a: .space 4 .type b, %object .size b, 4 b: .space 4 .ident "GCC: (Sourcery CodeBench Lite 2011.09-69) 4.6.1"

It's all about learning to use the tools you have, rather than buying more expensive tools.

mvh.,

David

Vote

M

Mark Borgerson 14 years ago

One of the reasons I like the EW_ARM system is that the IDE handles all the compiler and linker flags with a pretty good GUI. You can override the GUI options with #pragma statements in the code----which I haven't found reason to do for the most part.

That's comforting in a way. While I now use EW_ARM for most of my current projects, I spent about 5 years using GCC_ARM on a project based on Linux. I would hate to think that I was producing crap code all that time! I had some experienced Linux users to set up my dev system and show me how to generate good make files, so I probably got pretty good results there.

I'm using EW_ARM for projects that don't have the resources of a Linux OS, and I prefer it for these projects.

Which reminds me----when counting bytes in code like this, it's easy to forget the bytes used in the constant tables that provide the addresses of variables. A 16-bit variable may require a 32-bit table entry.

I started with EW_ARM about three years before I started on the Linux project. The original compiler was purchased by the customer---who had no preferences, but was developing a project with fairly limited hardware resources. They asked what compiler I'd like and I picked EW- ARM. At that time, I'd been using CodeWarrior for the M68K for many years and EW_ARM had the same 'feel'. When it came time to do the Linux project, the transition to GCC took MUCH longer than the transition from CodeWarrior to EW_ARM. Of course, much of that was in setting up a virtual machine on the PC and learning Linux so that I could use GCC.

One thing that I missed on the Linux project is that I didn't have a debugger equivalent to C-Spy that is integrated into EW_ARM. Debugging on the Linux system was mostly "Save everything and analyze later".

Of course, the original poster is discussing the type of code that few Linux programmers write----direct interfacing to peripherals. My recent experience with Linux and digital cameras was pretty frustrating. I was dependent on others to provide the drivers--and they often didn't work quite right with the particular camera I was using. That's a story for another time, though.

Mark Borgerson

Vote

Code size reduction migrating from PIC18 to Cortex M0

Join the Discussion

Didn't find your answer?