A Challenge for our Compiler Writer(s)

Hey Walter (et all, if you're out there):

With the gnu tools, optimizations on, and an Arm Cortex M3, this goes a _lot_ faster when you precede it with

#define ASSEMBLY_WORKS

than when you don't.

Yet you say that an optimizer should eat up the C code and spit out assembly that's better than I can do.

How come the difference? Is it the tools? I know it's not because it's the World's Best ARM Assembly, because I've learned a bit since I did it and could probably speed it up -- or at least make it cleaner.

CFractional CFractional::operator + (CFractional y) const { #ifdef ASSEMBLY_WORKS int32_t a = _x; int32_t b = y._x; asm ( "adds %[a], %[b]\n" // subtract "bvc .sat_add_vc\n" // check for overflow "ite mi\n" "ldrmi %[a], .sat_add_maxpos\n" // set to max positive "ldrpl %[a], .sat_add_maxneg\n" // set to max negative "b .sat_add_ret\n" ".sat_add_maxpos: .word 0x7fffffff\n" ".sat_add_maxneg: .word 0x80000001\n" ".sat_add_forbid: .word 0x80000000\n" ".sat_add_vc:\n" "bpl .sat_add_ret\n" "ldr %[b], .sat_add_forbid\n" "cmp %[a], %[b]\n" "it eq\n" "moveq %[a], %[b]\n" ".sat_add_ret:\n" : [a] "=r" (a), [b] "=r" (b) : "[a]" "r" (a), "[b]" "r" (b));

return CFractional(a); #else int32_t retval = _x + y._x;

// Check for underflow and saturate if so if (_x < 0 && y._x < 0 && (retval >= 0 || retval < -INT32_MAX)) { retval = -INT32_MAX; }

// check for overflow and saturate if so if (_x > 0 && y._x > 0 && retval

Reply to
Tim Wescott
Loading thread data ...

Not an answer to your question, but couldn't you use the SSAT instruction to your advantage here ?

Reply to
Arlet Ottens

If it's what I think it is -- very possibly. As I said, this isn't super- optimized assembly code, here.

--
My liberal friends think I'm a conservative kook.
My conservative friends think I'm a liberal kook.
Why am I not happy that they have found common ground?

Tim Wescott, Communications, Control, Circuits & Software
http://www.wescottdesign.com
Reply to
Tim Wescott

I don't have an ARM handy for testing speed, but I've just tried compiling some test code with the latest Code Sourcery "lite" arm compiler (gcc 4.6.1), with the command line:

arm-none-eabi-gcc test.c -c -std=gnu99 -Wa,-ahlsd=test.lst -fverbose-asm

-Os -mcpu=cortex-m4 -mthumb

(I tried with cortex-m4 because it supports saturating arithmetic.)

There might be differences about saturating negative values to

-INT32_MAX or to INT32_MIN - I don't know which is standard or required here.

As can be seen from the code below, your C code is not optimal. I would be very interested to know how the speed of satadd2() below compares to your hand-made assembly.

However, this all raises bigger questions - why are you making your own code for this? Modern compilers (such as gcc) support fractional types (from ISO/IEC TR 18037). If you use them, as in satadd3(), the compiler will generate optimal code for processors with hardware support (such as the Cortex-M4). For other processors, such as the Cortex-M3, the compiler automatically uses a library routine. You can expect such library routines to be pretty optimal for the architecture in question (for the M3, the library code is the same as for satadd2() below, which is hardly surprising given the source of that function).

So by using "signed long sat fract" types you get fast library code on M3 and before, and when you switch to an M4 with DSP functionality, a re-compile gives you optimal use of the hardware without having to re-write your assembly.

Tell me again why assembly is so great in this case?

mvh.,

David

// test.c

#include #include

int32_t satadd1(int32_t x, int32_t y) { int32_t retval = x + y; if ((x < 0) && (y < 0) && ((retval >= 0) || ( retval < -INT32_MAX))) { retval = -INT32_MAX; } if ((x > 0) && (y > 0) && (retval

Reply to
David Brown

Damn. And I thought I was so smart.

So, how long must I have been sleeping?

--
My liberal friends think I'm a conservative kook.
My conservative friends think I'm a liberal kook.
Why am I not happy that they have found common ground?

Tim Wescott, Communications, Control, Circuits & Software
http://www.wescottdesign.com
Reply to
Tim Wescott

I was going to try out that code on the IAR EWARM compiler at various optimization levels----until I realized that

"CFractional CFractional::operator + (CFractional y) const"

doesn't look like C to me. Am I missing something??

Could you include enough information to make that example directly compilable in standard C.?

Mark Borgerson

Reply to
Mark Borgerson

It is clearly C++, but it would seem that CFractional is a class containing an int32_t member "_x" which is the fractional value in question. Think of it as syntactic sugar around the function

int32_t add_sat_frac(int32_t a, int32_t b);

(Or see my re-write of the code in C in my other post.)

mvh.,

David

Reply to
David Brown

I did look at the C code and the compiler outputs. It seems that compilers have come a long way since I wrote some

68K assembly because the compiler refused to use the most efficient decrement-test-and-loop instruction (DBNE D0, Dest, I think).

It is clear to me that the compiler writers are way ahead of me for the ARM and ARM-Cortex chips. Even on the simpler MSP430, I seldom use assembly outside the startup code. I still look at the assembly listing in the debugger, though.

Mark Borgerson

Reply to
Mark Borgerson

Yup. In fact, that's an awful lot like what the call looks like when I need to do this in C (except that I'm going to be investigating just how ubiquitous fractional support is, now that I've been made aware of it).

Sorry for not elucidating -- I thought it would be obvious.

--
Tim Wescott
Control system and signal processing consulting
www.wescottdesign.com
Reply to
Tim Wescott

That reminds me of a situation where C was much better than assembly for startup code. This was 15 years ago - the compiler in question being about 20 years old now. The toolchain-provided startup code for clearing the bss was written in assembly, as is common. And it was slow and inefficient - also a very common situation for toolchain-provided assembly code. I re-wrote it in C - the result was clearer code, half the size of object code, and something like 10 times as fast run time. Ironically it is because the compiler generated a DBNE instruction, which the assembly code did not use.

(The compiler will only be able to generate DBNE instructions for a

16-bit counter, not an 32-bit "int" counter. It's one of the few 16-bit only instructions on the m68k.)

Reply to
David Brown

This might be slightly off topic, but at least it shows that embedded systems are used in quite different environments.

While it might be critical that the startup can be done in 1 ms in some cases, in other cases a startup time of 1 s, 1 minute or even 1 hour might be acceptable, if the system is expected to run for the next 1-30 years without restarts.

Reply to
upsidedown

It is obvious to people who are familiar with C++ - but gobbledegook to people who have managed to avoid it!

If you find out anything interesting about support for "fract", "sat", etc., it would be interesting to hear about it. I know it is supported in gcc for many processors (either with hardware support, or library calls), but I haven't looked further than that.

I also know that the numpties that wrote the specs have, as usual, underspecified them. How can they possibly have been so stupid as to write things like "the minimum formats for each type are..." ? Did they not notice that the embedded world is a great fan of the types like int16_t, and used their own types like u8 before that? I would much prefer to have seen standardised names like "fract16_t", "ufract32_t", etc., from the start - before people make them up themselves.

mvh.,

David

Reply to
David Brown

True enough - and startup time is seldom critical. But it is silly for a toolchain provider to have startup code in assembly that is both longer and slower than the equivalent C code, while being also less clear and flexible.

Sometimes I think toolchain providers don't realise it is possible to have C code that runs before main() starts. They write this crap in assembly once, for one member of the processor family, and re-use it ever after because no one can be bothered re-writing it optimally for different members. So the startup code you get for your Coldfire v4 is restricted to being able to run on an 68000 from 30 years ago. Or if they do modify it, they try to minimise the changes - resulting in an incomprehensible mixture.

(Sorry if that sounds like a bit of a rant - I've had to deal with some very messy low-level assembly code recently. The toolchain is otherwise good, but some of the junk created as part of the "project wizard" setup is ridiculous.)

Reply to
David Brown

The last time I was involved in the cross compiler business was in the

1980's so I might be a bit out of date :-).

In those days, when you hired a new person, what would be her/his first duty ? Typically writing examples and documents for different platforms in order to get some familiarity to the company products.

By no means, you would not use your best programmers for this duty, but instead used engaged in writing the actual compiler.

I guess that the situation has not changed a lot since those days.

Reply to
upsidedown

Whenever I see Wizard or similar with anything I immediately think "oh shit yet another there are only three possible ways to use this tool"

Wizards for anything are a bain of my life.

--
Paul Carpenter          | paul@pcserviceselectronics.co.uk
    PC Services
 Timing Diagram Font
  GNU H8 - compiler & Renesas H8/H8S/H8 Tiny
 For those web sites you hate
Reply to
Paul

ElectronDepot website is not affiliated with any of the manufacturers or service providers discussed here. All logos and trade names are the property of their respective owners.