How to store 32x32->64 result on ARM

I'm working with an XScale ARM processor (armv5te architecture). I have a GCC cross compiler.

I'm trying to store the output of an SMULL instruction into a 'long long' using inline assembly and the STRD instruction. However, I can't seem to get it right. SMULL places two 32-bit parts into separate registers (high and low). How do I get these values into my 'long long' data type? I tried using the STRD instruction with a pointer to the 64-bit datatype, but I get correct results only at optimization levels 1 and 3.

So I have two questions:

  1. How do I tell the compiler with inline assembly that I am going to be changing the value that's pointed by an input operand? My input operand is a pointer to the 64-bit datatype, but GCC seems to do whatever it wants with the actual value. In inline assembly I can specify the operand constraints, but in this case I want to inform the compiler about the referenced value, and not only the actual operand (pointer).

  1. In general, how do people concatenate the high and low words of 64-bit results to create a 'long long' type? I know I can use shifts and addition, but that is just ridiculous...

I can post my current code if necessary.

Thanks!

Reply to
Leonitis
Loading thread data ...

You could try declaring the pointer as "pointer to volatile". That might make the compiler do what you want.

Doesn't the gnu compiler already define a 'long long' data type?

--
http://www.wescottdesign.com
Reply to
Tim Wescott

Op Thu, 02 Jul 2009 15:09:06 +0200 schreef Leonitis :

Wrong. What makes you think that you need inline assembler? Did you look at your compiler output?

The following code:

long long smull (long a, long b) { return (long long) a * b; }

on my GCC 3.4.4 compiles into a SMULL instruction both without optimization and with -Os. Then, if you're not happy with the compiler output (my GCC allocated R4 even though R2 is not used), you can replace the body by the optimized assmbler code:

long long smull (long a, long b) { __asm ("smull r2, r3, r0, r1\ mov r1, r3\ mov r0, r2"); }

But this will prevent the compiler from being able to inline it.

--
Gemaakt met Opera's revolutionaire e-mailprogramma:  
http://www.opera.com/mail/
Reply to
Boudewijn Dijkstra

  1. Don't use inline assembler. Make the assembly function as a separate module and pass the parameters according to your compiler function agreement.
  2. If you still want to use the inline assembler, then read the manual how to pass the parameters and clobber the registers. There could be some very cryptic syntax about that.

RTFM

union { struct { u32 lo; u32 hi; } foo; u64 bar; } foobar;

If your compiler doesn't support for 64 bit types, create the C++ class to do that.

Vladimir Vassilevsky DSP and Mixed Signal Design Consultant

formatting link

Reply to
Vladimir Vassilevsky

I'd both agree and disagree with that. Don't use assembler until you are absolutely sure that you can't get the code you want with C (another poster gave ideas there).

If you do decide on assembler, use inline assembler, not a separate assembly module, for functions like this. Assembly modules are useful if you need to be absolutely sure of the exact code and the exact registers used, such as for some unavoidable timing-critical or register critical code. For optimisation code, use inline assembler - it lets the compiler pick the best registers and do other optimisations (assuming you are using a good compiler, like gcc, which can optimise well with inline assembly - not all compilers are as flexible).

Also look around for examples on web sites.

I believe your problem centres around three things. First, data passed into and out of registers is treated as raw data. If you pass a pointer to inline assembly, then the inline assembly gets the pointer - there is no automatic dereferencing. Second, you'll want to check the syntax for registers used for a long long in ARM inline assembly (if there is such a choice). If you can't explicitly specify a register pair for a long long return, use two 32-bit return registers and the union trick given by Vladimir. Finally, check the syntax for modifiable and in/out register specifications, and possibly for clobber specifications, for inline assembly.

gcc inline assembly syntax is very powerful, but pretty cryptic. Check it well, and test it well if you are not entirely sure.

Reply to
David Brown

  1. If you have to ask how inline assembler works then you probably shouldn't be using it. It's a very dangerous feature even for experts. If you want to optimize some code, optimize it in C first, and when you are certain you can beat the compiler (very few people are capable of that), translate the whole function into assembler. Don't mix C and inline assembler, even if you can get it to work, it is unlikely to result in efficient code. If you wrote this:

void f(long long *p, int x, int y) { *p = (long long)x * y; }

GCC4.2 produces this:

stmfd sp!, {r4, r5} smull r4, r5, r2, r1 strd r4, [r0] ldmfd sp!, {r4, r5} bx lr

Not 100% optimal as it could have done smull r2,r3,r1,r2 to avoid the save/restore.

However on XScale there is no performance gain in using STRD, so why do you want to use it?

  1. In general you use a cast or a union. On GCC the union approach results in the most efficient code. I typically define a few static inline functions to create a long long/double from a pair, or to extract the high or low word. This way you can easily produce efficient code on different compilers and avoid endianness issues.

Wilco

Reply to
Wilco Dijkstra

Thanks everyone for your input! Why I decided to write this using inline assembly was because I couldn't understand what GCC was doing. It was branching to __muldi3 which I couldn't see. Now, however, it displays the assembly. (could someone please clarify what the branch to __muldi3 or other __"..." is?)

I'm using GCC 4.3.2 and it produced a 'stmia' instruction instead of 'strd'. I've read before that 'strd' doesn't improve performance on the XScale, but then why is it included on the processor? (it's good for processors with 64-bit data transfers, right?)

I'll stick to C for simple functions but there's definitely an advantage to using inline assembly to access the DSP enhanced instructions, which the compiler uses rarely.

Reply to
Leonitis

Frankly, if you can't understand what GCC is doing, you ought to stop yourself from meddling with inline assembly.

It does exactly you're trying to do, without you worrying about the details. __muldi3 is one of a lot of helper functions supplied by the GCC runtime to implement things that the compiled code needs, but the CPU can't do (well) on its own. This particular one obviously implements multiplication of big integers.

Because processor designers don't always get to choose which instructions they have to support. XScale is a re-implementation of ARM. It's a new CPU core designed to run the same code that regular ARMs do. It can't just not support entire parts of the ARM machine language without counteracting its purpose.

Reply to
Hans-Bernhard Bröker

Op Mon, 06 Jul 2009 16:48:50 +0200 schreef Leonitis :

Compiler/system internal functions. __muldi3 is defined in gcc/libgcc2.c

Because it doesn't matter.

The processor is a conforming implementation of a specific ARM architecture so that compiler makers don't have to do extra work.

--
Gemaakt met Opera's revolutionaire e-mailprogramma:  
http://www.opera.com/mail/
Reply to
Boudewijn Dijkstra

__muldi3 is the long long multiply helper function (its name on ARM is __eabi_lmul which is clearer). If you see this being called you were using the wrong options.

LDRD/STRD were invented specifically for XScale to improve performance over the badly implemented LDM/STM instructions. So they are useful from that perspective. LDRD takes 1 cycle so it is better than doing 2 separate LDRs, but STRD takes 2 cycles so the only improvement over 2 STRs is codesize.

Check you were using the correct options, as I would expect a compiler optimized for XScale to either generate STRD when possible or 2 STRs (and emit STMIA only when optimizing for size). I optimized armcc for XScale to do exactly this, but it might be that GCC isn't that optimized.

If you need DSP instructions which the compiler can't generate automatically then you should use the builtin intrinsics.

Wilco

Reply to
Wilco Dijkstra

ElectronDepot website is not affiliated with any of the manufacturers or service providers discussed here. All logos and trade names are the property of their respective owners.