Any Large Integer Libraries Out There for the ARM?

Op Mon, 23 Apr 2012 20:36:48 +0200 schreef David T. Ashley :

Note that I did not perform exhaustive testing on the below code.

#include

typedef struct uint96_s { uint32_t w0; uint32_t w1; uint32_t w2; } uint96_t;

typedef struct uint128_s { uint32_t w0; uint32_t w1; uint32_t w2; uint32_t w3; } uint128_t;

void umul_96x32c(uint96_t *a, uint32_t b, uint128_t *r) { uint64_t p0, p1, p2; uint32_t r0, r1a, r1b, r2a, r2b, r3; uint64_t r1, r2;

p0 = (uint64_t)a->w0 * (uint64_t)b; r0 = p0; r1a = p0 >> 32; r->w0 = r0; p1 = (uint64_t)a->w1 * (uint64_t)b; r1b = p1; r2a = p1 >> 32; r1 = (uint64_t)r1a + (uint64_t)r1b; r->w1 = r1; p2 = (uint64_t)a->w2 * (uint64_t)b; r2b = p2; r3 = p2 >> 32; r2 = (r1 >> 32) + (uint64_t)r2a + (uint64_t)r2b; r->w2 = r2; r3 += r2 >> 32; r->w3 = r3; }

The obvious thing to do is to add manual ADC instructions.

#pragma inline=never void umul_96x32i(uint96_t *a, uint32_t b, uint128_t *r) { uint64_t p0, p1, p2; uint32_t r0, r1a, r1b, r2a, r2b, r3; uint32_t r1, r2;

p0 = (uint64_t)a->w0 * (uint64_t)b; r0 = p0; r1a = p0 >> 32; r->w0 = r0; p1 = (uint64_t)a->w1 * (uint64_t)b; r1b = p1; r2a = p1 >> 32; r1 = r1a + r1b; r->w1 = r1; p2 = (uint64_t)a->w2 * (uint64_t)b; r2b = p2; r3 = p2 >> 32; __asm("adcs %[Rd],%[Rn],%[Rm]" : [Rd]"=r"(r2) : [Rn]"r" (r2a), [Rm]"r" (r2b) : "cc"); r->w2 = r2; __asm("adc %[Rd],%[Rn],#0" : [Rd]"=r"(r3) : [Rn]"r" (r3)); r->w3 = r3; }

The only further possible optimization was register allocation.

umul_96x32: PUSH {R4} LDR R3,[R0, #+0] UMULL R3,R12,R1,R3 STR R3,[R2, #+0] LDR R3,[R0, #+4] UMULL R3,R4,R1,R3 ADDS R12,R3,R12 STR R12,[R2, #+4] LDR R0,[R0, #+8] UMULL R0,R1,R1,R0 ADCS R3,R4,R0 STR R3,[R2, #+8] ADC R0,R1,#0 STR R0,[R2, #+12] POP {R4} BX LR ;; return

--
Gemaakt met Opera's revolutionaire e-mailprogramma:  
http://www.opera.com/mail/
(Remove the obvious prefix to reply privately.)
Reply to
Boudewijn Dijkstra
Loading thread data ...

The somewhat less obvious optimization is to use umlal, which handles the 64-bit add.

I'm not sure what the pipeline looks like, but spacing producers and consumers, by using more temporary registers /might/ slightly improve performance.

Regards.

Reply to
Noob

In the parameters, you should provide the value of in_u32, instead of a pointer to the value, because the compiler is reloading the value every time it is needed.

You didn't specify. Which CPU(s) are you targeting?

Regards.

Reply to
Noob

Indeed. But I'm not sure which is faster.

No, it wouldn't. Cortex-M3 only stalls on memory operations.

--
Gemaakt met Opera's revolutionaire e-mailprogramma:  
http://www.opera.com/mail/
(Remove the obvious prefix to reply privately.)
Reply to
Boudewijn Dijkstra

Thanks for all the help. The code posted allowed me to optimize my multiply function to almost as efficient as hand-written assembly-language.

I did post my original code elsewhere on this thread. I handled carries badly.

Thanks and best regards, Dave Ashley

Reply to
David T. Ashley

According to ARM's "Cortex-M3 Technical Reference Manual",

UMULL/SMULL/UMLAL/SMLAL use early termination depending on the size of source values. These are interruptible (abandoned/restarted), with worst case latency of one cycle. MLAL versions take four to seven cycles and MULL versions take three to five cycles. For MLAL, the signed version is one cycle longer than the unsigned.

and

Cycle count based on input sizes. That is, ABS(inputs) < 64K terminates early.

Therefore, we are comparing (1 umull + 2 umlal) vs (3 umull + 3 add) = 11-19 vs 12-18 Hard to give a definitive answer...

I don't understand. Are multi-cycle instructions (such as umull) pipelined or not?

If I write

umull r4, r5, r0, r3 add r6, r7, r8

add can be executed one cycle after umull (mult and ALU are two different hardware blocks).

If I write

umull r4, r5, r0, r3 add r6, r5, r4

the pipeline must stall to wait for the result of umull to become available. What am I missing?

Regards.

Reply to
Noob

AFAIK Cortex-M does not feature parallel execution paths. Cortex-R does.

--
Gemaakt met Opera's revolutionaire e-mailprogramma:  
http://www.opera.com/mail/
(Remove the obvious prefix to reply privately.)
Reply to
Boudewijn Dijkstra

The compiler is set for Thumb. (I believe the Cortex-M3 handles only Thumb-2 instructions, which are a superset of Thumb.)

The parameter observation is interesting. I'd have to look at the code in more detail (which I won't do now).

But you'd think when you have a CPU with 16 registers, maybe 12 of which you are free to use assuming you push a few, the compiler could allocate 4 registers just to buffer the pass-by-reference inputs.

DTA

Reply to
David T. Ashley

Why all compiler manufacturers assume you are making a tablet/phone/laptop therefor just put in more RAM to blaot around the problem. Just like the chip manufacturers, nothing is made in quantities less than 10,000 and always ahs TerraBytes of RAM.

Thats what everybody does of course..

[Hint just a tad of sarcasm in the above statements]
--
Paul Carpenter          | paul@pcserviceselectronics.co.uk
    PC Services
 Timing Diagram Font
  GNU H8 - compiler & Renesas H8/H8S/H8 Tiny
 For those web sites you hate
Reply to
Paul

At least GCC does it - R0 to R3.

--

Tauno Voipio
Reply to
Tauno Voipio

All ARM/Thumb compilers do it, at least if they are conforming to the ARM procedure call specifications.

Reply to
Arlet Ottens

The issue is more subtle.

I gave the compiler a *UINT32 and a *UINT96. It surely keeps the pointers in registers.

But when you have a const * input (and it is either const * because you declared it that way or because the compiler can determine that you don't modify it), there is really no point in keeping the pointer.

Assuming that you use the object at least once, the compiler should just use the pointer to put the underlying object in registers and then forget all about the pointer.

Doesn't seem that hard.

DTA.

Reply to
David T. Ashley

It could, but it would not necessarily be a good idea. Each of the 3 words of your u96 construct is only read once. So what could possibly be gained (other than some memory bus congestion) by loading them all immediately?

I'm afraid you're overlooking the aliasing issue. Just because _your_ function doesn't modify those values doesn't mean they won't change mid-stride, which could actually _require_ the function to keep reading the original value through the pointer.

E.g. what do you think would happen if I did something (admittedly) crazy equivalent to

uint32_t mynum[10]; mul_32_96(/*in32*/mynum + i, /*in96*/ mynum + 1, /*out*/mynum);

i.e. if one of the 4 words of output area is actually in the same place as your input? That way, as soon as you start writing into your output, the input might change under your nose. Note that 'const u32 *foo' doesn't really mean "pointer to an immutable value" --- it just means that you cannot change the underlying object object via this particular pointer. Const-qualified can point to quite non-const data.

In case you ever read C99 language reference material and wondered what all that "restrict" business was about --- you've just found out.

Reply to
Hans-Bernhard Bröker

Or maybe we should all use Fortran, where "restrict" is implied in such cases.

It always bugs me that C compilers are forced to produce poorer code just because some sod wants to write code like that.

Reply to
David Brown

Thanks for the info.

I was not aware of the "restrict" keyword. This is a C99 addition:

formatting link

With some C90 compilers, I've seen various warning notes about enabling certain optimizations unless certain constraints on pointers were followed. This was apparently enough of a common occurrence that the "restrict" keyword showed up in C99 so that it could be handled in a more regular framework.

OK, your post constitutes my training for the year. I try to learn exactly one new thing every 365.25 days. I'll print this post and send it to our HR person and I'm good until 2013 ...

DTA

Reply to
David T. Ashley

cf. also

formatting link

Those who cannot afford the official standard may grab n1256.pdf

formatting link

AFAIR, gcc supports restrict even in (non-pedantic) C89 mode.

Regards.

Reply to
Noob

Doesn't the C strict aliasing rule make this unspecified behaviour? Pointers to different types neither of which are (signed|unsigned|blank) char and which aren't related by one being an element of the other are assumed not to alias.

Chris

Reply to
Christopher Head

Could you define "pedantic" in this usage?

I looked up the word in a dictionary and also looked at compiler options, and the meaning of the word really isn't making sense to me.

What is "pedantic"?

Thanks, DTA

Reply to
David T. Ashley

"Pedantic" means "particularly fussy or detailed". With gcc, using "pedantic" means that the compiler will be much stricter about supporting the standards, and /only/ the standards. For example, "restrict" and "inline" are keywords in later C standards, but were not available in earlier ones. gcc likes to give you more features than the standards demand, so that you can use these features when compiling in C89 mode. But to allow for people who happened to have a function called "restrict" and a typedef called "inline", which is legal in C89 but not C99, you can compile in "pedantic" mode. (You can still access newer features as "__restrict__" or "__inline__" for example.) "pedantic" mode is also useful for checking strict standards compatibility to aid portability of code.

Reply to
David Brown

The types in question are 32-bit integers or arrays of 32-bit integers, and these /are/ allowed to alias each other (AFAIUI).

mvh.,

David

Reply to
David Brown

ElectronDepot website is not affiliated with any of the manufacturers or service providers discussed here. All logos and trade names are the property of their respective owners.