Any Large Integer Libraries Out There for the ARM?

- B
- Boudewijn Dijkstra
  
  Contact options for registered users
Vote on answer
posted
12 years ago

Wed, Apr 25, 2012 7:23 AM

Op Mon, 23 Apr 2012 20:36:48 +0200 schreef David T. Ashley :

Note that I did not perform exhaustive testing on the below code.

#include

typedef struct uint96_s { uint32_t w0; uint32_t w1; uint32_t w2; } uint96_t;

typedef struct uint128_s { uint32_t w0; uint32_t w1; uint32_t w2; uint32_t w3; } uint128_t;

void umul_96x32c(uint96_t *a, uint32_t b, uint128_t *r) { uint64_t p0, p1, p2; uint32_t r0, r1a, r1b, r2a, r2b, r3; uint64_t r1, r2;

p0 = (uint64_t)a->w0 * (uint64_t)b; r0 = p0; r1a = p0 >> 32; r->w0 = r0; p1 = (uint64_t)a->w1 * (uint64_t)b; r1b = p1; r2a = p1 >> 32; r1 = (uint64_t)r1a + (uint64_t)r1b; r->w1 = r1; p2 = (uint64_t)a->w2 * (uint64_t)b; r2b = p2; r3 = p2 >> 32; r2 = (r1 >> 32) + (uint64_t)r2a + (uint64_t)r2b; r->w2 = r2; r3 += r2 >> 32; r->w3 = r3; }

The obvious thing to do is to add manual ADC instructions.

#pragma inline=never void umul_96x32i(uint96_t *a, uint32_t b, uint128_t *r) { uint64_t p0, p1, p2; uint32_t r0, r1a, r1b, r2a, r2b, r3; uint32_t r1, r2;

p0 = (uint64_t)a->w0 * (uint64_t)b; r0 = p0; r1a = p0 >> 32; r->w0 = r0; p1 = (uint64_t)a->w1 * (uint64_t)b; r1b = p1; r2a = p1 >> 32; r1 = r1a + r1b; r->w1 = r1; p2 = (uint64_t)a->w2 * (uint64_t)b; r2b = p2; r3 = p2 >> 32; __asm("adcs %[Rd],%[Rn],%[Rm]" : [Rd]"=r"(r2) : [Rn]"r" (r2a), [Rm]"r" (r2b) : "cc"); r->w2 = r2; __asm("adc %[Rd],%[Rn],#0" : [Rd]"=r"(r3) : [Rn]"r" (r3)); r->w3 = r3; }

The only further possible optimization was register allocation.

umul_96x32: PUSH {R4} LDR R3,[R0, #+0] UMULL R3,R12,R1,R3 STR R3,[R2, #+0] LDR R3,[R0, #+4] UMULL R3,R4,R1,R3 ADDS R12,R3,R12 STR R12,[R2, #+4] LDR R0,[R0, #+8] UMULL R0,R1,R1,R0 ADCS R3,R4,R0 STR R3,[R2, #+8] ADC R0,R1,#0 STR R0,[R2, #+12] POP {R4} BX LR ;; return

--
Gemaakt met Opera's revolutionaire e-mailprogramma:  
http://www.opera.com/mail/
(Remove the obvious prefix to reply privately.)

- N
- Noob
  
  Contact options for registered users
Vote on answer
posted
12 years ago

Wed, Apr 25, 2012 8:19 AM

The somewhat less obvious optimization is to use umlal, which handles the 64-bit add.

I'm not sure what the pipeline looks like, but spacing producers and consumers, by using more temporary registers /might/ slightly improve performance.

Regards.

- N
- Noob
  
  Contact options for registered users
Vote on answer
posted
12 years ago

Wed, Apr 25, 2012 8:23 AM

In the parameters, you should provide the value of in_u32, instead of a pointer to the value, because the compiler is reloading the value every time it is needed.

You didn't specify. Which CPU(s) are you targeting?

Regards.

- B
- Boudewijn Dijkstra
  
  Contact options for registered users
Vote on answer
posted
12 years ago

Wed, Apr 25, 2012 12:21 PM

Indeed. But I'm not sure which is faster.

No, it wouldn't. Cortex-M3 only stalls on memory operations.

--
Gemaakt met Opera's revolutionaire e-mailprogramma:  
http://www.opera.com/mail/
(Remove the obvious prefix to reply privately.)

- D
- David T. Ashley
  
  Contact options for registered users
Vote on answer
posted
12 years ago

Wed, Apr 25, 2012 1:06 PM

Thanks for all the help. The code posted allowed me to optimize my multiply function to almost as efficient as hand-written assembly-language.

I did post my original code elsewhere on this thread. I handled carries badly.

Thanks and best regards, Dave Ashley

- N
- Noob
  
  Contact options for registered users
Vote on answer
posted
12 years ago

Wed, Apr 25, 2012 3:56 PM

According to ARM's "Cortex-M3 Technical Reference Manual",

UMULL/SMULL/UMLAL/SMLAL use early termination depending on the size of source values. These are interruptible (abandoned/restarted), with worst case latency of one cycle. MLAL versions take four to seven cycles and MULL versions take three to five cycles. For MLAL, the signed version is one cycle longer than the unsigned.

and

Cycle count based on input sizes. That is, ABS(inputs) < 64K terminates early.

Therefore, we are comparing (1 umull + 2 umlal) vs (3 umull + 3 add) = 11-19 vs 12-18 Hard to give a definitive answer...

I don't understand. Are multi-cycle instructions (such as umull) pipelined or not?

If I write

umull r4, r5, r0, r3 add r6, r7, r8

add can be executed one cycle after umull (mult and ALU are two different hardware blocks).

If I write

umull r4, r5, r0, r3 add r6, r5, r4

the pipeline must stall to wait for the result of umull to become available. What am I missing?

Regards.

- B
- Boudewijn Dijkstra
  
  Contact options for registered users
Vote on answer
posted
12 years ago

Thu, Apr 26, 2012 9:11 AM

AFAIK Cortex-M does not feature parallel execution paths. Cortex-R does.

--
Gemaakt met Opera's revolutionaire e-mailprogramma:  
http://www.opera.com/mail/
(Remove the obvious prefix to reply privately.)

- D
- David T. Ashley
  
  Contact options for registered users
Vote on answer
posted
12 years ago

Thu, Apr 26, 2012 1:52 PM

The compiler is set for Thumb. (I believe the Cortex-M3 handles only Thumb-2 instructions, which are a superset of Thumb.)

The parameter observation is interesting. I'd have to look at the code in more detail (which I won't do now).

But you'd think when you have a CPU with 16 registers, maybe 12 of which you are free to use assuming you push a few, the compiler could allocate 4 registers just to buffer the pass-by-reference inputs.

DTA

- P
- Paul
  
  Contact options for registered users
Vote on answer
posted
12 years ago

Thu, Apr 26, 2012 6:30 PM

Why all compiler manufacturers assume you are making a tablet/phone/laptop therefor just put in more RAM to blaot around the problem. Just like the chip manufacturers, nothing is made in quantities less than 10,000 and always ahs TerraBytes of RAM.

Thats what everybody does of course..

[Hint just a tad of sarcasm in the above statements]

--
Paul Carpenter          | paul@pcserviceselectronics.co.uk
    PC Services
 Timing Diagram Font
  GNU H8 - compiler & Renesas H8/H8S/H8 Tiny
 For those web sites you hate

- T
- Tauno Voipio
  
  Contact options for registered users
Vote on answer
posted
12 years ago

Thu, Apr 26, 2012 6:37 PM

At least GCC does it - R0 to R3.

--

Tauno Voipio

- A
- Arlet Ottens
  
  Contact options for registered users
Vote on answer
posted
12 years ago

Thu, Apr 26, 2012 6:41 PM

All ARM/Thumb compilers do it, at least if they are conforming to the ARM procedure call specifications.

- D
- David T. Ashley
  
  Contact options for registered users
Vote on answer
posted
12 years ago

Thu, Apr 26, 2012 7:45 PM

The issue is more subtle.

I gave the compiler a *UINT32 and a *UINT96. It surely keeps the pointers in registers.

But when you have a const * input (and it is either const * because you declared it that way or because the compiler can determine that you don't modify it), there is really no point in keeping the pointer.

Assuming that you use the object at least once, the compiler should just use the pointer to put the underlying object in registers and then forget all about the pointer.

Doesn't seem that hard.

DTA.

- H
- Hans-Bernhard Bröker
  
  Contact options for registered users
Vote on answer
posted
12 years ago

Thu, Apr 26, 2012 8:59 PM

It could, but it would not necessarily be a good idea. Each of the 3 words of your u96 construct is only read once. So what could possibly be gained (other than some memory bus congestion) by loading them all immediately?

I'm afraid you're overlooking the aliasing issue. Just because _your_ function doesn't modify those values doesn't mean they won't change mid-stride, which could actually _require_ the function to keep reading the original value through the pointer.

E.g. what do you think would happen if I did something (admittedly) crazy equivalent to

uint32_t mynum[10]; mul_32_96(/*in32*/mynum + i, /*in96*/ mynum + 1, /*out*/mynum);

i.e. if one of the 4 words of output area is actually in the same place as your input? That way, as soon as you start writing into your output, the input might change under your nose. Note that 'const u32 *foo' doesn't really mean "pointer to an immutable value" --- it just means that you cannot change the underlying object object via this particular pointer. Const-qualified can point to quite non-const data.

In case you ever read C99 language reference material and wondered what all that "restrict" business was about --- you've just found out.

- D
- David Brown
  
  Contact options for registered users
Vote on answer
posted
11 years ago

Fri, Apr 27, 2012 10:40 AM

Or maybe we should all use Fortran, where "restrict" is implied in such cases.

It always bugs me that C compilers are forced to produce poorer code just because some sod wants to write code like that.

- D
- David T. Ashley
  
  Contact options for registered users
Vote on answer
posted
11 years ago

Fri, Apr 27, 2012 12:43 PM

Thanks for the info.

I was not aware of the "restrict" keyword. This is a C99 addition:

formatting link

With some C90 compilers, I've seen various warning notes about enabling certain optimizations unless certain constraints on pointers were followed. This was apparently enough of a common occurrence that the "restrict" keyword showed up in C99 so that it could be handled in a more regular framework.

OK, your post constitutes my training for the year. I try to learn exactly one new thing every 365.25 days. I'll print this post and send it to our HR person and I'm good until 2013 ...

DTA

- N
- Noob
  
  Contact options for registered users
Vote on answer
posted
11 years ago

Fri, Apr 27, 2012 1:08 PM

cf. also

formatting link

Those who cannot afford the official standard may grab n1256.pdf

formatting link

AFAIR, gcc supports restrict even in (non-pedantic) C89 mode.

Regards.

- C
- Christopher Head
  
  Contact options for registered users
Vote on answer
posted
11 years ago

Fri, Apr 27, 2012 1:40 PM

Doesn't the C strict aliasing rule make this unspecified behaviour? Pointers to different types neither of which are (signed|unsigned|blank) char and which aren't related by one being an element of the other are assumed not to alias.

Chris

- D
- David T. Ashley
  
  Contact options for registered users
Vote on answer
posted
11 years ago

Fri, Apr 27, 2012 1:55 PM

Could you define "pedantic" in this usage?

I looked up the word in a dictionary and also looked at compiler options, and the meaning of the word really isn't making sense to me.

What is "pedantic"?

Thanks, DTA

- D
- David Brown
  
  Contact options for registered users
Vote on answer
posted
11 years ago

Fri, Apr 27, 2012 2:22 PM

"Pedantic" means "particularly fussy or detailed". With gcc, using "pedantic" means that the compiler will be much stricter about supporting the standards, and /only/ the standards. For example, "restrict" and "inline" are keywords in later C standards, but were not available in earlier ones. gcc likes to give you more features than the standards demand, so that you can use these features when compiling in C89 mode. But to allow for people who happened to have a function called "restrict" and a typedef called "inline", which is legal in C89 but not C99, you can compile in "pedantic" mode. (You can still access newer features as "__restrict__" or "__inline__" for example.) "pedantic" mode is also useful for checking strict standards compatibility to aid portability of code.

- D
- David Brown
  
  Contact options for registered users
Vote on answer
posted
11 years ago

Fri, Apr 27, 2012 2:23 PM

The types in question are 32-bit integers or arrays of 32-bit integers, and these /are/ allowed to alias each other (AFAIUI).

mvh.,

David