Any Large Integer Libraries Out There for the ARM?

- D
- David T. Ashley
  
  Contact options for registered users
posted
12 years ago

Wed, Apr 18, 2012 2:22 PM

I needed to multiply U128 = U96 * U32 on an ARM the other day, and did a hack job in C. It is logically correct, but it probably takes 3-10 times as long as assembly-language.

Are there any large integer and similar libraries out there for the ARM?

The GMP is a great resource, of course, but it isn't very applicable to embedded systems because of:

(a)Dynamic allocation.

(b)Variable-size operands, which mean looping and branching, which causes pipelined RISC machines not to perform as well as they would if the code were written in a linear way.

Anything out there?

Thanks, Dave Ashley

- M
- mwilson
  
  Contact options for registered users
Vote on answer
posted
12 years ago

Wed, Apr 18, 2012 2:54 PM

I've been working on untangling the OpenSSL bignum library free from OpenSSL. I've found it confusing because of a lot of upward dependencies, and the fact that the ./configure intermingles option setting for the whole package. One target that may be nearby is to group the options that only bignum needs into a single .h file.

Mel.

- D
- David T. Ashley
  
  Contact options for registered users
Vote on answer
posted
12 years ago

Wed, Apr 18, 2012 3:12 PM

Hi Mel,

I have obtained this domain name:

formatting link

and I'm slugging away on the weekends.

If there is enough synergy where you feel we might be able to work together or exchange technical information, please write me directly.

My feeling based on the embedded work I've done is that there need to be two variants of the library--classic ARM and Thumb. That assessment may need to be adjusted downward or upward.

Thanks and best regards, Dave A.

- D
- David T. Ashley
  
  Contact options for registered users
Vote on answer
posted
12 years ago

Wed, Apr 18, 2012 3:15 PM

Also, I'm not sure if my news client actually provides information to write me directly.

My e-mail address is:

gmail.com@dashley

and I'm sure you'll figure out the sophisticated encryption algorithm used to prevent harvesting robots from making use of my e-mail address.

Thanks, Dave A.

- M
- mwilson
  
  Contact options for registered users
Vote on answer
posted
12 years ago

Wed, Apr 18, 2012 3:59 PM

I'm still working to get my head around the physical requirements for packaging a release. It seems as though a source->.a process would suit my needs best. That way, ARM/THUMB can be decided project-by-project.

I've just started some due diligence and registered with github.

Somebody there has under a BSD license which might be what we're both looking for.

Otherwise, it might be a plan to put up a BSD licensed project based on the BSD licensed OpenSSL code.

Mel.

- D
- David T. Ashley
  
  Contact options for registered users
Vote on answer
posted
12 years ago

Wed, Apr 18, 2012 5:43 PM

Hi Mel,

I have the same thoughts. Assuming that ".a" is the standard suffix for an ARM library, a library would be built from the smallest compileable modules (typically a single function each), and the linker will automatically bring in only what it needs to satisfy references.

With few exceptions, the source code would consist of source files that each contain a single function or a single constant or variable data structure.

There would have to be two libraries--one for ARM classic and one for THUMB.

Nearly all of the assembly-language modules would be different for the ARM vs. the THUMB libraries, but nearly all of the C functions would be the same for both libraries, with the development tool settings compiling for ARM or THUMB.

Thanks. I will look at that.

Thanks. I'm not sure about the BSD license. I'll look at that.

I believe it was Steve Ballmer of Microsoft who once wrote that "open-source licenses attach themselves like a cancer to anything they touch".

I disagree with that point of view (I fully support open-source development), but they just aren't practical for embedded systems.

Someone who buys a washing machine doesn't want to see information about software licenses anywhere, including the manufacturer's web page. It doesn't help them to use or understand the product.

Open-source is great. Open-source licenses applied to small embedded systems is not.

DTA

- D
- David Brown
  
  Contact options for registered users
Vote on answer
posted
12 years ago

Wed, Apr 18, 2012 7:00 PM

A library (.a) with a header file is a convenient format to use - and as you say, each source file should only contain a few functions so that only those needed are linked in.

However, this will lead to sub-optimal code because it does not allow cross-function and cross-file optimisations. If you are happy giving out the source code, then people can use that to get the smallest and fastest code.

Where possible, it is best to write such a library in C, using inline assembly when it makes a big difference. If you can make your functions as inlineable C code, then it will often result in smaller and faster object code, as the compiler can do things like re-arrange registers, re-use results, avoid function call overheads, etc. When you make the routines in fixed assembly, the compiler is forced to move data into and out of registers to suit the assembly code - this overhead can quickly outweigh the benefits of the tuned assembly code.

Where you do need assembly, it is often worth writing it as inline assembly using advanced syntaxes supported by the compiler. Then at least the compiler has the freedom to pick registers and include the code in its optimisations. gcc syntax is supported by several compilers, including (obviously) all gcc-based tools and CodeWarrior. I don't know if other tools like IAR and Keil support them, or if they have their own syntax.

Don't take advice about open-source licenses from Steve Ballmer - the only thing he knows about them is that they might slightly reduce his gazillion dollar income. Anything he says is pure FUD.

What you are thinking about here is pure GPL - if you make your library GPL, then the user (i.e., another programmer) has to give the end user the right to all the source code, and the person buying the washing machine gets confusing information about the software license.

With BSD-style licenses, no restrictions are imposed on your users (other developers), and the end user knows nothing about it. The only requirement is that no one can remove the copyright notices you put on the files and claim they wrote them. It's a perfectly good license for embedded libraries. (And Microsoft likes it too - they took the BSD-licensed network stack from FreeBSD to form the first version of the Windows TCP/IP stack. Apple likes it even more - they took the entire BSD-licensed FreeBSD OS, put a nice face on it, and called it "OS X".)

An alternative is to use a modified GPL which forces developers to publish enhancements and modifications to the library code, without affecting anything in their own code. See FreeRTOS for an example.

- G
- George Neuner
  
  Contact options for registered users
Vote on answer
posted
12 years ago

Thu, Apr 19, 2012 3:52 AM

That's an exaggeration. It's true that OS X contains some FreeBSD code, but Apple certainly did not "take FreeBSD and put a nice face on it".

OS X is based on NeXTSTEP, which itself was based on CMU's Mach 2.0 operating system. NeXTSTEP took Mach and replaced the existing Unix compatibility layer with new code that included ports of both FreeBSD and NetBSD modules. Apple chose NeXTSTEP to be the basis for the new Mac OS and purchased NeXT to get it.

[And yes, I know Steve Jobs founded NeXT - after he was ousted from Apple. Jobs came back to Apple after a 10 year absence as part of the deal to purchase NeXT. Jobs did not retake control of Apple until the company narrowly avoided bankruptcy 2 years later.]

George

- N
- Noob
  
  Contact options for registered users
Vote on answer
posted
12 years ago

Thu, Apr 19, 2012 8:06 AM

Errr... You provide your true e-mail address in the message headers (From: field). Do you think crawlers ignore the headers?

- R
- Rocky
  
  Contact options for registered users
Vote on answer
posted
12 years ago

Thu, Apr 19, 2012 11:51 AM

However, gmail is pretty good at filtering spam. I use a gmail address for most of my public activity and gmail successfully gets rid of

99.999 percent of spam

- B
- Boudewijn Dijkstra
  
  Contact options for registered users
Vote on answer
posted
12 years ago

Thu, Apr 19, 2012 10:57 PM

Op Wed, 18 Apr 2012 16:22:10 +0200 schreef David T. Ashley :

Unlikely. With Cortex-M3 and the IAR compiler I came to the following results:

C-only: 63 cycles

C-with-asm: 50 cycles

asm-only: 39 cycles

In how many cycles does your version run?

--
Gemaakt met Opera's revolutionaire e-mailprogramma:  
http://www.opera.com/mail/
(Remove the obvious prefix to reply privately.)

- B
- boB
  
  Contact options for registered users
Vote on answer
posted
12 years ago

Fri, Apr 20, 2012 5:29 AM

Was that with the GNU ARM compiler ??

If so, that's why I use IAR also. Looks like I better see if I can optimize some stuff like Boudewijn has done.

boB

- M
- mwilson
  
  Contact options for registered users
Vote on answer
posted
12 years ago

Mon, Apr 23, 2012 4:51 PM

[ ... ] I've taken a highly opportunistic cargo-cult-style swipe at/of the OpenSSL bignum code. The results can be got via

git clone

formatting link

Results don't look bad for a couple of days work, but clearly have to be massively cleaned and made consistent before they constitute a real library.

Mel.

- D
- David T. Ashley
  
  Contact options for registered users
Vote on answer
posted
12 years ago

Mon, Apr 23, 2012 6:36 PM

Can you post your C code that led to 63-cycle execution?

And your C with ASM?

And your ASM-only?

If you've posted it and I missed it, I apologize.

Thanks, Dave Ashley

- N
- Noob
  
  Contact options for registered users
Vote on answer
posted
12 years ago

Tue, Apr 24, 2012 11:55 AM

gcc "understands" the intent of the following code, and produces fairly decent assembly code for SH-4:

#include typedef uint32_t U32; typedef U32 U96[3]; typedef U32 U128[4]; void mul(U32 u, U96 v, U128 res) { uint64_t m0 = (uint64_t)u * v[0]; res[0] = m0;

uint64_t m1 = (uint64_t)u * v[1] + (m0 >> 32); res[1] = m1;

uint64_t m2 = (uint64_t)u * v[2] + (m1 >> 32); res[2] = m2;

res[3] = (m2 >> 32); }

Do you get acceptable results with this function?

Regards.

- D
- David T. Ashley
  
  Contact options for registered users
Vote on answer
posted
12 years ago

Tue, Apr 24, 2012 12:48 PM

I'll give that a try.

What I used was this ... similar but different.

Your handling of carries looks more efficient.

----------

void APP_BIGINT_UmulU128U96U32( APP_BIGINT_U128_T *out_u128, const APP_BIGINT_U96_T *in_u96, const APP_BIGINT_U32_T *in_u32) { APP_BIGINT_U64_T term_96w0_32w0; APP_BIGINT_U64_T term_96w1_32w0; APP_BIGINT_U64_T term_96w2_32w0; APP_BIGINT_U32_NATIVE_T carry; APP_BIGINT_U64_NATIVE_T temp_u64;

//****************************************************************************************************

//**************************************************************************************************** //Calculate all of the intermediate terms. It was discovered that the compiler does a good //job of using the UMULL instruction. // //96w0 * 32w0 temp_u64 = (APP_BIGINT_U64_NATIVE_T)in_u96->w[0] * (APP_BIGINT_U64_NATIVE_T)in_u32->w[0]; term_96w0_32w0.w[1] = temp_u64 >> 32; term_96w0_32w0.w[0] = temp_u64; // //96w1 * 32w0 temp_u64 = (APP_BIGINT_U64_NATIVE_T)in_u96->w[1] * (APP_BIGINT_U64_NATIVE_T)in_u32->w[0]; term_96w1_32w0.w[1] = temp_u64 >> 32; term_96w1_32w0.w[0] = temp_u64; // //96w2 * 32w0 temp_u64 = (APP_BIGINT_U64_NATIVE_T)in_u96->w[2] * (APP_BIGINT_U64_NATIVE_T)in_u32->w[0]; term_96w2_32w0.w[1] = temp_u64 >> 32; term_96w2_32w0.w[0] = temp_u64; //

//****************************************************************************************************

//**************************************************************************************************** //Add together the terms. The best way to envision this is the way that children are sometimes //taught to multiply with a rectangle and then adding diagonally. // //Word [0] of the result. No addition is required, because this term comes from the least significant //term alone. out_u128->w[0] = term_96w0_32w0.w[0]; // //Word [1] of the result. temp_u64 = (APP_BIGINT_U64_NATIVE_T)term_96w1_32w0.w[0] + (APP_BIGINT_U64_NATIVE_T)term_96w0_32w0.w[1]; carry = temp_u64 >> 32; out_u128->w[1] = temp_u64;

//Word [2] of the result. temp_u64 = (APP_BIGINT_U64_NATIVE_T)carry + (APP_BIGINT_U64_NATIVE_T)term_96w2_32w0.w[0] + (APP_BIGINT_U64_NATIVE_T)term_96w1_32w0.w[1]; carry = temp_u64 >> 32; out_u128->w[2] = temp_u64;

//Word [3] of the result. temp_u64 = (APP_BIGINT_U64_NATIVE_T)carry + (APP_BIGINT_U64_NATIVE_T)term_96w2_32w0.w[1]; out_u128->w[3] = temp_u64; }

Thanks, DTA

- N
- Noob
  
  Contact options for registered users
Vote on answer
posted
12 years ago

Tue, Apr 24, 2012 1:55 PM

IMHO, the biggest economy, in terms of code density, comes from the umlal instruction (32x32->64_mul followed by add_64). (On SH-4, it's all done "by hand" using 4-6 instructions.)

FWIW, here's the assembly code gcc 4.6.1 generates for Cortex-M3. What CPU are you targeting? ARM or Thumb mode?

mul: push {r4, r5, r6, r7, r8, r9} ldr r6, [r1, #0] movs r5, #0 umull r6, r7, r6, r0 str r6, [r2, #0] ldr r3, [r1, #4] mov r8, r7 mov r9, r5 umlal r8, r9, r0, r3 str r8, [r2, #4] ldr r3, [r1, #8] mov r6, r9 mov r7, r5 umlal r6, r7, r0, r3 str r6, [r2, #8] str r7, [r2, #12] pop {r4, r5, r6, r7, r8, r9} bx lr

AFAICT, the register copies before the 2 umlal are unnecessary, which means we could shave 3 instructions.

Regards.

- D
- David T. Ashley
  
  Contact options for registered users
Vote on answer
posted
12 years ago

Tue, Apr 24, 2012 3:32 PM

Thanks.

Examining your code ...

The possibility of a carry makes me nervous. I'd first like to verify that the sum

can't exceed 64 bits and generate a carry.

It doesn't need to be verfied that

can't generate a carry, because there is a simple proof of that relying on the operand sizes and the fact that this is the most significant product.

Squaring the maximum 32-bit integer yields:

(2^32-1) * (2^32-1) = 2^64 - 2^33 + 1.

If the second term is also a maximum square, adjusting for the placement, its maximum value is:

2^96 - 2^65 + 2^32

The question of whether there can be a carry is equivalent to the question of whether.

2^96 - 2^65 + 2^32 + 2^64 - 2^33 + 1 < 2^96

Looking at the terms, this is definitely the case.

OK, I believe that

cannot generate a carry. This seems plausible because one of the arguments is 32-bits ... I'm just not sure about how larger multiplications (more terms to add) would work out.

I believe your implementation is correct.

Below is the source code and assembly-language for my original plus your improved version.

Quite a size and speed difference!

Thanks!

DTA

--------

void APP_BIGINT_UmulU128U96U32Old( APP_BIGINT_U128_T *out_u128, const APP_BIGINT_U96_T *in_u96, const APP_BIGINT_U32_T *in_u32) { APP_BIGINT_U64_T term_96w0_32w0; APP_BIGINT_U64_T term_96w1_32w0; APP_BIGINT_U64_T term_96w2_32w0; APP_BIGINT_U32_NATIVE_T carry; APP_BIGINT_U64_NATIVE_T temp_u64;

//****************************************************************************************************

//**************************************************************************************************** //Calculate all of the intermediate terms. It was discovered that the compiler does a good //job of using the UMULL instruction. // //96w0 * 32w0 temp_u64 = (APP_BIGINT_U64_NATIVE_T)in_u96->w[0] * (APP_BIGINT_U64_NATIVE_T)in_u32->w[0]; term_96w0_32w0.w[1] = temp_u64 >> 32; term_96w0_32w0.w[0] = temp_u64; // //96w1 * 32w0 temp_u64 = (APP_BIGINT_U64_NATIVE_T)in_u96->w[1] * (APP_BIGINT_U64_NATIVE_T)in_u32->w[0]; term_96w1_32w0.w[1] = temp_u64 >> 32; term_96w1_32w0.w[0] = temp_u64; // //96w2 * 32w0 temp_u64 = (APP_BIGINT_U64_NATIVE_T)in_u96->w[2] * (APP_BIGINT_U64_NATIVE_T)in_u32->w[0]; term_96w2_32w0.w[1] = temp_u64 >> 32; term_96w2_32w0.w[0] = temp_u64; //

//****************************************************************************************************

//**************************************************************************************************** //Add together the terms. The best way to envision this is the way that children are sometimes //taught to multiply with a rectangle and then adding diagonally. // //Word [0] of the result. No addition is required, because this term comes from the least significant //term alone. out_u128->w[0] = term_96w0_32w0.w[0]; // //Word [1] of the result. temp_u64 = (APP_BIGINT_U64_NATIVE_T)term_96w1_32w0.w[0] + (APP_BIGINT_U64_NATIVE_T)term_96w0_32w0.w[1]; carry = temp_u64 >> 32; out_u128->w[1] = temp_u64;

//Word [2] of the result. temp_u64 = (APP_BIGINT_U64_NATIVE_T)carry + (APP_BIGINT_U64_NATIVE_T)term_96w2_32w0.w[0] + (APP_BIGINT_U64_NATIVE_T)term_96w1_32w0.w[1]; carry = temp_u64 >> 32; out_u128->w[2] = temp_u64;

//Word [3] of the result. temp_u64 = (APP_BIGINT_U64_NATIVE_T)carry + (APP_BIGINT_U64_NATIVE_T)term_96w2_32w0.w[1]; out_u128->w[3] = temp_u64; }

APP_BIGINT_UmulU128U96U32Old PROC ;;;621 ;;;622 void APP_BIGINT_UmulU128U96U32Old( APP_BIGINT_U128_T

*out_u128, 0001e0 e92d47f0 PUSH {r4-r10,lr} ;;;623 const APP_BIGINT_U96_T *in_u96, ;;;624 const APP_BIGINT_U32_T *in_u32) ;;;625 { 0001e4 b090 SUB sp,sp,#0x40 0001e6 4604 MOV r4,r0 0001e8 460d MOV r5,r1 0001ea 4616 MOV r6,r2 ;;;626 APP_BIGINT_U64_T term_96w0_32w0; ;;;627 APP_BIGINT_U64_T term_96w1_32w0; ;;;628 APP_BIGINT_U64_T term_96w2_32w0; ;;;629 APP_BIGINT_U32_NATIVE_T carry; ;;;630 APP_BIGINT_U64_NATIVE_T temp_u64; ;;;631 ;;;632 //**************************************************************************************************** ;;;633 //**************************************************************************************************** ;;;634 //**************************************************************************************************** ;;;635 //Calculate all of the intermediate terms. It was discovered that the compiler does a good ;;;636 //job of using the UMULL instruction. ;;;637 // ;;;638 //96w0 * 32w0 ;;;639 temp_u64 = (APP_BIGINT_U64_NATIVE_T)in_u96->w[0] * (APP_BIGINT_U64_NATIVE_T)in_u32->w[0]; 0001ec 6828 LDR r0,[r5,#0] 0001ee 6831 LDR r1,[r6,#0] 0001f0 fba00101 UMULL r0,r1,r0,r1 0001f4 e9cd0108 STRD r0,r1,[sp,#0x20] ;;;640 term_96w0_32w0.w[1] = temp_u64 >> 32; 0001f8 2100 MOVS r1,#0 0001fa 9809 LDR r0,[sp,#0x24] 0001fc e9cd0106 STRD r0,r1,[sp,#0x18] 000200 900f STR r0,[sp,#0x3c] ;;;641 term_96w0_32w0.w[0] = temp_u64; 000202 9808 LDR r0,[sp,#0x20] 000204 900e STR r0,[sp,#0x38] ;;;642 // ;;;643 //96w1 * 32w0 ;;;644 temp_u64 = (APP_BIGINT_U64_NATIVE_T)in_u96->w[1] * (APP_BIGINT_U64_NATIVE_T)in_u32->w[0]; 000206 6868 LDR r0,[r5,#4] 000208 6831 LDR r1,[r6,#0] 00020a fba00101 UMULL r0,r1,r0,r1 00020e e9cd0108 STRD r0,r1,[sp,#0x20] ;;;645 term_96w1_32w0.w[1] = temp_u64 >> 32; 000212 2100 MOVS r1,#0 000214 9809 LDR r0,[sp,#0x24] 000216 e9cd0106 STRD r0,r1,[sp,#0x18] 00021a 900d STR r0,[sp,#0x34] ;;;646 term_96w1_32w0.w[0] = temp_u64; 00021c 9808 LDR r0,[sp,#0x20] 00021e 900c STR r0,[sp,#0x30] ;;;647 // ;;;648 //96w2 * 32w0 ;;;649 temp_u64 = (APP_BIGINT_U64_NATIVE_T)in_u96->w[2] * (APP_BIGINT_U64_NATIVE_T)in_u32->w[0]; 000220 68a8 LDR r0,[r5,#8] 000222 6831 LDR r1,[r6,#0] 000224 fba00101 UMULL r0,r1,r0,r1 000228 e9cd0108 STRD r0,r1,[sp,#0x20] ;;;650 term_96w2_32w0.w[1] = temp_u64 >> 32; 00022c 2100 MOVS r1,#0 00022e 9809 LDR r0,[sp,#0x24] 000230 e9cd0106 STRD r0,r1,[sp,#0x18] 000234 900b STR r0,[sp,#0x2c] ;;;651 term_96w2_32w0.w[0] = temp_u64; 000236 9808 LDR r0,[sp,#0x20] 000238 900a STR r0,[sp,#0x28] ;;;652 // ;;;653 //**************************************************************************************************** ;;;654 //**************************************************************************************************** ;;;655 //**************************************************************************************************** ;;;656 //Add together the terms. The best way to envision this is the way that children are sometimes ;;;657 //taught to multiply with a rectangle and then adding diagonally. ;;;658 // ;;;659 //Word [0] of the result. No addition is required, because this term comes from the least significant ;;;660 //term alone. ;;;661 out_u128->w[0] = term_96w0_32w0.w[0]; 00023a 980e LDR r0,[sp,#0x38] 00023c 6020 STR r0,[r4,#0] ;;;662 // ;;;663 //Word [1] of the result. ;;;664 temp_u64 = (APP_BIGINT_U64_NATIVE_T)term_96w1_32w0.w[0] 00023e 2000 MOVS r0,#0 000240 f8dd9030 LDR r9,[sp,#0x30] 000244 e9cd9004 STRD r9,r0,[sp,#0x10] 000248 f8dd803c LDR r8,[sp,#0x3c] 00024c 4603 MOV r3,r0 00024e 4601 MOV r1,r0 000250 e9cd8006 STRD r8,r0,[sp,#0x18] 000254 eb190008 ADDS r0,r9,r8 000258 4159 ADCS r1,r1,r3 00025a e9cd0108 STRD r0,r1,[sp,#0x20] ;;;665 + ;;;666 (APP_BIGINT_U64_NATIVE_T)term_96w0_32w0.w[1]; ;;;667 carry = temp_u64 >> 32; 00025e 2100 MOVS r1,#0 000260 9809 LDR r0,[sp,#0x24] 000262 4607 MOV r7,r0 000264 e9cd0106 STRD r0,r1,[sp,#0x18] ;;;668 out_u128->w[1] = temp_u64; 000268 9808 LDR r0,[sp,#0x20] 00026a 6060 STR r0,[r4,#4] ;;;669 ;;;670 //Word [2] of the result. ;;;671 temp_u64 = (APP_BIGINT_U64_NATIVE_T)carry 00026c 46ba MOV r10,r7 00026e 2000 MOVS r0,#0 000270 e9cd7002 STRD r7,r0,[sp,#8] 000274 f8dd9028 LDR r9,[sp,#0x28] 000278 4603 MOV r3,r0 00027a 4601 MOV r1,r0 00027c e9cd9004 STRD r9,r0,[sp,#0x10] 000280 eb170009 ADDS r0,r7,r9 000284 4159 ADCS r1,r1,r3 000286 e9cd0100 STRD r0,r1,[sp,#0] 00028a 2000 MOVS r0,#0 00028c f8dd8034 LDR r8,[sp,#0x34] 000290 4603 MOV r3,r0 000292 e9cd8006 STRD r8,r0,[sp,#0x18] 000296 9800 LDR r0,[sp,#0] 000298 eb100008 ADDS r0,r0,r8 00029c 4159 ADCS r1,r1,r3 00029e e9cd0108 STRD r0,r1,[sp,#0x20] ;;;672 + ;;;673 (APP_BIGINT_U64_NATIVE_T)term_96w2_32w0.w[0] ;;;674 + ;;;675 (APP_BIGINT_U64_NATIVE_T)term_96w1_32w0.w[1]; ;;;676 carry = temp_u64 >> 32; 0002a2 2100 MOVS r1,#0 0002a4 9809 LDR r0,[sp,#0x24] 0002a6 4607 MOV r7,r0 0002a8 e9cd0106 STRD r0,r1,[sp,#0x18] ;;;677 out_u128->w[2] = temp_u64; 0002ac 9808 LDR r0,[sp,#0x20] 0002ae 60a0 STR r0,[r4,#8] ;;;678 ;;;679 //Word [3] of the result. ;;;680 temp_u64 = (APP_BIGINT_U64_NATIVE_T)carry 0002b0 46b8 MOV r8,r7 0002b2 2000 MOVS r0,#0 0002b4 e9cd7004 STRD r7,r0,[sp,#0x10] 0002b8 f8dd902c LDR r9,[sp,#0x2c] 0002bc 4603 MOV r3,r0 0002be 4601 MOV r1,r0 0002c0 e9cd9006 STRD r9,r0,[sp,#0x18] 0002c4 eb170009 ADDS r0,r7,r9 0002c8 4159 ADCS r1,r1,r3 0002ca e9cd0108 STRD r0,r1,[sp,#0x20] ;;;681 + ;;;682 (APP_BIGINT_U64_NATIVE_T)term_96w2_32w0.w[1]; ;;;683 out_u128->w[3] = temp_u64; 0002ce 9808 LDR r0,[sp,#0x20] 0002d0 60e0 STR r0,[r4,#0xc] ;;;684 } 0002d2 b010 ADD sp,sp,#0x40 0002d4 e8bd87f0 POP {r4-r10,pc} ;;;685 ENDP

---------

void APP_BIGINT_UmulU128U96U32( APP_BIGINT_U128_T *out_u128, const APP_BIGINT_U96_T *in_u96, const APP_BIGINT_U32_T *in_u32) { APP_BIGINT_U64_NATIVE_T m2, m1, m0;

m0 = (APP_BIGINT_U64_NATIVE_T)in_u96->w[0] * (APP_BIGINT_U64_NATIVE_T)in_u32->w[0]; out_u128->w[0] = m0;

m1 = (APP_BIGINT_U64_NATIVE_T)in_u96->w[1] * (APP_BIGINT_U64_NATIVE_T)in_u32->w[0] + (m0 >> 32); out_u128->w[1] = m1;

m2 = (APP_BIGINT_U64_NATIVE_T)in_u96->w[2] * (APP_BIGINT_U64_NATIVE_T)in_u32->w[0] + (m1 >> 32); out_u128->w[2] = m2;

out_u128->w[3] = (m2 >> 32); }

APP_BIGINT_UmulU128U96U32 PROC ;;;602 //---------------------------------------------------------------------------------------------------- ;;;603 void APP_BIGINT_UmulU128U96U32( APP_BIGINT_U128_T

*out_u128, 00018e b570 PUSH {r4-r6,lr} ;;;604 const APP_BIGINT_U96_T *in_u96, ;;;605 const APP_BIGINT_U32_T *in_u32) ;;;606 { 000190 b088 SUB sp,sp,#0x20 ;;;607 APP_BIGINT_U64_NATIVE_T m2, m1, m0; ;;;608 ;;;609 m0 = (APP_BIGINT_U64_NATIVE_T)in_u96->w[0] * (APP_BIGINT_U64_NATIVE_T)in_u32->w[0]; 000192 680b LDR r3,[r1,#0] 000194 6814 LDR r4,[r2,#0] 000196 fba33404 UMULL r3,r4,r3,r4 00019a e9cd3402 STRD r3,r4,[sp,#8] ;;;610 out_u128->w[0] = m0; 00019e 9b02 LDR r3,[sp,#8] 0001a0 6003 STR r3,[r0,#0] ;;;611 ;;;612 m1 = (APP_BIGINT_U64_NATIVE_T)in_u96->w[1] * (APP_BIGINT_U64_NATIVE_T)in_u32->w[0] + (m0 >> 32); 0001a2 684c LDR r4,[r1,#4] 0001a4 6815 LDR r5,[r2,#0] 0001a6 2600 MOVS r6,#0 0001a8 9b03 LDR r3,[sp,#0xc] 0001aa e9cd3600 STRD r3,r6,[sp,#0] 0001ae fbe43605 UMLAL r3,r6,r4,r5 0001b2 e9cd3604 STRD r3,r6,[sp,#0x10] ;;;613 out_u128->w[1] = m1; 0001b6 9b04 LDR r3,[sp,#0x10] 0001b8 6043 STR r3,[r0,#4] ;;;614 ;;;615 m2 = (APP_BIGINT_U64_NATIVE_T)in_u96->w[2] * (APP_BIGINT_U64_NATIVE_T)in_u32->w[0] + (m1 >> 32); 0001ba 688c LDR r4,[r1,#8] 0001bc 6815 LDR r5,[r2,#0] 0001be 2600 MOVS r6,#0 0001c0 9b05 LDR r3,[sp,#0x14] 0001c2 e9cd3600 STRD r3,r6,[sp,#0] 0001c6 fbe43605 UMLAL r3,r6,r4,r5 0001ca e9cd3606 STRD r3,r6,[sp,#0x18] ;;;616 out_u128->w[2] = m2; 0001ce 9b06 LDR r3,[sp,#0x18] 0001d0 6083 STR r3,[r0,#8] ;;;617 ;;;618 out_u128->w[3] = (m2 >> 32); 0001d2 2400 MOVS r4,#0 0001d4 9b07 LDR r3,[sp,#0x1c] 0001d6 e9cd3400 STRD r3,r4,[sp,#0] 0001da 60c3 STR r3,[r0,#0xc] ;;;619 } 0001dc b008 ADD sp,sp,#0x20 0001de bd70 POP {r4-r6,pc} ;;;620 ENDP

- N
- Noob
  
  Contact options for registered users
Vote on answer
posted
12 years ago

Tue, Apr 24, 2012 3:41 PM

Hand-written version (direct translation of the C source)

mul: push {r4, r5}

ldr r3, [ r1, #0 ] umull r4, r5, r0, r3 str r4, [ r2, #0 ]

ldr r3, [ r1, #4 ] mov r4, #0 umlal r5, r4, r0, r3 str r5, [ r2, #4 ]

ldr r3, [ r1, #8 ] mov r5, #0 umlal r4, r5, r0, r3 str r4, [ r2, #8 ] str r5, [ r2, #12 ]

pop {r4, r5} bx lr

Depending on the specific CPU being targeted, it might be possible to use more registers to improve performance.

- D
- David T. Ashley
  
  Contact options for registered users
Vote on answer
posted
12 years ago

Tue, Apr 24, 2012 3:46 PM

One more note ...

By changing to one temporary variable (rather than 3) the compiler seems to have shaved out two instructions (one at the start to allocate stack space and one at the end to deallocate it).

I believe that the definition of C guarantees that

temp = expression + temp;

will use the "old" value of temp before updating the new one, so that should be OK.

Here is the new code and assembly-language (with two fewer instructions).

I really appreciate the help. The C version you provided is SUBSTANTIALLY more efficient.

DTA

---------

void APP_BIGINT_UmulU128U96U32( APP_BIGINT_U128_T *out_u128, const APP_BIGINT_U96_T *in_u96, const APP_BIGINT_U32_T *in_u32) { APP_BIGINT_U64_NATIVE_T temp64;

temp64 = (APP_BIGINT_U64_NATIVE_T)in_u96->w[0] * (APP_BIGINT_U64_NATIVE_T)in_u32->w[0]; out_u128->w[0] = temp64;

temp64 = (APP_BIGINT_U64_NATIVE_T)in_u96->w[1] * (APP_BIGINT_U64_NATIVE_T)in_u32->w[0] + (temp64 >> 32); out_u128->w[1] = temp64;

temp64 = (APP_BIGINT_U64_NATIVE_T)in_u96->w[2] * (APP_BIGINT_U64_NATIVE_T)in_u32->w[0] + (temp64 >> 32); out_u128->w[2] = temp64;

out_u128->w[3] = (temp64 >> 32); }

APP_BIGINT_UmulU128U96U32 PROC ;;;602 //---------------------------------------------------------------------------------------------------- ;;;603 void APP_BIGINT_UmulU128U96U32( APP_BIGINT_U128_T

*out_u128, 00018e b57f PUSH {r0-r6,lr} ;;;604 const APP_BIGINT_U96_T *in_u96, ;;;605 const APP_BIGINT_U32_T *in_u32) ;;;606 { ;;;607 APP_BIGINT_U64_NATIVE_T temp64; ;;;608 ;;;609 temp64 = (APP_BIGINT_U64_NATIVE_T)in_u96->w[0] * (APP_BIGINT_U64_NATIVE_T)in_u32->w[0]; 000190 680b LDR r3,[r1,#0] 000192 6814 LDR r4,[r2,#0] 000194 fba33404 UMULL r3,r4,r3,r4 000198 e9cd3402 STRD r3,r4,[sp,#8] ;;;610 out_u128->w[0] = temp64; 00019c 9b02 LDR r3,[sp,#8] 00019e 6003 STR r3,[r0,#0] ;;;611 ;;;612 temp64 = (APP_BIGINT_U64_NATIVE_T)in_u96->w[1] * (APP_BIGINT_U64_NATIVE_T)in_u32->w[0] + (temp64 >> 32); 0001a0 684c LDR r4,[r1,#4] 0001a2 6815 LDR r5,[r2,#0] 0001a4 2600 MOVS r6,#0 0001a6 9b03 LDR r3,[sp,#0xc] 0001a8 e9cd3600 STRD r3,r6,[sp,#0] 0001ac fbe43605 UMLAL r3,r6,r4,r5 0001b0 e9cd3602 STRD r3,r6,[sp,#8] ;;;613 out_u128->w[1] = temp64; 0001b4 9b02 LDR r3,[sp,#8] 0001b6 6043 STR r3,[r0,#4] ;;;614 ;;;615 temp64 = (APP_BIGINT_U64_NATIVE_T)in_u96->w[2] * (APP_BIGINT_U64_NATIVE_T)in_u32->w[0] + (temp64 >> 32); 0001b8 688c LDR r4,[r1,#8] 0001ba 6815 LDR r5,[r2,#0] 0001bc 2600 MOVS r6,#0 0001be 9b03 LDR r3,[sp,#0xc] 0001c0 e9cd3600 STRD r3,r6,[sp,#0] 0001c4 fbe43605 UMLAL r3,r6,r4,r5 0001c8 e9cd3602 STRD r3,r6,[sp,#8] ;;;616 out_u128->w[2] = temp64; 0001cc 9b02 LDR r3,[sp,#8] 0001ce 6083 STR r3,[r0,#8] ;;;617 ;;;618 out_u128->w[3] = (temp64 >> 32); 0001d0 2400 MOVS r4,#0 0001d2 9b03 LDR r3,[sp,#0xc] 0001d4 e9cd3400 STRD r3,r4,[sp,#0] 0001d8 60c3 STR r3,[r0,#0xc] ;;;619 } 0001da bd7f POP {r0-r6,pc} ;;;620 ENDP