Division algorithm optimisation

Hi,

I'd like to ask how I can optimise a binary division algorithm if the divisor has only half bits as compared to the dividend (ie. 128bit/64bit,

64bit/32bit, 32bit/16bit, 16bit/8bit). I remember vaguely that you divide the higher bits of the dividend and somehow use the result in the division of the dividend lower bits, but don't remember the necessary details. I'm asking this because I'm developing an integral division algorithm for an 8bit controller and as I deal only with the above cases, I'd like to speed it up a bit.

thanks

Reply to
SM
Loading thread data ...

I do this routinely and have, recently. How many different divisions are you developing (64/32, 32/16, 16/8, 8/4, etc.), what are their fractional bits in their results (none is common for this, but it's worth asking), and what is the processor? (I can probably help suggest something, but the details are important.)

Jon

Reply to
Jonathan Kirwan

If this is for a PIC you might be interested in some macros I wrote which generate arbitrary division and multiplication functions:

formatting link

If you can puzzle out what it's doing it takes shortcuts for signed vs unsigned, slightly shorter (not just half) divisors etc.

--
Ben Jackson

http://www.ben.com/
Reply to
Ben Jackson

It's quite impossible to discuss that meaningfully without knowing what operations you have available as building blocks of such an algorithm.

Do you have a 16/8 division operator? An 8/8 one? None at all? Essentially, you'ld be using the same technique that you should have been taught in primary school for dividing large numbers on paper, where the "native" operations are 1*1 decimal digit multiplication,

1+1 addition, 1-1 subtraction and 2/1 division.

And then there's the more serious hacks, like doing 16/8 division with a compile-time fixed divisor by a 16*8 multiply and a shift. E.g. integer x/9 is the same as (x*57)>>9.

--
Hans-Bernhard Broeker (broeker@physik.rwth-aachen.de)
Even if all the snow were burnt, ashes would remain.
Reply to
Hans-Bernhard Broeker

you

in

64/32, 32/16 and 16/8. I'd like to, if possible, keep it in one generic form, though.

is the

None.

If it matters, I need an unsigned version and a signed one, with the remainder having the same sign as the dividend. For now I just do the unsigned division and correct the signs later.

Probably one of the ST7 family controllers. For now I develop the code in C.

Reply to
SM

Thanks, I'll check it out

Reply to
SM

I program using C, so there is a 16/8 operator, but it's emulated (there's no div in the ST7 instruction set). I believe the 'pen-and-paper' technique using shifts and subtractions would be faster than using this C operator somehow for 64/32 and 32/16.

Sure, I use that, it's just that it's not optimised for eg. 16/8, it's as generic as 16/16. And AFAIR there's a technique for an optimisation in this field, not only when dividing by constants.

Reply to
SM

I don't use C for something like this. And it certainly defeats your earlier point, which you said was "I'd like to speed it up a bit." Are you prepared to read some materials and write this in assembly? (An optimal algorithm in written in C is NOT the same as an optimal algorithm in assembly, because the available semantics are different in each case.) Also, do you want a rounded result? (In other words, is 1/2 going to return 1 or 0?)

I also prefer to have routines that normalize both the numerator and denominator before applying the division algorithm and provide an exponent to reflect the net of the shifting process. This way, I'm guaranteed to retain maximum precision and can use this as part of a floating point routine, later. But I gather you aren't interested in that, at all.

Here's an example algorithm in MSP430 assembler, for example:

; -------------------------------------------------------------------------- ; DIV32u16uQR 32u/16u --> 16u R 16u ; -------------------------------------------------------------------------- ; ; This routine provides division of an unsigned, 32-bit dividend by an ; unsigned, 16-bit divisor and produces an unsigned, 16-bit quotient and ; an unsigned, 16-bit remainder. Essentially, this is ( R14:R15 / R13 ). ; ; No normalization takes place in this routine and no exponent is produced. ; ; Inputs: ; ; R13 unsigned 16-bit divisor ; R14 unsigned high-order 16 bits of the dividend ; R15 low-order 16 bits of the dividend ; ; Outputs: ; ; R14 unsigned 16-bit remainder ; R15 unsigned 16-bit quotient ; ; Scratches: ; ; R12 counter

Div32u16uQR mov #16, R12 ; set up the shift counter Div32u16uQR_0 rla R15 rlc R14 jc Div32u16uQR_1 sub R13, R14 jc Div32u16uQR_2 add R13, R14 dec R12 ; done? jnz Div32u16uQR_0 ; no -- continue ret ; yes -- return Div32u16uQR_1 sub R13, R14 Div32u16uQR_2 inc R15 dec R12 ; done? jnz Div32u16uQR_0 ; no -- continue ret ; yes -- return

A nice feature here is that it *does* produce the remainder. That is handy if you want to round the result, for example. Another version provides a little more speed, pairing bits into a restoring version. At the expense of some bookkeeping code.

; -------------------------------------------------------------------------- ; DIV32u16uQR 32u/16u --> 16u R 16u ; -------------------------------------------------------------------------- ; ; This routine provides division of an unsigned, 32-bit dividend by an ; unsigned, 16-bit divisor and produces an unsigned, 16-bit quotient and ; an unsigned, 16-bit remainder. ; ; No normalization takes place in this routine and no exponent is produced. ; ; Inputs: ; ; R13 unsigned 16-bit divisor ; R14 unsigned high-order 16 bits of the dividend ; R15 low-order 16 bits of the dividend ; ; Outputs: ; ; R14 unsigned 16-bit remainder ; R15 unsigned 16-bit quotient ; ; Scratches: ; ; R12 counter

Div32u16uQR mov #8, R12 ; set up the shift counter Div32u16uQR_0 rla R15 rlc R14 jc Div32u16uQR_3 sub R13, R14 jc Div32u16uQR_4 add R13, R14 ; restore rla R15 rlc R14 jnc Div32u16uQR_5 Div32u16uQR_1 sub R13, R14 Div32u16uQR_2 inc R15 dec R12 ; done? jnz Div32u16uQR_0 ; no -- continue ret ; yes -- return Div32u16uQR_3 sub R13, R14 Div32u16uQR_4 inc R15 rla R15 rlc R14 jc Div32u16uQR_1 Div32u16uQR_5 sub R13, R14 jc Div32u16uQR_2 add R13, R14 ; restore dec R12 ; done? jnz Div32u16uQR_0 ; no -- continue ret ; yes -- return

There's actually quite a lot of information out there on integer division, because this is and has been a long going concern for those designing computers and micros -- not just software people. You'll find a lot of good information in the VHDL and Verilog areas (the so called "sequential dividers" as opposed to "combinatorial dividers".)

If you cannot extract the algorithmic logic from the above and make sense of it, feel free to ask. It's not hard to see, if you lay this out on paper and work an example. I just don't want to have to write out a tutorial here, when it's likely that there are already many pages on the web on this subject that go into the gory details and I figure you can do some work, here. But once you "get it," it is almost 2nd nature and you can adapt the above to produce various fixed point versions or enhance it to support floating point (which will almost always be faster and better than the libraries you are 'given' -- a true, recurring fact that happens for reasons I cannot really fathom, unless those writing the FP libraries are more interested in just putting out boilerplate than thinking a little.)

Good luck, Jon

Reply to
Jonathan Kirwan

--------------------------------------------------------------------------

Hi Jon,

I can't see the reason for this jump. If I'm not mistaken, it is only taken if the result doesn't fit into 16-bit.

Reply to
Jan-Hinnerk Reichert

As long as we are bandying old assembly code, here is a thoroughly tested version of mine from 30 years ago. It runs on an 8080, z80, z180, etc. and doesn't require the added z80 registers etc. The break up provides useful routines for other operations. The head comments show net registers disturbed. It is extremely simple.

I later built a faster version around a 24/16 bit module, with the identical external appearance.

; ; 2's complement (BC), leave (B) in (A) ; A,B,C c2bc: dcx b ; " " ; 1's complement (DE), leave (B) in (A) ; A,D,E c1bc: mov a,c cma mov c,a mov a,b cma mov b,a ret ; ; Integer (unsigned pos.) divide (DEHL)/(BC)=>(DE) ; Remainder appears in (HL) ; Carry for overflow, when registers unchanged ; Divisor, remainder and quotient range 0 to 65535 ; Dividend range 0 to 4295*10^6 (approx.) ; F,D,E,H,L idiv: push psw mov a,e; Check for overflow sub c mov a,d sbb b jc idiv1; No overflow pop psw; Restore (A) stc; Mark overflow ret idiv1: push b call c2bc; Change (BC) sign xchg; Do arithmetic in (HL) mvi a,-16; Iteration count idiv2: push psw; Save iteration count dad h; Left shift (HLDE) rar; Save Carry out xchg dad h xchg jnc idiv3; No Carry into L inx h idiv3: ral; Regain Carry from H jc idiv4; Yes, generate quotient bit mov a,l add c; Test for quotient bit mov a,h adc b jnc idiv5; No bit idiv4: dad b; Subtract inx d; Insert quotient bit idiv5: pop psw; Get iteration count inr a jm idiv2; Not done pop b; Restore BC pop psw; Restore A ora a; Clear any Carry, no overflow ret

--
Chuck F (cbfalconer@yahoo.com) (cbfalconer@worldnet.att.net)
   Available for consulting/temporary embedded and systems.
     USE worldnet address!
Reply to
CBFalconer

ElectronDepot website is not affiliated with any of the manufacturers or service providers discussed here. All logos and trade names are the property of their respective owners.