I'd like to ask how I can optimise a binary division algorithm if the divisor has only half bits as compared to the dividend (ie. 128bit/64bit,
64bit/32bit, 32bit/16bit, 16bit/8bit). I remember vaguely that you divide the higher bits of the dividend and somehow use the result in the division of the dividend lower bits, but don't remember the necessary details. I'm asking this because I'm developing an integral division algorithm for an
8bit controller and as I deal only with the above cases, I'd like to speed it up a bit.
I do this routinely and have, recently. How many different divisions are you developing (64/32, 32/16, 16/8, 8/4, etc.), what are their fractional bits in their results (none is common for this, but it's worth asking), and what is the processor? (I can probably help suggest something, but the details are important.)
It's quite impossible to discuss that meaningfully without knowing what operations you have available as building blocks of such an algorithm.
Do you have a 16/8 division operator? An 8/8 one? None at all? Essentially, you'ld be using the same technique that you should have been taught in primary school for dividing large numbers on paper, where the "native" operations are 1*1 decimal digit multiplication,
1+1 addition, 1-1 subtraction and 2/1 division.
And then there's the more serious hacks, like doing 16/8 division with a compile-time fixed divisor by a 16*8 multiply and a shift. E.g. integer x/9 is the same as (x*57)>>9.
--
Hans-Bernhard Broeker (broeker@physik.rwth-aachen.de)
Even if all the snow were burnt, ashes would remain.
64/32, 32/16 and 16/8. I'd like to, if possible, keep it in one generic form, though.
is the
None.
If it matters, I need an unsigned version and a signed one, with the remainder having the same sign as the dividend. For now I just do the unsigned division and correct the signs later.
Probably one of the ST7 family controllers. For now I develop the code in C.
I program using C, so there is a 16/8 operator, but it's emulated (there's no div in the ST7 instruction set). I believe the 'pen-and-paper' technique using shifts and subtractions would be faster than using this C operator somehow for 64/32 and 32/16.
Sure, I use that, it's just that it's not optimised for eg. 16/8, it's as generic as 16/16. And AFAIR there's a technique for an optimisation in this field, not only when dividing by constants.
I don't use C for something like this. And it certainly defeats your earlier point, which you said was "I'd like to speed it up a bit." Are you prepared to read some materials and write this in assembly? (An optimal algorithm in written in C is NOT the same as an optimal algorithm in assembly, because the available semantics are different in each case.) Also, do you want a rounded result? (In other words, is 1/2 going to return 1 or 0?)
I also prefer to have routines that normalize both the numerator and denominator before applying the division algorithm and provide an exponent to reflect the net of the shifting process. This way, I'm guaranteed to retain maximum precision and can use this as part of a floating point routine, later. But I gather you aren't interested in that, at all.
Here's an example algorithm in MSP430 assembler, for example:
; -------------------------------------------------------------------------- ; DIV32u16uQR 32u/16u --> 16u R 16u ; -------------------------------------------------------------------------- ; ; This routine provides division of an unsigned, 32-bit dividend by an ; unsigned, 16-bit divisor and produces an unsigned, 16-bit quotient and ; an unsigned, 16-bit remainder. Essentially, this is ( R14:R15 / R13 ). ; ; No normalization takes place in this routine and no exponent is produced. ; ; Inputs: ; ; R13 unsigned 16-bit divisor ; R14 unsigned high-order 16 bits of the dividend ; R15 low-order 16 bits of the dividend ; ; Outputs: ; ; R14 unsigned 16-bit remainder ; R15 unsigned 16-bit quotient ; ; Scratches: ; ; R12 counter
Div32u16uQR mov #16, R12 ; set up the shift counter Div32u16uQR_0 rla R15 rlc R14 jc Div32u16uQR_1 sub R13, R14 jc Div32u16uQR_2 add R13, R14 dec R12 ; done? jnz Div32u16uQR_0 ; no -- continue ret ; yes -- return Div32u16uQR_1 sub R13, R14 Div32u16uQR_2 inc R15 dec R12 ; done? jnz Div32u16uQR_0 ; no -- continue ret ; yes -- return
A nice feature here is that it *does* produce the remainder. That is handy if you want to round the result, for example. Another version provides a little more speed, pairing bits into a restoring version. At the expense of some bookkeeping code.
; -------------------------------------------------------------------------- ; DIV32u16uQR 32u/16u --> 16u R 16u ; -------------------------------------------------------------------------- ; ; This routine provides division of an unsigned, 32-bit dividend by an ; unsigned, 16-bit divisor and produces an unsigned, 16-bit quotient and ; an unsigned, 16-bit remainder. ; ; No normalization takes place in this routine and no exponent is produced. ; ; Inputs: ; ; R13 unsigned 16-bit divisor ; R14 unsigned high-order 16 bits of the dividend ; R15 low-order 16 bits of the dividend ; ; Outputs: ; ; R14 unsigned 16-bit remainder ; R15 unsigned 16-bit quotient ; ; Scratches: ; ; R12 counter
Div32u16uQR mov #8, R12 ; set up the shift counter Div32u16uQR_0 rla R15 rlc R14 jc Div32u16uQR_3 sub R13, R14 jc Div32u16uQR_4 add R13, R14 ; restore rla R15 rlc R14 jnc Div32u16uQR_5 Div32u16uQR_1 sub R13, R14 Div32u16uQR_2 inc R15 dec R12 ; done? jnz Div32u16uQR_0 ; no -- continue ret ; yes -- return Div32u16uQR_3 sub R13, R14 Div32u16uQR_4 inc R15 rla R15 rlc R14 jc Div32u16uQR_1 Div32u16uQR_5 sub R13, R14 jc Div32u16uQR_2 add R13, R14 ; restore dec R12 ; done? jnz Div32u16uQR_0 ; no -- continue ret ; yes -- return
There's actually quite a lot of information out there on integer division, because this is and has been a long going concern for those designing computers and micros -- not just software people. You'll find a lot of good information in the VHDL and Verilog areas (the so called "sequential dividers" as opposed to "combinatorial dividers".)
If you cannot extract the algorithmic logic from the above and make sense of it, feel free to ask. It's not hard to see, if you lay this out on paper and work an example. I just don't want to have to write out a tutorial here, when it's likely that there are already many pages on the web on this subject that go into the gory details and I figure you can do some work, here. But once you "get it," it is almost 2nd nature and you can adapt the above to produce various fixed point versions or enhance it to support floating point (which will almost always be faster and better than the libraries you are 'given' -- a true, recurring fact that happens for reasons I cannot really fathom, unless those writing the FP libraries are more interested in just putting out boilerplate than thinking a little.)
As long as we are bandying old assembly code, here is a thoroughly tested version of mine from 30 years ago. It runs on an 8080, z80, z180, etc. and doesn't require the added z80 registers etc. The break up provides useful routines for other operations. The head comments show net registers disturbed. It is extremely simple.
I later built a faster version around a 24/16 bit module, with the identical external appearance.
; ; 2's complement (BC), leave (B) in (A) ; A,B,C c2bc: dcx b ; " " ; 1's complement (DE), leave (B) in (A) ; A,D,E c1bc: mov a,c cma mov c,a mov a,b cma mov b,a ret ; ; Integer (unsigned pos.) divide (DEHL)/(BC)=>(DE) ; Remainder appears in (HL) ; Carry for overflow, when registers unchanged ; Divisor, remainder and quotient range 0 to 65535 ; Dividend range 0 to 4295*10^6 (approx.) ; F,D,E,H,L idiv: push psw mov a,e; Check for overflow sub c mov a,d sbb b jc idiv1; No overflow pop psw; Restore (A) stc; Mark overflow ret idiv1: push b call c2bc; Change (BC) sign xchg; Do arithmetic in (HL) mvi a,-16; Iteration count idiv2: push psw; Save iteration count dad h; Left shift (HLDE) rar; Save Carry out xchg dad h xchg jnc idiv3; No Carry into L inx h idiv3: ral; Regain Carry from H jc idiv4; Yes, generate quotient bit mov a,l add c; Test for quotient bit mov a,h adc b jnc idiv5; No bit idiv4: dad b; Subtract inx d; Insert quotient bit idiv5: pop psw; Get iteration count inr a jm idiv2; Not done pop b; Restore BC pop psw; Restore A ora a; Clear any Carry, no overflow ret
--
Chuck F (cbfalconer@yahoo.com) (cbfalconer@worldnet.att.net)
Available for consulting/temporary embedded and systems.
USE worldnet address!
ElectronDepot website is not affiliated with any of the manufacturers or service providers discussed here.
All logos and trade names are the property of their respective owners.