Division

R

Rick C 5 years ago

I know most processors have an instruction for division and certainly Forth includes division words. But how are they implemented? I am asking becau se I need to do a division in hardware in an FPGA.

Are there processors that don't include a division instruction? Has anyone had to implement division using something other than a native CPU division instruction?

I worked on an attached array processor many years ago (a floating point co processor in a rack cabinet) and it did division using Newton-Raphson itera tion I believe. It converged quickly and took some 10 or 12 cycles to comp lete. Considering the multiply needed to carry it out took more than one c lock cycle that's not bad.

Are there better techniques for division? I guess there is long division, "shift and subtract" as compared to shift and add for multiplication. Fort unately multipliers are still common.

Rick C. - Get 1,000 miles of free Supercharging - Tesla referral code - https://ts.la/richard11209

Vote

R

Richard Damon 5 years ago

You might want to look at

formatting link

Besides the simple long division you are describing, this also describes a couple other techniques, like SRT that can do multiple bits at once, and Newton-Raphson that works well for big numbers as each loop doubles the number of bits you have (but each step takes 2 multiplies)

Vote

D

David Brown 5 years ago

There are /lots/ of processors that don't have division instructions. It was pretty much standard practice for small microcontrollers to avoid hardware division instructions until the Cortex M3 became the standard. (Cortex M0, used for the cheapest and lowest power ARM microcontrollers, and Cortex M1 for soft cores in FGPAs, do not have hardware division.)

Hardware division instructions are expensive in die space and relatively rarely used in software - they are simply not worth the cost for many microcontroller systems.

Also, hardware division is almost invariably multi-cycle, and the division execution unit does not lend itself to pipelining, which does not fit well in a RISC processor when you are trying to get single cycle for a simple and predictable core. Some RISC designs skip division entirely, others have a "division step" instruction that is simple enough to be acceptable in small, low-power hardware, but speeds up software division routines.

Even on larger devices, division is a pain - requiring multiple cycles and without pipelining, a division is often an order of magnitude slower in throughput and latency than a multiplication, so compilers do their best to avoid them (preferring to multiply by the reciprocal where possible).

If my history is correct, the Motorola 68k family had hardware division instructions until someone realised (on the 68020, I think) that pure software division routines were /faster/ than the hardware division instructions.

(Very small microcontrollers don't even have multiply instructions, but these are getting rare now.)

"Better" depends on your requirements. Do you need to minimise hardware size, maximize throughput, minimise latency? Do you need exact results? (Not every use-case does.) Will the denominator be re-used for multiple divisions, or even constant? What are your sizes?

As Richard says, the Wikipedia article can give you some suggestions (in addition to the ones you mentioned here). But if you can take advantage of additional information about the situation, you might be able to do better than standard general-purpose algorithms.

Vote

G

gtwrek 5 years ago

Rick, entirely dependent on your requirements.

For < ~100 MHz, up to about ~11 bit divisors, just brute force it with long division. Yes, in a single clock cycle. If you need faster, then pipeline it.

For > 11 bit divisors I'd start getting more exotic. Thinking ROMs to hold look up tables of inverse values. Then on to even more exotic Newton-Raphson (perhaps combined with the Look up ROMs).

I've only ever needed to just brute force it (with pipelining).

Of course if you're doing this in a smaller FPGA, then you may need to consider bit-serial like sequential algorithms that take longer to compute each division, at the expense of more complicated hardware and performance.

Regard,

Mark

Vote

R

Rick C 5 years ago

rth includes division words. But how are they implemented? I am asking be cause I need to do a division in hardware

one had to implement division using something other than a native CPU divis ion instruction?

coprocessor in a rack cabinet) and it did division using Newton-Raphson it eration I believe. It converged quickly

ded to carry it out took more than one clock cycle that's not bad.

n, "shift and subtract" as compared to shift and add for multiplication. F ortunately multipliers are still common.

I looked at a bit serial multiplier once and it didn't save much if anythin g over an 8 bit sequential (shift and add) multiplier. The control logic i s much more complex for bit serial.

I plan to do this in a small device, but don't have a feel yet for what nee ds to go in it. The part does have 16 bit hardware fast multipliers and th e data rate is 100,000 times slower, so why not use the available resources .

I guess I was really asking if there were better algorithms that work like Newton-Raphson but converge more quickly or something. From reading about Newton-Raphson if it is seeded with a small table lookup, it will converge quickly, but since time is in much greater supply than memory, it won't rea lly matter much to start with a fixed value for every computation. In fact , it would be perfectly acceptable to calculate worse case numbers for iter ations required and simply loop that many times regardless. Once the error is zero the result stops changing.

This is also simplified by the fact that this is a way to match the ADC inp ut range to the sensor output which are based on separate power supplies. It's not necessary to use an accurate reference on the ADC if that same ADC is used to measure the sensor power supply which provides the basis for th e gain setting in the sensor output.

So the divisor will be in a narrow range. Maybe a table lookup for that na rrow range is the best way. With a 12 bit input even 200 entries will cove

Darn, I keep simplifying away the fun stuff.

Rick C. + Get 1,000 miles of free Supercharging + Tesla referral code - https://ts.la/richard11209

Vote

G

gtwrek 5 years ago

I hear you. Sometimes these "do elementary school maths" with the technology at hand, under these sets of conditions, are very satisfying small side projects. Easy concepts, but sometimes tricky conditions to work under. This may frustrate some folks but I often find this sort of tasks quite satisfying...

Regards, Mark

Vote

J

John Speth 5 years ago

Recalling an FPGA project I did about 10 years ago, I was able to implement FPGA processing blocks that could be integrated into the embedded core and present itself as a custom instruction. It worked very well for me. Fewer than 10 clocks would produce a single precision floating point result. The requirement was to compute std dev in less than 500 us. Emulated floating point divide and sqrt were time consuming operations at the time.

Details: It was an Altera Cyclone III FPGA. I was using the Quartus tool set. I embedded a NIOS II core. Quartus was a huge tool featuring many different ways to get something done. The choice I made (custom instruction plus various Quartus supplied floating point math blocks) was not obvious but I eventually stumbled across my solution combo and used it successfully.

JJS

Vote

Division

Join the Discussion

Didn't find your answer?