Efficient division algorithm?

I'm going to talk to a potential new client about using FPGAs to accelerate part of their system.

As part of what needs done there could be a significant amount of division(s) done.

Previously I've been able to multiply by a reciprocal then scale to make division a double clock operation so this can be easily pipelined. This is only achieveable if the divisor is pre-known and the reciprocal can be pre-calculated.

With what's coming up I'm not sure that I can do this, I know that

Are there any clever techniques for streamlining divisions that make them deterministic and don't use a big wodge of logic?

Thanks for any pointers,

Nial

Reply to
Nial Stewart
Loading thread data ...

Do you need high throughput and short latency? If latency isn't that big issue, I have successfully used the serial devider from opencores.Otherwise there are more divider projects at opencores.

--
Uwe Bonnes                bon@elektron.ikp.physik.tu-darmstadt.de

Institut fuer Kernphysik  Schlossgartenstrasse 9  64289 Darmstadt
--------- Tel. 06151 162516 -------- Fax. 06151 164321 ----------
Reply to
Uwe Bonnes

Check into the lpm_divide function that is part of the lpm package. It's synchronous and accepts new inputs on every clock. It is a standard, although people seem to rail against lpm as being mainly 'Altera'.

Kevin Jennings

Reply to
KJ

That's a trade-off with speed. A subtract loop is simple and deterministic but slow. A binary search multiply-subtract is possible with dsp blocks.

-- Mike Treseler

Reply to
Mike Treseler

I don't know yet :-)

I'll have a look at Opencores, I hadn't thought of trying there.

Thanks,

Nial.

Reply to
Nial Stewart

I'll have a look at this but my impression with this is that it uses a _lot_ of logic?

Nial

Reply to
Nial Stewart

It makes a_whole_lotta_luts in one lump. Requires a slow system clock or a multi-cycle constraint. Quartus will infer lpm_divide from from int or signed /.

-- Mike Treseler

Reply to
Mike Treseler

That of course depends on how much you think is *a lot*. It does let you tradeoff latency for performance though and depending on how you set that parameter it uses different amounts of logic...if I recall correctly, 0 latency (i.e. a combinatorial implementation of division) results in the largest logic usage, a latency equal to the width of the numerator (or denominator, maybe? fuzzy brain neuron currently being accessed) resulted in a good tradeoff in my application where I had three dividers but needed somewhat high performance (75 MHz in a Stratix I).

If you need to be able to process full divisions on a system clock cycle, it will do the trick. If your system clock can run much faster so you can take several clocks to do the divide a divider that works on single bits at a time might be more appropriate.

Kevin Jennings

Reply to
KJ

I ran it at 75 MHz on a Stratix I without any multi-cycle constraints. You want to use the pipelining parameter to lpm_divide to tell it how many pipeline stages it can use. That affects how it implements the logic and allows for tradeoffs between the size of the lump of logic, the latency and the clock cycle but does not require specifying of any multi-cycle timing constraints. In any case, it shouldn't require any dramatic system clock sloooowing.

Doing so though will result in lpm_divide being parameterized with 0 latency which will result in the largest logic lump and the slowest clock cycle performance. Depending on the application it might be better to instantiate the lpm_divide and not have it inferred from "/".

Kevin Jennings

Reply to
KJ

That's one that I've always wanted to try just to see how well it does, but haven't had the need to do so of late. One would expect it to get high throughput, low logic usage and probably be the best in today's world of nearly free DSP block multipliers in FPGAs.

Any benchmark data?

Kevin Jennings

Reply to
KJ

Yes, that one's on my white board too.

It would be a good use for hardware multipliers that often left unwired.

Well it couldn't do any better than log2(quotient) ticks with one multiplier. Hmmm.. could divide that time 2**m for m multipliers and parallel searches...

-- Mike Treseler

Reply to
Mike Treseler

Thanks for the clarifications. I might also try using "/" with pipeline registers and turn on synthesis register retiming.

-- Mike Treseler

Reply to
Mike Treseler

Hi all,

Take a look at

formatting link
On page 61, it describes two different algorithms. I've implement the first one without DSP block. It can perform a division of N bits in N cycles. It runs quite fast; on Spartan-3A-5, 100MHz for 32 bits (200MHz for 8 bits).

Best regards.

Reply to
Alain

It doesn't have to be. If you explicitly delay the result, it will infer pipeline stages (I haven't tried it for divide, but other lpm works this way) . Eg.

res3

Reply to
Tommy Thorn

Nial,

Your reciprocal division is reasonably close to a low latency divider I've used many times in the past. Instead of just one reciprocal, use the upper bits of the denominator to address a BRAM containing reciprocals, then use the lower bits of the denominator to interpolate between the stored reciprocals to obtain a corrected reciprocal for more precision. Use that to multiply the numerator to obtain the quotient. It can be pipelined to run at the max clock for the multipliers, and has a latency of a BRAM plus two multipliers plus an adder.

For example a divider with 17 bit denominator and numerator and 16 bit quotient can use the upper 10 bits of the denominator to address a 1Kx18 reciprocal ROM to get an approximate reciprocal. Interpolate using a second 1Kx18 ROM (holds the slope at each stored reciprocal) and a multiplier to get a correction equal to the product of the 6 LSBs of the denominator and the entry from the interpolation rom. That gets added to the reciprocal ROM output to get a corrected reciprocal. Multiply that corrected reciprocal by the numerator and round to get the quotient. The entire 17 bit divide uses two multipliers, two BRAMs and two adders plus registers to match delays to give a low latency divider. You can improve the accuracy further if you can normalize the denominator and numerator before the divide and then denormalize by the difference of the normalizing shifts afterwards (basically floating point).

Reply to
Ray Andraka

I've heard that the CORDIC can be used for division. I'm not sure how favorably this compares with other methods.

-Kevin

Reply to
Kevin Neilson

Cordic division is similar to non-restoring sequential division. It uses a lot of adder-subtractors, which means it is either quite slow or it has a large clock latency.

Reply to
Ray Andraka

used many times in the

to address a BRAM

interpolate between the

that to multiply the

for the multipliers,

quotient can use the upper

approximate reciprocal.

reciprocal) and a multiplier

the entry from the

corrected reciprocal.

quotient. The entire 17

match delays to give a

normalize the denominator

normalizing shifts

Thanks Ray,

This is exactly the sort of technique I was talking about.

This will defenitely be worth a look in I end up having to do multiple divisions (as I think I might have to).

Nial.

Reply to
Nial Stewart

The NiosII blurb says that it can implement a single instruction

32 bit multiply or divide.

I wonder what sort of architecture they're using?

Nial.

Reply to
Nial Stewart

What's unusual about a single instruction multiply or divide? Surely it's just a question of writing the microcode to do it.

Mike

Reply to
MikeShepherd564

ElectronDepot website is not affiliated with any of the manufacturers or service providers discussed here. All logos and trade names are the property of their respective owners.