Efficient division algorithm?

Question

I'm going to talk to a potential new client about using FPGAs to accelerate part of their system.

As part of what needs done there could be a significant amount of division(s) done.

Previously I've been able to multiply by a reciprocal then scale to make division a double clock operation so this can be easily pipelined. This is only achieveable if the divisor is pre-known and the reciprocal can be pre-calculated.

With what's coming up I'm not sure that I can do this, I know that

Are there any clever techniques for streamlining divisions that make them deterministic and don't use a big wodge of logic?

Thanks for any pointers,

Nial

Uwe Bonnes · Accepted Answer

Do you need high throughput and short latency? If latency isn't that bigissue, I have successfully used the serial devider from opencores.Otherwisethere are more divider projects at opencores.-- Uwe Bonnes                Institut fuer Kernphysik  Schlossgartenstrasse 9  64289 Darmstadt--------- Tel. 06151 162516 -------- Fax. 06151 164321 ----------

KJ · Answer

Check into the lpm_divide function that is part of the lpm package. It's synchronous and accepts new inputs on every clock. It is a standard, although people seem to rail against lpm as being mainly 'Altera'.

Kevin Jennings

Mike Treseler · Answer

That's a trade-off with speed.A subtract loop is simple and deterministic but slow.A binary search multiply-subtract is possible with dsp blocks.     -- Mike Treseler

Nial Stewart · Answer

I don't know yet :-)I'll have a look at Opencores, I hadn't thought of trying there.Thanks,Nial.

Nial Stewart · Answer

I'll have a look at this but my impression with this is that ituses a _lot_ of logic?Nial

Mike Treseler · Answer

It makes a_whole_lotta_luts in one lump.Requires a slow system clock or a multi-cycle constraint.Quartus will infer lpm_divide from from int or signed /.      -- Mike Treseler

KJ · Answer

That of course depends on how much you think is *a lot*. It does let you tradeoff latency for performance though and depending on how you set that parameter it uses different amounts of logic...if I recall correctly, 0 latency (i.e. a combinatorial implementation of division) results in the largest logic usage, a latency equal to the width of the numerator (or denominator, maybe? fuzzy brain neuron currently being accessed) resulted in a good tradeoff in my application where I had three dividers but needed somewhat high performance (75 MHz in a Stratix I).

If you need to be able to process full divisions on a system clock cycle, it will do the trick. If your system clock can run much faster so you can take several clocks to do the divide a divider that works on single bits at a time might be more appropriate.

Kevin Jennings

KJ · Answer

I ran it at 75 MHz on a Stratix I without any multi-cycle constraints. You want to use the pipelining parameter to lpm_divide to tell it how many pipeline stages it can use. That affects how it implements the logic and allows for tradeoffs between the size of the lump of logic, the latency and the clock cycle but does not require specifying of any multi-cycle timing constraints. In any case, it shouldn't require any dramatic system clock sloooowing.

Doing so though will result in lpm_divide being parameterized with 0 latency which will result in the largest logic lump and the slowest clock cycle performance. Depending on the application it might be better to instantiate the lpm_divide and not have it inferred from "/".

Kevin Jennings

KJ · Answer

That's one that I've always wanted to try just to see how well it does, but haven't had the need to do so of late. One would expect it to get high throughput, low logic usage and probably be the best in today's world of nearly free DSP block multipliers in FPGAs.

Any benchmark data?

Kevin Jennings

Mike Treseler · Answer

Yes, that one's on my white board too.It would be a good use for hardware multipliersthat often left unwired.Well it couldn't do any better than log2(quotient) tickswith one multiplier. Hmmm.. could divide that time 2**m form multipliers and parallel searches...          -- Mike Treseler

Mike Treseler · Answer

Thanks for the clarifications.I might also try using "/"with pipeline registersand turn on synthesis register retiming.       -- Mike Treseler

Efficient division algorithm?

Join the Discussion

Didn't find your answer?