modulo 2**32-1 arith

I

Ilya Kalistru 10 years ago

Hello. I need to add two unsigned numbers modulo 2**32-1. Now it's done in very inefficient way: at first clock cycle there is simple addition of two 32-bit unsigned numbers with 33-bit result and on second cycle if the result >= 2**32-1, we add 1 and take only 32 bits of that. Does anybody know a better way to do that?

Vote

G

GaborSzakacs 10 years ago

Sounds like you want an "end around carry." That means that the carry out of a 32-bit full adder wraps back to its carry in. In that way a 32-bit full adder would do what you want in one cycle.

Gabor

Vote

G

GaborSzakacs 10 years ago

Hmmm... On second thought, that would not handle the case where the inputs add up to exactly 2**32-1, since there would be no carry out, but you would want to add one to end up with zero.

Gabor

Vote

R

rickman 10 years ago

Not sure how efficient your implementation would be this way. You can achieve the same result by adding one to the sum and adding the high bit of this result to the original sum to get your answer. This should give a simpler result because of the comparison your approach uses requires a full adder rather than the incrementer of this approach.

signal sum, temp, mod_result : unsigned (32 downto 0); signal answer : unsigned (31 downto 0);

sum

Vote

K

KJ 10 years ago

le addition of two 32-bit unsigned numbers with 33-bit result and on second cycle if the result >= 2**32-1, we add 1 and take only 32 bits of that.

Maybe you should revisit the need for 'modulo 2**32-1' instead of 'modulo 2

**32'. Synthesis results of your method and Rickman's method indicate that the modulo portion is consuming significantly more logic than the addition itself. Given the code posted below, here are the synthesis results using Quartus to target a Cyclone IV GX:

Method# Logic Elements Notes ======= ============== == ===

0 32 Sum

Vote

K

KJ 10 years ago

On Tuesday, December 15, 2015 at 3:38:57 PM UTC-5, KJ wrote: OOPS, correction to the Method3 code: Original: C

Vote

R

Rob Gaddi 10 years ago

Vote

R

rickman 10 years ago

I'm not clear on how "Ilya's method" uses only 76 LEs. I assume an LE is a 4 input LUT and/or a register. I count at least 131. Producing Sum uses 33, producing Sum_P1 uses another 33, evaluating (Sum >= MAX) uses

33 then there are 32 used in the mux to select the result. How can that reduce to 76 LEs?

That would reduce to the same complexity as my method as implemented above. My approach can be optimized by producing temp (Y1) from the two inputs. Then in parallel adding bit 32 into the sum of the two inputs that produces Y. Take Y modulo 2^n as the final step using 66 LEs. Like this...

signal Y: unsigned(31 downto 0); signal Y, Y1, A, B: unsigned(32 downto 0); Y1

Vote

K

KJ 10 years ago

The target device family was stated.

It reduces because synthesis does not work how you how you have described i t above.

Your updated algorithm is actually the same as your first. There are also a few problems with your code that indicate that you did not bother to comp ile your design to produce actual results so your conclusions are simply sp eculating (and they are incorrect).

- You've declared 'Y' twice, once to be a 32 bit vector, the other 33 bits. Minor thing, easily fixed

- The calculation of Y1 is not correct and will result in Y1(32) always bei ng 0 which then affects the computation of Y in the next line. This error functionally changes the output to simply be the 32 bit sum of the inputs m od 2^32, not mod 2^32-1 as requested.

- The code you posted with the fix to remove the declaration of Y as a 33 b it number results in 32 logic cell usage not 66 as you speculated. However this number is not really meaningful because it is not computing what the OP wanted due to the second error. When that second error is corrected as shown in the code posted below, it result in the same 97 logic cells as you r originally posted method.

Bottom line is that what you have outlined for your method takes more logic simply because there is an additional adder required for your method that is not needed with Ilya's method.

You and Rob are also not correct in thinking that computing Sum+1 as A+B+1 will save anything. In the updated code I've posted at the end, I've added another generic control called SP1_METHOD which is used to control how 'Su m + 1' gets computed. Sum + 1 can now be computed as: Sum_P1

Vote

K

KJ 10 years ago

Could be right, at least there is some possible context now to the question . Thanks.

See my earlier post today. Quartus will implement 'Y1=A+B+1' exactly the same as 'Y1=Y+1'.

Interesting idea, but I don't think it saves anything over Ilya's method. Ilya's method as originally specified (i.e. spread out over two clock cycle s) takes essentially 2 logic cells per bit (but it takes ~2.4 per bit for s ingle clock cycle). When it comes time to periodically update the overflow s that you mentioned, you will end up with another adder and consume anothe r 32 logic cells (one per bit).

Kevin Jennings

Vote

T

thomas.entner99 10 years ago

Am Dienstag, 15. Dezember 2015 11:32:14 UTC+1 schrieb Ilya Kalistru:

Which logic family? What is "better"? Less ressources? Faster? Do you need a result every cycle?

If you can do with a result every other cycle, you could try following:

- Make an 32b adder with carry in and carry out

- First cycle: Add the two numbers with carry.

- Second cycle: Check if the output carry is set. If yes, we already have the result. Otherwise clear the input carry and use the new result.

This should be below 40 LEs. But I have not tried it out...

Regards,

Thomas

formatting link

- Home of EEBlaster

Vote

R

Rob Gaddi 10 years ago

Y0=A+B is a 32-bit adder with CIN=0. Y1=A+B+1 is a second 32-bit adder with CIN=1. Choosing between them is a 32-bit 2:1 mux. So my rough math puts my answer at 96 LEs and 32 flops. But the worst case propagation path is from the LSB, through the 32-bit Y1 carry chain, to the mux select, to the output, which should be pretty screamingly fast.

Whereas Ilya's (and rickman's as well, if I'm reading it right) use the carry-out of the A+B+1 add as the carry in to the "real" add. That gives you the total prop delay of a 64-bit carry chain, plus some slop.

The OP, who seems to have vanished off into the ether, never specified what "efficiency" he was trying to optimize on, or whether pipeline delays were acceptable, etc.

My above comment again. If I'm right about the checksum, then Ilya's method (at least on a reading) gives you an operating duty cycle of

50%, you can put new data no faster than every other clock because you're waiting for the result of that "add one more decision" to percolate back around. Leave out the pipeline registers and your fmax gets ugly.

If you just carry the excess and cook it periodically, you get a "data accept" duty cycle of well past 99%. There may be some cleverer way to get to a 100% duty cycle pipeline, but with only one cup of coffee in me I don't see it.

Rob Gaddi, Highland Technology -- www.highlandtechnology.com Email address domain is currently out of order. See above to fix.

Vote

R

rickman 10 years ago

The tool was not using the carry in. Here is a version that uses the carry in. 66 LEs in one clock cycle. It can be reduced to about 40 LEs if done in two clock cycles.

BTW, did you test any of the designs you synthesized? How do you know they actually work?

library ieee; use ieee.std_logic_1164.all; use ieee.numeric_std.all; use std.textio.all;

ENTITY FPGA IS GENERIC ( WIDTH : POSITIVE := 32 ); port( -- Board Clock A, B : in unsigned(WIDTH - 1 downto 0); Y : out unsigned(WIDTH - 1 downto 0) ); END FPGA;

ARCHITECTURE behavior OF FPGA IS

signal Y1 : unsigned(WIDTH downto 0); signal temp : unsigned(WIDTH downto 0);

BEGIN

Y1

Vote

K

KJ 10 years ago

On Wednesday, December 16, 2015 at 12:12:01 PM UTC-5, rickman wrote:

Here is the fitter summary showing 98 logic cells for this design that you posted. Presumably you're using a different tool or different target devic e to come up with 66 so that number only becomes relevant if you use that s ame tool and target device to synthesize all of the other variants.

+--------------------------------------------------------------------------

-------+ ; Fitter Summary ;

+------------------------------------+-------------------------------------

-------+ ; Fitter Status ; Successful - Wed Dec 16 12:31:58 201

5 ; ; Quartus II 64-Bit Version ; 13.1.0 Build 162 10/23/2013 SJ Web E dition ; ; Revision Name ; Junk ; ; Top-level Entity Name ; FPGA ; ; Family ; Cyclone IV GX ; ; Device ; EP4CGX22CF19C6 ; ; Timing Models ; Final ; ; Total logic elements ; 98 / 21,280 ( < 1 % ) ; ; Total combinational functions ; 98 / 21,280 ( < 1 % ) ; ; Dedicated logic registers ; 0 / 21,280 ( 0 % ) ; ; Total registers ; 0 ; ; Total pins ; 96 / 167 ( 57 % ) ; ; Total virtual pins ; 0 ; ; Total memory bits ; 0 / 774,144 ( 0 % ) ; ; Embedded Multiplier 9-bit elements ; 0 / 80 ( 0 % ) ; ; Total GXB Receiver Channel PCS ; 0 / 4 ( 0 % ) ; ; Total GXB Receiver Channel PMA ; 0 / 4 ( 0 % ) ; ; Total GXB Transmitter Channel PCS ; 0 / 4 ( 0 % ) ; ; Total GXB Transmitter Channel PMA ; 0 / 4 ( 0 % ) ; ; Total PLLs ; 0 / 4 ( 0 % ) ; +------------------------------------+-------------------------------------

-------+

Kevin Jennings

Vote

K

KJ 10 years ago

On Wednesday, December 16, 2015 at 9:23:49 AM UTC-5, snipped-for-privacy@gmail.com

I have yours coming in at 36 if I coded up what you said correctly, not tested. You're in the lead now with Ilya in second.

Kevin Jennings

==== START OF CODE ==== library ieee; use ieee.std_logic_1164.all; use ieee.numeric_std.all;

entity Custom_Adder is generic( SP1_METHOD: integer range 0 to 1; -- Controls how 'Sum + 1' is calculated METHOD: integer range 0 to 7); port( Clock: in std_ulogic; A: in unsigned(31 downto 0); B: in unsigned(31 downto 0); C: out unsigned(31 downto 0); Valid: out std_ulogic ); end Custom_Adder;

-- Results:

-- Logic Elements

-- Method# SP1=0 SP1=1 Notes

-- 0 32 32 C

Vote

R

rickman 10 years ago

I don't know what to tell you. I used Lattice Diamond 3.3. The design is simple, two 33 bit adders, 1 LUT per bit. There is nothing special about either the tool or the device (standard 4 input LUTs with carry support for adders). I think your Quartus tool is broken. Try looking at the actual logic produced. BTW, doesn't the Cyclone IV have 6 input LUTs? What is the Altera tool doing that is so inefficient?

Rick

Vote

K

KJ 10 years ago

Correction to Method 7: Posted: Carry_In

Vote

K

KJ 10 years ago

Ilya's method (Method=6 in the code I posted) should be computing one out put every clock cycle. The output will just have a one clock cycle latency compared with the pure combinatorial logic version (Method=2). Thomas' method (Method=7) though will only accept one input every other clock cyc le but consumes roughly half the logic resource.

Kevin Jennings

Vote

R

rickman 10 years ago

Vote

I

Ilya Kalistru 10 years ago

I thank you all for this very interesting and useful discussion. There is a lot to think about. But I should apologize: later on it turns out that the definition of "modulo 2**32-1" which is used in the algorithm description other than I had thought. It is weird from the mathematical point of view: A+B if A+B=2**32 I manged to meet timings by using Sum(32) to decide what sum I should use.

Sum

Vote

modulo 2**32-1 arith

Join the Discussion

Didn't find your answer?