Sum of 8 numbers in FPGA

How do I most efficiently add 8 numbers in FPGA? What is the best way to save LUTs? How is data width affecting LUT consumption? Thanks in advance.

--------------------------------------- Posted through

formatting link

Reply to
b2508
Loading thread data ...

With an adder. You haven't stated any requirements so any answer here would be OK. Consider:

- You didn't specify your latency or processing speed requirements

- You didn't specify your efficiency metric (i.e. power? LUTs? Something else?)

Use an accumulator and stream the numbers in sequentially might use fewer LUTs

More LUTs will be used when you increase the data width

Kevin

Reply to
KJ

The most efficient adder is the carry save adder.

But the actual implementation depends on many other details, such as the timing of the availability of the numbers, and also the bit width.

-- glen

Reply to
glen herrmannsfeldt

This sounds like a homework problem. In an FPGA there aren't many ways to save LUTs for adders. Unless you can process your data serially, the only thing I can think of is to do the additions in a tree structure which saves you a very few LUTs from the bit growth of the result compared to processing the additions serially, it's also faster. (((a+b)+(c+d))+((e+f)+(g+h))) vs. ((((((a+b)+c)+d)+e)+f)+g)+h

--

Rick
Reply to
rickman

At the risk of doing someone else's homework:

The latest parts from Xilinx and Altera will add three numbers at a time using a single carry chain.

A: Doing serial arithmetic using block RAM to hold inputs & outputs. B: Using DSP adders in place of LUT carry chains.

Jim

Reply to
jim.brakefield

Yes, but even so, leaving lots of unknowns.

If you have 8 n-bit inputs and need the sum as fast as possible, there aren't a huge number of choices. Though it does depends a litlte on n.

In this case, there are two choices. You can process the data bit serial, or word serial. (Or, I suppose somewhere in between.)

Choosing one of those would depend on how the data was supplied, and again, how fast you need the result. In addition, only one set of eight, or many?

If you just chain adders, the usual tools will optimize them.

But you might also want some registers in there, too.

Also, this could be a lab homework problem, where the student is supposed to try things out and see what happens.

-- glen

Reply to
glen herrmannsfeldt

Why not try it out. Run one of the tool chains and see what happens when you build adder in different ways and then if its not what you expect come and ask on here. The tool chains will show you what the LUT usage is. I was a tad suprised to find that when I coded:-

byteout ---------------------------------------

Define efficiency. Almost all efficiency is a trade off between space and performance.

In this modern world of optimising tool chains why not just put them all in one expression and let the tool chain work out what is best for the chip.

A classical trade off of speed, as its now serial, for gates used. If you do it serially then you may need to do 7 separate serial additions.. .. which will need more LUTs for the carry latches....

Assuming your chip has one?

Just my two cents/pence/yuan... And Jim, Nothing personal, your comments seemed a suitable place to hang my hat....

Dave

Reply to
David Wade

Yes, the optimizers can likely figure that one out.

Some years ago, I needed a 36 bit population count. That is, how many '1' bits there are in a 36 bit word.

The usual way to make one is with carry save adders, so I build one up, I think first 8 bits, and then combined those.

It was a little unusual, since I needed to know 0, 1, 2, 3, more than 3.

It wasn't hard to make, but it turns out that if you just say:

p=x[0]+x[1]+x[2]+x[3]+ ... x[35];

it works just about as well. It might be that I had to pipeline it also, but it still would have been easier to write.

(snip)

You mean ones with 6 input LUTs? I haven't looked at those much yet.

(snip)

My favorite test of the optimizer is when I make a tiny mistake, which turns out to cause some signal to never change, and the optimizer optimizes out all the logic! Nothing at all left!

-- glen

Reply to
glen herrmannsfeldt

6LUTs are a favorite of mine: One 4-to-1 mux or two 2-to-1 muxes 2-to-1 mux and an add/subtract

IMHO their reason for being is that they reduce the number of logic levels. Routing delay is now larger than logic delay, so reducing logic levels is a big speed win, more so than the greater logic capability.

The ALUT/ALM is somewhat different and more complicated. Not currently using it, but does appear to have overall characteristics similar to the 6LUT.

Jim

Reply to
jim.brakefield

I expect it to be most efficient to use 8 adders in parallel when the incoming data is not always fully ocupying their vector withs since the Compiler might discoder unsued bits and shorten carry chain lengths appropriately.

To meet Timing, I always add FFs behind and use Register balancing and retiming giving the Compiler oall Options of Optimization.

--------------------------------------- Posted through

formatting link

Reply to
carstenherr

ElectronDepot website is not affiliated with any of the manufacturers or service providers discussed here. All logos and trade names are the property of their respective owners.