Fast 28x28 multiplier + adder in Virtex4

Hi, We are using a virtex4 FPGA to prototype a DSP processor to be implemented in an ASIC. We are using the ISE flow and everything works fine except that we can't prototype at full speed. We are only able to run at about 65MHz, which is far from the 150MHz target. The longest combinationnal path is in the MAC, which contains a 28x28 multiplier followed by a 56x56 adder. I created the multiplier and the adder using Core Generator.

Is there a way to speed this up? The virtex4 have those Xtreame DSP slices, but I can't find a way to to make good use of them, since our datapath is so large.

Thank you, David

Reply to
gretzteam
Loading thread data ...

Virtex4 has 18x18 multiplier hardware. Your 28x28 may be made from them, but you need to pipeline it, and also a pipeline stage before the adder. I will guess that gets to 150MHz, but you will have to try it to find out.

-- glen

Reply to
glen herrmannsfeldt

If you use the Xtreme DSP slices properly, with all of their dedicated interconnects, you should be able to do a 34x34 multiply using 4 pipelined slices at full rate (450-500MHz, depending upon part speed). You might need an extra two slices to do the 56-bit accumulate. Look for the "XtremeDSP Design Consdierations" guide on the Xilinx site and it describes how to do this. I'm not sure exactly what CoreGen is producing but it might not be completely optimized. It might be using CLB fabric for some of the operations.

-Kevin

Reply to
Kevin Neilson

Right now I'm not using anything fancy. I created a 28x28 multiplier and a 56x56 adder with coregen and wired them together. I used the multiplier component and it is supposed to use the XtremeDSP slices. Maybe it is not wise enough to make use of other dedicated interconnects. I will look at this "XtremeDSP Design Consdierations". Thank you, David

Reply to
gretzteam

Pipelining is the magic word (Coregen calls it registered inputs and outputs)

Regards Falk

Reply to
Falk Brunner

I can't really use pipelining here. The MAC is all combinationnal; i receive inputs at time 0, and I need an answer by time x. I don't see how pipelining would help. Thanks, Dave

Reply to
gretzteam

What is x?

If x is one clock cycle then you need either faster logic or a lot more of it. I believe this can be done easily with a three cycle pipeline, so that you get an answer out every cycle, which each one taking three cycles.

-- glen

Reply to
glen herrmannsfeldt

Hi, I guess I don't understand something about pipeling. In my case, the whole system runs at master clock, which I would like to be 100MHz or more. Right now, the whole MAC unit is combinational logic and needs to produce an answer for each clock cycle (time x=1/100MHz). Are you guys saying that if I would run the mac at 3 times the master clock (300MHz) with a three stage pipeline, I could compute the answer fast enough?

Thanks, David

Reply to
David

Howdy David,

Using different terms, let's try another analogy on this Saturday: imagine an automobile assembly line. It puts out a certain number of cars per hour. If you add another step in the assembly process, you can still get the same number of cars per hour out - it just takes a little longer for it to roll off the assembly line. Circuits work the same way.

If your main requirement is to be able to handle a certain number of calculations per second, you can possibly break the calculations up into smaller parts which are easier to do in series: rather than doing a multiply and an accumulate in the same cycle, do the multiply in one cycle, and the addition in the next cycle. While the accumulation is occuring during this 2nd cycle, the 2nd piece of data is being multiplied. On the 3rd cycle, the 2nd piece of data is now in the accumulator and a 3rd piece of data enters the multiplier. You get the same number of calculations per second out of the circuit (or perhaps even more, since you can meet timing now!), but it takes 20 ns rather than 10 ns. If you can't stand the extra delay, then you may need to up the clock rate (and then you will sure enough have to pipeline!).

Hope that helps,

Marc

Reply to
Marc Randolph

Hi, I understand what you mean. However, I don't think it works in my case because I have a loop (it is a MAC). In order to start the next calculation, I need an answer to the previous one. I guess the only solution is faster logic. I thought that a virtex4 would be able to give us those kind of calculation speed...

Dave

Reply to
gretzteam

Unless the result from the accumulator goes as an input to the multiplier, it should pipeline just fine. Using the built in multipliers, it should be two or three stages. The answers will come out, one per clock cycle, two or three clocks later.

-- glen

Reply to
glen herrmannsfeldt

ElectronDepot website is not affiliated with any of the manufacturers or service providers discussed here. All logos and trade names are the property of their respective owners.