high number of multipliers / low cost

- R
- ryan_usenet
  
  Contact options for registered users
posted
17 years ago

Wed, Apr 4, 2007 11:35 AM

Hello, I need the highest possible number of multiplication operations per second at low cost. I know that several factors affect the overall performance, but since I have no idea which FPGA chips might be worth to be considered, I'd like to ask what you think is the chip with the lowest ratio

R=(prize of chip)* (delay time)/(number of multipliers)

18x18bit multipliers seem to be quite common, so lets assume this design for the estimate.

For example for the Spartan XC3S1000 (~60$, 24 multipliers, 4ns delay) I have R= 10$ per (Billion multiplications/s). The Cyclone EP2C70 (~230$, 150 multipliers, 4ns delay) has R=6.13$ per (Billion multiplications/s).

Do other FPGAs exist that are maybe specialized for multiplication- intensive tasks and which therefore are much cheaper?

Best regards, Ryan

- R
- Rob
  
  Contact options for registered users
Vote on answer
posted
17 years ago

Wed, Apr 4, 2007 11:59 AM

Look at the CycloneIII EP3C family that was just released.

- B
- Ben Jones
  
  Contact options for registered users
Vote on answer
posted
17 years ago

Wed, Apr 4, 2007 12:11 PM

Wow, FPGA marketing departments are going to be swarming all over you like it's Christmas!

Wait, are you sure you're real...? ;-)

-Ben-

- S
- Sylvain Munaut
  
  Contact options for registered users
Vote on answer
posted
17 years ago

Wed, Apr 4, 2007 12:28 PM

Look at the Xilinx virtex 5 sxt series

The V5 SX95T for example has 640 multipliers running at 450 MHz. I don't know it's cost, but even if it's 750$ (and it's probably less), that would make R =

2.6 $ per (Billion multiplications/s)

You can also look at the spartan 3A DSP series. The 3SD3400A costs around 60$ and can do 30 billion multiplication per seconds so that would make R = 2

Sylvain

- R
- ryan_usenet
  
  Contact options for registered users
Vote on answer
posted
17 years ago

Wed, Apr 4, 2007 2:28 PM

Thank you for your replies so far.

It is no problem if the chip is ~1000$, because the aim is some 10^12 multiplications/s. The simplistic estimate needs 3 chips V5 SX95T or

27 chips EP2C70. For easier design a lower number of chips is preferred.

Is there some more comprehensive overview of FPGA prizes, especially for the larger devices, than what I find at

formatting link

Although your suggestions have already been useful, I'd appreciate more hints maybe concerning exotic manufacturers that I have never heard of. Or are Xilinx and Altera definitely the only choices?

Best regards, Ryan

(

Why do you doubt this? Because the question sounds stupid? I don't pretend that I am close to production of the planned FPGA board, but I need an idea of what will be possible at what costs.)

- S
- Symon
  
  Contact options for registered users
Vote on answer
posted
17 years ago

Wed, Apr 4, 2007 3:22 PM

Hi Ryan, Don't forget FPGAs have other resources apart from 'hard' multipliers. Check this page on Mr. Andraka'a website about distributed arithmetic. You can make a _lot_ of multipliers out of the ordinary fabric of FPGAs. HTH, Syms.

formatting link

- A
- Austin Lesea
  
  Contact options for registered users
Vote on answer
posted
17 years ago

Wed, Apr 4, 2007 3:35 PM

Ryan,

Your metric is extremely simple. Perhaps too simple? What is it that you wish to do? 1E12 multiplies per second is a bit too simplistic.

You need to consider the 'care and feeding' of this 'monster'.

What is the resolution required? 18X18? or 9X9? Or really 25 X 18?

Is it all multiplies, and no accumulates? Hard to imagine a problem where no addition whatsoever is required. Multipliers alone may not suffice. What about accumulators? What number of bits? The DSP48 blocks in the Xilinx architectures are intended for most common needs. The DSP48 in V5 also has the traditional 4 bit control (16 function) ALU as part of the DSP block. Very wide AND, OR, XOR, etc. are all provided.

You should also consider how much SRAM is on chip. If you have insufficient RAM, you can not "feed your monster."

Similarly, IO is required to feed the RAM, and get the results. You need to consider if the IO is a bottleneck (limits your performance).

Austin

- R
- ryan_usenet
  
  Contact options for registered users
Vote on answer
posted
17 years ago

Wed, Apr 4, 2007 3:40 PM

Well part of the other logic is of course needed for accumulation, registers etc. If huge amounts of logic will be free, my plan is to use those as additional 'software' multipliers.

However, I have no idea how efficient these multipliers built from logic blocks can be. As far as I know you have the choice of implementing either parallel or sequential 'software' multipliers. The parallel architecture has the advantage that it can calculate 1 product per clock, but requires lots of logic, so that not many of them could be used. The sequential architecture needs about 1/nth of the logic space, but finishes after about n clocks, so that I don't expect either of these solutions to significantly contribute to the dedicated 'hardware' multipliers performance.

If the sequential multiplier could be used in 'pipeline' operation (meaning the result is delayed by n clocks, but then a new result is produce in each cycle), this would be interesting, but I guess, it cannot?

Thanks Ryan

- J
- John_H
  
  Contact options for registered users
Vote on answer
posted
17 years ago

Wed, Apr 4, 2007 3:52 PM

Another chip to consider: the Lattice ECP2(M) series.

The Lattice approach was to stuff the inexpensive chips with DSP resources since many massively parallel algorithms can't afford the super-performance chips. Their sysDSP blocks can be configured for 8x9-bit, 4x18-bit or

1x36-bit multipliers per block. A MAC structure is included for FIR style applications. For 18-bit multipliers, an 88 multiplier solution has a "marketing" price around $35 (ECP2-70). I think your 18-bit R value is about $1.60 but I have a little difficulty figuring out the multiplier speeds.

The families with the most multipliers at reasonable costs - Altera Cyclone III, Xilinx Spartan-3DSP, Lattice ECP2 - are all rather new. The costs might be larger in the near term than marketing announcements would suggest. You really need to get conversations going with the sales reps from these three companies so that they can work the numbers toward your goal. Getting them to where they can understand your pipeline needs, they should be able to give you a true attainable frequency and a price or price range that would allow you to compare to your economy of scale.

Something else that might interest you is the FPOA from Mathstar. Their "Field Programmable Object Array" has MACs, ALUs, and Register Files as the distributed elements capable of 1 GHz speeds according to their literature. Since this product is far outside my tecnical needs, I haven't delved too far into it but your application might be one of the few FPGA-style designs that can seriously leverage this not-so-mainstream technology.

- B
- Ben Jones
  
  Contact options for registered users
Vote on answer
posted
17 years ago

Wed, Apr 4, 2007 4:11 PM

You would have to define "lots". Also, you might calculate one product per clock with a fabric-based multiplier, but it will be a pretty long clock cycle.

No, but the "parallel" multiplier architecture could. Indeed, it must be, or your clock speed would be ridiculously slow.

It would be interesting to know what you think "a multiplier" is - i.e. what word-length, and are you talking about fixed-point or floating-point operations?

-Ben-

- R
- ryan_usenet
  
  Contact options for registered users
Vote on answer
posted
17 years ago

Wed, Apr 4, 2007 4:25 PM

Definitely too simple. But I need a first guess what might be possible, not the final answer about the ideal FPGA. Therefore I don't want to discuss all details, but only the multiplication rate which is one of the limiting factors, which will boil down the choice to Is it all multiplies, and no accumulates? I do need accumulators.

I wasn't asking you to imagine the problem, but whether you know chips that can do highspeed multiplication at low cost.

Yes, I guess the 640 DSP48-slices of the XC5VSX95T are the multipliers that Sylvian mentioned. I don't mind if the FPGA is filled with additional (then unused) logic unless it uses too much power. So you're welcome to suggest xtreme DSP chips even if their multipliers are used inside DSP slices.

With 1000 18x18 multipliers I won't need more than 8x1000x18=144kbit RAM space.

It is not, because the computational intensive part is localized on each chip.

Ryan

- R
- ryan_usenet
  
  Contact options for registered users
Vote on answer
posted
17 years ago

Wed, Apr 4, 2007 4:32 PM

Simulations show that 36bit fixed-point are sufficient. However for later improvements we might change to floating-point calculation. Do FPGAs with floating point multipliers/adders exist, or how can one estimate the (emulated) floating-point performance given the fixed- point performance?

Ryan

- B
- Ben Jones
  
  Contact options for registered users
Vote on answer
posted
17 years ago

Wed, Apr 4, 2007 4:53 PM

Well, it's tricky and your mileage may vary a great deal, but you might want to take a look at the Xilinx Floating-point operator core datasheet. This will tell you how big and how fast the various FP operations will be depending on your desired wordlength.

Multiplication in floating-point is not really too much of an overhead relative to fixed-point, but addition certainly is much much bigger and slower.

Good luck,

-Ben-

- S
- Sylvain Munaut
  
  Contact options for registered users
Vote on answer
posted
17 years ago

Wed, Apr 4, 2007 6:36 PM

Google for karatsuba, you can do it with just 3 ;)

- D
- Daniel S.
  
  Contact options for registered users
Vote on answer
posted
17 years ago

Wed, Apr 4, 2007 6:39 PM

If you want to infer pipelined multipliers with ISE, the basic coding template is simple and is the same for both hardware and fabric multipliers:

process(clk) begin if(rising_edge(clk)) then multpipe0

- R
- Ray Andraka
  
  Contact options for registered users
Vote on answer
posted
17 years ago

Wed, Apr 4, 2007 7:14 PM

You can also time-share the multipliers. The DSP48 elements in Virtex 4 can be clocked at 400 MHz in the slow speed grade part. With careful design, the fabric can also support 400 MHz (well, except for the carry logic, which is hard pressed at 400MHz for anything but simple counters). My gigasample floating point FFT design runs on a 400 MHz clock in a V4SX55. That design is highlighted in this month's Xilinx DSP magazine:

formatting link

and I have the floorplan for that design on my website at

formatting link

- R
- Ray Andraka
  
  Contact options for registered users
Vote on answer
posted
17 years ago

Wed, Apr 4, 2007 7:25 PM

What is your application? There is usually more than one way to approach a problem. For example, distributed arithmetic is an elegant solution to handling a sum of products with constant or nearly constant coefficients that does a great job at compacting the area required for fabric multipliers. The precipitation radar design on my website gallery (

formatting link

) does about 82 multiply-accumulates per clock cycle at 133 MHz, which works out to over

10 billion multiplies/sec in an FPGA that has zero hardware multipliers (XCV1000) and is rather small and slow compared to current FPGA offerings. In order to find a possible alternative approach, you need to disect your algorithm to see if there are other constructs that might get you the performance you want in a reasonable amount of hardware.

- R
- Ray Andraka
  
  Contact options for registered users
Vote on answer
posted
17 years ago

Wed, Apr 4, 2007 7:33 PM

No, there are currently no FPGAs with built-in floating point support. However, with care, floating point can be done at speeds similar to fixed point at the price of additional logic. See my article in this month's Xilinx DSP magazine on a floating point FFT (

formatting link

) Floating point does not have to be done on every elemental operation, but instead can be done after a series of operations or a more complex fixed point operation without losing any precision if the fixed point kernel which is surrounded by the floating point extensions has enough room for growth to prevent overflow. Taking that approach can greatly reduce the amount of hardware needed to support floating point. The expensive thing in floating point are the normalize and denormalize funstions. Those can be done at DSP48 speeds by using the DSP48 multipliers for the shifters.

- R
- ryan_usenet
  
  Contact options for registered users
Vote on answer
posted
17 years ago

Wed, Apr 4, 2007 10:13 PM

Thanks to all for your very informative replies. Since I have only experience with one rather small FPGA application every little information like the lattice family, the dsp48 slices, and Rays publications promise to be very useful, and I'll evaluate your suggestions in greater detail than I could at a first glance.

I hope you understand that I am reluctant to describe the project more precisely. I'll start with a downscaled solution just for curiosity and as a proof of concept as kind of low-budget hobby project, but I don't want to give away the chance to have a profitable project later. Of course, since I spread only very little information, I can't expect you to tell me the perfect solution. But I am already very happy with your answers.

Thank you Ryan