What's the smallest/fastest CLB/Slice/Lut based unsigned integer 64 bit by 64 bit multiplier/squarer design?

- T
- Totally_Lost
  
  Contact options for registered users
posted
16 years ago

Sat, Jun 2, 2007 7:37 AM

I've been looking at the various core/macro generators and they all seem horribly large and slow, almost like student designs. Has anyone seriously taken a good look at hand fitting multipliers and squarers into Altera/Xilinx FPGA's?

- T
- Thomas Womack
  
  Contact options for registered users
Vote on answer
posted
16 years ago

Sat, Jun 2, 2007 3:53 PM

I believe Dan Bernstein has, in the context of fast factorisation machines, but I don't think he's published yet.

formatting link

is his rather disorganised Web page.

Tom

- R
- Ray Andraka
  
  Contact options for registered users
Vote on answer
posted
16 years ago

Tue, Jun 5, 2007 12:45 AM

If you haven't already, you might look at the multiplication in FPGAs page on my website at

formatting link

The partial product LUT multipliers are about as efficient as you are going to get with the FPGA LUT structure. If you can use multiple cycles, then you can combine that with sequential multiplication to make the multiplier smaller at the expense of more clock cycles per product.

- T
- Thomas Womack
  
  Contact options for registered users
Vote on answer
posted
16 years ago

Tue, Jun 5, 2007 5:56 PM

I agree that partial-product LUT multipliers are pretty much optimal for multiplying three-bit quantities. For wide multipliers, I think the relevant literature is the multi-precision arithmetic stuff by people like Knuth; you can multiply two 64-bit numbers by using three 32x32 multipliers in parallel. Relevant search terms are Karatsuba and Toom-Cook.

(trying to multiply 2^32A+B by 2^32C+D: compute AC, BD, and (A+B)*(C+D), then do 2^64*AC + BD + 2^32*((A+B)*(C+D)-AC-BD). This takes

one 32-bit add delay in which you get A+B and C+D one 33x33->66 multiply delay for (A+B)(C+D), parallelly AC and BD one 64-bit add delay to get AC+BD one 66-bit subtract delay to get ((A+B)*(C+D)-AC-BD) one 64-bit add delay to add the top 34 bits of that product to AC

and you can decompose the 33x33->66 multiply in the same kind of way, and if you're clever you can probably pipeline the wide additions)

Toom-Cook lets you multiply two (A*B)-bit numbers by using (2A-1) multipliers a bit wider than B, and a lot of rather strange additional tidying-up circuitry. The idea is that multiplying numbers is a lot like multiplying polynomials, and you can reconstruct the coefficients of a polynomial of degree 2d from its values at 2d+1 places by multiplying by a constant matrix; pick the places to evaluate to make the coefficients in the matrix nice.

Tom

- T
- Totally_Lost
  
  Contact options for registered users
Vote on answer
posted
16 years ago

Tue, Jun 5, 2007 9:14 PM

Hi Ray,

I certainly had took another cruise past that web site, as it's certainly a useful resource. The logic depths to build a wide

64x64=>128 bit multiplier that way do get a bit excessive, of which some pipelining does help a bit. The last time I faced this for a smaller 20x20=>40 bit multiplier I did better with an odd carry-save construction using the carry chain in a non-standard way, but took a couple days in the fpga editor to pack cleanly. The 64x64 isn't looking nearly as fun, so I was looking to see if someone has already been down this path.

- T
- Totally_Lost
  
  Contact options for registered users
Vote on answer
posted
16 years ago

Wed, Jun 6, 2007 3:16 AM

Thanks Tom,

I'll explore optimizations around that structural model as well, and compare with the carry-save model optimized around the extra half adders and mux in the carry chain hardware that worked well for the

20x20 multiplier.

- T
- Totally_Lost
  
  Contact options for registered users
Vote on answer
posted
16 years ago

Wed, Jun 6, 2007 6:14 AM

Hi Ray ... Just for reference, what would you consider an optimal number of LUTs/Slices using this approach for an 8x8=>16 and 16x16=>32 multiplier? Thanks

- C
- comp.arch.fpga
  
  Contact options for registered users
Vote on answer
posted
16 years ago

Thu, Jun 7, 2007 4:24 PM

Tricks like FFT based multiplications or Toom-Cook often are not beneficial in the range up to 64 bits because any improvement in LUTs (bothh area and depth) are killed by the area and delay required by the irregular routing.

Therefor almost any hardware multiplier around is based on addition trees that are sum up partial products. (booth, wallace-tree, array, sequential, etc.) All of these architectures require N*N/K 1-bit adders when multiplying N bit numbers in K clock cycles.

It turns out that all these architectures can be created from each other by just swapping wires around. This means, that if you are the type that uses RLOCs you can optimze the routing after placement has been fixed to equalize the critical path. (See our paper on that topic:

formatting link

)

A script that implements this with a greedy algorithm is not extremely difficult to write.

Kolja Sulimma

- T
- Totally_Lost
  
  Contact options for registered users
Vote on answer
posted
16 years ago

Fri, Jun 8, 2007 7:42 AM

Hi Kolja,

Without a doubt the product of two numbers of length M and N, yields at most M*N full adders, and LUT's, by most decompositions we teach students ... or N*N = N^2 in your form. For the product of two 64 bits, that yield roughly 4096 LUTs using the basic student forms which are taught in most EE courses. Actually, it's (N-1) rows of adders of length N, or 4064 full adders or LUTs.

However, that is about what we expect from students, and novice engineers which fail to study best practice algorithms, and think about the problem in a bit more detail.

Karatsuba's form requires two products of (N/2) plus a product of ((N/

2)+1) bits, plus three more adders of length (M/2)+1, plus some careful merging of the data streams. Roughly 992+992+1056+99=3139 full adders, or LUTs which is a significant savings over N^2. A good student would produce this result without question.

An excellent student would recursively implement the two 32 bit multiplies using Karatsuba's form as well, saving roughly another 378 full adders or LUTs, for a rough total of 2761 LUTs. And probably one more time to pickup another hundred or so full adders, or LUTs, but also to better constrain the input routing and locality. One of the side effects of using Karatsuba's form, is that it reduces the fan-in required for the input digits, and greatly relieves input routing over a basic student form solution where every input term is in a full N bit row or column.

It's with this starting point, significantly better than N^2, (about

60% the size of the basic student adder tree from) that we can do even better with careful optimization and layout, using a host of salty dog tricks to implement a a slightly better variation of this model.

- B
- Ben Jones
  
  Contact options for registered users
Vote on answer
posted
16 years ago

Fri, Jun 8, 2007 8:42 AM

This has been an interesting thread. But I still wonder why you want to implement a LUT-based 64x64 multiplier?

The reason the multiplier generator core in Xilinx Coregen doesn't use a whole lot of tricks in this configuration is that it seems like a very unlikely use case, and not one worth concentrating a lot of effort on. Ever since hard multiplier/DSP blocks began appearing in FPGAs, they have been the best building block for these large multiplier structures.

Maybe you can share some details of why exactly you need such a macro?

Cheers,

-Ben-

- T
- Totally_Lost
  
  Contact options for registered users
Vote on answer
posted
16 years ago

Fri, Jun 8, 2007 5:59 PM

The application requires one processed data point per clock, pipelined ... some 17 multiplies and 6 squarings, per point. Not even close to enough multiplier blocks in the XC3S2000 FPGA (just 40), but nearly enough LUTs (just under 41K).

The LUT squarers are simpler, as many terms fall out. The application's data has some specific relationships, which also allow some optimizations of the multipliers, with an estimated savings in area of about 20%. Plus a few multiplies are with scaled integers, allowing a few LSB's to be ommitted in the results.