Floating point multiplication on Spartan3 device

Hi, All.

Nowdays I'm focusing on building some design for multiplication of matrices that each element is floating point number.

What I want to do is -

M(0,0), M(0,1), M(0,2) M(1,0), M(1,1), M(1,2) M(2,0), M(2,1), M(2,2)

multiply by vector

V(0,0) V(1,0) V(2,0)

and when I got the result below,

R(0,0) R(1,0) R(2,0)

I want to divide R(0,0), R(1,0) by R(2,0) for normalization.

Is this multiplication and division are possible using Spartan3 device with VHDL? Or should I convert each floating point numbers to integers and calculate all after that?

I'm sure any advice helps me a lot. Thanks.

Reply to
codejk
Loading thread data ...

codejk,

How fast and precise does it have to be?

You could put a MicroBlaze on there with the single-precision floating-point instructions enabled and do it in software.

Stephen

Reply to
Stephen Craven

Thank you for your interest. I want to get the result on each 80 ns. (pipeline can be used.) Sorry but since I have worked with xilinx devices several years, I have no experience about MicroBlaze. So could you plz recommend some materials about MicroBlaze?

Reply to
codejk

codejk,

What kind of floating-point number is it? IEEE754? Single or double precision?

Are those the actual dimensions of your matrices?

It looks to me like you have 9 multiplications, 6 additions and 2 divisions to do every 80ns. I work this out to be 212.5MFLOPS.

According to:

formatting link

The best performance you're going to get from a FPU-enabled microblaze running on a V4 is 33MFLOPS (clocking at 200MHz). I work out about

22-23 MFLOPS being the most you would get for your particular instruction mix. Reduce this again for the lower performance of the Spartan 3, then you're looking at being a long way off from what you need.

So, FPU-enabled microblaze is a good suggestion, but when you factor in the 'result every 80 ns constraint' it's unfortunately not viable.

All is not lost though, even with Spartan devices these days you can get well into the GFLOPS in terms of floating-point performance. To do it you need to take advantage of the vast real-estate and data-throughput capabilities of the devices.

What I would suggest are floating-point cores. Pipelined, and with multiple instantiations. With these you could easily obtain the performance you need.

You would have to pay for the IP, but once you had it you could very easily achieve your design with minimum headaches.

Of course, if you don't have the cash to fork out on this, then you'll have no choice but to look at another way of doing it, fixed point for example...

Tell us how you get on,

Robin

Reply to
Robin Bruce

Hi codejk,

A library of floating-point IP blocks is available in the latest release of Coregen. If you have the ISE tools (not web-pack), go take a look. A fixed-point implementation may be smaller and simpler, but you might need bigger bit-widths to cope with the dynamic range.

What you specified (3x3 matrix times 3x1 vector + normalize result) might be possible, but it depends on your processing requirements. You would need at least one FP multiplier, one FP adder and one FP divider. You said in another post that you needed a result every 80ns, but that pipelining was not a problem (i.e. latency doesn't hurt, I guess). You have 9 multiplies, 6 adds and two divides to do. The adder and multiplier hardware could easily be time-multiplexed between those operations (~9ns cycle time for the multiplier is easily enough). Size-wise, the FP matrix multiply portion of your design would need 4 MULT18x18s and less than 500 slices (assuming single-precision).

The divide is the killer. A fully-parallel FP divider will currently set you back around 3000 slices. So add on your sequencing logic (state machine) to schedule all this compute, plus some buffering registers, and you're looking at ~5000 slices running at (say) 120-150MHz. So probably the very smallest S3 device would be too small. Depends what else you need in there, and how you're interfacing it to the rest of your system.

Incidentally, what is your application? 3-D graphics, at a guess?

Cheers,

-Ben-

Reply to
Ben Jones

I apologize that my question was not clean. Each element of matrix is just single precision real number. Both of floating and fixed point implementations are allowed. Because it's my small project, so I cannot pay for the IP. So now I try to use some related libraries of coregen. Thanks a lot for your advice, Robin. And any further guide, that'll be great help for me. Thanks again.

Reply to
codejk

Hi Ben, Thank you for your advices.

Now I'll try to use coregen IPs in ISE 6.1. I cannot use the latest release of ISE (7.1.03, I mean) because it seems like it has some bugs. (I'm not sure about that. Maybe my circuit has some bugs, too.)

And.. My application related to 2D graphics. Exactly it's about the self configuration of image sensor using several images but now I have too much obstacles :) Thank you so much.

Reply to
codejk

Sure is possible, and you've got plenty of real-estate if you are clever about the design. Floating point multiplies don't require much more than a fixed point multiply, so that isn't going to overload the device. The floating point mutliplies, assuming normalized inputs only require an adder for the exponent and a conditional one bit shift after the multiplier to renormalize. The adds are a little more complicated because they require a denormalize, a fixed point add, and a renormalize. Your adds in a 3x3 matrix , you've essentially got a 3 input adder, so all three of the multiplier outputs have to be denormalized to the same weight. This is for all intents a conversion to fixed point where the common part of the exponent is stripped off and retained for later. You just look at the individual products, shift each right by the difference between its exponent and the largest exponent in the row products and then add them together. The shift can be envisioned as a 24x24 (assuming single precision here, you might very likely need less) multiply of the previous multiplier product and

2^(24-shift). The denormalized row products are added together with a pair fixed point adder (you might want a few extra lsbs), paying attention to the signs and the sum is then left shifted to eliminate leading zeros. The left shift is easiest with a layered barrel shift because you get the leading one detection as part of the shifting. You can use the embedded multipliers and run them at 125 MHz easily, in which case you get 10 clocks per matrix multiply, pipelined. That means you can use 4 embedded multipliers to make up a 24x24 multiply, and then perform the 9 multiplications sequentially. A second 4 embedded multipliers, or a barrel shifter made up of about 125 luts will do the denormalize on the products. You'll need a delay queue (made up of SRL16s) between the multiplier and the denormalizer to be able to detect the maximum exponent in each row and then derive the shift distance. The matrix multiply using floating point as described does not take up much room.

The divide is actually easier as floating point, especially if you don't need a lot of precision (you may not, since it is the last operation). You've got 10 clocks for each divide, so for full single precision, you only need to compute 3 new bits per clock. By having the inputs to the divider normalized, you have a smaller range the divdider has to work over.

--
--Ray Andraka, P.E.
President, the Andraka Consulting Group, Inc.
401/884-7930     Fax 401/884-7950
email ray@andraka.com  
http://www.andraka.com  

 "They that give up essential liberty to obtain a little 
  temporary safety deserve neither liberty nor safety."
                                          -Benjamin Franklin, 1759
Reply to
Ray Andraka

ElectronDepot website is not affiliated with any of the manufacturers or service providers discussed here. All logos and trade names are the property of their respective owners.