algorithm: square operation

There is a lot of square operations in my FPGA projects (using Xilinx VitexE). Now, I complete them by multiple, but this method is slice consuming and slow. As we know, square operation has many characters, which multiple operation of any random numberic don' have, then could we utilize these features to calculate square operations more economical and more fast? Thanks for all advice. Best Regards

Reply to
fwj_733
Loading thread data ...

There may be shortcuts if you need a series of squares, for example. You can compute that as

sq = 0; for (i = 1; i < n; ++i) { sq = sq + (i

Reply to
Ben Jackson

Now, I complete them by multiple, but this method is slice consuming and slow. As we know, square operation has many characters, which multiple operation of any random numberic don' have, then could we utilize these features to calculate square operations more economical and more fast? Thanks for all advice. Best Regards

You really need to give us more information. Are the squares sequential (eg squares of an arithmetic progression of values)? How many clock cycles per square are available? How many bits? Do you have BRAM available? Do you have area restrictions? etc. As with many other problems, there are many ways to approach this.

--
--Ray Andraka, P.E.
President, the Andraka Consulting Group, Inc.
 Click to see the full signature
Reply to
Ray Andraka

Sorry for not giving sufficient info. The squares works parallel, that is, one clock cycle per square. Bits width of Operand are about 10~20. VirtexE do have BRAM, I use Virtex600E, and the remaining maximum capacity for square operation is 128 Kb. I hope square of a 20bit number can be completed by less than 30 VirtexE slices, and work above 70MHz. I still wonder, for a multiplier written as: use IEEE.std_logic_arith; ... C

Reply to
fwj_733

Generally modern synthesize tools generate a very high quality result for basic operatinos like add, multiply. Both Xilinx and Altera have already highly optimized macros for these operations (remember LPMs?) and usually they are hand optimized and very efficient. The synthesize tool selects from a library of these pre-made macros for these operations. Also the output generated for C

Reply to
Arash Salarian

Thank you for your reply. I made a test, two instance, one calculate A*B, the other calculate A*A, oprand is 20 bit. I found they are almost the same result. After map, both occupy 215 slices (VirtexE). And only small difference in the time delay. ISE translate the square operation to multiplier.

Reply to
fwj_733

I think there was a post a year or two ago about doing an efficient square operation by splitting the number in half such that a N bit square could be done with some N bit lookups and some additions instead of a 2N table lookup. Such that a 10 bit square only required a 10 bit lookup table instead of 20 bit. I think it went something like this:

Take a 10 bit number to be squared and represent it as a+32b where a and b are 5 bit numbers. (a+32b)(a+32b) = a^2 + 64ab+1024b^2. Now you have to do 3 of the smaller table lookup mpys and some shifting and adding to get the result.

Reply to
Jeff Cunningham

ElectronDepot website is not affiliated with any of the manufacturers or service providers discussed here. All logos and trade names are the property of their respective owners.