There is a lot of square operations in my FPGA projects (using Xilinx VitexE). Now, I complete them by multiple, but this method is slice consuming and slow. As we know, square operation has many characters, which multiple operation of any random numberic don' have, then could we utilize these features to calculate square operations more economical and more fast? Thanks for all advice. Best Regards
Now, I complete them by multiple, but this method is slice consuming and slow. As we know, square operation has many characters, which multiple operation of any random numberic don' have, then could we utilize these features to calculate square operations more economical and more fast? Thanks for all advice. Best Regards
You really need to give us more information. Are the squares sequential (eg squares of an arithmetic progression of values)? How many clock cycles per square are available? How many bits? Do you have BRAM available? Do you have area restrictions? etc. As with many other problems, there are many ways to approach this.
--
--Ray Andraka, P.E.
President, the Andraka Consulting Group, Inc.
401/884-7930 Fax 401/884-7950
email ray@andraka.com
http://www.andraka.com
"They that give up essential liberty to obtain a little
temporary safety deserve neither liberty nor safety."
-Benjamin Franklin, 1759
Sorry for not giving sufficient info. The squares works parallel, that is, one clock cycle per square. Bits width of Operand are about 10~20. VirtexE do have BRAM, I use Virtex600E, and the remaining maximum capacity for square operation is 128 Kb. I hope square of a 20bit number can be completed by less than 30 VirtexE slices, and work above 70MHz. I still wonder, for a multiplier written as: use IEEE.std_logic_arith; ... C
Generally modern synthesize tools generate a very high quality result for basic operatinos like add, multiply. Both Xilinx and Altera have already highly optimized macros for these operations (remember LPMs?) and usually they are hand optimized and very efficient. The synthesize tool selects from a library of these pre-made macros for these operations. Also the output generated for C
Thank you for your reply. I made a test, two instance, one calculate A*B, the other calculate A*A, oprand is 20 bit. I found they are almost the same result. After map, both occupy 215 slices (VirtexE). And only small difference in the time delay. ISE translate the square operation to multiplier.
I think there was a post a year or two ago about doing an efficient square operation by splitting the number in half such that a N bit square could be done with some N bit lookups and some additions instead of a 2N table lookup. Such that a 10 bit square only required a 10 bit lookup table instead of 20 bit. I think it went something like this:
Take a 10 bit number to be squared and represent it as a+32b where a and b are 5 bit numbers. (a+32b)(a+32b) = a^2 + 64ab+1024b^2. Now you have to do 3 of the smaller table lookup mpys and some shifting and adding to get the result.
ElectronDepot website is not affiliated with any of the manufacturers or service providers discussed here.
All logos and trade names are the property of their respective owners.