Xilinx Square Root Unit

- R
- Robin Bruce
  
  Contact options for registered users
posted
18 years ago

Tue, Mar 21, 2006 1:20 PM

Hi guys,

I was wondering if anyone here knows the technique that the Xilinx floating-point square-root core employs to get its results. I ask because we've got a unit that does the same job with near-identical latency and clock speed but uses far less resource (330 slices, versus Xilinx's 624 slices for single-precision accuracy).

Interestingly my paper on this unit was rejected from FCCM on the grounds that "more recent developments in the field of square root calculation" have been made. Could anyone cast any light on what these are? The algorithm I used is

400 years old, but why should that make any difference if it provides the correct answer for less resource use?

If anyone knows of a unit that provides pipelined single-precision square-root calculations over the full range possible inputs, I'd be interested to hear about it.

Thanks for your attention,

Robin Bruce

- M
- Mike Treseler
  
  Contact options for registered users
Vote on answer
posted
18 years ago

Tue, Mar 21, 2006 5:56 PM

Perhaps someone other than Xilinx has a better one.

No difference at all if your goal is to use a square root on an fpga. If your goal is research existing algorithms and find a better one, then do that.

Here's some related discussion:

formatting link

-- Mike Treseler

- B
- Brannon
  
  Contact options for registered users
Vote on answer
posted
18 years ago

Tue, Mar 21, 2006 6:58 PM

I posted a few hours ago but it has yet to show so I'll try again. Most modern square root algorithms do multiple bits per clock cycle. There was a great paper out of Thailand (I believe) a few years back with a very simple square root algorithm that does two bits per clock cycle. I was unable to locate the paper but I did find some C code I used to test it. Using it I was able to do a fine FP square root operator using half the resources of Xilinx and having half the latency for both pipelined and unpipelined versions. I actually think I could cut the latency in half again if I worked at it and used just a few more resources to do two adds in a clock cycle. If anyone recognizes the C code below, please post the source!

DWORD n = 90; //input 8 bits worth: 90 should give 9r9 DWORD bl = sizeof(n)*8;//bitlen DWORD hibit = 11)-1;i >=0; i--){ if(!(r & hibit)){ r = (r(i

- W
- Weng Tianxiang
  
  Contact options for registered users
Vote on answer
posted
18 years ago

Wed, Mar 22, 2006 1:01 AM

Hi, I am interested in the algorithm too.

By googling, the following address shows 16,700 articles.

formatting link

Weng

- R
- Robin Bruce
  
  Contact options for registered users
Vote on answer
posted
18 years ago

Wed, Mar 22, 2006 11:21 AM

Would it be this paper?

formatting link

Good post, thanks.

Robin

- W
- Weng Tianxiang
  
  Contact options for registered users
Vote on answer
posted
18 years ago

Wed, Mar 22, 2006 3:53 PM

Hi Robin, It is not the one Brannon mentioned. The paper you list was published in 1997.

Weng

- B
- Brannon
  
  Contact options for registered users
Vote on answer
posted
18 years ago

Wed, Mar 22, 2006 4:46 PM

I found the correct paper:

formatting link

Entitled "an FPGA implementation of a fixed-point square root operation". Of course it works for floating point too; you just run it on the mantissa.

- W
- Weng Tianxiang
  
  Contact options for registered users
Vote on answer
posted
18 years ago

Wed, Mar 22, 2006 11:08 PM

Hi Brannon, Thank you for your help.

It is great for several people in a group with great interest in an algorithm and their cooperation rediscovers its best implementation published somewhere and then shares it with everyone.

Weng

- B
- Ben Jones
  
  Contact options for registered users
Vote on answer
posted
18 years ago

Thu, Mar 23, 2006 8:55 AM

Hi Robin,

I expect that it is very similar to the algorithm your own unit is using. I don't think I can go into details though.

As you will see from the product datasheet, this core is based on IP that Xilinx licensed from QinetiQ (UK) Ltd. We have been working on improvements since we brought this in-house, but the focus has been on the more "mainstream" floating-point operators (e.g. add, multiply). So it's good to know that there is at least some interest in this square-root core too!

To cut a long story short: we know where these inefficiencies are, and our core should indeed be a fair bit smaller when fully pipelined than it currently is. The engineering team is aware of this so there should be a further-optimized version available in a future IP release (watch this space).

Incidentally, how come you have so many square-roots to do? We racked our brains and couldn't think of a "killer app" that would require a ~300MHz fully pipelined square-root unit. Perhaps this use case is more common than we thought?

Thanks a lot for your feedback in any case.

Cheers,

-Ben-

(@xilinx)

- R
- Robin Bruce
  
  Contact options for registered users
Vote on answer
posted
18 years ago

Thu, Mar 23, 2006 10:34 AM

Hi Ben,

of SRT division related technique. I'm about a million miles away from being an expert on this, but it was my understanding that this technique was best suited to the case where a division unit and a square root unit are sharing resources. I might be wrong on that though .

The technique one of my esteemed colleagues settled on was derived from Napier's bones. From what I can tell so far the technique performs well in terms of clock frequency and resource use in comparison to other techniques being deployed. It has a similar latency to the core that you guys have developed, but I understand now that there are other techniques out there that can reduce latency by calculating two bits of the result per cycle. I imagine this would reduce the clock frequency possible when you're fully-pipelined.

Not to take anything away from what you guys have done though, when you've made such an effort to allow the cores to be customisable to the degree that they are, then it's perfectly understandable that they're not fully optimised in every way :)

The interest I've got in the square root core is from the perspective of wanting to provide a design environment to HPC programmers who want to use FPGAs to get serious computational speed-up. For that I need fully pipelined floating-point cores for single and double precision. These floating point cores would then be linked together to implement specific HPC functions that are maximally pipelined. Given that data sets are going to have to be big to justify the use of FPGAs, and that the functions will be pipelined, the cores' main qualities should be, in order of importance:

1) High Max Clock Frequency - 2) Low Resource Use 3) Low Latency (A distant third)

Latency isn't so much of a concern, as for a pipelined HPC function with a large data set, the contribution of the core latencies to total computation time should be such a tiny fraction as to make it insignificant in a lot of cases.

Anyway, I think I've rambled on a bit. The point is to say that I don't know what the application is either just yet, but I'm taking a "Build it and they will come" style approach.

How much are you guys considering the emerging Reconfigurable HPC crowd? I would have thought that given your experience with the floating-point cores it would be right up your street.

Cheers,

Robin

(@Nallatech)

- B
- Ben Jones
  
  Contact options for registered users
Vote on answer
posted
18 years ago

Thu, Mar 23, 2006 11:26 AM

Hi Robin,

Same here. The safest thing for me to do is to say as little as possible. :-)

We've looked at quite a few algorithms for division and square-root operations, and how they map to the FPGA fabric. Mainly our conclusion at the moment is that more exotic approaches using hard multipliers don't tend to buy you anything in area or speed terms, and can often make the latency

*worse* (because of the pipelining requirements). We do endeavour to keep our eyes open, though.

Yes, there's a lot of configurability in there. The variable data rate/pipeline depth is the most complex feature (allowing you to choose between fully pipelined and fully sequential operation, or various options in between). The main issue right now is that SRL16 pipes are sometimes not being used when they should be.

Interesting that power consumption doesn't figure at all (probably makes #17, just below "looks attractive when viewed in FPGA editor" :-)).

There's certainly a lot going on in this arena: most of the "early adopters" of FP on FPGAs seem to be attacking just these sorts of problems. You'll know all about the Edinburgh Parallel Computing Centre's "FPGA supercomputer" undertaking, of course (since our guys talk to your guys about it on a regular basis.)

Of course, now and then someone starts complaining that they're not seeing any revenue from these sorts of projects... but we try not to let that stop us! :)

Cheers,

-Ben-

- R
- Robin Bruce
  
  Contact options for registered users
Vote on answer
posted
18 years ago

Thu, Mar 23, 2006 12:13 PM

Cheers ben,

Well, it's true that power consumption isn't really keeping me up at night. I suppose it's because I'm envisaging this stuff in a workstation style environment. No matter how much the power the FPGA is consuming, I know it's still going to look good compared to the 100W+ commodity microprocessors consume.

Well, this has certainly been a model discussion for comp.arch.fpga! No flaming, no rattles being thrown out of prams and no interminable arguments about standards of English ;-)

Cheers,

Robin

- W
- Weng Tianxiang
  
  Contact options for registered users
Vote on answer
posted
18 years ago

Sat, Mar 25, 2006 12:15 AM

Brannon wrote: "I posted a few hours ago but it has yet to show so I'll try again. Most modern square root algorithms do multiple bits per clock cycle. There was a great paper out of Thailand (I believe) a few years back with a very simple square root algorithm that does two bits per clock cycle. I

was unable to locate the paper but I did find some C code I used to test it. Using it I was able to do a fine FP square root operator using

half the resources of Xilinx and having half the latency for both pipelined and unpipelined versions. I actually think I could cut the latency in half again if I worked at it and used just a few more resources to do two adds in a clock cycle. If anyone recognizes the C code below, please post the source! "

I have read the paper out of Thailand, but its credit should go to the algorithm inventors: Y. Li and W. Chu of The University of Aizu, Japan.

Please notice in paper of Thailand there is a serious typo in the algorithm in Fig. 1.

in the loop: R = R - ((Q

- R
- Robin Bruce
  
  Contact options for registered users
Vote on answer
posted
18 years ago

Mon, Mar 27, 2006 8:55 AM

Hi Brannon,

can you remember what clock rate you were getting on which architecture when you implemented this algorithm?

Cheers,

Robin

- B
- Brannon
  
  Contact options for registered users
Vote on answer
posted
18 years ago

Mon, Mar 27, 2006 5:10 PM

I just compiled the single-prec version of it at 150MHz on a V2 part speed -4. 166MHz just barely fails. I show it uses 490-522 slices (275 of which include LUT usage) depending on mapper settings and not counting registers on each end. It has a pipelength of 17 clock cycles: two bits per clock cycle plus a few for the left justification. For comparison, the double-prec version runs at the same speed and uses apx

1940 slices. In both cases, there are more than twice as many registers used as LUTs. The double-prec has a pipelength of 31 clock cycles.