# Hardware floating point?

• posted

So, just doing a brief search, it looks like Altera is touting a floating point slice in at least one of their lines.

Is this really a thing, or are they wrapping some more familiar fixed- point processing with IP to make it floating point?

And, anything else you know.

TIA.

--
Tim Wescott
Wescott Design Services
• posted

I'm not sure what you are asking. What do you think floating point is exactly? The core of floating point is just fixed point arithmetic with an extra bit (uh, rereading this I need to make clear this is the British "bit" meaning part :) to express the exponent of a binary multiplier. To perform addition or subtraction on floating point numbers the mantissa needs to be normalized meaning the bits must be lined up so they are all equal weight. This requires adjusting one of the exponents so the two are equal while shifting the mantissa to match. Then the addition can be done on the mantissa and the result adjusted so the msb of the mantissa is in the correct alignment.

Multiplication is actually easier in that normalization is not required, but exponents are added and the result is adjusted for correct alignment of the mantissa.

So the heart of a floating point operation is a fixed point ALU with barrel shifters before and after.

--

Rick C
• posted

I think you oversimplify FP. It works a lot better with dedicated hardware.

• posted

Not sure what your point is. The principles are the same in software or hardware. I was describing hardware I have worked on. ST-100 from Star Technologies. I became very intimate with the inner workings.

The only complications are from the various error and special case handling of the IEEE-754 format. I doubt the FPGA is implementing that, but possibly. The basics are still the same. Adds use a barrel shifter to denormalize the mantissa so the exponents are equal, a integer adder and a normalization barrel shifter to produce the result. Multiplies use a multiplier for the mantissas and an adder for the exponents (with adjustment for exponent bias) followed by a simple shifter to normalize the result.

Both add and multiply are about the same level of complexity as a barrel shifter is almost as much logic as the multiplier.

Other than the special case handling of IEEE-754, what do you think I am missing?

--

Rick C
• posted

Altera claims it IS IEEE-754 compliant, but it is surprisingly hard to find any more detailed facts. And we all know how FPGA marketing works, so bit of doubt is very understandable...

The best I could find is this:

In short: It appears that infinite and NaNs are supported, however sub-normals are treated as 0 and only one rounding-mode is supported...

Somewhere there is a video which shows that using the floating-point DSPs cuts the LE-usage by about 90%, so if you need floating point, I think Arria/Stratix 10 are really the best way to go...

Regards,

Thomas

- Home of EEBlaster and JPEG-Codec

• posted

That video may be for the *entire* floating point unit in the fabric. Most FPGAs have dedicated integer multipliers which can be used for both the multiplier and the barrel shifters in a floating point ALU. The adders and random logic would need to be in the fabric, but will be

*much* smaller.
--

Rick C
• posted

Xilinx and Altera both support "DSP blocks" that do a multiply and add (they say multiply and accumulate, but it's more versatile than that).

According to the above paper, Altera has paired up their DSP blocks and added logic to each pair so that they become a floating-point arithmetic block. Personally I think that for most "regular" DSP uses you're going to know the range of the incoming data and will, therefor, only need fixed-point -- but it looks like they're chasing the "FPGA as a supercomputer" market (hence, the purchase by Intel), and for that you need floating point just as a selling point.

--
Tim Wescott
Control systems, embedded software and circuit design
• posted

If you look around I thing you will find many uses for floating point in the DSP market. It's not just a selling gimmick. I don't think the many floating point DSP devices are sold because they look good in the product's spec sheet.

Heck back in the day when DSP was done on mainframes the hot rods of computing were all floating point. Cray-1, ST-100...

--

Rick C
• posted

I am attempting to design a 40-bit single and 80-bit double hardware- expressed form of an n-bit floating point "unum" (universal number) engine, as per the design by John Gustafson:

I intend an FPU, and 4x vector FPU for SIMD:

In my Arxoda CPU (design still in progress):

Thank you, Rick C. Hodgin

• posted

]> If you look around I thing you will find many uses for floating point in ]> the DSP market.

There was a rule of thumb in voice compression that floating point DSP took a third fewer operations than fixed point DSP. Plus probably faster code development not having to keep track of the scaling.

Jim Brakefield

• posted

ware.

It just all works better with dedicated hardware. Finding the leading one for normalization is somewhat slow in the FPGA and is something that benefi ts from dedicated hardware. Using a DSP48 (if we're talking about Xilinx) for a barrel shifter is fairly fast, but requires 3 cycles of latency, can only shift up to 18 bits, and is overkill for the task. You're using a ful l multiplier as a shifter; a dedicated shifter would be smaller and faster. All this stuff adds latency. When I pull up CoreGen and ask for the basi c FP adder, I get something that uses only 2 DSP48s but has 12 cycles of la tency. And there is a lot of fabric routing so timing is not very determin istic.

• posted

I'm not sure how much you know about multipliers and shifters. Multipliers are not magical. Multiplexers *are* big. A multiplier has N stages with a one bit adder at every bit position. A barrel multiplexer has nearly as many bit positions (you typically don't need all the possible outputs), but uses a bit less logic at each position. Each bit position still needs a full 4 input LUT. Not tons of difference in complexity.

The multipliers I've seen have selectable latency down to 1 clock. Rolling a barrel shifter will generate many layers of logic that will need to be pipelined as well to reach high speeds, likely many more layers for the same speeds.

What do you get if you design a floating point adder in the fabric? I can only imagine it will be *much* larger and slower.

--

Rick C
• posted

A 32-bit barrel shifter can be made with 5 steps, each step being a set of 32 two-input multiplexers. Dedicated hardware for that will be /much/ smaller and more efficient than using LUTs or a full multiplier.

Normalisation of FP results also requires a "find first 1" operation. Again, dedicated hardware is going to be a lot smaller and more efficient than using LUT's.

So a DSP block that has dedicated FP support is going to be smaller and faster than using integer DSP blocks with LUT's to do the same job.

• posted

If I understand, you can do a barrel shifter with log2(n) complexity, hence your 5 steps but you will have the combitional delays of 5 muxes, it could limit your maximum clock frequency. A brute force approach will use more r esoures but will probably allow a higher clock frequency.

• posted

The "brute force" method would be 1 layer of 32 32-input multiplexers. And how do you implement a 32-input multiplexer in gates? You basically have 5 layers of 2-input multiplexers.

If the depth of the multiplexer is high enough, you might use tri-state gates but I suspect that in this case you'd implement it with normal logic.

• posted

Yes, I stand corrected. Still, it is hardly a "waste" of multipliers to use them for multiplexers.

Find first 1 can be done using a carry chain which is quite fast. It is the same function as used in Gray code operations.

Who said it wouldn't be? I say exactly that below. My point was just that floating point isn't too hard to wrap your head around and not so horribly different from fixed point. You just need to stick a few functions onto a fixed point multiplier/adder.

I was responding to:

"Is this really a thing, or are they wrapping some more familiar fixed- point processing with IP to make it floating point?"

The difference between fixed and floating point operations require a few functions beyond the basic integer operations which we have been discussing. Floating point is not magic or incredibly hard to do. It has not been included on FPGAs up until now because the primary market is integer based.

Some 15 years ago I discussed the need for hard IP in FPGAs and was told by certain Xilinx employees that it isn't practical to include hard IP because of the proliferation of combinations and wasted resources that result. The trouble is the ratio of silicon area required for hard IP vs. FPGA fabric gets worse with each larger generation. So as we see now FPGAs are including all manner of functio blocks.... like other devices.

What I don't get is why FPGAs are so special that they are the last hold out of becoming system on chip devices.

--

Rick C
• posted

Technically N log(N).

--

Rick C
• posted

A barrel shifter is simpler than that. I believe in a somewhat parallel method to computing an FFT, the terms in a barrel shifter can be shared to allow this. (pseudo vhdl)

function (indata : unsigned(31:0), sel : unsigned(4:0)) return unsigned(31:0) is variable a, b, c, d, e : unsigned(31:0); begin a := indata(31:0) & '0' when sel(0) else indata; b := (a(30:0), others => '0') when sel(1) else indata; c := (b(27:0), others => '0') when sel(2) else indata; d := (c(23:0), others => '0') when sel(3) else indata; e := (d(15:0), others => '0') when sel(4) else indata;

return (e); end;

--

Rick C
• posted

s

as

d
.
t
.
d
I

ence your 5 steps but you will have the combitional delays of 5 muxes, it c ould limit your maximum clock frequency. A brute force approach will use mo re resoures but will probably allow a higher clock frequency.

Yep true, thanks for the clarification

• posted

Well, if the multipliers are already there and you don't have alternative dedicated hardware, then I agree you are not wasting the multipliers in using them for a shifter.

It is not something I have looked into, but I'll happily take your word for it. However, like pretty much /any/ function, it will be smaller and faster in dedicated hardware than in logic blocks.

Fair enough.

Okay.

I think this has come up before in this newsgroup. But I can't remember if any conclusion was reached (probably not!).

ElectronDepot website is not affiliated with any of the manufacturers or service providers discussed here. All logos and trade names are the property of their respective owners.