So, just doing a brief search, it looks like Altera is touting a floating
point slice in at least one of their lines.
Is this really a thing, or are they wrapping some more familiar fixed-
point processing with IP to make it floating point?
And, anything else you know.
I'm not sure what you are asking. What do you think floating point is
exactly? The core of floating point is just fixed point arithmetic with
an extra bit (uh, rereading this I need to make clear this is the
British "bit" meaning part :) to express the exponent of a binary
multiplier. To perform addition or subtraction on floating point
numbers the mantissa needs to be normalized meaning the bits must be
lined up so they are all equal weight. This requires adjusting one of
the exponents so the two are equal while shifting the mantissa to match.
Then the addition can be done on the mantissa and the result adjusted
so the msb of the mantissa is in the correct alignment.
Multiplication is actually easier in that normalization is not required,
but exponents are added and the result is adjusted for correct alignment
of the mantissa.
So the heart of a floating point operation is a fixed point ALU with
barrel shifters before and after.
Not sure what your point is. The principles are the same in software or
hardware. I was describing hardware I have worked on. ST-100 from Star
Technologies. I became very intimate with the inner workings.
The only complications are from the various error and special case
handling of the IEEE-754 format. I doubt the FPGA is implementing that,
but possibly. The basics are still the same. Adds use a barrel shifter
to denormalize the mantissa so the exponents are equal, a integer adder
and a normalization barrel shifter to produce the result. Multiplies
use a multiplier for the mantissas and an adder for the exponents (with
adjustment for exponent bias) followed by a simple shifter to normalize
Both add and multiply are about the same level of complexity as a barrel
shifter is almost as much logic as the multiplier.
Other than the special case handling of IEEE-754, what do you think I am
Altera claims it IS IEEE-754 compliant, but it is surprisingly hard to find any more detailed facts. And we all know how FPGA marketing works, so bit of doubt is very understandable...
The best I could find is this:
In short: It appears that infinite and NaNs are supported, however sub-normals are treated as 0 and only one rounding-mode is supported...
Somewhere there is a video which shows that using the floating-point DSPs cuts the LE-usage by about 90%, so if you need floating point, I think Arria/Stratix 10 are really the best way to go...
That video may be for the *entire* floating point unit in the fabric.
Most FPGAs have dedicated integer multipliers which can be used for both
the multiplier and the barrel shifters in a floating point ALU. The
adders and random logic would need to be in the fabric, but will be
Xilinx and Altera both support "DSP blocks" that do a multiply and add
(they say multiply and accumulate, but it's more versatile than that).
According to the above paper, Altera has paired up their DSP blocks and
added logic to each pair so that they become a floating-point arithmetic
block. Personally I think that for most "regular" DSP uses you're going
to know the range of the incoming data and will, therefor, only need
fixed-point -- but it looks like they're chasing the "FPGA as a
supercomputer" market (hence, the purchase by Intel), and for that you
need floating point just as a selling point.
Control systems, embedded software and circuit design
If you look around I thing you will find many uses for floating point in
the DSP market. It's not just a selling gimmick. I don't think the
many floating point DSP devices are sold because they look good in the
product's spec sheet.
Heck back in the day when DSP was done on mainframes the hot rods of
computing were all floating point. Cray-1, ST-100...
]> If you look around I thing you will find many uses for floating point in
]> the DSP market.
There was a rule of thumb in voice compression that floating point DSP took a third fewer operations than fixed point DSP. Plus probably faster code development not having to keep track of the scaling.
It just all works better with dedicated hardware. Finding the leading one
for normalization is somewhat slow in the FPGA and is something that benefi
ts from dedicated hardware. Using a DSP48 (if we're talking about Xilinx)
for a barrel shifter is fairly fast, but requires 3 cycles of latency, can
only shift up to 18 bits, and is overkill for the task. You're using a ful
l multiplier as a shifter; a dedicated shifter would be smaller and faster.
All this stuff adds latency. When I pull up CoreGen and ask for the basi
c FP adder, I get something that uses only 2 DSP48s but has 12 cycles of la
tency. And there is a lot of fabric routing so timing is not very determin
I'm not sure how much you know about multipliers and shifters.
Multipliers are not magical. Multiplexers *are* big. A multiplier has
N stages with a one bit adder at every bit position. A barrel
multiplexer has nearly as many bit positions (you typically don't need
all the possible outputs), but uses a bit less logic at each position.
Each bit position still needs a full 4 input LUT. Not tons of
difference in complexity.
The multipliers I've seen have selectable latency down to 1 clock.
Rolling a barrel shifter will generate many layers of logic that will
need to be pipelined as well to reach high speeds, likely many more
layers for the same speeds.
What do you get if you design a floating point adder in the fabric? I
can only imagine it will be *much* larger and slower.
A 32-bit barrel shifter can be made with 5 steps, each step being a set
of 32 two-input multiplexers. Dedicated hardware for that will be
/much/ smaller and more efficient than using LUTs or a full multiplier.
Normalisation of FP results also requires a "find first 1" operation.
Again, dedicated hardware is going to be a lot smaller and more
efficient than using LUT's.
So a DSP block that has dedicated FP support is going to be smaller and
faster than using integer DSP blocks with LUT's to do the same job.
If I understand, you can do a barrel shifter with log2(n) complexity, hence
your 5 steps but you will have the combitional delays of 5 muxes, it could
limit your maximum clock frequency. A brute force approach will use more r
esoures but will probably allow a higher clock frequency.
The "brute force" method would be 1 layer of 32 32-input multiplexers.
And how do you implement a 32-input multiplexer in gates? You basically
have 5 layers of 2-input multiplexers.
If the depth of the multiplexer is high enough, you might use tri-state
gates but I suspect that in this case you'd implement it with normal logic.
Yes, I stand corrected. Still, it is hardly a "waste" of multipliers to
use them for multiplexers.
Find first 1 can be done using a carry chain which is quite fast. It is
the same function as used in Gray code operations.
Who said it wouldn't be? I say exactly that below. My point was just
that floating point isn't too hard to wrap your head around and not so
horribly different from fixed point. You just need to stick a few
functions onto a fixed point multiplier/adder.
I was responding to:
"Is this really a thing, or are they wrapping some more familiar fixed-
point processing with IP to make it floating point?"
The difference between fixed and floating point operations require a few
functions beyond the basic integer operations which we have been
discussing. Floating point is not magic or incredibly hard to do. It
has not been included on FPGAs up until now because the primary market
is integer based.
Some 15 years ago I discussed the need for hard IP in FPGAs and was told
by certain Xilinx employees that it isn't practical to include hard IP
because of the proliferation of combinations and wasted resources that
result. The trouble is the ratio of silicon area required for hard IP
vs. FPGA fabric gets worse with each larger generation. So as we see
now FPGAs are including all manner of functio blocks.... like other
What I don't get is why FPGAs are so special that they are the last hold
out of becoming system on chip devices.
A barrel shifter is simpler than that. I believe in a somewhat parallel
method to computing an FFT, the terms in a barrel shifter can be shared
to allow this. (pseudo vhdl)
function (indata : unsigned(31:0), sel : unsigned(4:0))
return unsigned(31:0) is
variable a, b, c, d, e : unsigned(31:0);
a := indata(31:0) & '0' when sel(0) else indata;
b := (a(30:0), others => '0') when sel(1) else indata;
c := (b(27:0), others => '0') when sel(2) else indata;
d := (c(23:0), others => '0') when sel(3) else indata;
e := (d(15:0), others => '0') when sel(4) else indata;
ence your 5 steps but you will have the combitional delays of 5 muxes, it c
ould limit your maximum clock frequency. A brute force approach will use mo
re resoures but will probably allow a higher clock frequency.
Yep true, thanks for the clarification
Well, if the multipliers are already there and you don't have
alternative dedicated hardware, then I agree you are not wasting the
multipliers in using them for a shifter.
It is not something I have looked into, but I'll happily take your word
for it. However, like pretty much /any/ function, it will be smaller
and faster in dedicated hardware than in logic blocks.
I think this has come up before in this newsgroup. But I can't remember
if any conclusion was reached (probably not!).