Cascaded floating-point reduction?

- S
- Saad Zafar
  
  Contact options for registered users
posted
10 years ago

Wed, Aug 21, 2013 7:31 PM

y1= 1.5f*y0 - x*y0*y0*y0

...Note that all quantities are in single precision floating point. I can't write this equation in behavioral form for synthesizer to optimize because it has to be broken down and fed into FP-multipliers. I have got both the y0 and x available in 1 clock cycle ready to be plugged into this equation ...Now I'm stuck on reducing this equation in least cycles...right now I ha ve got cascaded series fp-multipliers feeding into final fp-subtractor...Ea ch multiplication consuming one clock cycle.

What in your opinion should be the best way to map this equation in hardwar e? Is there an alternative form of this equation that would be more suitabl e for implementation?

Regards.

- G
- glen herrmannsfeldt
  
  Contact options for registered users
Vote on answer
posted
10 years ago

Wed, Aug 21, 2013 8:11 PM

The usual one would be to factor out a y0, so

y1 = (1.5f-x*y0*y0)*y0;

That sames one multiplier, but maybe the same number of pipeline stages.

If you factor it as 1.5f*y0-(x*y0)*(y0*y0);

Then you can do it in one less pipeline stage, but no less multipliers.

-- glen

- J
- jonesandy
  
  Contact options for registered users
Vote on answer
posted
10 years ago

Wed, Aug 21, 2013 10:41 PM

I think the ieee floating point library's * operator is synthesizable, but synthesis would try to build the fp multipliers out of fixed point multipliers (e.g. DSP blocks) itself, which may take more than one clock cycle.

If the above works, then you could enable retiming & pipelining, and then use your original expression, and run the result through multiple pipeline stages. Retiming/pipelining can redistribute the operations and/or logic among the pipeline stages.

I have seen cases where synthesis tools did this automatically when assembling smaller fixed point multipliers into one larger multiplier, so long as there were pipeline register stages (clock cycles) available to spread across.

Andy

- M
- Mark Curry
  
  Contact options for registered users
Vote on answer
posted
10 years ago

Wed, Aug 21, 2013 11:55 PM

Not in any synthesizer I know. Floating point types aren't handled at all, much less operation like multiplication on them. I wouldn't expect them to do so *EVER*. Too much overhead, and too little of a customer base would need/want it.

Regards,

Mark

- J
- jonesandy
  
  Contact options for registered users
Vote on answer
posted
10 years ago

Thu, Aug 22, 2013 1:53 PM

Mark,

Ok, I checked our FPGA synthesis tool's documentation.

The Synplify Pro reference guide states the following in regards to the bui lt-in "real" data type:

"When one of the following constructs in encountered, compilation continues , but will subsequently error out if logic must be generated for the construc t.

? real data types (real data expressions are supported in VHDL-2008 IEEE float_pkg.vhd) ? real data types are supported as constant declarations o r as constants used in expressions as long as no floating point logic must be generated"

Thus, you cannot use the built-in real data type or expressions thereof to generate logic.

However, the reference guide also states the following:

"The following packages are supported in VHDL 2008: ? fixed_pkg.vhd, float_pkg.vhd, fixed_generic_pkg.vhd, float_generic_pkg. vhd, fixed_float_types.vhd ? IEEE fixed and floating point packages ... String and text I/O functions in the above packages are not supported. Thes e functions include read(), write(), to_string()."

Significantly, it states no other limitations on the support for float_pkg.

The float_generic_package (the generic package which float_pkg instantiates ) defines the "*" operator for type float.

From ieee.float_generic_pkg-body.vhdl, the following indicates that the pac kage is synthesizeable:

-- This deferred constant will tell you if the package body is synthesiza ble -- or implemented as real numbers, set to "true" if synthesizable. constant fphdlsynth_or_real : BOOLEAN := true; -- deferred constant

So, while I have not tried it to see, it appears that there are at least de finite plans, if not the current ability, to synthesize floating point hard ware long before *EVER* gets here.

The resulting hardware may not be particularly efficient, and may not be op erable in a single clock cycle at any reasonable clock rate, but that is wh ere retiming and pipelining come in.

Andy

- M
- Mark Curry
  
  Contact options for registered users
Vote on answer
posted
10 years ago

Thu, Aug 22, 2013 4:46 PM

Andy,

I stand corrected. Being a verilog user - I wasn't familiar with these updates for VHDL-2008.

Looks like they've done it correctly. There's default support for IEEE 754

32-bit, and IEEE 754 64-bit. But users can (and very likely should) use the generic float types, specifying all the settings including exponent width, fraction width, rounding options, normalization options, etc... One wonders however how exceptions will be handled in synthesis (i.e. NaN, etc.).

The generic 32-bit, (and worse 64-bit) IEEE 754 floating point are rarely EVER appropriate for FPGA (and even ASIC) designs. For both you're almost always designing something for a specific problem. There's not going to be many valid cases where a specfic wire is going to need all that dynamic range. For generic processors, (and DSPs) yeah, it may be appropriate.

But more controlled "floating point" like these library's provide, might be useful. I tend to think they'll also be dangerous in the hands of inexperienced HW designers - who will just take the defaults and go.

Thanks for the pointer.

Mark

- G
- glen herrmannsfeldt
  
  Contact options for registered users
Vote on answer
posted
10 years ago

Thu, Aug 22, 2013 6:15 PM

(snip)

Most of the time, you want internal pipelining on the floating point operations. There is no where to specify that with the usual arithmetic operators, but is is easy of you reference a module to do it.

-- glen

- J
- jonesandy
  
  Contact options for registered users
Vote on answer
posted
10 years ago

Fri, Aug 23, 2013 4:58 PM

Most of the time you will need the extra pipelining if you want to infer bu ilt-in multipliers.

This is where retiming and pipelining synthesis optimizations come in handy . If you follow up (and/or precede) the expression assignment with a few ex tra clock cycles of latency (pipeline register stages), the synthesis tool can distribute the HW across the extra clock cycles automatically.

Whether synthesis can do it as well as you can manually, I don't know. But if it is good enough to work, does it really need to be as good as you coul d have done manually? I'd rather have the maintainability of the mathematic al expression, if it will work.

Andy

- G
- glen herrmannsfeldt
  
  Contact options for registered users
Vote on answer
posted
10 years ago

Fri, Aug 23, 2013 5:34 PM

(snip regarding pipelining)

Which tools do that? That sounds pretty useful.

As I am not the OP, the things that I try to do are different. One that I have wondered about is the ability to add extra register stages to speed up the critical path. I work on very long, fixed point pipelines, so usually there is at some point some very long routes which limit the speed. If I could put registers in them, it could run a lot faster.

Well, for really large problems every ns counts. For 5% difference, maybe I wouldn't worry about it, but 20% or 30% is worth working for.

-- glen

- M
- Mark Curry
  
  Contact options for registered users
Vote on answer
posted
10 years ago

Fri, Aug 23, 2013 9:24 PM

In Xilinx XST, the switch you're looking for is:

-register_balancing yes

I now leave it on by default - it rarely makes things worse. It seems to help - I notice in the log file it does move Flops forward and backward through the combinational logic in an attempt to better balance the pipeline paths. How well it does the job - I've not dug in that deep.

Sounds just like what the tool is targetting. If you have access to it, I'd suggest giving it a shot.

Regards,

Mark

- J
- jonesandy
  
  Contact options for registered users
Vote on answer
posted
10 years ago

Mon, Aug 26, 2013 1:20 PM

Glen,

I know Synplify Pro has a retiing/pipelining option (for Xilinx and Altera targets), and I think Altera's and Xilinx's own tools do as well.

The last time I checked, straight retiming may only move logic into an adja cent clock cycle, but pipelining of functions such as multipliers or multip lexers can spread that logic over several clock cycles. I have seen example s where a large multiply (larger than a DSP block could handle) was automat ically partitioned and pipelined to use multiplie DSP blocks.

Since straight retiming may be limited to adjacent clock cycles, it might b e best to provide additional clock cycles of latency before and after the e xpression, so that two empty, adjacent clock cycles are available. Note tha t retiming does not need to have empty clock cycles to share logic across, but there does need to be positive slack in those adjacent clock cycles in order to "make room" for any retimed logic.

As far as timing or utiliszation is concerned, as long as I have positive s lack in both, with any margin requirements met, I prefer to have the most u nderstandable, maintainable description possible, even if a lesser descript ion would cut either (or both) by half. This was very hard to do when I sta rted VHDL based FPGA design many years ago (just meeting timing and utiliza tion was tougher in those devices and with those tools, and the "optimizer" in me was hard to re-calibrate.) I now try to optimize for maintainability whenever possible.

Andy