With regard to the speed-optimised single-precision floating-point core, look at:
Table 6: Latency of Speed Optimized Core (Page 10) Table 10: Characterization of Speed-Optimized Single-Precision Core (Page 18)
These are the characteristics of the speed-optimised floating-point cores from Xilinx. For virtex-4 there's a multiplier version that uses a single DSP48 block, as opposed to the standard 4x DSP48 version. My question is: What's the point of it? There's a pure logic one there that uses less slices, has a lower latency and operates as-near-as-damn-it at the same frequency. More to the point you're not throwing away a DSP48...
I've spoken to colleagues about this, and the best guess we can come to is that there's a typo, most likely on the slice count for the single DSP48 + slices version. Anyone know better?
Cheers,
Robin Bruce