Adding "super-LUTs" to FPGA, good idea ?

Question

Hi,A thought cross my mind ...I've been working much on Virtex4 lately and getting fast (~300-350 Mhz)logic for the datapath isn't really hard. But making the control stuffgo that fast is a whole lot more tricky, just a 10 bits comparatorbecomes "a lot" at that speed ... and some control signals have highfanout and that brings the net delay in the 1 - 1.5 ns range which ishalf of the period ...So what if every now and then in the FPGA fabric, there was a smallcluster of like 1 CLB with "Super LUTs" that would have a whole lotfaster logic (but no special func like SRL and distributed ram) and"bigger" drivers to charge/dischare the net faster to propagate thecontrols.Maybe it's un-feasible for some reason, it's just a thought ...	Sylvain

Antti Lukats · Accepted Answer

"Sylvain Munaut"  schrieb im Newsbeitrag news:4399cf94$0$9070$I guess altera would claim they have it in the stratix ALmAL(Antti Lukats)

Sylvain Munaut · Answer

They do ?I'm gonna check that out ...	Sylvain

Antti Lukats · Answer

"Sylvain Munaut"  schrieb im Newsbeitrag news:4399e34b$0$10953$not quite so but they claim to have 7-input lut capabilities for better logic opt.antti

Ray Andraka · Answer

I think if you look at the logic that is not making speed, it is probably using the carry chain (comparators over 7 bits do, for example). General logic is quite fast in V4. The carry chain is very slow comparatively, which has been a beef of mine. Simply speeding up the carry chain so that reasonable sized adders (16-24 bits) can run at speeds similar to the block rams and DSP slices would make all the difference. (yes Austin, I know the "simply" isn't all that easy).

You already do have "super LUTs" in the Virtex4. They are called RAMB16, and can be used for logic functions with up to 14 inputs, at clock rates of 400 MHz in a -10 part.

The other option you do have is to optimize your control logic to reduce the reliance on difficult structures such as carry. For example, if your control is using a compare to decode a count, consider instead using a down counter so that the terminal count is the most significant bit. Also consider other counter architectures, such as linear feedback shift register counters to eliminate wide logic functions.

>

Sylvain Munaut · Answer

... Well 400 MHz if you register both side and don't have too many logic before and after.

A block ram without output reg is like 2.1 ns clock to out and around

0.5 ns net delay after. If you have output reg then it's 0.9 ns clock to out. But sometimes you just can't have a 1 or 2 clock cycle latency ...

And here I was more referring to the drive strenght than the number of input nets. For example if you have to generate a clock ena combinatorially (just a single LUT level but still) and it controls like 50 FFs, the net take like 1.5 ns propagation ... half of my period ...

Well, yes optimizing control is good but sometimes very hard ... I've basically spent the last few days just doing that to finally meet timing. My comparators are not for counters but to detect a "empty" condition in a FIFO like block. ('FIFO like' because it's quite more complicated than a simple FIFO).

Sylvain

Ray Andraka · Answer

Agreed about the BRAM speed. You pretty much have to use the DO_Reg for a 400 MHz design in a -10 part. There shouldn't be any logic between the previous register and inputs to the BRAM, and the outputs can go through a single level of logic, but placement isn't critical.

As I said, the real stumbling block for fast fabric stuff is the carry chain. If you are using an SX part, you can use the DSP48's to get faster arithmetic, but at a considerable cost.

I stand by my contention that if the carry chains were faster (more specifically, the time to get on and off them), you'd probably find it a lot easier to make timing in your design.

juendme · Answer

I don't know if your initial idea is feasible or not, but it sounds good to me.

In the meantime, you can reduce the fanout (at cost) by using logic duplication. If you duplicate the signal and drive only half the flip flops, that should improve your timing (at the increased cost in terms of area). You can do that in one of two ways:

Manually (in your code) create two signals, and set options so that your synthesis tool does not optimize redundant logic
Turn on logic duplication,and hope the synthesis tool will recognize that the critical path can be improved by duplicating that piece of logic

Fred

Kolja Sulimma · Answer

Sylvain Munaut schrieb:Well, there are couple of 14-Input LUTs in their newer devices. Thespeed is about 2ns in Virtex-4.They call them BRAMs.Kolja Sulimma

Peter Alfke · Answer

And one of those dual-ported BRAMs can beeithertwo identical, but independently addressable, 14-input LUTs,or two completely different, independent 13-input LUTs.Naturally...Peter Alfke

Brian Davis · Answer

An important "Danger Will Robinson" observation on using BRAMS:

If you violate setup/hold on the address inputs of an enabled BRAM, EVEN IF WE IS INACTIVE, BRAM contents can (will) be corrupted.

This means: - No multicycles (unless you use EN). - No async inputs. - TIMING CONSTRAINTS ARE A NECCESSITY!!!

IF BRAM TIMING CONSTRAINTS ARE NOT SET PROPERLY, AND MET, BRAM CONTENTS WILL BE CORRUPTED!!!

See Answer Record 21870 "Virtex-II/-II Pro/-4 block RAM - Do the setup/hold times for the Address inputs need to be met, even if the output is unused and WE is deasserted?"

Brian

Peter Alfke · Answer

Brian, I think you overdramatize this.(I was involved in finding and explaining this behavior a few weeksago).Anybody who writes into the BRAM must of course abide by the addressset-up time requirement.Anybody who reads from the BRAM must also abide by the address set-uptime requirement.The surprising, non-obvious requirement is that, if the BRAM isenabled, a violation of the address set-up time can corrupt data, eventhough WE remained disabled.So, do NOT change the address right before the enabled active clockedge.You would obviously not do this when you are writing, and you wouldn'tdo it when you are reading, but you must also not do it when you havethe BRAM clock-enabled and read-enabled and you really do not careabout the result of the uncontrolled read operation. The easy way outof it is to disable the clock, not just WE.Thisis a highly unusual (but explainable) restriction, so unusual thatneither Xilinx nor any customer  found it for many years.Peter Alfke, Xilinx Applications

Adding "super-LUTs" to FPGA, good idea ?

Join the Discussion

Didn't find your answer?