Adding "super-LUTs" to FPGA, good idea ?

Hi,

A thought cross my mind ...

I've been working much on Virtex4 lately and getting fast (~300-350 Mhz) logic for the datapath isn't really hard. But making the control stuff go that fast is a whole lot more tricky, just a 10 bits comparator becomes "a lot" at that speed ... and some control signals have high fanout and that brings the net delay in the 1 - 1.5 ns range which is half of the period ...

So what if every now and then in the FPGA fabric, there was a small cluster of like 1 CLB with "Super LUTs" that would have a whole lot faster logic (but no special func like SRL and distributed ram) and "bigger" drivers to charge/dischare the net faster to propagate the controls.

Maybe it's un-feasible for some reason, it's just a thought ...

Sylvain

Reply to
Sylvain Munaut
Loading thread data ...

"Sylvain Munaut" schrieb im Newsbeitrag news:4399cf94$0$9070$ snipped-for-privacy@news.skynet.be...

I guess altera would claim they have it in the stratix ALm AL (Antti Lukats)

Reply to
Antti Lukats

They do ? I'm gonna check that out ...

Sylvain

Reply to
Sylvain Munaut

"Sylvain Munaut" schrieb im Newsbeitrag news:4399e34b$0$10953$ snipped-for-privacy@news.skynet.be...

not quite so but they claim to have 7-input lut capabilities for better logic opt.

antti

Reply to
Antti Lukats

I think if you look at the logic that is not making speed, it is probably using the carry chain (comparators over 7 bits do, for example). General logic is quite fast in V4. The carry chain is very slow comparatively, which has been a beef of mine. Simply speeding up the carry chain so that reasonable sized adders (16-24 bits) can run at speeds similar to the block rams and DSP slices would make all the difference. (yes Austin, I know the "simply" isn't all that easy).

You already do have "super LUTs" in the Virtex4. They are called RAMB16, and can be used for logic functions with up to 14 inputs, at clock rates of 400 MHz in a -10 part.

The other option you do have is to optimize your control logic to reduce the reliance on difficult structures such as carry. For example, if your control is using a compare to decode a count, consider instead using a down counter so that the terminal count is the most significant bit. Also consider other counter architectures, such as linear feedback shift register counters to eliminate wide logic functions.

>
Reply to
Ray Andraka

... Well 400 MHz if you register both side and don't have too many logic before and after.

A block ram without output reg is like 2.1 ns clock to out and around

0.5 ns net delay after. If you have output reg then it's 0.9 ns clock to out. But sometimes you just can't have a 1 or 2 clock cycle latency ...

And here I was more referring to the drive strenght than the number of input nets. For example if you have to generate a clock ena combinatorially (just a single LUT level but still) and it controls like 50 FFs, the net take like 1.5 ns propagation ... half of my period ...

Well, yes optimizing control is good but sometimes very hard ... I've basically spent the last few days just doing that to finally meet timing. My comparators are not for counters but to detect a "empty" condition in a FIFO like block. ('FIFO like' because it's quite more complicated than a simple FIFO).

Sylvain

Reply to
Sylvain Munaut

Agreed about the BRAM speed. You pretty much have to use the DO_Reg for a 400 MHz design in a -10 part. There shouldn't be any logic between the previous register and inputs to the BRAM, and the outputs can go through a single level of logic, but placement isn't critical.

As I said, the real stumbling block for fast fabric stuff is the carry chain. If you are using an SX part, you can use the DSP48's to get faster arithmetic, but at a considerable cost.

I stand by my contention that if the carry chains were faster (more specifically, the time to get on and off them), you'd probably find it a lot easier to make timing in your design.

Reply to
Ray Andraka

I don't know if your initial idea is feasible or not, but it sounds good to me.

In the meantime, you can reduce the fanout (at cost) by using logic duplication. If you duplicate the signal and drive only half the flip flops, that should improve your timing (at the increased cost in terms of area). You can do that in one of two ways:

  1. Manually (in your code) create two signals, and set options so that your synthesis tool does not optimize redundant logic
  2. Turn on logic duplication,and hope the synthesis tool will recognize that the critical path can be improved by duplicating that piece of logic

Fred

Reply to
juendme

Sylvain Munaut schrieb:

Well, there are couple of 14-Input LUTs in their newer devices. The speed is about 2ns in Virtex-4. They call them BRAMs.

Kolja Sulimma

Reply to
Kolja Sulimma

And one of those dual-ported BRAMs can be either two identical, but independently addressable, 14-input LUTs, or two completely different, independent 13-input LUTs. Naturally... Peter Alfke

Reply to
Peter Alfke

An important "Danger Will Robinson" observation on using BRAMS:

If you violate setup/hold on the address inputs of an enabled BRAM, EVEN IF WE IS INACTIVE, BRAM contents can (will) be corrupted.

This means: - No multicycles (unless you use EN). - No async inputs. - TIMING CONSTRAINTS ARE A NECCESSITY!!!

IF BRAM TIMING CONSTRAINTS ARE NOT SET PROPERLY, AND MET, BRAM CONTENTS WILL BE CORRUPTED!!!

See Answer Record 21870 "Virtex-II/-II Pro/-4 block RAM - Do the setup/hold times for the Address inputs need to be met, even if the output is unused and WE is deasserted?"

Brian

Reply to
Brian Davis

Brian, I think you overdramatize this. (I was involved in finding and explaining this behavior a few weeks ago).

Anybody who writes into the BRAM must of course abide by the address set-up time requirement. Anybody who reads from the BRAM must also abide by the address set-up time requirement. The surprising, non-obvious requirement is that, if the BRAM is enabled, a violation of the address set-up time can corrupt data, even though WE remained disabled. So, do NOT change the address right before the enabled active clock edge. You would obviously not do this when you are writing, and you wouldn't do it when you are reading, but you must also not do it when you have the BRAM clock-enabled and read-enabled and you really do not care about the result of the uncontrolled read operation. The easy way out of it is to disable the clock, not just WE.

Thisis a highly unusual (but explainable) restriction, so unusual that neither Xilinx nor any customer found it for many years. Peter Alfke, Xilinx Applications

Reply to
Peter Alfke

Hardly- it should be mentioned front and center in the BRAM sections of the datasheet and user guides; in bold print; with circles and arrows and a paragraph on the back explaining the problem.

Adopting the same head-in-the-sand, "it's in an Answer Record somewhere", mentality that, of recent years, has pervaded Xilinx's approach to documenting serious problems, does not help your customers one whit.

The thread in question was about using BRAMS as logic.

Who would expect a ROM to clobber its' own contents due to an address setup violation?

Brian

Reply to
Brian Davis

I have to agree with Brian. This is a big deal. I expected that violating read address setup time would screw up the read result that cycle; I was amazed to find out that violating address timing could actually change the contents of the RAM. I imagine that anyone using the BRAM as a ROM, with WE arc-welded to ground, would be doubly surprised.

I have a design in which different buses supply the read and write addresses to a large number of BRAMs. The write address is synchronized; the read address isn't, because I saw no reason to, at least until I saw Answer Record 21870 (which I saw only by accident, thanks to a tip from another designer). Even then, it took about a week and a half working with the Hotline and an FAE before I found out what the Answer Record actually meant, the original version having been more vague than the current one. So I've got a design that I have to redo.

The "easy way out of the problem" is easy only if you know there's a problem in the first place.

I heard from the Hotline that the data sheets for the affected families would be amended. If amended data sheets haven't been released already, I hope they will be soon.

I guess the thing that bothers me the most is that once the problem was identified, no one at Xilinx seemed to know that when RAMs don't work like RAMs, it's potentially a Big Damn Deal for at least some designers, and deserving of something more than to be hidden away in an Answer Record that you might or might not see.

Bob Perlman Cambrian Design Works

Reply to
Bob Perlman

Valid point. Reminds me of an oops Philips made in their UARTS, which was a Test mode kicked into by READ [?!] of a certain address. So, yes, without care on the selection lines, this could go out-to-lunch. Took them a while to admit to it.....

Another issue here, is if this IS loaded/used as a ROM, what happens during brownout, where it is quite possible that timing MAY be violated. Sounds like there could be a lot of ?? space between the 'Let's Reconfigure' decision point, and the 'Inside Specs' operate point ?

-jg

Reply to
Jim Granville

Or if the BlockROM clock is sourced by a DCM which goes unlocked, thus rendering all BlockROM contents unreliable until the device has been reconfigured.

Oops.

Better not use that XST BRAM_MAP logic-into-BRAM mapping option any time soon...

Brian

Reply to
Brian Davis

I'm with Brian and Bob on this. As a designer, we need have limitations like this as well as those with the FIFO16's printed in bold right in the users guides so that it can be avoided by design rather than discovered in the lab. Findng it in the lab is too late in the design cycle. The question is, what other gems like this are hidden away in obscure answer records?

Reply to
Ray Andraka

.... (stuff deleted)

I have the same problem with the high clock to output. I found that when putting the signal through a delay (SRL16's) I can actually detect zero condition BEFORE going in, and shift the zero signal with the same delay. In theory I can detect some bits at each shift, making it very fast. When using RAMB I can also detect zero condition on any port and reserve a bit for that.

Reply to
Morten Leikvoll

ElectronDepot website is not affiliated with any of the manufacturers or service providers discussed here. All logos and trade names are the property of their respective owners.