LUT6 FPGAs and Carry Logic

Hallo.

Some questions about Xilinx LUT6 FPGAs (my WebPack Toolchain is a little outdated, and the newer LUT6-FPGAs don't seem to show up correctly in fpga_editor).

  • Is there really no carry-bypass option in LUT6-paths like the CYMUX(any,cin,1) living in LUT4 paths, apart from constraining the LUTs?

  • SLICEMs and SLICELs always have a carry chain, while SLICEXs neither have a RAM nor a carry option?

  • Within a CLB, SLICEMs are paired with SCLICEX (if there are SLICEXs in the device)? Sounds strange to me: If a LUT is configured to be dynamic, it is probably very likely that additional Carry logic isn't used, compared to static LUTs (with LUT4s, one rare reason to this is using the carry chain to implement a post-invert option for the RAM...). Have you ever seen a dynamic LUT6 really gain something in also using carry?

  • What about production? Does it look like Xilinx might stop selling and developing new LUT4-FPGAs in the near future? I personally don't have enough overview about these two FPGA classes, so I can't see the detailed pros and cons.

Gruss

Jan Bruns

--
Ein paar Fotos: http://abnuto.de/gal/
Reply to
Jan Bruns
Loading thread data ...

The datasheets and usermanuals show everything you would need I think though... see UG384 for example. Pages 9-11 show the various slices in some detail.

I'm not sure what you mean by constraining the LUTs. There are various muxes shown in Fig 3,4,5 - can you achieve what you want with them?

Correct

It seems to me that there are "big slices", "medium slices" and "smal slices" - the silicon area taken up by the carry chain may well be "free" compared to the rest of the big/medium slices.

Additionally, SLICEMs can be used for dynamic filter-coefficient storage, the arithmetic logic is also useful then.

Xilinx will have pushed an awful lot of existing and potential designs through this architecture and decided its a win overall. Whether it's a win for your particular designs and style is immaterial to them (unless you are an *enormous* customer!)

Selling... I doubt they'll stop selling Spartan 3 (for example) for a very long time yet - Xilinx have a long history of keeping old families going for many many years after it was sensible to design them into new systems.

Developing... Spartan 3(and E,A,ADSP) was the last LUT4 generation, so yes, I think it's stopped!

My understanding of the the Series 7 goal is to make as much of the user-visible logic as possible identical across the three ranges (Artix, Kintex and Virtex). There are differences in power/speed tradeoffs and the mix of memory, DSP, gigabit IO, logic etc. But the fundamental blocks are the same throughout. Unlike in the V5/S3 era when the LUTS, DSPs, BRAMs, IOs were all different between the two families!

I'm not sure there's much to care about pros and cons. LUT6 is here, unless you want to design with relatively old chips.

Cheers, Martin

--
martin.j.thompson@trw.com 
TRW Conekt - Consultancy in Engineering, Knowledge and Technology
 Click to see the full signature
Reply to
Martin Thompson

Martin Thompson:

Thanks. I've tested the wrong datasheets then.

The select-Input of the main Carry-Select Muxes is directly connected to the LUT-output, without an option to put another signal on it. If you make use of the Carry Logic, the function you put on the LUT will always become part of the Carry calculation.

Xilinx LUT4 FPGAs had an option to make the main CarrySelect Mux always fprward the cin to cout, no matter of the LUT said. This was pretty useful, because it was possible to make relatively huge logic feed the Carry Chain, without ever crossing CLB boundaries.

For example, within a SLICE, it was possible to have one LUT act as

16-bit RAM, and have it added (or whatever) to some external value on the other LUT. The RAM-LUTs output was not expected to directly connect to the Carry Logic, but had relatively fast routes to the arithmetic then.

However, I'd expect many reasons to use "partial populated" carry chains to be gone with LUT6.

Hmn, sounds like that's only one theory of yours.

Hm, what about details, then? A dynloadable LUT1 calculating "external signal xor stored bit"?

Compared to what? LUT4 vs. LUT6, given the same silicon process? What would you expect the term "win" to represent, then?

I don't believe there's no market for LUT4 FPGAs using current silicon process.

Argh. So all these valuable customers have to rework all parts of their highly optimized, huge module database, just because Xilinx engineers thought it might be less work for them to ever put LUT6 in silicon?

Gruss

Jan Bruns

--
Ein paar Fotos: http://abnuto.de/gal/
Reply to
Jan Bruns

Market: Maybe. Resonable facts to support such an architecture: No.

The problem here is that users tend to evaluate the capabilites of an FPGA mainly as logic, while really you pay mostly for routing. Logic is a very small portion of the silicon area. Of course the vendors don't publish the numbers, but university research suggests the area of LUT and LUT configuration is only a few percent of total area.

Therefore when going from 4-LUT to 6-LUT you don't get a 4x area increase (16 entries to 64 entries) but more like a 60% increase (going from 4 inputs that must be routed to

6 inputs that must be routed in a somewhat worse than linear routing area). This is offset by the fact that routing now gets a lot simpler.

Routing increases faster than linear with the number of wires. Therefore with bigger FPGAs the percentage of logic goes down. The optimum LUT size therefore tends to go up with technology improvements.

Research shows that the efficiency curve for FPGA technologies is relatively flat around the optimum. E.g. for a given technology there are multiple LUT sizes that get you almost the same area efficiency. Because performce tends to be better for the larger LUTs and because the software runtimes go down for larger LUTs (mapping is polynomial time, routing exponential) a typical design decision would be to chose the largest LUT size within the flat region of the curve, expecting that future implementations of the architecture would move the optimum spot in that direction.

This is exactly what FPGA vendors did: In the early 90ies the sweet spot was consistenly show to be between 3- LUTs and 4-LUTs so most vendors chose 4-LUTs.

Newer research shows the flat region to be go from 4-LUTs to 6-LUTs. While 4-LUTs probably would be still a good choice, it is clear that there must be switch to

6-LUTs at some time, and one might just as well do the switch now getting much better EDA software run times.

Kolja Sulimma cronologic.de

Reply to
Kolja Sulimma

Yes, I agree. No doubt there will be *some* designs which don't work out so well in the newer architectures.

Well, yes, it is - you'll have to wait for someone from Xilinx for anything better than that :)

Well, I only offer it as a possibility (haven't done an actual comparison), but distributed arithmetic FIR filters were what I was thinking of.

Don't ask me - I'm not making the decisions. Ultimately, Xilinx presumably decided it was a "win" in business terms: "We'll make the most money doing it this way."

No-one is saying there is not a market. Just that it's not big enough for Xilinx to be targetting it.

That's progress :)

This is how bare-metal-assembly-language programmers felt as processors developed and their highly-tuned routines needed to be rewritten. Of course, the processors were faster and compilers were better, so the smart ones just wrote straightforward, portable C-code which turned out to be good-enough most of the time. And that code was much more re-usable.

I'm sure it wasn't done on a whim! There are sound business reasons for how it's been done. Sounds like they just don't fit what you'd like :(

Cheers, Martin

--
martin.j.thompson@trw.com 
TRW Conekt - Consultancy in Engineering, Knowledge and Technology
 Click to see the full signature
Reply to
Martin Thompson

(snip)

Well, they do have some competition. If they don't design and build what works for their customers, they will lose out.

As I understand it, 6LUT is better for larger chips.

For smaller ones, it likely doesn't make so much difference. There is some advantage as far as synthesis software of keeping a minimum number of different architectures.

Still, 4LUT chips should be around for a while.

-- glen

Reply to
glen herrmannsfeldt

Kolja Sulimma:

That's what I expected. This becomes pretty obvious if you imagine a LUT2 FPGA, where everyone should intuitively understand that the entire silicon would be filled up with routing resources. And LUT4 can't be far off.

So let's compare Spartans:

Spartan6 LUT6: about 7 ins, about 3 outs = 10 ports Spartan3 Slice: about 10 ins, about 6 outs = 16 ports

Where the port count for the Sparta3 Slice doesn't include the FXMUX path, but the full XB/YB (I doubt this path has/needs full routing caps, anyway).

So from what you said about area with taking routing resources into account, the Spartan3 Slice might very well consume a little more area, although it has only about half the SRAM bits.

What do we get for that?

For SLICEL, I think of:

2*any 4 inp-func: LUT4:yes, LUT6:no 2*any 4 inp-func, paired invert: LUT4:yes, LUT6:no any 5-inp func: both any 6-inp func: LUT4:no LUT6:yes MUX4: both half/partial populated Carry: LUT4:yes, LUT6:no 2 Bit full Adder: both 2 Bits of long Adder: LUT4:yes, LUT6:one, but 2? 2 Bits of long MulAdder: LUT4:yes, LUT6:one, but 2? 1 Bit ALU (fast Carry): maybe both

--with dual Ext-feedin: LUT4:yes(paired with DPram), LUT6:no Large Chain Logic: LUT4: 8Bit/Slice, LUT6:6Bit/LUT DblLUTed Chain Logic: LUT4: no, BX, only, LUT6: yes

For SLICEM, I also think of:

64x1 RAM: LUT4:no, LUT6 yes 32x2 RAM: LUT4:no, LUT6 yes 32x1 RAM: LUT4:yes, LUT6 yes 16x2 RAM: LUT4:yes, LUT6 yes 16x1 RAM+Adder: LUT4:yes, LUT6 no

Well, for the SLICEM-Part, the LUT6 might be a better choice, but for SLICEL, I'd still prefer the LUT4, given 50% area overhead, although I'm missing a little partial bit more of static MUXes and FF-paths (independent clock-inverters, or something).

Gruss

Jan Bruns

--
Ein paar Fotos: http://abnuto.de/gal/
Reply to
Jan Bruns

I believe that is what it comes down to. Given the fact that routing is a huge percentage of the chip area (and so cost) this becomes a more important factor as the chips get larger. After all, routing does go up at a faster rate than linear. So minimizing routing is more important in larger chips. The tradeoff provides for lower costs with LUT6 in larger devices.

The other side of the coin is more "wasted" logic when larger LUTs are underutilized. So it would seem that we have reached the point where the LUT6 is optimal for many if not the vast majority of designs.

I don't know that there is a performance penalty in using LUT6. I would expect that is minimal since the muxes in the LUTs are done with transmission gates with very little delay, but I don't really know. If so, the only issue then becomes cost. So if you design is one of the minority designs that can indeed be done more efficiently in a LUT4 architecture, then you will pay a bit more for a LUT6 based part... but given the advantages of smaller feature size you will likely get lower costs with the newer parts than sticking with an old generation.

As to design reworks required to optimize a design for a newer part, I expect that would be done for speed and/or cost. My experience is that Xilinx is more than willing to help you with that, especially if it means a design win over a competitor. But would anyone really expect much lost ground from a LUT4 design to a current LUT6 design? Software changes can greatly impact results, but I can't see needing to touch a design from a Spartan 3 to get it to run well in a newer device given the large improvements in the hardware from using a much smaller process. I suppose if you have used hard constraints you may have to remove them. But you knew the risk when you used those features, no?

Rick

Reply to
rickman

(snip, I wrote)

(snip)

One that I am interested in, though, is that 6LUT should be much better for building the MUX needed for barrel shifters. A 4LUT makes a two input MUX, but 6LUT can make a 4 input (and two select line) MUX. Other than that, I haven't though much about how useful differnet sizes are. The less logic between FF's, the less advantage to larger ones.

Well, they have to be designed not to glitch when switching between entries with the same output value. That doesn't naturally happen with an SRAM. Also, with transmission gates you can't go through too many without a buffer, but presumably that is part of optimizing the cell.

-- glen

Reply to
glen herrmannsfeldt

Not. It will consume a lot more area if you include routing. Routing grows faster than linear (look up "rent exponent"). Of course it can cover more flexible circuit areas because you can chose much more combinations of input signals with two 4-luts compared to one 6-lut (except if you have high fanin random logic. But the area is much larger.

The point is: It does not matter if a LUT-6 on average has lower utilization, as LUT area is virtually free. What matters is routing utilization.

There is research that clearly shows that from an efficiency standpoint FPGAs are best that can't achieve 100% LUT utilization because they have sparse routing.

The reasons why vendors choose to provide lots of routing anyway is: a) customers don't understand this and tend to start whining when they don't get 100% LUT utilization instead of beeing happy that they get better wire utilization. (Remember: Wires are the expensive part)

b) It get's hard to predict what can be implemented and what can't.

c) software gets harder to do and slower with worse routing ressources.

So you pay a premium to be able to reliably plan your design and to simplify marketing.

Back to LUT size: Have a look at figure 3.3 in this:

formatting link

area is virtually constant in that analysis for LUT sizes from 4 to 6. But with LUT size 6 you get much better software runtimes.

Kolja

Reply to
Kolja Sulimma

Kolja Sulimma:

Take some area A of silicon and put n_1 blocks of type T_1 into it. Take another area A of silicon and put n_2 blocks of a similar type T_2 into it.

If n_1*portcount(T_1) = n_2*portcount(T_2) then portcount(A) won't depend on what blocktype was implemented, and I don't see any reason why one or the other should consume more routing overhead.

If the utilization of a given LUT goes low, the routing will on average become lesser "localized", so that wires become longer,

Thanks for sharing that link.

However, my understanding from that presentation is, that LUT4,,6 give the same overall area utilization, LUT>6 would give shortest delays, and LUT4..6 all give the same best area*delay product.

Overall area (including routing) doesn't significantly change from LUT4 to LUT6, and even the delay was similar from LUT4 to LUT6.

But these results don't represent the fact, that the Xilinx Lut4-design has an enormous fit to many practically relevant problems (for example, adders ans busmuxes are very frequently used). Even the software generated technology mapping makes heavy use of these additional Lut4 features, that are almost for free, compared to the theoretical, simple LUT4 design.

The technology mapping might become easier for synthesis software, if the CLB design comes nearer to the bare LUT (with LUT6, the Carry seems to become the only additional specialized circuit), but the Xilinx software is already able to make good use of their LUT4 specials, it's only that it doesn't always notice the ideal, obvious solution.

Gruss

Jan Bruns

--
Ein paar Fotos: http://abnuto.de/gal/
Reply to
Jan Bruns

(snip)

I have done place and route on pipelined arrays with different numbers of cells per chip, and found that speed goes fairly close to inversely proportional to the number of cells, over a fairly wide range.

-- glen

Reply to
glen herrmannsfeldt

glen herrmannsfeldt:

Some pipeline control signals crossing the data-path and getting slower with wider fanouts?

Gruss

Jan Bruns

--
Ein paar Fotos: http://abnuto.de/gal/
Reply to
Jan Bruns

(snip, I wrote)

It is a linear array of fairly simple cells. I believe it is that the routes get longer and slower as things get more tightly packed together.

-- glen

Reply to
glen herrmannsfeldt

Just some days ago, I had a similar problem. There was a horizontal data flow, with the parallel data lines vertically aligned.

The bottleneck was one CLB column using a couple of "control signals" sourced elsewhere. The timing heavily scaled down with bus size, and timinganlysis showed up a couple of ns of routing delay, just for the control.

Luckily, the critical CLB row had some unused regs, so I used them to replicate the most critical controls.

At first, this didn't work out as expected. It even got worse than without the replication. This was caused by the way I've arranged the replicates, with more vertical direct lines than available. So the router came up with solutions like routing a critical, local CLB signal once around that CLB (a lot of hops through a handful of neighbor switch matrices).

A simple rearrangment of the replicate usage however fully solved that further problem (by halving the direct neighbor route consumption).

Althugh there are now some more signals on the switches (remember the original signals still need to go to the replicate regs), now all the replicates have direct neighbor connects (or better) to the LUTs. So timing doesn't scale anymore with bus width.

Gruss

Jan Bruns

--
Ein paar Fotos: http://abnuto.de/gal/
Reply to
Jan Bruns

Yes, the 4LUT can be finagled by using the fourth input as an enable which is in essence the AND gate of the next mux stage, then you can use all four inputs of a LUT as the OR gate to combine 8 inputs in two levels. So the 4LUT is more like 1.5 2 input muxes.

The glitching is from logic race conditions. Using transmission gates pretty much eliminates that as long as you use break before make connections. Then the capacitance of the line retains the last value until the new value comes up.

Rick

Reply to
rickman

ElectronDepot website is not affiliated with any of the manufacturers or service providers discussed here. All logos and trade names are the property of their respective owners.