2048 input or gate ?

- M
- mk
  
  Contact options for registered users
posted
17 years ago

Mon, Jul 17, 2006 1:17 AM

Hi everyone, I am trying to 'or' a 2K vector in Virtex4. Looking at the problem as a first approximation, it would need 6 levels of 4 input lookup tables. So far I have tried XST but it seems to be using the initial

512 LUT4s and then 56 levels of MUXCY. Any ideas why it's using the MUXCYs? They seem to be quite fast at 45ns each but number of levels is quite high. I'm curious what the timing would look like if I could force it to use only LUT4s but I really don't want to code it by hand and I am too lazy to write a perl script to do it either. Any suggestions ?

Thanks.

PS Here is what I am using as a test module. I am trying to map it to a virtex4-10.

module orlt(clk, in, out); input clk; input [2047:0] in; output out;

reg [2047:0] inr; reg out; wire outw;

orl u0(inr, outw);

always @(posedge clk) begin out

- R
- rickman
  
  Contact options for registered users
Vote on answer
posted
17 years ago

Mon, Jul 17, 2006 3:45 AM

I was thinking, what is hard about this??? Then I looked at your code and realized that you are not using VHDL. In VHDL you can use a GENERATE statement to lay out the functional elements exactly how you want them with looping to define it all without the tedium. Isn't there a similar construct in Verilog?

- J
- John Adair
  
  Contact options for registered users
Vote on answer
posted
17 years ago

Mon, Jul 17, 2006 6:55 AM

I am rusty on Verilog so can't remember if you have a generate statement available but another way to cut work is to have a layered component such that the bottom level has say four 4ip OR gates in it. The layer above has 4 of the supper gate and so on. If you start with at the bottom with a or gate instantiation and do the same all the way up with component instatiations the synthesiser won't be able to do much to insert other gates.

The MUXCY is probably being used as the carry chain is a fast route compared to general routing and can be used to make a wide OR function with 2 or more LUTs. To a degree this may be the fastest way to get you OR but probaby tempered with some imposed structure. As a guess the synthesiser is currently generating a number of 220-228 i/p OR gates then putting the output together in another OR function.

John Adair Enterpo> Hi everyone,

- S
- Symon
  
  Contact options for registered users
Vote on answer
posted
17 years ago

Mon, Jul 17, 2006 9:01 AM

Your synthesiser is using the MUXCYs because it uses less resource (about

75% of the tree method) and is faster. If the MUXCY propagation delay was 45ns, I'd be worried, but it's really only 45ps! :-) If you build a tree, it'll be slower. It's not just the LUT delay, it's all that routing you need for a wide OR gate. To show it, you could try synthesising a 2k XOR gate. Your synthesiser might struggle to implement that with a carry structure. HTH, Syms.

- R
- rickman
  
  Contact options for registered users
Vote on answer
posted
17 years ago

Mon, Jul 17, 2006 2:27 PM

Is that 45 ps per LUT of the carry or 45 ps per CLB in the carry chain? If I use the 56 elements that the OP said, I get 2.52 ns total carry delay. That is pretty remarkable if it is correct.

Increasing that to 45 ps per each of the 512 LUTs the carry delay is still only 23.04 ns. A combination approach combining say 16 LUTs with the carry then using an 8 input OR gate should be a bit faster. 16 carries is about the same speed as a LUT. I have not looked at the Virtex 4 architecture so I don't know for sure if this is needed or if the carry delay is 45 ps per CLB.

- S
- Symon
  
  Contact options for registered users
Vote on answer
posted
17 years ago

Mon, Jul 17, 2006 2:50 PM

Hi Rick, Yes, that's 45ps per LUT. I believe the carry is actually implemented as a two bit look ahead, so that each CLB is a two bit carry with delay of 90ps. But, now you mention it, I don't understand the 56 levels thing.

Thinking about it a bit harder, and after reading your post, I reckon the synthesiser must be doing what you suggest, dividing the chain up into sections, and oring together the output. Cheers, Syms.

- J
- John_H
  
  Contact options for registered users
Vote on answer
posted
17 years ago

Mon, Jul 17, 2006 4:39 PM

More specifically the synthesizer is probably splitting into two levels of carry chains. Rather than 512 LUTs feeding a carry chain that's 128 rows high (there are 2 carry chain paths in a CLB, 4 LUTs per carry chain) using

2 levels of carry chains with the first at 5 MUXCY stages (32 inputs) and the second at 6 MUXCY stages (64 inputs, specifying 64 initial carry chains) the delay ends up being shorter still. The Tbyp value, by the way, is about 103 ps in the Spartan3E (-5 speed grade) and corresponds to 2 LUTs worth of carry chain since the bypass is on a slice-by-slice basis. ***** Dadgummit. The 8.2.01i speedprint numbers for Tbyp doesn't match my Timing Analyzer numbers (which did seem to correspond in speedprint 8.1.03i). I've submitted a case to Xilinx on this. *****

In the Spartan3E -5 speed grade, for instance, using timing numbers from my

8.2.01i Timing Analyzer (a mixbag of SliceM and SliceL values so the actual numbers will vary) the 6-level OR would end up

Tcko+5*(Tnet+Tilo)+Tnet+Tfck = 0.567+6*Tnet+5*0.660+0.776 = 4.643+6*Tnet an average Tnet of 1ns (routing to logic of 56% to 44% which is much better than what I'd expect for a wide distribution of inputs) gives = 10.643 ns

While a single carry chain across 128 CLB rows would be Tcko+Tnet+Topcyf+255*Tbyp+Tcinck = 0.567+Tnet+1.011+255*(0.103)+0.518 = 28.561+Tnet or probably under = 29.561 ns

Which is much worse than the tree or for 2 levels of carry chains which would be

Tcko+Tnet+Topcyf+2*Tbyp+Tnet+Topcyf+2*Tbyp+Tcinck = 0.567+Tnet+1.011+2*0.103+Tnet+1.011+2*0.103+0.518 = 3.519 + 2*Tnet or around = 5.519 ns

Two levels of carry chains use significantly fewer resources than an OR tree while the delay is about half what the tree would need.

The key to the number of carry chains the tool generates for the longest delay would be the number of Topcyf (or Topcyg) values in the path as reported by Timing Analyzer.

Ain't optimization fun?

- B
- Ben Jones
  
  Contact options for registered users
Vote on answer
posted
17 years ago

Mon, Jul 17, 2006 4:54 PM

90ps.

If you think about it just a tiny bit harder, the structure of the optimal circuit comes down to an assessment of the relative performance of the LUT delay + routing, and the carry chain delay. Intuitively, the best circuit will have minimal disparity between the fastest and slowest path. Say for the sake of argument that four stages of carry-OR takes as long as one LUT-OR. Then an extremely coarse rendition of the fastest circuit to do a big OR will look a bit like this (L = LUT, ^ = carry-mux OR, inputs [not shown] on left):

L-L-L-^ (top (result)) L-L-L-^ L-L-L-^ L-L-L-^ L-L-^ L-L-^ L-L-^ L-L-^ L-^ L-^ L-^ L-^ (bottom)

The further up the carry chain you get, the more the inputs to the carry-mux elements are just "waiting around" for the carry propagation. Eventually it reaches the point where you can squeeze in an extra level of LUTs in these higher stages, and thus reduce the total size of the carry chain. Go further up, and you can afford two extra levels, and so on. I'd hope that at least some tools are clever enough to exploit this.

(Note: in reality, the ratio of LUT:CY speed in this context is somewhere in the 12:1 to 16:1 ballpark for most Xilinx architectures.)

Hope this makes sense... perhaps someone can take it a step further and work out where the 56 levels thing really comes from (and thus deduce what this particular synthesis tool believes the LUT:CY speed ratio is!).

Cheers,

-Ben-

- S
- Symon
  
  Contact options for registered users
Vote on answer
posted
17 years ago

Mon, Jul 17, 2006 5:55 PM

Hi Ben, Thanks for that, it made sense to me. I think we might need to know what part the design was in because the carry chain length is limited by the number of rows in the FPGA. Smaller parts have smaller maximum length chains. Also, as a BTW, I see from the datasheet that the ORCY structure that was in V2PRO has been dropped from the V4. That made wide gates even faster. Cheers, Syms.

- J
- John_H
  
  Contact options for registered users
Vote on answer
posted
17 years ago

Mon, Jul 17, 2006 6:04 PM

I thought through this too quickly. The first stage in the example I was drawing out could do 64-wide ORs with the first carry chain which is 8 slices or 7*Tbyp, not 2*Tbyp. The second stage would be from 32 carry chains for 4 slices of MUXCY-based OR for 3*Tbyp, not 2*Tbyp so the timing would be more like 6.137 ns, still significantly better than the LUT tree.

I missed the 56 elements mentioned initially; this is probably just poor partitioning, relying instead on a "maximum carry width" value.

I'd manually partition the OR into two sets based on the 2 levels of carries. The generate can be used to shorthand the 32 intermediate values. The KEEP attribute may be what's needed in XST - I use the syn_keep=1 in the synplicity synthesizer. This synthesized okay but I didn't put a wrapper around it to get into a physiacl part (2k I/O is too much for me).

module orlt(clk, in, out); input clk; input [2047:0] in; output out;

reg [2047:0] inr; reg out; wire outw;

orl u0(inr, outw);

always @(posedge clk) begin out

- M
- mk
  
  Contact options for registered users
Vote on answer
posted
17 years ago

Mon, Jul 17, 2006 7:54 PM

On Mon, 17 Jul 2006 18:04:33 GMT, "John_H" wrote: ...

...

Thanks John and everyone else, So far I tried all three options. It turns out a LUT4 tree is slightly faster at 6.26ns than what XST comes up with (6.613ns) where as the number of LUT4s go from 515 to 811. John's two level LUT4+muxcy on the other hand has a delay of 4.94ns at 648 LUT4s. In terms of generating the LUT4 tree by hand, I used 5 different generate statements with keeps on the outputs which convinces XST to give me what I wanted. By the way 32x64 vs 64x32 partition does not make a difference but 64x32 is very slightly larger.

- J
- John_H
  
  Contact options for registered users
Vote on answer
posted
17 years ago

Mon, Jul 17, 2006 8:19 PM

I would have thought the result would be 512+16+1 LUTs -- 2048/4 LUTs feeding 64 carry chains, 64/4 LUTs feeding the final carry chain, and 1 to register the carry at the top of the chain -- for 539 total, not 648. _____

For the OR tree, rather than 5 generates you could be creative with one big wire and do one generate loop:

(*KEEP*) wire [681:0] ORs; // 512+128+32+8+2 intermediate OR results wire [2729:0] XtraWideOR = {ORs,inr}; generate genvar i; for( i=0; i

- R
- rickman
  
  Contact options for registered users
Vote on answer
posted
17 years ago

Tue, Jul 18, 2006 1:39 AM

It makes some sense, but I think the fastest approach is to use combinations in layers. With the large discrepancy in speed, it makes sense to use a lot of input in a single carry chain. But the carry chain is by nature serial and the delays keep adding. The delays with LUTs can be done in parallel using a pyramid structure. I think the fastest approach would be to combine the two in a pyramid of LUTs and carry chains like they have been discussing above.

One way to compare the two is to consider building up from a single LUT. With four inputs to a LUT, you can combine four single LUTs to another LUT for 16 inputs with two LUT delays. Using the 12:1 approximation for the ratio of delays in the two, you can combine 12 LUTs for 48 inputs and get two LUT delays using the carry chain. Clearly this is faster on the first level.

So let's consider increasing the size by a factor of four by combining four, 48-input carry chains using a LUT. This will give 192 inputs with three LUT delays. To do that in a longer carry chain, you would get 5 LUT delays. So clearly it is better at some point to break the carry chains and combine their outputs with a LUT. I am assuming that you can not run one carry chain into another without using a LUT. In fact, I seem to recall that to get an output from a carry chain you need to use a LUT. So maybe my delay calculations are flawed since the

12:1 ration does not account for the required end of chain LUT. In fact, I am pretty sure that makes the carry chain slower than a pyramid of LUTs.

Anyone know the details of connecting the output of the carry chain in V4 parts?

- J
- John_H
  
  Contact options for registered users
Vote on answer
posted
17 years ago

Tue, Jul 18, 2006 4:39 AM

In the Spartan3E parts (things may vary for the Virtex4) the exit from a MUXCY (as opposed to an XORCY used in an adder) can be directly onto routing. The timing report will go straight from a Tbyp delay to the net.

For 4^n inputs ORed in a LUT tree (or pyramid) the additive delay will be n*Tilo+(n-1)*Tnet. For one carry chain, the additive delay will be Topcyf+((4^n)/8-1)*Tbyp. For my Spartan3E, -5 speed grade, the carry chain is better at n=2-4; I'm surprised the numbers work for n=2 but that comes from a history where carry chains were tough to get on and off (here we only have to sweat getting on the carry chain, off is free).

If the problem were a 256-wide OR or a 16k-wide OR, an extra level of LUTs would work out well with one or two carry chain stages, respectively. The numbers may skew slightly for other families but the timing reports will show what the makeup is for the various paths - including routing - to make a better determination of what's "best."

- B
- Ben Jones
  
  Contact options for registered users
Vote on answer
posted
17 years ago

Tue, Jul 18, 2006 9:05 AM

The quickest way off the carry chain, believe it or not, is to use the XORCY element (with a 0 input from the fabric, or a 1 if you want inversion). This is faster than going up to the next LUT and using the dedicated output. This still takes a few hundred picoseconds though (check the timing files for the exact number).

In Virtex 5, the carry chain structure has got *much* faster relative to the rest of the fabric, both in terms of propagation delay and time to get on/off the chain... :-)

Cheers,

-Ben-

- J
- Jim Granville
  
  Contact options for registered users
Vote on answer
posted
17 years ago

Tue, Jul 18, 2006 11:02 AM

Hi Ben, Great, so does that mean Webpack will (soon?) allow designers to actually use the Virtex 5 family ?

-jg

- S
- Symon
  
  Contact options for registered users
Vote on answer
posted
17 years ago

Tue, Jul 18, 2006 12:24 PM

Hi Ben, C'mon, dish the numbers! :-) Is it looking ahead further, or has the chain itself got faster? Cheers, Syms.

- B
- Ben Jones
  
  Contact options for registered users
Vote on answer
posted
17 years ago

Tue, Jul 18, 2006 2:45 PM

to

So, each slice now has four LUTs and four FFs. Consequently, there are four MUXCY/XORCY pairs in each slice's subchain. The Tbyp (i.e. cin-to-cout delay) for the slice is on the order of 80ps. So that works out to around

20ps per LUT, or twice as fast as in Virtex-4.

I am not a sub-micron designer, so I don't really know how they achieved it :) but I'll bet it's a combination of the two factors you mentioned.

Cheers,

-Ben-

- B
- Ben Jones
  
  Contact options for registered users
Vote on answer
posted
17 years ago

Tue, Jul 18, 2006 2:49 PM

relative

to

I, like you, am surprised that it doesn't (even in the 8.2i release). Doubtless there is a justification somewhere in terms of reduced burden on the technical support hotline or some-such... bah!

Then again, the full ISE suite isn't all that expensive... :)

Cheers,

-Ben-

- S
- Symon
  
  Contact options for registered users
Vote on answer
posted
17 years ago

Tue, Jul 18, 2006 4:35 PM

Just as well, you wouldn't be able to reach your keyboard! :-)

Yep, sounds like it. Thanks for that heads-up. Cheers, Syms.