8:1 MUX implementaion in XILINX and ALTERA

- P
- prav
  
  Contact options for registered users
posted
18 years ago

Mon, Apr 10, 2006 9:15 AM

Hi all,

I wanted to know how many CLB's does a 8:1 mux implementation take in a ALTERA and a XILINX device. I wanetd the details of the internal implementation also(like how many LUT's )are used.

One more doubt i had if the depth of the multiplexer increases can the LUT' s be shared.

Regards, Prav

- B
- Ben Jones
  
  Contact options for registered users
Vote on answer
posted
18 years ago

Mon, Apr 10, 2006 12:08 PM

Hi Prav,

This depends very much on exactly what family you are talking about. Different device families have different capabilities, and also different numbers of slices per CLB. You should look at the datasheets for details.

In all Xilinx FPGAs since Spartan-II & Virtex, there are dedicated multiplexor resources that make 4:1 and larger muxes quite efficient. A

1-bit 8:1 mux will require 4 LUTs to implement the four first-stage 2:1 muxes. Then there are two second-stage 2:1 muxes, which can use an "F5" mux, and a single final-stage 2:1 mux which can be an "F6".

So the total is 2 slices (the Fx muxes are "free" within the slice). In older parts that equates to 1 CLB; in anything newer than Virtex-II, that is

0.5 CLB. In any case, this will be a very fast function since the routing to and from the Fx multiplexors is dedicated.

I'm not sure what you mean by that. I *think* the answer is no.

Cheers,

-Ben-

- K
- Kolja Sulimma
  
  Contact options for registered users
Vote on answer
posted
18 years ago

Mon, Apr 10, 2006 1:40 PM

Ben Jones schrieb:

Maybe he means the width. There are indeed some optimizations that you can do for wider muxes. If you have an N-Bit wide M-to-1 Mux, there is a certain N for each M at which it becomes benefitial to use first stage that outputs 0 or the input value depending on the select value and that reduce these with an OR-tree or the carry chain. Muxes reduce at a rate of 2-to-1 per LUT while OR reduces at a rate of 4-to-1.

Also note that a multiplier or a BRAM can implement a MUX.

Kolja Sulimma

- M
- Mike Hutton
  
  Contact options for registered users
Vote on answer
posted
18 years ago

Sat, Apr 15, 2006 12:51 AM

I don't think I've posted anything in years, but I just couldn't resist adding to this one because I played with it for some time.

As the previous posters said -- it depends on the device. But I'd also add (in some detail) it also depends even within one device.

To answer the question most directly an 8:1 mux requires two slices in Virtex IV or 2 Adaptive Logic Modules (ALMs) in Stratix II. But whether you actually get that in a full system depends on the struture of your design.

The Virtex IV version is easy to see because it's just the output of the F6 mux provided as dedicated hardware. Spartan III is a cost-reduced Virtex IV, so it should behave identically.

In Stratix II we can do it without the need for dedicated hardware but it's a bit trickier to synthesize:

For Z = mux(d0,d1,d2,d3,d4,d5,d6,d7; s0,s1) synthesis will give you: y0 = mux(d0,d1,d2; s0,s1) y1 = mux(d4,d5,d6; s0,s1) which are two 5-input functions that pack into a single ALM.

In the second ALM z0 = (s0 & s1 & d3) # !(s0 & s1) & y0 z1 = (s0 & s1 & d7) # !(s0 & s1) & y1 Z = mux(z0,z1,s2) will be generated using 7-LUT mode.

I attached Verilog at the end if you want to run it through Quartus, and you can look at the result in the equation file and will see what I just described. Note that depending on what else is in the design the

5-LUTs might get packed differently or synthesized differently i.e. Quartus may prefer to pack the two 5-LUTs with two unrelated 2 or 3-LUTs to make two 7-input ALMs rather than 1 8-input ALM and a second 6 input ALM or may synthesize differently at the cost of area to hit a delay constraint.

On older devices (Altera Stratix, Cyclone; Xilinx Spartan I, 4000) and on MAX II and Cyclone II, you can basically use "4-LUT" in the discussion below, though it will depend on other issues in practice. I haven't thought about PTERM devices like MAX 7000.

But this brings me to the bigger discussion. I would stress that in practice it makes a big difference what the surrounding context is, and also if you have more than one mux in your design, because in a mux system like a barrel shifter or crossbar the amortized cost of k muxes in Stratix II is less than k times the cost of one (which is a benefit over Virtex IV).

In a generic 4-LUT architecture with no dedicated hardware, a simple

2:1 mux is a 3-input function and takes one LUT (with one input going unused). A 4:1 mux would take two LUTs (not three -- exercise to the reader; it's easier than the 8:1 above). An 8:1 mux reqires five vanilla 4-LUTs because it's 2 4:1 muxes and 1 2:1. But it's arguably something like 4.5 LUTs (see two paragraphs down).

I already mentioned the Virtex IV hardware. Stratix-and some earlier Altera architectures have hardware that facilitates other special cases, e.g. a set of mux(a,b,c,0; s0,s1) can be implemented in a LAB cluster by stealing functionality from the LAB-wide SLOAD hardware before the DFF. So you can a restricted 4:1 mux in one LE instead of 2. (that's the "basically" in the above).

When I said context I meant this: If an 8:1 mux is followed by an AND-gate (e.g. Z = mux(a,b,c,d; s0,s1) & e), then the AND gate would be a "free" addtion to the 5 4-LUT implementation in the vanilla architecture (because there's a leftover input on the last LE), but would cost an new LE using the Virtex IV hardware. So F5 gives a a maximal 20% savings for a lone 8:1 mux, but depending on the surrounding logic the relative benefit could disappear. That's not a deficiency, you just can't count on getting the benefit in all cases. Note that if it's a 3-input AND gate, the situation reverses and the dedicated hardware is again ahead by one LE.

In reality, though, you don't probably don't care about one simple mux, you care about systems of muxes that consume huge numbers of LUTs. For example, a simple 16-bit barrel shifter

out[15:0] = in[15:0]