8:1 MUX implementaion in XILINX and ALTERA

Do you have a question? Post it now! No Registration Necessary

Translate This Thread From English to

Threaded View
Hi all,

I wanted to know how many CLB's does a  8:1 mux implementation take in
a ALTERA and a XILINX device. I wanetd the details of the internal
implementation also(like how many LUT's )are used.

One more doubt i had if the depth of the multiplexer increases can the
LUT' s be shared.


Regards,
Prav


Re: 8:1 MUX implementaion in XILINX and ALTERA
Hi Prav,

Quoted text here. Click to load it

This depends very much on exactly what family you are talking about.
Different device families have different capabilities, and also different
numbers of slices per CLB. You should look at the datasheets for details.

In all Xilinx FPGAs since Spartan-II & Virtex, there are dedicated
multiplexor resources that make 4:1 and larger muxes quite efficient. A
1-bit 8:1 mux will require 4 LUTs to implement the four first-stage 2:1
muxes. Then there are two second-stage 2:1 muxes, which can use an "F5" mux,
and a single final-stage 2:1 mux which can be an "F6".

So the total is 2 slices (the Fx muxes are "free" within the slice). In
older parts that equates to 1 CLB; in anything newer than Virtex-II, that is
0.5 CLB. In any case, this will be a very fast function since the routing to
and from the Fx multiplexors is dedicated.

Quoted text here. Click to load it

I'm not sure what you mean by that. I *think* the answer is no.

Cheers,

    -Ben-



Re: 8:1 MUX implementaion in XILINX and ALTERA
Ben Jones schrieb:
Quoted text here. Click to load it

Maybe he means the width. There are indeed some optimizations that you
can do for wider muxes.
If you have an N-Bit wide M-to-1 Mux, there is a certain N for each M at
which it becomes benefitial to use first stage that outputs 0 or the
input value depending on the select value and that reduce these with an
OR-tree or the carry chain. Muxes reduce at a rate of 2-to-1 per LUT
while OR reduces at a rate of 4-to-1.

Also note that a multiplier or a BRAM can implement a MUX.

Kolja Sulimma

Re: 8:1 MUX implementaion in XILINX and ALTERA

I don't think I've posted anything in years, but I just couldn't resist
adding to this one because I played with it for some time.

As the previous posters said -- it depends on the device.  But I'd also
add (in some detail) it also depends even within one device.

To answer the question most directly an 8:1 mux requires two slices in
Virtex IV or 2 Adaptive Logic Modules (ALMs) in Stratix II.  But
whether
you actually get that in a full system depends on the struture of your
design.

The Virtex IV version is easy to see because it's just the output of
the
F6 mux provided as dedicated hardware.  Spartan III is a cost-reduced
Virtex IV, so it should behave identically.

In Stratix II we can do it without the need for dedicated hardware but
it's a bit trickier to synthesize:

For Z = mux(d0,d1,d2,d3,d4,d5,d6,d7; s0,s1) synthesis will give you:
   y0 = mux(d0,d1,d2; s0,s1)
   y1 = mux(d4,d5,d6; s0,s1)
which are two 5-input functions that pack into a single ALM.

In the second ALM
   z0 = (s0 & s1 & d3) # !(s0 & s1) & y0
   z1 = (s0 & s1 & d7) # !(s0 & s1) & y1
   Z = mux(z0,z1,s2)
will be generated using 7-LUT mode.

I attached Verilog at the end if you want to run it through Quartus,
and
you can look at the result in the equation file and will see what I
just
described.  Note that depending on what else is in the design the
5-LUTs
might get packed differently or synthesized differently i.e.  Quartus
may prefer to pack the two 5-LUTs with two unrelated 2 or 3-LUTs to
make
two 7-input ALMs rather than 1 8-input ALM and a second 6 input ALM or
may synthesize differently at the cost of area to hit a delay
constraint.

On older devices (Altera Stratix, Cyclone; Xilinx Spartan I, 4000) and
on MAX II and Cyclone II, you can basically use "4-LUT" in the
discussion below, though it will depend on other issues in practice.  I
haven't thought about PTERM devices like MAX 7000.

But this brings me to the bigger discussion.  I would stress that in
practice it makes a big difference what the surrounding context is, and
also if you have more than one mux in your design, because in a mux
system like a barrel shifter or crossbar the amortized cost of k muxes
in Stratix II is less than k times the cost of one (which is a benefit
over Virtex IV).

In a generic 4-LUT architecture with no dedicated hardware, a simple
2:1
mux is a 3-input function and takes one LUT (with one input going
unused).  A 4:1 mux would take two LUTs (not three -- exercise to the
reader; it's easier than the 8:1 above).  An 8:1 mux reqires five
vanilla 4-LUTs because it's 2 4:1 muxes and 1 2:1.  But it's arguably
something like 4.5 LUTs (see two paragraphs down).

I already mentioned the Virtex IV hardware.  Stratix-and some earlier
Altera architectures have hardware that facilitates other special
cases,
e.g.  a set of mux(a,b,c,0; s0,s1) can be implemented in a LAB cluster
by stealing functionality from the LAB-wide SLOAD hardware before the
DFF.  So you can a restricted 4:1 mux in one LE instead of 2.  (that's
the "basically" in the above).

When I said context I meant this:  If an 8:1 mux is followed by an
AND-gate (e.g.  Z = mux(a,b,c,d; s0,s1) & e), then the AND gate would
be
a "free" addtion to the 5 4-LUT implementation in the vanilla
architecture (because there's a leftover input on the last LE), but
would cost an new LE using the Virtex IV hardware.  So F5 gives a a
maximal 20% savings for a lone 8:1 mux, but depending on the
surrounding
logic the relative benefit could disappear.  That's not a deficiency,
you just can't count on getting the benefit in all cases.  Note that if
it's a 3-input AND gate, the situation reverses and the dedicated
hardware is again ahead by one LE.

In reality, though, you don't probably don't care about one simple mux,
you care about systems of muxes that consume huge numbers of LUTs.  For
example, a simple 16-bit barrel shifter

   out[15:0] = in[15:0] << k[3:0]

results in 16 16:1 muxes or 16x5 4:1 muxes = 16x5x2 LUTs = 160 LUTs
synthesized in the obvious way or 16x4 2:1 muxes = 64 LUTs synthesized
properly into a n*log(n) shifter network of 2:1 muxes.  The Virtex
hardware would get some savings from this vs.  the vanilla 4-LUT, but
it
bounces between 0 and 20% based on round-off and arrangement issues in
the size of the barrel shifter, and because the advantage is lost for
all the shifter bits that source a zero in the shifter network and go
non-symmetric.

I should mention that it's also not technically correct to compare #LEs
in the presence of any dedicated hardware, because you use fewer LEs
but
the cost of an LE changes.  From the architecture point of view you
have
to multiply #LEs * sizeof(LE) (even better #LABs * sizeof(LAB) or #CLB
*
sizeof(CLB)) to evaluate whether the HW is beneficial to put in the
device (or simply compare the dollar-price of the smallest Virtex IV or
Stratix II device your complete design fits in).

Although 64 LEs from a simple one-line statement sounds like a lot,
it's
actually worse because usually in and out are w-bit words, so
everything
gets repeated w times.  A properly synthesized w16%, 16x16 barrel
shifter, for example, requires 16x64 = 1024 4-LUTs.  The dedicated
hardware in Virtex gets 16x58 half slices or about 9% better than a
4-LUT implementation, and Stratix II can do this in 16x32 ALUTs or 50%
fewer -- see full data below.

Note that a rotating barrel shifter (second version I attached code
for)
will require more resources in both.  This is because of the
wrap-around
data -- none of the muxes collapse due to zeroed inputs.  You can see
this in an ALU, but the zero-padded version will be more common in
commercial designs.

On to crossbars.  A crossbar is like a barrel shifter, except that you
can't re-synthesize it into a shifter network, you're stuck with the k
k:1 muxes.  So a 16x16 crossbar with 16 4-bit select inputs actually
requires 16 independent 16:1 muxes, again times data-width.  Because
there is no re-expression of this that isn't a plain mux, the F5 and F6
hardware should be more beneficial here on average (closer to the 20%).

When we designed the Stratix II architecture, we spent a lot of time
looking at crossbar, barrel shifters and multiplexor structures.  But
you might have figured that out by now.  What we came up with is
particularly beneficial for systems with many muxes -- the sub-linear
growth I mentioned earlier.

The Stratix II ALM is a 8 input fracturable logic block that can
implement
(among other combinations not listed)
a) two independent 4-LUTs
b) independent 5-LUT and 3-LUT
c) two 5-LUTs that share 2 common inputs
d) a single 6-LUT
e) some 7-LUTs
f) two 6-LUTs that have 4 common inputs, and additionally the same
LUT-mask

Note that for (a) an ALM is (all other things equal)  equivalent to two
Stratix LEs or one Virtex slice, for (b,f) it's always better, and for
(c,d,e) usually but not guaranteed to be better.  But you can find this
in the ALM vs.  slice discussion from a year or two ago.

Way off topic, but even the word "better" is a bit abstract-- it's
dependent on other issues like the tech-mapping algorithm and the
relative routability of the device and Si area.  For example, though a
nxn xbar might fit in f(n) cells, a (2n)x(2n) may not fit in the
optimal
number f(2n) cells because a lack of routability in the device forces
the placer to spread the design out.  E.g.  interconnect doesn't scale
as smoothly in older architctures like Altera Apex or Xilinx 4000
(we've
gotten better at it, but it's also a function of modern designs).

Since a 4:1 mux is a 6-input function, it can fit in one ALM.  With the
tricks described above using (c) and (e) an 8:1 fits in two ALMs.  A
16:1 mux requires 4 ALMs + a 2:1 mux, which is 4.5 ALMs (though, again
the 3-input function has two or more additional inputs to absorb more
logic, so you could argue this is 4.25 ALMs instead of 4.5).

Item (f) is where the real benefit comes in for muxes.  The
decomposition of crossbars and barrel shifters into primitive muxes
results in large numbers of 4:1 muxes that have either (i) similar data
and common select bits in the case of barrel shifters, or (ii) common
data and different select bits in the case of xbars.  By the latter, I
mean mux(a,b,c,d; s0,s1) and mux(a,b,c,d; t0,t1).  Not by coincidence,
this fits into the template of two 6-input functions with 4 common
inputs and the same LUT-mask so a single ALM can implement two 4:1
muxes
arising from such a mux system, which makes it roughly 2X the
efficiency
for powers of 4 and between 1.5X and 2X for odd powers of 2 (i.e.
8:1).
That's a generalization, because it also depends on whether barrel
shifters are rotating or shift in zeros, and whether all the outputs
are
used (in packet processing you might do a 3n->2n type shifter so some
of
the bits get dropped).  Same as the discussion above on F5 and F6-- as
soon as you introduce 0's on the mux inputs you have leftover
neighbouring logic to slurp up and the numbers get fuzzy.

But we can look at least look the bottom line of all this using output
from Quartus II and ISE.  I ran this more than a year ago, so both
tools
have newer versions.

16x16 zero-shifting barrel shifter
   Cyclone, Stratix, 4-LUT   64  LUTs (LEs)
   Virtex IV                        59  half-slices  (packs to 47
slices)
   Stratix II ALM                 32  ALUTs (or half-ALM) (packs to 23
ALMs)

16x16 xbar
    Cyclone, Stratix, 4-LUT  160 LEs
    Virtex IV                        128 half-slices
    Stratix II ALM                  88  half-ALM

(for w-bit datapaths just take all the numbers and multiply by w).

Again, I included the Verilog below, in case someone says I'm cheating,
and both ISE and Quartus are available in free versions.  So try it
yourself.

Note that neither Quartus nor ISE will guarantee perfect packing
(half-slice to slice or ALUT to ALM).  This is either due to things
like
the placer choosing to split up two sub-blocks that could be packed in
order to improve delay, or other reasons.  For example, ISE used 47
slices to implement the 59 half-slices after placement, but at least
some of the 35 unused half-slice partners are likely available to be
packed with 2,3,4 input functions from elsewhere in the design, were
the
design bigger.  Quartus II uses 23 ALMs for the 32 ALUTs, meaning that
6
ALUTs are still potentially available for other logic without consuming
further ALMs.

For a common sub-design like a SPI4.2 PHY interface, component pieces
such I mentioned above contain modules like a a M-bit xbar to 2M-1:1
shifter into a 3M bit buffer from which 2M bits are selected.  I
synthesized such a design in each of Stratix, V4 and Stratix II.

   Stratix:    907 LEs
   Virtex IV: 1368 half slices (741 full slices after placement)
   Stratix II: 536 ALUTs (514 ALMs after placement)

(Sorry, can't provide Verilog for this one because it's part of the IP
core.)

You have to treat the synthesis of small designs carefully.  The XST
solution is non-optimal for Virtex IV -- I can hand-map this design
into
the hardware and use fewer slices.  For example, it's nearly trivial to
get the 907 that I got in Stratix, though that also uses the 3:1 mux
trick I mentioned above, but XST isn't doing it for some reason.

Finally, bus-muxes.  This is when you have e.g.  an simple 8:1 mux
where
all the inputs are 16 bits wide.  Synthesis often re-structures these
for delay vs.  area tradeoffs because you can play games with the
selects to amortize different structures through the datapath.  So be
careful trying to analyze these for area out of context.  There are a
couple publications on this that I listed below.  The FPL paper below
also talks about crossbar and barrel shifter synthesis into the ALM.

I also didn't understand the question about sharing LUTs, but I agree
with the previous poster that the answer is probably "no" all around.
You might mean resource sharing as in making the mux iterative /
multi-cycle, but that would probably be more expensive in area.  In
terms of delay, you can always pipeline.  Also, as someone else also
said, a multiplier can be used for a barrel shifter (multiply data by
unary k) if you have no other purpose for the dedicated DSP block.

All this information is in published papers; below are some references.

The first three references are on the general mux synthesis topic.  The
other two are on the Stratix II ALM and architecture and discuss some
of
barrel-shifter/xbar discussion I repeated above.

Paul Metzgen and Dominic Nancekievill, "Multiplexor Restructuring for
FPGA Implementation Cost Reduction".  Design Automation Conference,
June, 2005.

Dominic Nancekievill and Paul Metzgen, "Factorizing Multiplexers in the
Datapath to Reduce Cost in FPGAs".  IWLS, June 2005.

Jennifer Stephenson and Paul Metzgen, "Logic Optimization Techniques
for
Multiplexors", in Mentor user2user conference, 2004

Mike Hutton, Jay Schleicher, David Lewis, Bruce Pedersen, Richard Yuan,
Sinan Kaptanoglu, Gregg Baeckler, Boris Ratchev, Ketan Padalia, Mark
Bourgeault, Andy Lee, Henry Kim and Rahul Saini, "Improving FPGA
Performance and Area Using an Adaptable Logic Module", Proc.
14th International Conference on Field-Programamble Logic, Antwerp,
Belgium, pp.  135-144, Sept 2004.  LNCS 3203

David Lewis, Elias Ahmed, Gregg Baeckler, Vaughn Betz, Mark Bourgeault,
David Cashman, David Galloway, Mike Hutton, Chris Lane, Andy Lee, Paul
Leventis, Sandy Marquardt, Cameron McClintock, Ketan Padalia, Bruce
Pedersen, Giles Powell, Boris Ratchev, Srinivas Reddy, Jay Schleicher,
Kevin Stevens, Richard Yuan, Richard Cliff, Jonathan Rose, "The
Stratix-II Routing and Logic Architecture".  2005 Int'l Symposium on
FPGAs (FPGA, Feb 2005).

Regards,

Mike Hutton
Altera Corp
San Jose CA
<firstinitial><lastname>@altera.com

Note:  Please don't bother sending email to the yahoo account in the
header, I won't read it.  My real email is in the signature.

------------------------------

Here's the Verilog for the 8:1 mux, barrel shifters and crossbars.

// Simple 8:1 mux
// M. Hutton, Altera Corp, 2006
module mux(in,out,s, clk);
    input [7:0] in;
    input [2:0] s;
    input clk;
    output out;
    reg out;

    always@ (posedge clk)
    begin
        case (s)
          3'b000: out <= in[0];
          3'b001: out <= in[1];
          3'b010: out <= in[2];
          3'b011: out <= in[3];
          3'b100: out <= in[4];
          3'b101: out <= in[5];
          3'b110: out <= in[6];
          3'b111: out <= in[7];
        endcase
    end
endmodule


// Simple barrel shifter with no rotation
// M. Hutton, Altera Corp, 2003
module barrel (data_in, data_out, shift_by, clk) ;
   input [15:0] data_in ;
   input [15:0] shift_by;
   input clk;
   output [15:0] data_out ;
   reg [15:0]      data_out ;

   reg [15:0]      reg_data_in ;
   reg [15:0]      reg_shift_by ;

   always @(posedge clk)
   begin
        reg_data_in <= data_in ;
        reg_shift_by <= shift_by ;
        data_out = reg_data_in << reg_shift_by;
   end
endmodule

// Simple 16-bit barrel shifter with rotation.
// Mike Hutton, Altera Corp. 2003
module barrel16 (data_in, data_out, shift_by, clk) ;
   input [15:0] data_in ;
   input [15:0]      shift_by ;
   input      clk;
   output [15:0] data_out ;
   reg [15:0]      data_out ;
   reg [15:0]      reg_shift_by;

   reg [15:0]      reg_data_in ;

   always @(posedge clk)
   begin
    reg_data_in <= data_in ;
    reg_shift_by <= shift_by;

    case (reg_shift_by)
       4'b0000: data_out <= reg_data_in [15:0] ;
       4'b0001: data_out <= {reg_data_in[0], reg_data_in[15:1]};
       4'b0010: data_out <= {reg_data_in[1:0], reg_data_in[15:2]};
       4'b0011: data_out <= {reg_data_in[2:0], reg_data_in[15:3]};
       4'b0100: data_out <= {reg_data_in[3:0], reg_data_in[15:4]};
       4'b0101: data_out <= {reg_data_in[4:0], reg_data_in[15:5]};
       4'b0110: data_out <= {reg_data_in[5:0], reg_data_in[15:6]};
       4'b0111: data_out <= {reg_data_in[6:0], reg_data_in[15:7]};
       4'b1000: data_out <= {reg_data_in[7:0], reg_data_in[15:8]};
       4'b1001: data_out <= {reg_data_in[8:0], reg_data_in[15:9]};
       4'b1010: data_out <= {reg_data_in[9:0], reg_data_in[15:10]};
       4'b1011: data_out <= {reg_data_in[10:0], reg_data_in[15:11]};
       4'b1100: data_out <= {reg_data_in[11:0], reg_data_in[15:12]};
       4'b1101: data_out <= {reg_data_in[12:0], reg_data_in[15:13]};
       4'b1110: data_out <= {reg_data_in[13:0], reg_data_in[15:14]};
       4'b1111: data_out <= {reg_data_in[14:0], reg_data_in[15]};
    endcase
   end
endmodule


// Simple 16-bit crossbar with one-bit width
// M. Hutton, Altera Corp, 2003
module xbar(in,out,s,clk);
    input [15:0] in;
    input [63:0] s;
    input clk;
    output [15:0] out;
    reg [15:0] out;
    reg [15:0] out1;
    integer k;

    reg [15:0] inreg;

    always@ (posedge clk)
    begin
        inreg <= in;
        for (k = 0; k < 16; k = k+1)
        begin
            out1[k] <= inreg[];
        end
        out <= out1;
    end
endmodule


Site Timeline