add/sub 2:1 mux and ena in a single LE (Cyclone)

- M
- Martin Schoeberl
  
  Contact options for registered users
posted
19 years ago

Thu, Oct 7, 2004 7:31 PM

I want to realize an add/subtract function, a 2:1 mux between this adder and a load value and an enable of the register in a single LE. As I can see in the data sheet (Cyclone) this should be possible: There is an extra input addnsub to decide between add and subtract. Two inputs of the LUT are used for the add/sub, the remaining two inputs can perform the

2:1 mux. The register has an additional ena input. However, with the following VHDL I get 2 LEs per bit instead of 1. Any ideas?

Martin

VDHL example:

library ieee; use ieee.std_logic_1164.all; use ieee.numeric_std.all;

entity alu is

generic ( width : integer := 32 -- one data word ); port ( clk : in std_logic;

lmux : in std_logic_vector(width-1 downto 0); b : in std_logic_vector(width-1 downto 0);

sel_sub : in std_logic; -- 0..add,

1..sub sel_amux : in std_logic; -- 0..sum, 1..lmux ena_a : in std_logic; -- 1..store new value dout : out std_logic_vector(width-1 downto 0) ); end alu;

architecture rtl of alu is

signal a : std_logic_vector(width-1 downto 0);

begin

-- this add/sub, the sum/lmux mux and the enable should fit into

-- a single LE.

process(clk, ena_a) begin if rising_edge(clk) then if ena_a='1' then if sel_amux='0' then if sel_sub='0' then a

- S
- Symon
  
  Contact options for registered users
Vote on answer
posted
19 years ago

Thu, Oct 7, 2004 8:34 PM

Martin, I'll guess. Is it because addnsub is low for an add? So try:-

if sel_sub='1' then a

- P
- Philip Freidin
  
  Contact options for registered users
Vote on answer
posted
19 years ago

Fri, Oct 8, 2004 12:07 AM

Too many inputs to the LUT.

1 for the A operand 1 for the B operand 1 for the add/sub signal 1 for the load value 1 for the carry in from the previous bit.

I faced this problem back in 1990 when designing R16 for the XC4005. My solution was to make the load value come in on the "A" operand path.

In the older Xilinx architectures this is not too hard, as the carry logic is separate from the LUT (but very close by), and the CE for the FFs is a separate signal. While the topology of the CLBs has changed significantly since then, I believe this can still be done in the more recent Xilinx devices.

Many of the earlier Altera architectures implied this was possible in their data sheets, but the architecture could not actually support it, because the data sheets showed mutually exclusive functions, while not explaining that they were mutually exclusive. For example, the CE signal shared an input with the LUT, and the LUT was broken into two 3 input LUTs to implement ADD (1 for the sum bit and one for the carry out), so you could not even do add/sub in one LE. I believe these deficiencies have been corrected in more recent products, but you would need to look very carefully at what a LAB can really do.

Philip

Philip Freidin Fliptronics

- P
- Paul Leventis (at home)
  
  Contact options for registered users
Vote on answer
posted
19 years ago

Fri, Oct 8, 2004 4:54 AM

Hi Philip, Martin:

Cyclone (and Stratix and MAX II) should be able to perform this function in

1 LE per bit.

I *think* you have uncovered a bug in Quartus 4.1 synthesis. I'll confirm this with the synthesis team tomorrow. Basically, it looks like Quartus will automatically make a loadable adder flop, or a adder-subtractor flop, but not a loadable adder-substractor flop. If I make a very simple VHDL design that implements an synchronously loadable adder+flop, I get 1 LE/bit as expected. If I add a add/subtract selector, I get 2 LE/bit for no apparent reason. The LE (as shown in Figure 2-5 of the Cyclone databook

formatting link

should be able to implement an sloadable adder/subtractor in 1 LE/bit. Explanation of why below.

Cyclone has a LAB-wide addnsub signal that can be used to control whether the A operand to each LE in the LAB is inverted or not. In addition, addnsub can also enter the carry chain at bit 0 -- so you get compliment(a)

B + addnsub, or (compliment(a) + 1) + B = -A + B.

If you consider the 4-LUT as two 3-LUTs followed by a 2:1 Mux, then you get the following assignments of signals (using some psuedo-VHDL; I'm rusty):

-- The programmable inversion of the A input Aprime(i)

- M
- Martin Schoeberl
  
  Contact options for registered users
Vote on answer
posted
19 years ago

Fri, Oct 8, 2004 7:57 AM

whether

compliment(a)

get

rusty):

for

control

sload_data,

and three lab-wide signals: addnsub, sload and ena of the FF. I've checked again with the data sheet. In figure 2-7 you can see the LE in 'dynamic arithmetic mode' and the resource are there for this kind of function.

When we take a look in the Analysis & Synthesis Equations we get:

--a[0] is a[0]

--operation mode is normal

a[0]_lut_out = lmux[0] & (C1_result[0] # sel_amux) # !lmux[0] & C1_result[0] & !sel_amux; a[0] = DFFEA(a[0]_lut_out, clk, VCC, , ena_a, , );

And it should be:

a[0]_lut_out = C1_result[0]; a[0] = DFFEA(a[0]_lut_out, clk, VCC, , ena_a, sel_amux, lmux[0]);

However, the last two parameters for this DFFEA are the asynchronous inputs to the FF and we want a synchronous load. Why is there such a thing as asynchronous inputs to a FF?

Perhaps the synthsizer should generate LPM_FF, where the synchronous load is available.

This function also uses 2 LCs per bit in a Spartan-3. As I'm not so used to 'read' the Xilinx diagram of the LC I don't know if the resources for one LC could implement this function.

You're welcome

As you will notice, this question is related to the JOP optimizing contest ;-)

Martin

---------------------------------------------- JOP - a Java Processor core for FPGAs:

formatting link

- S
- Sylvain Munaut
  
  Contact options for registered users
Vote on answer
posted
19 years ago

Fri, Oct 8, 2004 1:37 PM

Don't think so. Not in this form at least.

If I understand correctly the SLICE view of DS099-2 page 11, the muxes like CYINIT, CY0F, are configured during configuration of the FPGA.

So if you enable the carry logic for a bunch of slices, it stays active all the time. Then for the load operation to work, you must ensure your b input is all '0', then you can do it in 1 LC/bit.

If not, the carry will pollute the output ...

But this is only a simplified slice view and I don't know where to find the complete one. With this view I don't see how it implements a addsub in 1LC/bit, but it can do it. In the view, I see nothing capable of inverting the F1 or F2 so that the carry logic knows that one operand is inverted.

Sylvain

- J
- Jan Gray
  
  Contact options for registered users
Vote on answer
posted
19 years ago

Sat, Oct 9, 2004 4:38 AM

In Virtex-derived architectures, you can implement o = add ? (a + b) : c; or o = sel ? (a + b) : (a + c); or even o = addsub ? (addand ? a+b : a-b) : (addand ? a&b : a^b); in one LUT per bit.

The trick is to use a MULT_AND to kill the carry propagation when add=0. See

formatting link

But as Philip points out, you'd need five input signals to do o = sel ? (add ? a + b : a - b) : c; and I don't think that can be done in one LUT per bit.

Jan Gray

- M
- Martin Schoeberl
  
  Contact options for registered users
Vote on answer
posted
19 years ago

Sat, Oct 9, 2004 9:31 AM

"Jan Gray" schrieb im Newsbeitrag news:CXJ9d.11048$ snipped-for-privacy@newsread3.news.atl.earthlink.net...

The original request needs even six inputs. In your notation I want to achieve following function: d = ena ? (sload ? c : (addnsub ? a+b : a-b)) : d However, the Cyclone LC has LAB wide signals for addnsub, sload and ena. You only need three of the LUT inputs for a,b and c which are available in arithmetic mode. For the Spartan LC I can see only the CE signal as additional 'global' input that can serve as ena. There are two inputs (FIXINA/B) for the register load, but it seems to me that GYMUX is statically configured. So it can't be used for the sload part.

Martin

---------------------------------------------- JOP - a Java Processor core for FPGAs:

formatting link

- P
- Paul Leventis (at home)
  
  Contact options for registered users
Vote on answer
posted
19 years ago

Fri, Oct 15, 2004 4:08 AM

Hi All,

First of all, I should point out that this is sub-optimal synthesis, NOT a "bug" -- the design will function, it just uses more logic elements than necessary. We *may* fix this in a future release of Quartus, but the solution will not be easy to implement so don't hold your breath. The value is rather limited due to the input limitations explained below, and the relative rarity of this combination of functions.

In the meantime, there is a work-around. You can directly instantiate "stratix_lcells" (the WYSIWYG cell for Stratix/Cyclone LEs). Below I give the code (thanks to a helpful synthesis guy) for a registered adder/subtractor with oodles of extras. Features: - Implements A - B or A + B (depending on signal "addnsub") - Registers are synchronously loadable with "data" when synchronous load "sload" is asserted - There is shared clock "clk", clock enable "ena", synchronous clear "sclr", asynchronous clear "aclr"

A couple caveats: - There are only 26 non-global inputs to each LAB in Cyclone (and 30 in Stratix). So the fitter will have to split the design over multiple labs if you use more than 7 bits in Cyclone, since you need 3 bits/bit (A, B, sload_data) plus a 4 local control signals and 2 global signals. Assuming aclr and clk are global, and the others are local, that's 4 extra signals you need. - When you stress the number of inputs on a LAB, you run the risk of having reduced routability, resulting in longer run-times, poor performance, or unroutable designs in the worst case. You should try to keep # of LAB inputs around 22-24.

When Quartus splits the carry-chain, it must insert extra logic elements to end the chain and begin the next. For example, to implement a 10-bit add/sub/load/ena/aclr/sclr/sload requires 13 LEs. Still better than 20 LEs, but not 1:1. Also, the remaining unused in the lab will not be too useful, since the lab inputs are nearly saturated.

If you have no sload or a constant sload, you can implement 10 bits/LAB since you only need 2n + 4 lab lines.

Hope this helps!

Paul Leventis Altera Corp.

************************* VERILOG CODE ******************

// Thanks to Gregg Baeckler for code!

module addsub (clk,a,b,addnsub,sload,sclr,aclr,ena,data,out); parameter WIDTH = 7;

input [WIDTH-1:0] a; // Operand A input [WIDTH-1:0] b; // Operand B (+B or -B based on addnsub) input [WIDTH-1:0] data; // Data to load upon sload input clk; // Clock input addnsub; // ADD=1, SUBTRACT=0 input sload; // Triggers synchronous load of register input sclr; // Synchronous clear input aclr; // Asynchronous clear input ena; // Clock enable

output [WIDTH-1:0] out; wire [WIDTH-1:0] out; wire [WIDTH-1:0] cout_wires;

// The first cell CIN is special since it has no carry-in. // Its carry-in will be the addnsub signal stratix_lcell first_cell ( .dataa(b[0]), .datab(a[0]), .datac(data[0]), .sload(sload), .sclr(sclr), .ena(ena), .aclr(aclr), .clk(clk), .inverta(addnsub), .regout(out[0]), .cout(cout_wires[0]) ); defparam first_cell .operation_mode = "arithmetic"; defparam first_cell .synch_mode = "on"; defparam first_cell .sum_lutc_input = "cin"; defparam first_cell .lut_mask = "96b2"; defparam first_cell .output_mode = "reg_only";

// fill in the rest of the cells in this loop genvar i; generate for (i=1; i

- M
- Martin Schoeberl
  
  Contact options for registered users
Vote on answer
posted
19 years ago

Sun, Oct 17, 2004 3:23 PM

Paul,

thanks for your suggestion. However, I will stay at plain VHDL and wait for the synthesizer update :-)

I was never thinking that this is a 'bug' in the sense that it produces wrong results.

However, if the LAB global inputs such as 'sload' and 'ena' are not available for the synthesizer you're 'wasting' resources. Do you use these signals for other functions (perhaps the loadable counter)?

BTW.: Do we really need asynchronous signals such as PRN/ALD, ADATA and CLRN (ok this one for the asynch. reset) in these days? Isn't that a waste of resources usfull only for a some designed who doing asynchronous design.

Is there some documentation about these AYSIAYG lcells? I was looking for such an entity in the Megafunctions/LPM help of Quartus (befor you provided the solution) to implement this function. However, I did not find these basic megafunction.

Martin

---------------------------------------------- JOP - a Java Processor core for FPGAs:

formatting link

- P
- Paul Leventis (at home)
  
  Contact options for registered users
Vote on answer
posted
19 years ago

Fri, Oct 22, 2004 12:36 PM

Hi Martin,

available

for other

This problem is specific to combining the addsub feature with the sload signal (I think). When you write a vanilla loadable counter or other such code that requires an sload, enable, etc. Quartus should be using the LE properly.

Note: There are some circumstances where we do not use an sload or other control signal even though we could. For example, I don't think we recommned that synthesis tools use sload or enable as general logic signals since these signals are shared LAB wide and if you and up with oodles of independent sload or enable signals in a design, there will be poor packing. That's an example where even though the synthesis will use fewer LEs, the # of LABs required will be higher and thus the synthesis is "larger".

waste

design.

The quick answer is no, no one should be using many asynchronous control signals. But users do, and if the user writes their (bad?) HDL so that they have async signals, and we don't have the hardware, our only choice is to emulate the async functionality using logic elements. And then you get into some difficulty with the start-up condition of these soft flip-flops, potential glitch issues if not careful, etc.

If you look carefully at Cyclone II, we have removed the aload capability since this is cheaper to implement in soft logic for those few times people use it -- we might as well tax the users with aloads rather than burden every user with the slight silicon bloat there would be with hard support.

give

such

solution)

megafunction.

There is no "user" documentation on the WYSIWYG (I like the all version ;-)). We are working on some documentation that may be released in the future. In the meantime, you can also download the QUIP toolkit (search for QUIP on

formatting link

This is a package we make available for academics who are designing CAD tools. It provides a document describing the Stratix (and thus Cyclone and Max II) LE WYSIWYGs.

Regards,

Paul Leventis Altera Corp.

- S
- Sylvain Munaut
  
  Contact options for registered users
Vote on answer
posted
19 years ago

Fri, Oct 22, 2004 11:40 PM

After rechecking more closely, I better understand.

More generaly :

o = addsub ? (add_aux ? a+b : a-b) : ( F(a,b,add_aux) )

with F(a,b,add_aux) any function ...

Solutions would be to either not load from a third bus but implement a load operand A or load operand B (even both, selecting which to load via add_aux).

Also, depending on where the the load operand comes : If it comes from a mux and that mux has a 'spare' input, connect that spare input to every time the add or sub selector. Then send the load via add_aux signals. When add_sub is used, on your mux, use the add or sub signal.

Sylvain