CPU design uses too many slices

- J
- Jürgen Böhm
  
  Contact options for registered users
posted
16 years ago

Tue, Nov 27, 2007 6:36 PM

Hi,

currently I am designing (as an amateur project) a 32bit Stack oriented CPU with two stack-pointers (Data Stack/Return Stack) and some additional registers, that are partly purely auxiliary, partly dedicated for the intended purpose of the CPU as a specialized Lisp-Processor. The control is microcoded and the greater part of the microcode is already written and successfully tested (in simulation with Icarus). Missing at the moment is parts of the ALU functions and the complete interrupt/exception logic. Nevertheless the design (done in Verilog), when synthesized, occupies already about 1100 slices in a Spartan 3 FPGA, which I feel is a bit heavy for what seems to me a very simple design.

Below I give the output of the Xilinx ISEWebpack synthesis tool

Logic Utilization Used Available Utilization Note(s) Number of Slice Flip Flops 621 3,840 16% Number of 4 input LUTs 2,561 3,840 66%

Logic Distribution Number of occupied Slices 1,517 1,920 79% Number of Slices containing only related logic 1,517 1,517 100% Number of Slices containing unrelated logic 0 1,517 0% Total Number 4 input LUTs 2,751 3,840 71%

(about 400/500 slices can be subtracted from the above figures, as they result from accompanying structures like VGA driver and the like).

What catches my eye is, how small the utilization of Slice Flip/Flops compared to the utilization of slices is: Can this be an expression of the fact, that there is much combinatorial logic (adders, multiplexors) and, relative to that, few registers/state elements? Are especially adders, that I used quite generously to speed up the instructions, a source of slices consumption? Or are multiplexors with many alternative inputs more likely the culprits?

I would be very happy, if someone with more experience than me (being just an hobbyist) could look at the Verilog source of the CPU and give me some hints how to possibly lower the amount of resources needed by the design.

Greetings,

Jürgen

--
Jürgen Böhm                                            www.aviduratas.de
"At a time when so many scholars in the world are calculating, is it not
desirable that some, who can, dream ?"  R. Thom

- J
- Jon Elson
  
  Contact options for registered users
Vote on answer
posted
16 years ago

Tue, Nov 27, 2007 8:57 PM

Yes, precisely. Are especially

Yes, wide adders use a lot of LUTs. Multiplexers use up LUTs too. A single 4-input LUT could form a single bit of a 2-input mux, wasting one input. If you need more inputs, then you have to combine several LUTs to perform one bit's worth of multiplexer.

Xilinx has pretty detailed info on what the basic structure of their chips are, and you should be able to see how one would form basic logic functions out of that. It may be that Virtex would give more resources for this particular task than Spartan.

Jon

- G
- Gabor
  
  Contact options for registered users
Vote on answer
posted
16 years ago

Tue, Nov 27, 2007 10:37 PM

tructions, a

Another point to make is that unless you change some defaults, the mapper will not pack slices to capacity until the whole part becomes mostly full. So the number of occupied slices does not necessarily represent the most compact placement of your design. The statistics for LUTs and flip-flops are more useful for determining your actual logic usage.

However given the fact that your number of slices is not a whole lot more than half the number of LUTs, I'd say that further packing of "unrelated logic" won't make your design much smaller.

To benefit from changing families, you probably need to go to Virtex 5, which has 6-input LUTs. Other Virtex families look very similar to Spartan 3 from the viewpoint of the fabric.

- J
- Jim Granville
  
  Contact options for registered users
Vote on answer
posted
16 years ago

Tue, Nov 27, 2007 10:50 PM

You could download the Lattice Mico32, and reality check against that, as that is open source. Most FPGAs these days have multiport RAM, so it makes sense to optimise your architecture to use that - in your case for registers, and maybe even for micocode storage.

-jg

- J
- Jürgen Böhm
  
  Contact options for registered users
Vote on answer
posted
16 years ago

Wed, Nov 28, 2007 3:41 AM

Thank your for your answer: Indeed I use RAMB16_S36 for microcode-storage, the final design will probably need four of them, as the microcode is more than 36 bit wide. The idea from the other posters to change to Virtex FPGAs is currently not an option for me, as I really want to develop for the cheaper Spartan platform, for which a lot of affordable boards are offered - if necessary I will buy a board with the next larger Spartan 3 on it.

Greetings,

Jürgen

--
Jürgen Böhm                                            www.aviduratas.de
"At a time when so many scholars in the world are calculating, is it not
desirable that some, who can, dream ?"  R. Thom

- J
- Jürgen Böhm
  
  Contact options for registered users
Vote on answer
posted
16 years ago

Wed, Nov 28, 2007 4:04 AM

Currently I have predominantly three (5bit select) x (32bit data size) muxes with 16 alternatives select actually used (I overdimensioned the muxes, as I did not exactly knew before having written the microcode, how many inputs would be necessary). Are these muxes realized by cascaded LUTs, and does your above remark imply, that a 5-stages-deep chain of LUTs (1 stage for every select bit) will be used?

Greetings,

Jürgen

--
Jürgen Böhm                                            www.aviduratas.de
"At a time when so many scholars in the world are calculating, is it not
desirable that some, who can, dream ?"  R. Thom

- J
- Joseph Samson
  
  Contact options for registered users
Vote on answer
posted
16 years ago

Wed, Nov 28, 2007 5:23 PM

[snip]

The synthesized results are really the worst case scenario. Before worrying about a design, take it through mapping; that's where most of the logic optimization and signal trimming happens. We have designs that are over 100% utilized after synthesis that fit just fine after mapping.

--
Joe Samson
Pixel Velocity

- R
- rickman
  
  Contact options for registered users
Vote on answer
posted
16 years ago

Wed, Nov 28, 2007 7:56 PM

If you are trying to fit a given device, then you need to use the full map and place portions of the tools as well. Only then will you know for sure that your design won't fit. But what part is on your board? You are using about 75% of available resources. I can't say for sure about your design, but ALU logic can be very light if designed properly. So the rest of your design may fit easily in the part.

I designed my own 16 bit CPU to have minimal size and it was about 500 LUTs, IIRC. Like you, most of the logic was from muxes, so I kept them as small as possible, even to the point of eliminating some instructions. Having an extra, unused select line makes them twice as large. BTW, any unused inputs will be optimized out by the tools. So if you don't connect the select input or data inputs, that logic will not be generated.

- J
- Jon Elson
  
  Contact options for registered users
Vote on answer
posted
16 years ago

Wed, Nov 28, 2007 10:13 PM

I think it probably does a little better than that. Really, it breaks it down into basic boolean equations, and then minimizes them. So, it may make much more efficient use than what you describe above, and it probably gets better the more inputs you have. I think three LUTs can do a 4-input MUX, you can almost do it with 2 but are one input short. If you had 5 separate select inputs (like if you were originally designing for 5 tri-state drivers on a bus) that might be less efficient than using a 3-bit binary address for the MUX. But, if a binary address is decoded somewhere in your logic to the 5 select lines, that will all fall out in the logic minimization.

Jon

- J
- Jon Elson
  
  Contact options for registered users
Vote on answer
posted
16 years ago

Wed, Nov 28, 2007 10:15 PM

Yup, the low-cost Spartan was my choice for some designs, too, as I really had no need for the special structures that the Virtex features.

Jon

- E
- Eric Smith
  
  Contact options for registered users
Vote on answer
posted
16 years ago

Wed, Nov 28, 2007 10:33 PM

You can use a single BRAM as a 72-bit wide single-ported RAM, if you only need half the "depth". For instance, normally the maximum width of a Spartan 3 BRAM would be 512x36, but you can combine the two ports to get

256x72.

Obviously if you need greater depth or dual-port this won't help you.

Eric

- J
- Jürgen Böhm
  
  Contact options for registered users
Vote on answer
posted
16 years ago

Thu, Nov 29, 2007 9:33 PM

I use a Spartan-3 starter kit with a XC3S200. The utilization figures I gave above refer to this component. Map and Place I already did, too, but it did not shrink the design significantly.

Considering the ALU, it seems that it can become quite heavy. The utilization figure above are with an ALU that misses some operations which I really would have liked to implement, especially a r/lshift(x,y) operation which shifts the 32 bit word x by an amount of y[4:0]. As long as I kept this in the ALU I nearly had 90% device utilization and, what is even worse, only maximal 46Mhz speed for the CPU.

Here I would like to ask question: if I write the following

wire[4:0] sel;

case (sel) 0: case0; .. 15: case15; endcase

then obviously one specific select-line (sel[4]) won't be used and, following your argumentation and common-sense intuition, the size of the multiplexor should be halved. But will this be also the case with

wire[4:0] sel

case (sel) 3: case3; 7: case7; 8: case8; .. m: casem; endcase

where 3,7,8,..,m form a more or less arbitrary 16-element set from the range 0..31 ?

--
Jürgen Böhm                                            www.aviduratas.de
"At a time when so many scholars in the world are calculating, is it not
desirable that some, who can, dream ?"  R. Thom

- J
- JÃ¼rgen BÃ¶hm
  
  Contact options for registered users
Vote on answer
posted
16 years ago

Thu, Nov 29, 2007 9:47 PM

Right, I need the full depth, but there are two other points coming into play here:

I noticed that using a dual port RAMB instead of a single port increases (slightly) the number of used slices, even if only one port the RAMB was used. I do not know the reason for this, maybe it is because some external dual-port logic has to generated and added.
More importantly the access delay seems to be shorter for a single port BRAM - I could lift my design from 46Mhz above the 50Mhz barrier only by replacing dual port with single port BRAM.

Jürgen

--
Jürgen Böhm                                            www.aviduratas.de
"At a time when so many scholars in the world are calculating, is it not
desirable that some, who can, dream ?"  R. Thom

- R
- rickman
  
  Contact options for registered users
Vote on answer
posted
16 years ago

Thu, Nov 29, 2007 10:33 PM

Yes, an n stage barrel shifter is a very logic intensive function. It can easily be larger than all of the other ALU functions combined. If you consider what is required, you in essence need to build a mux with an input for each possible shift on every bit. If you are shifting in zeros instead of rotating the other bits back on the other end, you can cut your mux roughly in half. But it is still huge. If you want to be able to shift both left and right it is doubled again and if you want to shift right either arithmetic or logical it is larger yet and if you want to rotate as well it is even larger.

If you check the details of the slice logic, there should be some additional gates to allow a pair of 4LUTs to be used to make a 4 input mux. I would expect the tools to use this automatically, but I never trust the tools and I check. If there is any logic driving the select inputs rather than being connected to register outputs, that logic can get mixed in with the mux and make quite an ugly picture. I don't know that it is any less efficient, but I can no longer verify how good it is. I like to verify the logic my HDL is generating.

Unless you specifically specify don't care for cases 16 to 31, I don't know what the tool assumes. I expect it will add sel[4] as an enable. But my point is that it won't use the data inputs case16 through case31 which should cut the number of mux LUTs in half.

In fact, (I am very rusty in Verilog working mostly in VHDL) but the above logic may well generate a latch. That is what happens with incompletely specified functions, no? So sel[4] may end up as an enable to a latch at the output of the mux. In VHDL you can't use a case statement without specifying all possible cases or using an otherwise case. If the otherwise is spec'd to output a zero, then sel[4] will be an enable. To have sel[4] ignored you would have to spec the case from 16 to 31 to be the same output as 0 to 15 respectively.

My original statement about the logic being automatically optimized away would only apply if you designed the mux to have 32 data inputs and did not drive half of them. Again, that likely is not legal, but I don't recall what any particular compiler will do for Verilog.

This design will use all of the select inputs even if only half of the data inputs are used. So again the mux logic will be reduced, but each data input will be enabled by a full decode all of the sel inputs. I don't think the logic will be halved in this case however, but it depends on how it is implemented. Again without a fully spec'd case, it may generate a latch on the output.

- R
- rickman
  
  Contact options for registered users
Vote on answer
posted
16 years ago

Thu, Nov 29, 2007 10:38 PM

y

This sounds odd to me, but obviously your dual port design is different from the single port design in other ways than just the ram. You need two address busses and control signal sets, not to mention the two data paths. How did you connect the dual port ram that was different from the single port ram? I am pretty sure the block ram itself fully implements the dual port memory and does not require any slices to be used.

- J
- Jürgen Böhm
  
  Contact options for registered users
Vote on answer
posted
16 years ago

Fri, Nov 30, 2007 12:30 AM

Actually I just used "dummy signals" at the unused port B ADDR, DI, DO,.. signals. That is, first I wrote (because of laziness, I did only copy&paste) something like

RAMB16_S36_S36 micro_store ( ... ,.DOB(dummydob), ... );

and used only the port A. (dummydob is a signal left undeclared).

Secondly I wrote explicitly

RAMB16_S36 micro_store (..)

and got the results with faster timing and less slices used.

- Jürgen

--
Jürgen Böhm                                            www.aviduratas.de
"At a time when so many scholars in the world are calculating, is it not
desirable that some, who can, dream ?"  R. Thom

- R
- rickman
  
  Contact options for registered users
Vote on answer
posted
16 years ago

Fri, Nov 30, 2007 3:11 AM

I don't know the impact of dummydob. I would expect it to use LUTs to source a fixed value signals since you are instantiating a fixed ram block, but I don't really know what the tools would do with that. If it doesn't provide signal drivers it would have to minimize the dual port rams to single port rams since that is all that is being used. I don't think the ram blocks can ignore inputs, but again, I don't know for sure. One of the Xilinx guys could tell you for sure.

- B
- Brian Drummond
  
  Contact options for registered users
Vote on answer
posted
16 years ago

Fri, Nov 30, 2007 1:58 PM

One trick with the barrel shifter is to use the multiplier blocks to implement it : for a 32*16 shifter, 2 multipliers are enough. Simply decode the shift distance to one-of-16. Unless they are already fully employed elsewhere.

- Brian

- P
- Peter Alfke
  
  Contact options for registered users
Vote on answer
posted
16 years ago

Fri, Nov 30, 2007 7:19 PM

Just make sure that you force the Enable and the Write Enable inputs of the unused port Low. Nothing else.

BTW: Use the 18 x 18 multipliers to implement barrel shifters. It saves many LUTs and is faster. (Multiply> >>> >>>>> Jürgen Böhm wrote:

- G
- glen herrmannsfeldt
  
  Contact options for registered users
Vote on answer
posted
16 years ago

Fri, Nov 30, 2007 9:22 PM

rickman wrote: (snip)

The more usual barrel shifter would be log2(n) 2 input muxes, which is still a lot of CLBs, or maybe log4(n) 4 input muxes.

It is barrel shifters that make floating point addition and subtraction so expensive in FPGAs. You need one for prenormalization (shift to align the radix point), and one for postnormalization (remove leading zeros in the result and adjust the exponent).

-- glen