XST Help - Device Utilization Woes

Hello,

I'm synthesizing a design in XST and I'm having a hard time figuring out what's consuming all of the devices resources.

I wrote mostly structural VHDL, so I decided to synthesize each component separately to get a better idea of the low level utilization. I haven't seen any option in XST to see a hierarchal analysis of area... Anyway, I estimated the resource consumption of my design, excluding routing, the FSM, and some other small amounts of logic and multiplexing:

Slice Count Slice FFs 4-input LUTs ----------- --------- ------------ used: 10936 29048 12406 total: 23616 47232 47232 ----------- --------- ------------ 46.31% 61.50% 26.27%

Here is the actual: Number of Slices: 45523 out of 23616 192% (*) Number of Slice Flip Flops: 22611 out of 47232 47% Number of 4 input LUTs: 78378 out of 47232 165% (*)

When looking in the synthesis report, I noticed some warnings indicating that duplicate FFs were removed, so that explains the reduction in FF count. However, I cannot explain the HUGE increase in LUT and Slice usage. What can I infer from this?

The report also tells me that some of my 6-bit counter signals are being replicated (once or twice). What is the cause of this? High fan-out?

FlipFlop cnt_dout_ins_cnt_v_0 has been replicated 2 time(s) FlipFlop cnt_dout_ins_cnt_v_1 has been replicated 1 time(s) FlipFlop cnt_hreg_ins0_cnt_v_0 has been replicated 2 time(s) FlipFlop cnt_hreg_ins0_cnt_v_1 has been replicated 1 time(s) FlipFlop cnt_hreg_ins10_cnt_v_0 has been replicated 2 time(s) FlipFlop cnt_hreg_ins10_cnt_v_1 has been replicated 1 time(s) FlipFlop cnt_hreg_ins11_cnt_v_0 has been replicated 2 time(s)

Is there anyway to decipher the cell usage count perhaps? Does anyone have a URL that includes an explanation of all the cell names? I also checked the macro statistics and everything is accounted for in that table.

Thanks.

-Brandon

Reply to
Brandon
Loading thread data ...

Hi Brandon, The floorplanner tool might help you track down where most of your usage is. HTH, Syms.

Reply to
Symon

Brandon,

I would suggest taking a look at the synthesis warnings. Maybe you instantiated the same component twice, maybe you took wrong device size...

But if you did everything ok, then the only thing that could happen here is that the estimation you get is not close (sometimes) to the actual placement results.

Any details on this?

Reply to
Vladislav Muravin

replication is usually from two sources...

1/ the fanout as you suggest 2/ speed improvement 3/ borg

I would suggest there is little hope for your design as you have too much logic. Correct answer is get a bigger device :-)

However.. take a look at any memories... if they are distributed and not block they will eat memory Then look at shift registers.. SLR16 ? Think about what you are trying to achieve and see if there's a simpler solution.

Simon

Reply to
Simon Peacock

I checked all the warnings and none of them seem significant.

Ok, I'm trying to synthesize a design for an N-tap complex MAC FIR. The design serially loads the complex coefficients (currently 64-taps) via a 64 long, 16-bit word size shift register. Here is the synthesis blurb of that unit:

Synthesizing Unit . Related source file is "/../../../Modeltech_6.0c/projects/espfep/work/srsipo.vhd". Found 1024-bit register for signal . INFO:Xst:738 - HDL ADVISOR - 1024 flip-flops were inferred for signal . You may be trying to describe a RAM in a way that is incompatible with block and distributed RAM resources available on Xilinx devices, or with a specific template that is not supported. Please review the Xilinx resources documentation and the XST user manual for coding guidelines. Taking advantage of RAM resources will lead to improved device usage and reduced synthesis time. Summary: inferred 1024 D-type flip-flop(s). Unit synthesized.

Since I was worried about the size of this unit, I synthesized it alone and noticed that it consumed very few slices/FFs (2%,4%), so I don't think that this is the problem. This HDL ADVISOR message is just an FYI correct? I do no believe I'd have any benefit from using the on-chip RAM resources?

Anyway, here is the final report for the design.. if it helps.

=========================================================================

  • Final Report
  • ========================================================================= Final Results RTL Top Level Output File Name : mac_fircplx_wrapper.ngr Top Level Output File Name : mac_fircplx_wrapper Output Format : NGC Optimization Goal : Area Keep Hierarchy : NO

Design Statistics # IOs : 102

Macro Statistics : # Registers : 3651 # 1-bit register : 2817 # 16-bit register : 258 # 32-bit register : 256 # 33-bit register : 128 # 38-bit register : 128 # 6-bit register : 64 # Counters : 2 # 6-bit up counter : 2 # Multiplexers : 130 # 16-bit 64-to-1 multiplexer : 130 # Adders/Subtractors : 320 # 33-bit adder carry in : 128 # 38-bit adder : 128 # 6-bit subtractor : 64 # Multipliers : 256 # 16x16-bit multiplier : 256 # Xors : 128 # 1-bit xor3 : 128

Cell Usage : # BELS : 159480 # BUF : 4 # GND : 1 # INV : 367 # LUT1 : 64 # LUT2 : 4304 # LUT3 : 73793 # LUT4 : 217 # MUXCY : 9164 # MUXF5 : 33281 # MUXF6 : 16640 # MUXF7 : 8320 # MUXF8 : 4160 # VCC : 1 # XORCY : 9164 # FlipFlops/Latches : 22611 # FDC : 130 # FDCE : 22114 # FDCPE : 15 # FDP : 64 # FDPE : 288 # Clock Buffers : 1 # BUFGP : 1 # IO Buffers : 101 # IBUF : 67 # OBUF : 34 # MULTs : 256 # MULT18X18 : 256 =========================================================================

Device utilization summary:

---------------------------

Selected Device : 2vp50ff1152-5

Number of Slices: 45523 out of 23616 192% (*) Number of Slice Flip Flops: 22611 out of 47232 47% Number of 4 input LUTs: 78378 out of 47232 165% (*) Number of bonded IOBs: 102 out of 692 14% Number of MULT18X18s: 256 out of 232 110% (*) Number of GCLKs: 1 out of 16 6%

WARNING:Xst:1336 - (*) More than 100% of Device resources are used

=========================================================================

I'm not worried about the multiplier over-utilization. I'll probably just reduce the number of taps once I get the other numbers down... I've yet to find any info on the Xilinx cell primitives,i.e. what's the difference between FDC, FDCE, FDP, etc.? Does anyone have any technical documentation on these? If anyone is interested, I could provide a design schematic...

Much thanks,

-Brandon

Reply to
Brandon

There are 130 16 bit wide, 64-to-1 multiplexers in your design. That is where your excess logic utilization is comming from. What are they for?

Each 64-to-1 mux is going to be 32 luts plus some MUXFXs per bit.

32*16*130 = 66560 Luts.

Regards,

John McCaskill

Reply to
John McCaskill

I believe it is the xilinx "libraries guide" you are after for FDC, FDCE etc documentation. Look in the ISE install directory and you should find it under xilinx\doc\usenglish\books\docs\lib\lib.pf. Or in ISE 7.x go to Help->online documentation, then in left hand pane you will see "libraries guide".

What is the sample rate and number of bits in your input data? You may have already considered it but a "distributed arithmetic" filter uses much less FPGA resources than a fully parallel filter. The cost is in sample rate. A distributed arithmetic approach to the FIR filter allows you to trade off sample rate and FPGA resource usage.

Regards Andrew

Reply to
Andrew FPGA

John,

I need those multiplexers to multiplex the coefficients h[0] through h[63] to each MAC b input. There are 64 complex MACs, so I need

64x2=128, 64 to 1 multiplexers for the complex 'b' input. There are two 64 to 1 multiplexers to multiplex the complex accumulator outputs s[0] through s[63] to the output y.

Here is how the timing goes for first two samples: __@ t = 0__ b[0]

Reply to
Brandon

Hi Brandon, What is the signal you are analysing/filtering? Presuamably you have ruled out decimation to get the rate down?

Regards Andrew

Reply to
Andrew FPGA

Brandon,

I'm not 100% clear on what your filter structure looks like, but for a high throughput, large FIR like this, you should be using a transposed direct form I structure. If you have fanout problems with this, and can tolerate additional latency, you can use a systolic structure. Xilinx has a nice depiction of both in (pp. 84-85):

formatting link

For either structure, you can simply string a shift register together to load your tap coefficients. I hope this helps.

cheers, aaron

Reply to
aholtzma

ElectronDepot website is not affiliated with any of the manufacturers or service providers discussed here. All logos and trade names are the property of their respective owners.