V2p block ram clock -> Q delay help

Hi all, I have a long combinational path in my fpga design and I am looking for ways to reduce the path. one of the biggest contributors is the clock to Q delay from memory on some of the inputs to the path. The memory(blockram) is currently very wide and not deep. Is there a way to optimize the size or any other paramaters to decrease the clock to Q time?

Thanks

Matt

Reply to
Matthew E Rosenthal
Loading thread data ...

Pipelining is the most obvious and most popular way to reduce long delays. When it can be used, it is great... Peter Alfke

Reply to
Peter Alfke

Unfortunately that can not be implemented. I was hoping for something specific to bram clock-> Q delay.

Matt > Pipelining is the most obvious and most popular way to reduce long delays.

Reply to
Matthew E Rosenthal

Matthew E Rosenthal wrote:

The BlockRAM CLK -> Q can only be improved by going to a faster speed grade. If that is out of the question and adding latency is not an option then you have two choices, either look elsewhere to reduce the path delays or not use the BlockRAM. You mention that you do not need the RAM very deep but do need it very wide. Have you considered using LUT-based RAM (RAM16X1S)? You can configure LUT-based RAM fairly easily in 16, 32 and 64 bit depths and will see a better CLK -> Q than in the BlockRAMs and on top of that, likely see better placement for wider buses since they are not all tied together like they are in a BlockRAM. Also, LUT-RAMs have asynchronous reads so if you want to keep that clock cycle of latency for your reads, you can either add a register to the output of the RAM in the same slice and get that latency back and still get a good CLK --> Q or else you can push that register deeper into you critical path and perhaps get a better balance of registers in that path and thus get much better timing. You can configure the LUT-RAMs to depths deeper than 64-bits but you start to consume a lot of LUT resources and the trade-off is not as great. My suggestion is if you can get by with 64-bits or less ore bit, might as well go to LUT-RAM. If you need deeper RAMs, stay in the BlockRAM and look at reducing routing delays (you can try adding placement constraints, replicating registers/logic, higher effort levels in Map/Par, etc.) or logic levels for those critical paths (try harder synthesis constraints/options, re-coding that section of the design, etc.).

Good Luck,

-- Brian

Reply to
Brian Philofsky

that's sound you in trouble, a design with no room for pipelines? one thing i want to point out:(Ray mentioned it before?) you need to do manual placement so that the flipflop can sit next to the blockram, auto placement some time failed to do that.

Reply to
thangkho

--

--Ray Andraka, P.E. President, the Andraka Consulting Group, Inc.

401/884-7930 Fax 401/884-7950 email snipped-for-privacy@andraka.com
formatting link

"They that give up essential liberty to obtain a little temporary safety deserve neither liberty nor safety." -Benjamin Franklin, 1759

Reply to
Ray Andraka

Reply to
Matthew E Rosenthal

The clock to Q of the BRAMs is what it is, and it is longer than the clock to Q of the flip-flops in the fabric. The best solution is to pipeline the BRAM outputs by adding a register. For maximum performance, that register should be placed immediately adjacent to the block RAMs to minimize the routing delays out of the block RAM. The automatic placement does an exceptionally poor job at placing pipeline registers on BRAM outputs, so in order to have them be of much use you have to do a little floorplanning.

Now you mentioned you can't afford to pipeline the design, which I'll trust you on for the moment. If that is the case, then you'll have to live with the long clock to Q from the BRAM, although it doesn't mean you also have to live with the routing delays to the logic you have connected to them, nor necessarily the propagation delay through that logic. First, look at the logic connected the BRAM outputs. Is it designed for minimum propagation delay to the next flip-flop? Is there anything you can do to reduce the number of LUTs it passes through? Are you using the carry chain (the carry chain can be expensive in terms of propagation delay)? Next, look at your timing report. It enumerates how much of the delay between the BRAM and the flip flop is attributed to logic and how much to routing, and gives you the delay for each net in the path. You need to reduce those delays by placing the logic as close to the BRAMs as you can get it. If your design is like many novice FPGA designs, your signal goes through several LUTs before reaching a flip-flop. Each LUT has a flip-flop with it, so pipelining comes for free if you can afford the latency, but I assume you know that. Anyway, the automatic placer does alright with placing one level of logic (levels of logic are the number of LUTs the signal passes through between flip-flops), but when there are two or more levels of logic, the placer does quite poorly, often placing the LUTs far away from the direct path between the flip-flops. What you need to do is constrain the placement of the flip flops as well as all the logic between the flip-flops and the BRAM so that it is kept as close to the BRAM as practical. An area constraint on that logic will help, although the ultimate performance will come by hand placing that critical logic.

Another consideration is that the automatic router in recent versions of ISE has gotten lazy compared to the router in versions 2 years ago. The current router no longer gets the shortest route between well placed logic, rather it stops optimizing each route as soon as the route is under the timing constraint. The result is you wind up with every route being a critical route, and in dense high perfomrance designs you get congestion so that the router can't find a solution that meets timing. Running the router multiple times in the reentrant mode will sometimes improve the results, but usually will not achieve the level of performance you can get with a hand route, or in the case of VirtexI devices what you could achieve with the version 3 sp8 tools. If placement constraints alone don't get your timing to where it needs to be, you can try doing some hand routing of that circuit using FPGA editor. At the very least, that will tell you how much performance is possible, and if the level of performance you seek is possible with your circuit, it may be the only way to reach it given the current state of the tools without further changes to your design.

Matthew E Rosenthal wrote:

--

--Ray Andraka, P.E. President, the Andraka Consulting Group, Inc.

401/884-7930 Fax 401/884-7950 email snipped-for-privacy@andraka.com
formatting link

"They that give up essential liberty to obtain a little temporary safety deserve neither liberty nor safety." -Benjamin Franklin, 1759

Reply to
Ray Andraka

ElectronDepot website is not affiliated with any of the manufacturers or service providers discussed here. All logos and trade names are the property of their respective owners.