Achieving required speed in Virtex-II Pro FPGA

V

v_mirgorodsky 21 years ago

Hi, ALL!

Several months ago I did schematic based design, implementing median image filtering in Altera EP1K30TC144-2. It was running something close to 150MHz without any explicit constraints, except target clock frequency. At that time I did not need that much speed, because my device was providing me data only 60MHz or below.

Now we are busy with another device, capable to run at 150MHz and we have XC2VP4 speed grade 6 Xilinx FPGA as a data processing unit within the device. I rewrote design using VHDL language. During verification, RTL schematic of synthesized VHDL code was looking exact like schematic for Altera ACEX-1k device. The only issue was speed. VHDL reincarnation of median filter was running only 134+MHz. Flor-plan editor was showing half of the chip polluted with registers and multiplexers of the design. I tried to set some constraints on VHDL code to reduce area where this block located. I spent about 6 hours playing with various placement/timing/routing attributes and constraints but failed to get any better.

So, is there any guide about constraints strategy? I read the guide about constraints, but there are too many choices. I managed to remove couple setup errors by explicit placing combinatorial logic and registers in adjacent slices, but it would be horrible idea to do manual chip routing :(

With best regards, Vladimir S. Mirgorodsky

Vote

M

Mike Treseler 21 years ago

That is quite often all you need.

You have changed not just design entry, but the device and the function. There are lots of ways to drop fmax from "something close to 150MHz" to 134+ MHz.

Constraints are for fine tuning. You could evaluate other synthesis tools, but I expect you need to make some design tradeoffs. Maybe a faster part, or use up extra resources for wider datapaths.

-- Mike Treseler

Vote

M

Marc Randolph 21 years ago

v snipped-for-privacy@yahoo.com wrote: [...]

within

verification,

schematic

reincarnation

Howdy Vladimir,

I'd investigate the timing analyzer output and see what is holding the design back. Do you have 5 failing paths or 100? Are there large fanouts involved on some of the failing paths? Are there too many levels of logic on some of the paths, and if so, can a pipeline stage be moved forward or back to help break up the levels of logic?

Honestly, a -6 speed grade V2Pro should pretty easily meet 150+ MHz if fanout and levels of logic are kept under control.

remove

I agree, and in this case, I suspect it would be unnecessary. What synthesis tool are you using? What clock speed are you telling the tools? You might try a slightly faster target speed to see if it helps you come much closer to meeting your period constraint, in addition to lowering fanout limits and investigating the number of levels of logic.

Good luck,

Marc

Vote

V

v_mirgorodsky 21 years ago

back to help break up the levels of

I have only 10 timing errors. I have a 10-bit wide data bus inside of the filter, delayed on SRL16 elements to save some triggers, the output of SRL16 goes directly to two comparators and two mux'es, driven by

-ge and -lt output bits and I have two bits within this bus, violating timing requirements. Doing manual placing I managed to cure one bit, but got error in another. Timing report says I have about 5-7 logic levels on failing logic paths. The fan-out for erroneous bits is 3-4 average, or at least tool reports that. Actually, I can add another pipeline stage between 2-to-1 mux'es and comparator outputs, but this will bring another 30+ triggers into design, which is not good. In ACEX-1K such optimization brought the speed up to 200MHz; actually, it was there from the beginning, but we did not need that fast solution and trade-off speed for area. It did run in slower Altera chip, what should I do to get the same result out of considerably faster Xilinx chip?

This is the only hope :)

remove

helps

to

logic.

I tried to use slightly faster clock constraints. Instead of 150MHz I asked the tool to PAR my design to meet something 166+MHz. The result was exactly the same. 134+MHz is some sort of hard border, which is almost never crossed :( I am using ISE 6.3 SP1 for all synthesis, routing and placement operations. With best regards, Vladimir S. Mirgorodsky

Vote

V

v_mirgorodsky 21 years ago

Dear Mike Treseler,

It WAS running fast enough in slower Altera chip, so it SHOULD run the same fmax or better in faster Xilinx chip, right?

Regards, Vladimir S. Mirgorodsky

Vote

A

Antti Lukats 21 years ago

schrieb im Newsbeitrag news: snipped-for-privacy@z14g2000cwz.googlegroups.com...

wrong! dont ever assume anything like that.

the faster fmax most likely can be achived on the faster xilinx part but in generic if one designs has some fmax on some device than retargetting to the new FPGA architecture may require some adjustment to achive the comparable performance. The way synthesis tools map the design to the FPGA are way different.

Antti

Vote

V

v_mirgorodsky 21 years ago

Dear Antti Lukats,

I am just curious, how to optimize VHDL code to use with Xilinx versus Altera? Yes, I know, some elements may be created more efficiently in Xilinx chips, anothers - in Altera chips. You may target your design to use one or another element, but generic triggers, multiplexers and adders are not optimizable for certain FPGA architecture within VHDL language without using black box primitives.

My concern about Xilinx tools is that they are not giving comparable performance versus Altera tools with default settings.

With best regards, Vladimir S. MIrgorodsky

Vote

M

Marc Randolph 21 years ago

output

violating

7 is quite a few, but probably not impossible. Unfortunately, that many levels of logic, combined with almost any fanout, gives the tools a chance to make very poor placement choices - as you've seen.

reports that.

Does the fanout come directly from the LUT that is used for the SRL, or did the tools do the right thing and use a FF? It may or may not be obvious from the timing report - you might have to use FPGA editor to check.

Regardless, you might consider making your SRL one bit shorter and forcing there to be another FF after the SRL. You might even go so far as to fanout the output of the SRL to two or more FF's, and have THOSE feed the rest of your logic. You may need a keep properity on the FF's to keep them from being optimized out.

What are triggers? Do you mean FF's? They are basicly free in most FPGA designs, and are vital to high speed designs.

it

I'm not sure what you're asking. For the same design, the V2Pro part is running considerably faster than the old ACEX part, is it not?

If possible, try to get ahold of Synplify from Synplicity for your synthesis. They will often do eval's so that your purchasing department doesn't have to get involved until AFTER you see the (hopefully better) results.

Good luck,

Marc

Vote

J

Jon Elson 21 years ago

Xilinx has dedicated carry logic that makes anything with carries (adder, magnitude comparator) much faster. I know Spartan much better than Virtex, so what I know may not apply. But, what I'm wondering is if something you are doing in the schematic is excluding the use of the dedicated carry function. In one design I had to laboriously copy the way a Xilinx library macro used the carry components when I made up a slightly different macro. IIRC it also was a magnitude comparator, but I needed a greater than or equal to function. All I can say is the thing works, but I don't actually understand these carry components, I was just copying the thing pretty blindly. But, the macro synthesizes to a much smaller and faster instance on the Spartan chip when it uses these carry blocks.

Jon

Vote

V

Vic Vadi 21 years ago

Hi Vladimir, In the Xilinx Software the default settings are optimized for Software Run Time - not for effort level. Are you using a PC or Unix? I am more familiar with PC - on your PC if you right click on the "synthesize -XST" and select properties you can choose effort level and various other options.

SRL16's are an extremely efficient way to use 1 slice as a 16-deep register. If you want to force the use of an actual flipflop in order to meet timing - you can place a reset on the last flipflop. Flipflops with resets go into regular flipflops - flipflop chains with no resets will be able to take advantage of the SRL16 feature which could save you a lot of area.

- Vic

v snipped-for-privacy@yahoo.com wrote:

Vote

P

Phil 21 years ago

Hi Victor,

the quality of your VHDL code definetely influences the achievable frequency in your design. Even the best VHDL synthesis tools are not able to generate a 'good' netlist from 'bad' code.

For the moment you can try to have somebody experienced with VHDL taking a look at your code.

Furthermore, you should do a detailed Critical path analysis. I might be that just a tiny piece of VHDL causes a problem (maybe a RAM implemented out of registers etc).

You also can try a better synthesis tool such as Synplify Pro from Synplicity.

Regards, Phil

Vote

V

v_mirgorodsky 21 years ago

Hi ALL,

I got the problem solved in not very efficient way. I replaced SRL16 elements with conventional triggers and now design flys in the sky - the fmax went all the way up to 214+MHz.

The only thing left to figure out - why conventional triggers do such a good job and "very efficient" SRL16 apeared to mess up everything :(

With best regards, Vladimir S. Mirgorodsky

Vote

A

Antti Lukats 21 years ago

schrieb im Newsbeitrag news: snipped-for-privacy@g14g2000cwa.googlegroups.com...

hm thats strange there is on usually unused flip flop at the 'end' of SRL 16 so doing the SRL16 1 clock shorter and using that flop should have the same performance as only flips if what you say is so, then it must be a bug in the timing estimation ??

antti

Vote

V

v_mirgorodsky 21 years ago

Hi Vic,

I am using PC version software. Do you have an idea how to optimize the XST for spped or for area? During my experiments any touch to efforts/packing/spped controls were bringing fmax down. I understand, that all of them should be tuned all together in some special fassion, but there are too many variants and relations between controls is not evident.

With best regards, Vladimir S. Mirgorodsky

Vote

A

Antti Lukats 21 years ago

schrieb im Newsbeitrag news: snipped-for-privacy@l41g2000cwc.googlegroups.com...

hm stupid question did you constrain the clock for the speed you need? the clock can be constrained to higher than the fmax is reported when running with no constrains also dont constrains too high just to the fmax you actually need, if the timing can not be met by a small margin it yields to timing far worse than it is possible to achive.

antti

Vote

V

v_mirgorodsky 21 years ago

There is only one question left in such case - ho to instruct ISE to put unused flip-flop at the end of the SRL16 shifter in the same slice without doing explicit placement operation? Yes, there is a constrain, called RLOC. And if you put special constraints on SRL16 block and trigger than they may got to the same slice, but this leads to completelly unportable code even between Xilinx family chips.

Design with pure triggers runs fast enough and I don't have any clue why SRL16's are not.

With best regards, Vladimir S. Mirgorodsky

Vote

V

v_mirgorodsky 21 years ago

Sure, I constrained the clock path and I explicitly told to tools that my Clk line is clock for my design :) I put constrain on the clock line about 155+MHz, requiring only 150MHz - it is always good to have a couple percent backup :)

Doing manual routing I was able to correct timin errors on data bits, so I am wandering why tool can not do the same :(

Vote

M

Mike Treseler 21 years ago

Sounds like two good reasons to keep the design generic. I expect that the SRL regs are not quite as fast as the standard registers.

-- Mike Treseler

Vote

J

Jim Granville 21 years ago

Maybe someone from Xilinx can comment on that - does using SRL16's have a real speed impact ?

-jg

Vote

P

Pete Fraser 21 years ago

"Mike Treseler" wrote

I found that on an old 2V1000 design that I did. I was using plenty of SRLs to compensate for filter delays, but I had to stick an extra register on the output of each one to get decent speed.

Vote

Achieving required speed in Virtex-II Pro FPGA

Join the Discussion

Didn't find your answer?