Old vs. New FPGAs

I was updating a CPU design I did a few years ago and I was a bit disappointed in the results I see. The CPU was originally targeted to an Altera ACEX part which is 5 volt compatible (to give you an idea of its age). I did my own CPU because Altera does not support their NIOS for that family. I spent a fair amount of time optimizing the architecture to be easy to implement in 4 input LUTs and other basic elements found in FPGAs. I coded it up for the ACEX async memories and got it running. If memory serves me, it clocked in at 55 MHz max and I used it at 40 MHz.

Currently I wanted to look at how fast it might run if I redid it for a current FPGA architecture using synchronous memories. I compiled it for a Spartan 3 and got the speed up to 77 MHz using less than 10% of an XC3S400 (315 slices). I am not impressed with the speed. I expected a much larger increase and had hoped for operation at over 100 MHz. I checked the timing analyzer output and the signal paths are pretty much what I expected, no oddball logic generation and I got carry chains where I wanted them. The slow paths have a few long route times, so although it may approach 100 MHz with careful floorplanning, I don't think this is worth the effort compared to the >> 100 MHz CPU cores you can get from the FPGA vendors.

I was wondering if this small speed up is typical of improvements from one or two generations difference in FPGAs? The ACEX parts are designed for economy, not for speed, just like the Spartans. When I did the initial design 3 or 4 years ago, the ACEX parts were old news then! Given that there was nothing in the design that is tailored for one FPGA family over another, I guess I expected more like a 2X speedup in the current technology chip. Isn't that reasonable given the vast difference in the timing specs in the data sheets?

Reply to
rickman
Loading thread data ...

This does not surprise me. Xilinx seems to have emphasized size over speed of Spartan as they update it. It is very difficult to get Microblaze to run at 100MHz in a Spartan 3E, so 77MHz without trying is about what I would expect.

Alan Nishioka

Reply to
Alan Nishioka

I'm just curious and it might not be applicable to application but did you try targeting a Stratix II? If you did what kind of fMax's where you able to achieve?

Derek

rickman wrote:

Reply to
Derek Simmons

Or just a Cyclone-II - the (currently) latest installment in Altera's low-cost offerings. If you're not supplying timing constraints, be sure to take the fitter out of its default Auto Fit mode, or it will simply give you _a_ possible solution with possibly horrible performance.

Altera is boasting (some) performance advantage over Spartan-3, so here's a chance to see some real field feedback.

Best regards,

Ben

Reply to
Ben Twijnstra

I see what you mean. I checked the Xilinx site and I was confused thinking that MB would run at higher speeds. They list 100 MHz in the

-5 high performance versions while I was running my design in the -4 version. So I guess my performance is not so bad considering that it is not pipelined. Of course with a MISC architecture, it requires more instructions to do the same amount of work as the instructions are not as powerful. I may do some other work to see how practical my CPU design will be in the future. I don't mind doing the leg work to support an FPGA CPU core, but not if it does not have advantages. Right now the only advantage is the size, about 600 LUTs vs. 1300 for MB. I'll need to make sure it will do a decent job of keeping up with the clock.

Reply to
rickman

Alan Nishioka wrote:

I tried a couple of things, but I was not able to use the floorplanner. I get a fatal error and it crashes. This may be due to it not being able to phone home when it tries to reach out and touch someone. My firewall blocks it and when I click the OK button the floorplanner crashes.

I get different failing paths depending on some of the settings I make, like the Starting Placer Cost Table setting. But the long path is around 13 ns and has about the same amount of logic and routing delay. Is that normal? These paths all start with a 2 ns clock to out from the BRAM. Then there are typically two or three routes that are longer than 1 ns, sometimes one is longer than 2 ns. I can't tell what is weird about this since I can't really "see" it. This path is only 5 levels of logic with no carry chain. Others are 4 level of LUTs plus a carry chain (although typically only the last few bits of a 16 bit adder for some reason).

Timing constraint: TS_SysClk = PERIOD TIMEGRP "SysClk" 10 ns HIGH 50%;

24616 items analyzed, 84 timing errors detected. (84 setup errors, 0 hold errors) Minimum period is 12.915ns.

-------------------------------------------------------------------------------- Slack: -2.915ns (requirement - (data path - clock path skew + uncertainty)) Source: InstFtch/Mram_Inst_Ram1.B (RAM) Destination: RegPsw/DebugIrqEn (FF) Requirement: 10.000ns Data Path Delay: 12.914ns (Levels of Logic = 5) Clock Path Skew: -0.001ns Source Clock: SysClk rising at 0.000ns Destination Clock: SysClk rising at 10.000ns Clock Uncertainty: 0.000ns Timing Improvement Wizard Data Path: InstFtch/Mram_Inst_Ram1.B to RegPsw/DebugIrqEn Delay type Delay(ns) Logical Resource(s) ---------------------------- ------------------- Tbcko 2.394 InstFtch/Mram_Inst_Ram1.B net (fanout=0) 1.792 InstFtch/InstReg Tilo 0.608 DecodeSlow/DatStkCntl21 net (fanout=19) 0.758 DecodeSlow/N23 Tilo 0.608 DecodeSlow/FlagsEn11 net (fanout=6) 0.369 DecodeSlow/N56 Tif5x 0.911 DecodeSlow/FlagsEn_F DecodeSlow/FlagsEn net (fanout=0) 1.241 DecodeSlow/FlagsEn Tilo 0.551 RegPsw/_not00141 net (fanout=5) 1.079 RegPsw/_not0014 Tilo 0.608 RegPsw/_not00211 net (fanout=1) 1.393 RegPsw/_not0021 Tceck 0.602 RegPsw/DebugIrqEn ---------------------------- --------------------------- Total 12.914ns (6.282ns logic, 6.632ns route) (48.6% logic, 51.4% route)

Is this normal for the routing delays to range so widly and total as long as the logic delays?

This is with nothing else in the chip, so I can only imagine that the path delays will get longer as I combine other logic inside the chip.

I'll give it a try in a Virtex4 part over the weekend and see if that is faster.

Reply to
rickman

Here are a couple more data points. I changed the part to an xc4vlx25-12 and it exceeded the 100 MHz timing requirement, in fact it ran at 110 MHz. But at -10 it failed only reaching 84 MHz. On the other hand the XC3S400-5 weighed in at almost 91 MHz. So speed grade can make a moderate difference.

The thing that surprised me the most is that in the Spartan 3 parts the routing was about half the delay in the worst case paths. But in the V4 part routing was over 70% of the delay in the worst case paths! So the LUTs got faster between S3 and V4, but not the routing! In fact, the routing delays were longer in absolute terms, but I'm not sure this was a valid comparison as the longest delays were on different nets between the two parts.

I also found a bug in the IDE. When you change parts to evaluate differences, the Summary Report does not change the Target Device. All the other info seems to be correct, but the target stayed the same no matter what I did.

Reply to
rickman

rickman schrieb:

you are correct - the main difference between S3 and V4 is the LUT delay in the matter of fact the LUT delay is really really small in V4, when I made measurements to check this delay I wasnt to belive at first but then looked at datasheet timings and it was all correlating. I got signals up to 975MHz within slowest V4, while in S3 I think I did not get to around 370Mz only.

so the routing really matters!

Antti

Reply to
Antti

ElectronDepot website is not affiliated with any of the manufacturers or service providers discussed here. All logos and trade names are the property of their respective owners.