JOP on Spartan-3 Starter Kit

I got the Spartan-3 Starter Kit yesterday from Xilinx. This board is a really good bargain: A XC3S200 and 1MB SRAM for just $ 99,-. This board makes it hard for guys like Tony Burch or me to sell FPGA boards ;-( Only the Flash is a little bit small.... Not too much space left for application data.

However, the board and the documentation is fine. It took me only half a day to port JOP (a Java processor) from the Altera Cyclone to the Spartan (thanks to Ed Anuff who did the hard part and wrote a memory generator for Xilinx). Just two Xilinx specific files for the top-level and the memory interface. You can find a Xilinx ISE project under xilinx/s3sk for JOP on this board.

If you have such a board and want to try out JOP:

Download the JOP sources from:

formatting link
Compile the ISE project under ../xilinx/s3sk Download JOP to the FPGA Connect a serial cabel from your PC to the board Open a command prompt in ../java/target Change the COM-port in doit.bat type: doit test test Clock

that's it, a small Java program should now run on the Spartan!

Martin

---------------------------------------------- JOP - a Java Processor core for FPGAs:

formatting link

Reply to
Martin Schoeberl
Loading thread data ...

a

Spartan

for

For those who are interested in a short comparison between Cyclone and Spartan-3:

Cyclone EP1C6Q240C6: fmax: 98 MHz, 2066 LC/Es (34% out of 5980) Spartan-3 XC3S200-5 fmax: 82 MHz, 2015 LC/Es (52% out of 3840)

I mean a 4 input LUT with register for the LC/E comparison. The CLB or slice numbers are just confusing. We can see that JOP needs about the same resources in the A and X devices. Both devices used are the fastest speed grade available. Is the Cyclone, although 'older', faster than the Spartan-3?

It's interesting when we compare the two devices with respect to LC/Es and memory (In case of memory I count K-Bytes (not bits) and don't care about a 9th parity bit... Why do I need a parity bit for the block RAM? Is there also a parity protection for the SRAM based configuration?):

XC3S50: 1536 LC/Es, 4*2KB=8KB, 4 HW multiplier EP1C3: 2910 LC/Es, 13*0.5KB= 6.5KB XC3S200: 3840 LC/Es, 12*2KB=24KB, 12 HW multiplier EP1C4: 4000 LC/Es, 17*0.5KB= 8.5KB EP1C6: 5980 LC/Es, 20*0.5KB= 10KB XC3S400: 7168 LC/Es, 16*2KB=32KB, 16 HW multiplier EP1C12: 12060 LC/Es, 52*0.5KB= 26KB XC3S1000: 15360 LC/Es, 24*2KB=48KB, 24 HW multiplier EP1C20: 20060 LC/Es, 64*0.5KB=32KB XC3S1500: 26624 LC/Es, 32*2KB=64KB, 32 HW multiplier

When we order the parts with respect to LC/E count they alternate in a nice way. Does that mean that our design complexity determines the choice? Not that easy. The X parts have more memory per LC and additional multipliers. However, I don't have prices, a very important 'feature', handy for all these devices :-)

Martin

Reply to
Martin Schoeberl

Good to know that people like it, because I'm also "seriously buying" it!

However, being a complete newbie to FPGA's, I would like to know what range of applications this "DO-SPAR3-DK with XC3S200 FT256 Xilinx Spartan-3 FPGA" (just to make sure that we are speaking of the same device!) is good for.

For example, at Xilinx's site there is a list of various (mainly third party) processor cores, starting from MC68000 and ending to Z80:

formatting link

And for example, in CAST Inc.'s C68000's Data Sheet

formatting link
there is "Table 1: Example Implementation Statistics", where the most low-end device listed is Spartan-IIE XC2S400E-7.

Does this mean that it is impossible to fit C68000 into XC3S200 which has only half of the system gates of XC2S400E-7 ? (I don't know whether the gate counts between Spartan-IIE and Spartan-3 series compare linearly.)

Same problem with many other CAST's processor cores mentioned: 80C51, TMS32025 and "Z80 Compatible Microprocessor" CZ80CPU, the data sheets mention only Spartan-3 XC3S400-4 and Spartan-IIE XC2S300E-7 and some larger Virtex-II's as Example Devices on which to implement them.

Does this mean that XC3S200 has not enough logic to implement ANY of these or just that CAST Inc. didn't have XC3S200-device at hand, and thus haven't tested their designs on it?

Also, most of the games and platforms mentioned at:

formatting link
seem to be implemented on at least 300K gate device.

So is this 200K-gate XC3S200 thus just a little bit too small for them?

(Hmm... although on "Space Invaders" page:

formatting link
it mentions: "As so few of the available logic elements are used, a much cheaper FPGA could be used along with external memory device(s)." So there is some hope.)

Also, one important question: What is the maximum speed this XC3S200 can be clocked with?

Yours,

Antti.

Reply to
Antti Karttunen (remove the trailing .do from the address)

yes the board is cool, it's just incredible cheap...

The simplest way to check it out is to donwload Xilins ISE software (it's free) and compile your design. You will see how it fit's and if there are some resources left.

I expect the XC3S200 should do it, since I can easily fit a 32-bit CPU in it.

That depends really on your design. As above, run it through the (free) synthesizer and you will get the numbers.

Martin

Reply to
Martin Schoeberl
[snip]

As a quick aside, Cyclone has three speed grades, Spartan-3 only two. In general, a speed grade represents about a 15% difference in performance.

Slowest vs. slowest speed grade would be interesting.

--------------------------------- Steven K. Knapp Applications Manager, Xilinx Inc. General Products Division Spartan-3/II/IIE FPGAs

formatting link

--------------------------------- Spartan-3: Make it Your ASIC

Reply to
Steven K. Knapp

Hi Martin,

By turning on Minimize Area w/Chains under Fitter Settings/More Settings.../Auto Packed Registers - Cyclone you can cut the LE count to 1868 LEs (from 2066). Quartus doesn't try too hard to put registers & LUTs together unless it runs out of room (or you tell it to with this setting). In my compile, this didn't hurt Fmax (Fmax was 99 Mhz). On average, aggressively packing can slightly hurt performance and cause an increase in wiring.

By turning on "Area" mapping option in synthesis (instead of Balanced), this drops further to 1775 LEs. Fmax = 95 Mhz.

Just pointing out that without even looking at the HDL, there are ways to tweak the LE/Fmax trade-off. I'm sure there are some such tricks for Xilinx too. To automatically try-out the area optimization tricks in Quartus, run the Design Space Explorer tool, and select "Area Optimization" mode under the Advanced settings. It'll take a while, but this will find you the best settings (for area) for your design.

Yes. This performance result is actually pretty poor as far as Cyclone vs. Spartan-3 goes. We see an average of 80% better performance -- yes, that's

1.8X Fmax -- when comparing the fastest speed grades of the two chips with default "push-button" results from Quartus & ISE over a suite of 49 designs. Another way of looking at it is the slowest Cyclone speed-grade out-performs the fastest Spartan-3 speed-grade by a considerable margin. See
formatting link
for details.

In this particular case, your critical path appears to stretch from a RAM to a RAM (configured as a ROM) with little logic in-between. Logic + routing-rich paths tend to accentuate the speed differences between the two devices, while RAM-heavy paths show a smaller advantage.

And it also depends on which speedgrade you need to buy to meet your performance -- can you get by with a slower speed-grade in Cyclone than you'd need in Spartan-3? Or maybe with the faster Cyclone chips you may be able to get away with a wider bus (less demultiplexing) resulting in fewer LEs but a higher clock speed..

Picking a chip ain't easy... so just go with Altera ;-)

Regards,

Paul Leventis Altera Corp.

Reply to
Paul Leventis (at home)

Can you try that, and report back the speed gain, and how long it took to find this ? ( IIRC you mentioned +37% in another post ?)

Do you have tips for Martin on how to improve this for Cyclone specific cases ? - ie should the ROM change to logic-based, rather than RAM based, or would a pipeline stage help ?

-jg

Reply to
Jim Granville

They are often useful for other things.

On FIFOs/buffers: End of packet flag. In-band vs out-of-band signaling.

Used/free flags.

Just plane more bits (wider) for things like table driven state machines.

-- The suespammers.org mail server is located in California. So are all my other mailboxes. Please do not send unsolicited bulk e-mail or unsolicited commercial e-mail to my suespammers.org address or any of my other addresses. These are my opinions, not necessarily my employer's. I hate spam.

Reply to
Hal Murray

RAM to

the two

The critical path is from bytecode RAM (the instruction cache for the processor), which has registered address but unregistered data ,out through a 'larg' table. A jump table to map bytecode instructions to microcode addresses. I was thinking to add another pipeline stage in this path. However, than the bytecode branches take one more cycle. When I add a register in this stage the critical path moved to the ALU and fmax was 106MHz. Not a big win and it showed that the pipeline is not so bad balanced.

If you have another good idea, I would be happy to make JOP faster :-)

Martin

--

---------------------------------------------- JOP - a Java Processor core for FPGAs:

formatting link

Reply to
Martin Schoeberl

In

performance.

Ok, here it is: Cyclone slowest (-8): 77.5MHz Spartan slowest (-4): 77.8MHz Looks now better for X....

And now let's throw in some price numbers. Prices are single units from arrow.com and avnet.com, both devices in the same package (tqfp144): Cyclone: EP1C6T144C6: $41.60 Cyclone: EP1C6T144C8: $27.70 Spartan-3: XC3S200-4TQ144C: $19.93 no price for -5 speed grade

And relate the price to density and speed in a 'funny' way: price / 1000 LCs / MHz:

EP1C6-6: 41.60$ / 5.980 kLC / 98 MHz = 7.1 cent / kLC / MHz EP1C6-8: 5.98 cent / kLC /MHz XC3S200-T: $19.93 / 3.840 kLC / 77.8 MHz = 6.7 cent / kLC /MHz and now it looks again better for A...

I did not take into account the multipliers and larger memories in the Spartan, but also not the fact that the Cyclones are available for a longer time (I got my first Cyclone samples 01/2003 and sold the first boards 02/2003 :-)

Martin

--

---------------------------------------------- JOP - a Java Processor core for FPGAs:

formatting link

Reply to
Martin Schoeberl

I can't reproduces those numbers ( or any of the one you gave ) (for xilinx, I mean, I don't have quartus installed), how do you proceed exactly ?

Huh ? I can buy XC3S400-4FT256 for 23$ piece for very small qty (6 pieces in my case) and that's from an avnet company.

For the XC3S400-4FT256 (assuming same frequency) : $23 / 7.680 kLC / 77.8Mhz =

3.85 cent / kLC / Mhz

Numbers ... you can make them tell anything you want ;)

Yup, I think there are 200LC used for a booth multiplier, should be easy to lower that with a dedicated multiplier (and also go faster i would guess).

Btw, is there any networking available ? I have a Avnet Spartan 3 kit with an ethernet PHY on bard, that would be nice to get TCP/IP ;)

Sylvain

Reply to
Sylvain Munaut

Where did you get this numbers from ? I get on ISE 6.2.03: xc3s200-4 Minimum period: 10.428ns (Maximum Frequency: 95.896MHz) xc3s200-5 Minimum period: 9.503ns (Maximum Frequency: 105.235MHz)

cheers

Reply to
E.S.

Sylvain,

xilinx, I mean, I don't have quartus installed),

Set a time constraint for clk (in this case I used 12ns). However, this should already be done in the UCF you downloaded with the project. Then look at the 'text-based post-plcae & route static timing report'. At the end you will find:

Design statistics: Minimum period: 12.848ns (Maximum frequency: 77.833MHz)

Don't let yourself be fooled by the maximum frequency from the synthesis report. These are dummy numbers (in this case 96 MHz...).

pieces in my case) and that's from an avnet company.

The list price for the XC3S400-5FT256C (they don't have the -4 on the website) at avnet.com is $41. I just compared the prices that are available 'online'. I also got the Cyclone (Q240 package) cheaper: EUR

22.75 instead of $ ??..... for this device there is a 'call for quote' at arrow.com. The lead free costs $32.90, but these are more expensive in general.

Btw, does somebody know why the lead free devices are more expensive. I did'n know up to now that semiconductors contain lead. I only know that it's part of the solder and when it's forbidden will probably increase production cost of PCBs.

the

first

easy to lower that with a dedicated multiplier (and also go faster i would guess).

Yes that would drop the LC count and I could go with the next smaller Spartan-3. Uups, where is the XC3S100? The multiplication would be faster, but the multiplication (imul bytecode in the JVM) has a dynamic instruction frequency of 0.24% in typicall Java programs. That would not compensate for the clock frequency factor of 1.2 between the Cyclone and the Spartan-3.

be nice to get TCP/IP ;)

Do you mean available with JOP? Yes, I have a small TCP/IP stack in Java with drivers for the CS8900 (Ethernet), PPP and SLIP. Even a small webserver is running on JOP: http://84.112.19.23 ;-)

Martin

---------------------------------------------- JOP - a Java Processor core for FPGAs:

formatting link

Reply to
Martin Schoeberl

and

Don't take the numbers from the Synthesizer! Use the frequency after P&R, in the post P&R static timing report. This mistake is done by many ISE users. Xilinx should change the text in the synthesizer report to state it clearly that this number is an estimation!

Martin

Reply to
Martin Schoeberl

Hi Paul,

1868

setting).

increase in

Thank's for the hint. It's nice to get it smaller AND faster.

run

under

best

I could not find the 'Design Space Explorer' in Quartus. If you mean the Resource/Timeing Opt. Adviser under 'Tools' than I'm in bad luck. This function is not available with the web edition of Quartus.

RAM to

two

In this case I think it's a logic/routing-rich path. From the memory data out (unregistered) there is a 8-bit 'lookup table' and an adder, resulting in 6 LCs till the next register (in this case the address register of another RAM). Perhaps the ALM structure from the Stratix II would help for this function, but the Stratix devices are too big and too expensive ;-) Or a clock-free ROM as it was available in the ACEX parts. In fact, Quartus implements this structure in a block RAM when targeting the ACEX device (I still have several ACEX boards laying around and collectiong dust ;-).

Martin

---------------------------------------------- JOP - a Java Processor core for FPGAs:

formatting link

Reply to
Martin Schoeberl

You can find a 6809 with VGA, UART and keyboard controller running on the Starter Kit at:

formatting link

Martin

Reply to
Martin Schoeberl

Design statistics: Minimum period: 17.812ns (Maximum frequency: 56.142MHz)

... Even with effort level high, it's even worse !?

Design statistics: Minimum period: 18.508ns (Maximum frequency: 54.031MHz)

Could you send me the exact files you compile to '246tnt' at the domaim gmail dot com ? Maybe ise just don't like vmware ?

Yup, I know that the warning is pretty clear

--- QUOTE --- NOTE: THESE TIMING NUMBERS ARE ONLY A SYNTHESIS ESTIMATE. FOR ACCURATE TIMING INFORMATION PLEASE REFER TO THE TRACE REPORT GENERATED AFTER PLACE-and-ROUTE.

------------

The alloid used are more complex and uses more precious metals. (for the solder balls and solder plating of terminal) Sn/Pb before and now, like Nickel/Palladium

;) I'm more interested on the space I win to put more devices connected to JOP like an ethernet mac, a i2s master, lcd controller ...

Sure, I never meant it would fill the gap !

Silly me ... I must had a windows over my browser hiding the link ;) Since you use an external ethernet controller, I guess I would need a MAC inside the FPGA and the appropriate drivers for it too.

Sylvain

Reply to
Sylvain Munaut

this

Then

the

That's strange, perhaps you have a different version (my ISE is 6.2 as shipped with the board).

gmail dot com ?

done

Ooh, I'm sorry that I did not read it and complained about missing it in another thread. One excuse: I usually don't read the synthesis results, only the P&R reports. I only had to post about it since I get many of these high fmax reports from Xilinx users (and this was an issue in the MB thread too).

Solder balls ok, but that difference in QFP packages?

Java

MAC

But a MAC is a big and difficult beast and you still need an external chip for the voltage levels. An external Ethernet chip is cheap, works an d you usually get a lot of memory for buffering Ethernet frames.

Martin

---------------------------------------------- JOP - a Java Processor core for FPGAs:

formatting link

Reply to
Martin Schoeberl

Hi Martin

Thanks for the files. I finally got to the same result. The problem was : - A ISE 6.2 not updated - A 'bad' constraint file

I think the pins are plated with something similar to the solder to get good solering. That plating probably must be "updated". Reflow temp is also higher IIRC

Yeah but the devboard I have already has the PHY. (Standard Avnet spartan 3 kit). But indeed the MAC seems pretty big ;( about 2000 slice.

Sylvain

Reply to
Sylvain Munaut

That's an easy one : because they can. It's a good place to do a little cost recovery/price racking, as users will have designed in the devices, and are thus captive by both the legistation and the layout, plus many do not compare Pb/PbFree prices, so that's the ideal time to nudge the prices!

-jg

Reply to
Jim Granville

ElectronDepot website is not affiliated with any of the manufacturers or service providers discussed here. All logos and trade names are the property of their respective owners.