Nios performance

Hello,

how fast a Nios processor can be if embedded in a speed grade 6 Cyclone FPGA? What is the approximate maximum reachable clock frequency?

Best regards Piotr Wyderski

Reply to
Piotr Wyderski
Loading thread data ...

If you're asking how fast your application can run then you better test it on an actual Cyclone. You can get a Nios to run 140MHz out of onchip ram, but each memory access will cost you at least 5 clocks. Running out of sdram can be had at 110MHz, but memory accesses will set you back 11+ clocks per.

Also, watch out for bit shifts. They can take 1 clock per bit on the Cyclone.

To overcome these limitations we upgraded our application to a StratixI to get 1 clock bit shifts and wrote a Custom Instruction to bypass the Avalon Bus and read mem in 2 clocks. (writes are always fast)

Now our app cooks.

Ken

Reply to
Kenneth Land

Unfortunately, I have no working hardware yet, so I cannot test it. To clarify my requirements: I am going to perform digital signal processing on a data stream. There is no need for rapid control flow branching, everything can be pipelined.

16--20 bit fixed-point arithmetic is enough. I need a FIR block, as fast as possible, and an interface to SDRAM, only capable of performing pseudo-DMA burst transfers (say, one M4K RAM block at a time). The NIOS CPU will be responsible for quite simple things: synchronization of the hardware "coprocessors" (the FIR block, a TDMA interface to an AC-97 codec, NCO, parallel interface to a USB2.0 coupler etc.), filter coefficient computation (off-line, can be slow), CF card data transfer handling. I wolud like to perform some complex operations using NIOS, e.g. div, sqrt etc., so the CPU should be as fast as possible, but not faster

-- there's no need for a Stratix device. :-)

No direct communication between NIOS and SDRAM is necessary, a user-controlled "data cache" in the internal memory banks is enough.

BTW, I'd like to connect a 16Mbit 70ns parallel (1M x 16) FLASH memory chip to the FPGA. Where should I connect it? Can it share the lines with the SDRAM interface (+ a simple address decoder)?

But it is fully pipelined, isn't it?

Best regards Piotr Wyderski

Reply to
Piotr Wyderski

dma's with the supplied SOPC dma peripheral is fast. It's latency aware and we can dma in and out of it in as fast as one clock per word.

The data master of the Nios is not latency aware and even hand assembly coded back to back reads to sequential addresses in sdram will be no faster than 11+ clocks per read.

So heres the answer: dma's, and all writes are fast. Avoid non-dma reads whenever possible.

Originally for my app we dma'd an external fifo into sdram buffers for processing. We got that xfer down to one clock per 32 bit word, but the 11 clocks to get each word out to process it killed us.

The solution was a custom instruction that can read the fifo in 2 clocks, process it, then write the result to an sdram buffer. (remember writes are fast)

buffers are eventually dma'd from sdram to USB2.0 controller.

Hope this helps.

Ken

Reply to
Kenneth Land

Hello Piotr,

With a 1C20, speedgrade 6 and Quartus physical synthesis, I achieve 116 MHz (fast-fit in contrast: 92MHz), with speedgrade 8 (a bit cheaper...) this drops to 89 MHz (typical design, with SDRAM-controller). The real fmax of course depends on your design, e.g. which periperals you are using, how full your chip is, etc.

If you need the CPU only for simple control tasks, you might also considering to use our ERIC5

formatting link
However, there is no support for fast multiplications and divisons, it is more comparable to a ATMEL AVR in performance (but higher fmax).

Regards,

Thomas

"Piotr Wyderski" schrieb im Newsbeitrag news:cv63gb$to$ snipped-for-privacy@news.dialog.net.pl...

Reply to
Thomas Entner

Hello Piotr,

With a 1C20, speedgrade 6 and using Quartus physical synthesis, I achieve

116 MHz (using fast-fit, in contrast: 92MHz), with speedgrade 8 (a bit cheaper...) this drops to 89 MHz (typical design, with SDRAM-controller). The real fmax of course depends on your design, e.g. which periperals you are using, how full your chip is, etc.

If you need the CPU only for simple control tasks, you might also considering to use our ERIC5

formatting link
However, there is no support for fast multiplications and divisons, it is more comparable to a ATMEL AVR in performance (but higher fmax).

Regards,

Thomas

P.S.: This is the 2nd try, the first one did not pop-up here. If you get it twice, please execuse...

"Piotr Wyderski" schrieb im Newsbeitrag news:cv63gb$to$ snipped-for-privacy@news.dialog.net.pl...

Reply to
Thomas Entner

faster

That's bad, why hasn't Altera done something with this?

I'm going to use intensively the internal RAM blocks.

Of course, thanks!

Best regards Piotr Wyderski

Reply to
Piotr Wyderski

"Thomas Entner" schrieb im Newsbeitrag news:42176496$0$33864$ snipped-for-privacy@newsreader01.highway.telekom.at...

full

is

a

Hi Thomas,

could you please give Quartus resource utilization for ERIC5 when targetting EPM240 and executing from UFM? On your website you claim it would be 50% and that ERIC5 was initially targetted for MAX2. I am just curious to see that report :)

Antti PS the two other companies that used to offer 9-Bit processors IP-Cores are now dead and vanished, hope you have better luck!

Reply to
Antti Lukats

Hi Antti,

it was very nice to meet you at the Embedded World. I have rechecked my last MAXII-ERIC-version (as mentioned, it is not supported in the moment, but could be "revitalised"): It needs 120 LCs (50%), when I removed some debug-signals, I got even down to 109 (45%). This includes 9bit output-ports and 9bit input-ports and NO RAM, just the 3 internal registers. As mentioned, the missing RAM blocks of MAXII will reduce the usefulness of a CPU in this "CPLD" as RAM is very expensive, even with your tricks.

The core has changed a bit since the MAXII-implemention, so LC-count might differ slightly (up or down) if we would redo it. The Cyclone-version needs about 110 LCells, for MAXII we can remove the PC of our core, but need to add the UFM-flash-interface.

Even if ERIC5 gets no commercial success (how can it, at this pricing ;-), we will survive, dont worry... Our main business is camera-design (where we use ERIC5 for our own products) and FPGA-design.

Regards,

Thomas

formatting link

P.S.: The exact resource usage summary: Logic cells 109 / 240 ( 45 % ) Registers 103 / 240 ( 42 % ) Total LABs 16 / 24 ( 66 % ) Logic elements in carry chains 14 User inserted logic cells 0 Virtual pins 0 I/O pins 20 / 80 ( 25 % ) -- Clock pins 0 Global signals 1 UFM blocks 1 / 1 ( 100 % ) Global clocks 1 / 4 ( 25 % ) Maximum fan-out node clk Maximum fan-out 104 Total fan-out 601 Average fan-out 4.62

"Antti Lukats" schrieb im Newsbeitrag news:cvc4va$2li$04$ snipped-for-privacy@news.t-online.com...

Reply to
Thomas Entner

That was one of Altera's big mistakes on MAX II, they did not see the opening of state-rom and tiny-cpu coding, or Smart-UART areas, and so designed a part with no RAMs and cripled Code-FLASH....

Another option to consider would be to add support for Serial memory, like Ramtrons FM25x devices. These allow any mix of RAM and CODE, and are variable sized from 4Kb..256Kb, and have 25MHz serial bus speeds.

-jg

Reply to
Jim Granville

ElectronDepot website is not affiliated with any of the manufacturers or service providers discussed here. All logos and trade names are the property of their respective owners.