Nios II Going Live...

Sorry Austin,

Just asked, for the xc3s400-fg456 I can get now ces samples. Probably I can get parts in 14 weeks, so I stay with spartanII.

So, xilinx did something right, and it is wrong again ? ;-)

Reply to
E.S.
Loading thread data ...

ES,

Yes. Xilinx just can not do anything at all "right."

The most FPGAs shipped in the history of FPGAs (Virtex family), the only FPGA with embedded processors, the first FPGA ever with 10 Gbs transceivers, lowest interrupt latency of any soft processor core(and even better than most hard processors), 40% speed improvement in our tools, over 250K seats of software shipped, XCell Magazine with a subscription larger than the premier electronics mag......

Such a bummer, I guess we must just keep striving to be better and better!

Austin

Reply to
Austin Lesea

In deeply embedded systems (i.e. no RTOS), the use of the windowed registers is extremely useful due to its speed. When you start using the processor in applications that have an RTOS, it's a different story. Each time you have to do a context switch, unless the RTOS is really clever, you have to save out the whole set of registers associated with the task that is getting swapped out and read in the set of registers for the task that is getting swapped back in.

Initially soft-core microprocessors on FPGAs were used as simple control processors by the HW engineers in place of state-machines. So only rarely did someone want to run an OS on them. But as they have gained more acceptance, engineers want the same tools that other microprocessors provide.

Having a compiler option allowed you to choose between using the windowed registers vs. a flat register. With Nios II we optimized for size and speed, and the architecture we chose did not use the register windows.

-Joel-

Reply to
Joel A. Seely

The problem here is because you accept/target a 'less than really clever' RTOS, you also compromise the available peak performance.

...but the first group have not 'gone away' ?

One advantage of a FPGA core is you CAN change it as tools evolve :)

One path that appeals for embedded design, is the hyperthread/switched CPU approach, that is now appearing in mainstream MPU (and so tools will follow, over time ). eg Ubicom divide their newest CPU into (IIRC) 64 time slots, and tasks/processes can have N,M etc of those slots assigned. Result is good granularity of horsepower allocation, and very hard real-time performance. With a FPGA, you could assign the hard real time stuff to one Core, with register pointer features ON, and running the small, time paranoid code. Time muxed on the other core, you can run the softer-time stuff, on a RTOS, with register pointer features OFF. In this approach, you are really multiplexing at the slowest memory BUS pivot, rather than context thrashing a single, fast core.

-jg

Reply to
Jim Granville

Mr. Lesea, this is not a flame, but to correct an error in your statement: "the only FPGA with embedded processors" is far from the truth. The following come to mind immediately (I'm sure I'm forgetting several):

- Nios & Excalibur (Introduced June, 2000, that was FOUR years ago, and a year ahead of the competition; my how time flies!)

- QuickLogic

- The company you just acquired (I'll leave my theories out of this post)

All are processors on FPGAs. These are commercial offerings, there are numerous 3rd party & free cores out there too.

Your comments on ISR latency can be debated if you like, but I won't get into it now; there is already a thread discussing the architectural pros & cons that affect this.

Boy, all this stuff makes feel me like I did yesterday when a guy dropped in on me while surfing.

Regards,

Jesse Kempa Altera Corp. jkempa at altera dot com

Reply to
Jesse Kempa

that must be red rag to a bull for john jackson and the other transputer folk.

and why are there so many transputer people in fpgaland?

Reply to
Tim

Hi Joel,

(trying to tune out the trolls here) I'm going to be porting what you call a "deeply embedded" interrrupt driven application from NiosI to NiosII shortly. Can you contrast the two in terms of interrupt latency?

The app was originally developed on a dual coldfire system and I can say a single Cyclone based NiosI handles things very nicely. I'm looking forward to the new IDE and even more performance.

Whatever the naysayer's say, Motorola is not sending me a new higher performance cpu to download to my *existing boards*. This is great stuff!

TIA, Ken

Reply to
Kenneth Land

Jesse,

Processors, plural.

I'm still right.

Aust>

Reply to
Austin Lesea

I am just curious Austin, do you think this message helped either you or Xilinx?

--

Rick "rickman" Collins

rick.collins@XYarius.com
Ignore the reply address. To email me use the above address with the XY
removed.

Arius - A Signal Processing Solutions Company
Specializing in DSP and FPGA design      URL http://www.arius.com
4 King Ave                               301-682-7772 Voice
Frederick, MD 21701-3110                 301-682-7666 FAX
Reply to
rickman

Actually, it works quite well if used correctly. It isn't used correctly in the implementations I've seen (from Altera and from an OS vendor). I modified the OS to change the register spill strategy: Rather than spilling the entire register set, we only spill one register frame. Restores are done normally. This results in a "run time optimization" of the top of the register window forprograms. This works very well in practice because after initialization and task startup, a task's register window is at the top of the register file. For a 256 register file that means you get 14 function calls before a register spill occurs.

I'm a little sad that we'll lose the register windows in Nios2. Performance, etc. will make up for it. ;-)

-Rich

Reply to
Richard Pennington

Tee hee. Interrupt latency is a joke number. I wrote a piece about twelve years ago for one of the embedded-system comics, pointing out how insignificant is the processor's own interrupt latency - there are many things that are orders of magnitude more important to interrupt performance. Here as in many other things, the transputer was on the right track. Sadly, limitations of design culture and available technology doomed it to commercial failure.

Just for the record, here's Bromley's First and Second Law of commercial failure in a technological product:

First Law: Probability of commercial failure is increased if the product meets any of the following criteria: 1) It employs concepts and techniques that will become popular more than a decade later. 2) Its design is based on technically, logically or mathematically sound principles. 3) Its creators are British.

Second Law: The probability of commercial failure is unity if two or more of the above criteria are met.

Perhaps because they know a good thing when they see one?

Getting more and more cynical as time rolls by...

--
Jonathan Bromley, Consultant

DOULOS - Developing Design Know-how
VHDL, Verilog, SystemC, Perl, Tcl/Tk, Verification, Project Services

Doulos Ltd. Church Hatch, 22 Market Place, Ringwood, BH24 1AW, UK
Tel: +44 (0)1425 471223          mail:jonathan.bromley@doulos.com
Fax: +44 (0)1425 471573                Web: http://www.doulos.com

The contents of this message may contain personal views which 
are not the views of Doulos Ltd., unless specifically stated.
Reply to
Jonathan Bromley

Yes, but without the windows those would have been swapped out to the stack allready anyway so you loose nothing.

Also note how much you gain: For example for a bifurcating recursion even a single level of register windows saves 50% of the register spills, regardless of how deep the recursion is. Two levels save 75%. And so on... For non-recursive scenarios the numbers are even better. (5 levels save almost all spills)

BTW: This whole discussion is oT and belongs into comp.arch.

Kolja Sulimma

Reply to
Kolja Sulimma

I remember doing a bit of due dilligance for a relative who was looking at a job at a company which was making similar claims (they were using a shadow-register setup).

I basically did an amdahl's law workup and gave the advice of "this is why it is bogus", and the observation that, since the company HAD funding, it might be good for a year but nothing beyond that.

More importantly, if we ever "solve" the tool problem for general purpose computation on FPGAs, we solve it for Transputers.

--
Nicholas C. Weaver                                 nweaver@cs.berkeley.edu
Reply to
Nicholas C. Weaver

Rick,

You are correct. I just lashed out. I apologize (to the newsgroup).

Now that we are the "gorilla" I need to be 5X more humble. We win with listening to customers and always placing them first.

I can't say I won't over-react again, but I can say I will try to improve.

Austin

-snip-

Reply to
Austin Lesea

My sincere apologies. I would drop this, but as its a public forum and I want the reading public to know the truth. Some further elaboration:

Multiple embedded processorS on an FPGA (plural) have been technologically feasible, supported, and implemented by customers -- with Nios -- since its inception (I'm sure the same could be said of other offerings prior to that date, too), and we continue to support that. That has been extended in the most recent release of our product. As an example, the user can debug many (we have tested up to

8) processorS (plural) simultaneously via a single JTAG connection and a nice IDE environment.

That's the real beauty of an FPGA, as we all know... you have logic you can put to any use, including the same use several times over to do interesting things.

And if for some reason a "soft" processor does not equal a "hard" one, well, I suppose that is a matter of debate. They both take compiled C code and do useful tasks, so I think they're both proessorS.

Regards,

Jesse Kempa Altera Corp. jkempa at altera dot com

Reply to
Jesse Kempa

You're really sad ? Take a look at the terribly broken setjmp/longjmp implementation for Nios I. Register windows work ok if you never switch stacks (say for threads or to have a separate exception stack). A correct implementation of context switching requires that you spill all the register windows on the task being switched out and restore to the previous depth the windows on the task being switched in. setjmp/longjmp together should behave as a context switch.

If your interrupt processing model is -- all processing related to an interrupt happens in the interrupt service routine you might be happy with register windows (unless you are unfortunate enough to have the exception occur when the windows are full). On the other hand, if your modle is do only the things that must be done in the service routine, then enable a thread to do the rest, then you probably aren't too happy.

I'm quite pleased that they dumped this feature and took the lean approach.

Geoffrey

Richard Penn>

Reply to
Geoffrey Brown

The officially defined semantics of setjmp and longjmp do not require that they be usable for switching stacks; they only are defined to unwind a stack.

I ran into exactly this problem when I ported the Telebit Netblazer operating system to the AMD 29000 back in 1991. The 29000 typically uses register windows, although it can also use the entire set of 128 local registers as "normal" non-windowed registers. I had to rewrite the setjmp and longjmp implementation exactly as you describe.

However, I wouldn't claim that this is because the setjmp/longjmp implmenetation was broken. It was behaving exactly as specified. Rather, the problem is with using setjmp/longjmp for something other than unwinding the stack.

I thing a case could be made that the next revision of the C standard should have new library functions for context switching.

Reply to
Eric Smith

Oh, I am asked to say something:)

Ok I have no idea whose interupt latency is shortest. Probably the cpu that has the fastest clock rate or the one thats specially designed for int response handling.

I suspect that the several ASIC MT cpus that have recently come along for the wireless set could well have the best int response esp 1 that runs 8 threads at 250MHz (or was it 400MHz) because the threads run all the time every 8th cycle. ANd these cpus don't have context to swap since they have N contexts in ram.

Technically Transputers don't have interrupts, thats too low a level of looking at them, but they do service events with an incredibly quick response for a variety of reasons but that was at 25MHz and

15yrs ago.

Now the R3 cpu also being an multithreaded (MT) cpu (and also now running baby code BTW in C model) could designate 1 of its 16 threads to poll some HW and take the event home. That would mean about 20-50 cycles of computation might pass before Pn noticed it had to do some work. If Pn can find away to stay active in the IX engine without branching (which causes process swap round robin style) then it could notice an event in that must be red rag to a bull for john jackson and the other

Well I don't remember anyone else here that identifies themself as such, most are probably busy elsewhere. And where is Alan C!

Well the answer to that is real simple. Anything FPGAs do today esp DSP and coms and whatever was once done by Transputers. Look at Nallatech and a whole load of UK/European companies that were once Transputer TRAM module houses. Those that survived are all FPGA guys today and in the top tier of high perf engineering. Whats a good engineer to do when something runs out of gas, look for the next obvious replacemment.

Also the FPGA and the Transputer more or less came out at the same time 84++, the Transputer peaked along time ago, the FPGA really started peaking only a few years ago, wasn't really much use till 4K or later (sorry)..

That also brings me to the other point. Occam runs on both. Not C. Ofcource Occam had to resurrect itself in C syntax (HandelC) to be more attractive to the avg EE to be synthesizeable for FPGA. BTW I am not a fan of HandelC, just mention thats its roots go back to Occam.

I will leave it there

regards

johnjakson_usa_com

Reply to
john jakson

Perhaps I am doomed to fail on all 3 counts.

Anyway I may be a US citizen before this thing gets polished and can deny the last rule as everything important has to seem to be invented or reinvented in the US- (sadly).

Since my math isn't so great maybe I can deny the 2nd rule too:).

And 20yrs have passed since I left and the Transputer shipped so I can beat that one too perhaps.

yep

Reply to
john jakson

The Ubicom part claims 0 or 1 cycles.

In a hard-timesliced CPU, I can see two schemes for handling interrupts, that would need sightly different hardware (no problem in a FPGA-CPU:) It is a CPU structure that would seem to fit well into FPGA resource.

- First scheme allows any/(first) available free timeslot to an interrupt thread. This allows good granularity, but does not give the smallest possible INT response.

- Other scheme is carefull to leave every second time-slot free, for possible INT. INT response/context sw is MUCH faster (1-2 clocks), but cost is that other threads cannot have more than 50% of the CPU. With time-sliced CPUs threads have zero time-crosstalk, but the peak CPU usage for any single thread is lower.

in most embedded applications, bounding the slowest path, and reducing jitter, can matter more than fastest-possible-speed over a short distance numbers.

-jg

Reply to
Jim Granville

ElectronDepot website is not affiliated with any of the manufacturers or service providers discussed here. All logos and trade names are the property of their respective owners.