Building the 'uber processor'

-

Hi Goran

What I am really after is a speedy Transputer, better still many of them distributed inside & across FPGAs. Not the original with funny

8bit opcodes (partly my fault) but a modern design that is RISC & targeted to FPGA using MicroBlaze as a HW/performance reference. I would budget for about 2x the cost before thinking of FPU, still pretty cheap.

The important part is the ISA supports process communication transparently with scheduler in HW. The physical links internal or external is only a part of it. Since many cpus now have these links, and serial speeds can be far in excess of cycle speed, thats nice, but no use if the programmer has to program them themselves. With an improved event wheel scheduler in HW too, HW simulation becomes possible for HW that might be "hard" or "soft", but then HW in FPGAs are not strictly "hardware" either (see old thread). So if HW & SW can be somewhat interchanged, it becomes easier to migrate large C seq problems gradually into C-Occam par/seq then into more HDL par all from inside one (maybe ugly)language. It would be even nicer to start over with a new leaner language that can cover HDL & SW but its more practical to fuse together the languages people actually use.

Who is the potential customer for this, any SW-HW person interested in speeding up SW like the original poster or any embedded engineer that wants to customize cpu with own HW addons using Occam style channels to link them. I could go on, but much work to do.

John

johnjaksonATusaDOTcom

Reply to
john jakson
Loading thread data ...
090004090500080102080000 Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 8bit

Hi John,

The new instruction in MicroBlaze for handling these locallinks are simple but there is no HW scheduler in MicroBlaze. I have done processor before with complete Ada RTOS in HW but it would be an overkill in a FPGA:

The Locallinks for MicroBlaze is 32-bit wide so they are not serial. They can handle a new word every clock cycle.

MicroBlaze has 8 new instructions for handling the Locallinks (or I call them FSL) FSL = Fast Simplex Links The are mainly two instruction with each four options. The instruction for reading from a FSL is called GET GET rD, FSL #n This will get a word from the FSL number #n and move the data into register rD The instruction for writing a value to a FSL is PUT rA, FSL #n This will move the value in register rA to FSL #n

Each FSL has a FIFO which will control if there is available data on the FSL or if the FSL is full. If you try to do a GET when the FSL is empty, MicroBlaze will stall until the data is there and if you try to do a PUT on a FSL which is full, MicroBlaze will stall until space is available on the FSL.

So the normal GET and PUT is blocking which is great when you communicate with HW since there is no "read status bits, mask status bits, branch back if not ready" loop which will reduce any kind of bandwidth with HW. HW tends to be much faster than SW so in general the blocking instruction will never block.

One option on the instruction is to have a nonblocking version of GET and PUT which are called nGET and nPUT. These will try to perform the instruction and the carry bit will be set if they were successful and it will be cleared if they failed. This will make it possible to try communication on the FSL.

The FSL also have a extra bit which is done for sending more than 32-bit data, you can consider this as a tag information. We call it "control". Normal PUT will set this signal to '0' but the cPUT will set it to '1'. The Control signal is stored as the data in the FIFO so the FIFO's are really 33 bit. This extra bit permits more synchronization and control/data flow betwene to FSL units.

The normal usage of FSL is to analyze your c-code and find a function where the MicroBlaze spends most of the cycles. Moce this function in the HW (there are getting moreand more C-to-HW tools). Create a wrapper in C with the same function name as the HW function. This wrapper will just take all the parameter and PUT them onto a FSL and then it's do a GET.

You could also connect up a massive array of MicroBlaze over FSL ala transputer but I think that the usage of the FPGA logic as SW accelarators will be a more popular way since FPGA logic can be many magnitudes faster than any processor and with the ease of interconnect as the FSL provides it will be the most used case.

Göran

john jaks>

Reply to
Goran Bilski

Hi John,

do you know about this nice stuff developed by Cradle

formatting link
?

They have developed something like an FPGA. But the PFUs do not consist of generic logic blocks but small processors. That's perhaps something you would like :-)

Regards, Mario

Reply to
Mario Trams

Hi Goran

.. now that sounds like something we could chat about for some time. An Ada RTOS in HW certainly would be heavy, but the Occam model is very light. The Burns book on Occam compares them, the jist being that ADA has something for everybody, and Occam is maybe too light. Anyway they both rendezvous. At the beginning of my Inmos days we were following ADA and the iAPX32 very closely to see where concurrency on other cpus might go (or not as the case turned out). Inmos went for simplicity, ADA went for complexity.

Thanks for all the gory details.

I am curious what the typ power useage of MicroBlaze is per node, and has anybody actually tried to hook any no of them up. If I wanted large no of cpus to work on some project that weren't Transputers, I might also look at PicoTurbo, Clearspeed or some other BOPSy cpu array, but they would all be hard to program and I wouldn't be able to customize them. Having lots of cpus in FPGA brings up the issue of how to organize memory hierarchy. Most US architects seem to favor the complexity of shared memory and complicated coherent caches, Europeans seem to favor strict message passing (as I do).

We agree that if SW can be turned into HW engines quickly and obviously, for the kernals, sure they should be mapped right onto FPGA fabric for whatever speed up. That brings up some points, 1st P4 outruns typ FPGA app maybe 50x on clockspeed. 2nd converting C code to FPGA is likely to be a few x less efficient than an EE designed engine, I guess 5x. IO bandwidth to FPGA engine from PC is a killer. It means FPGAs best suited to continuous streaming engines like real time DSP. When hooked to PC, FPGA would need to be doing between

50-250x more work in parallel just to be even. But then I thinks most PCs run far slower than Intel/AMD would have us believe because they too have been turned into streaming engines that stall on cache misses all too often.

But SW tends to follow 80/20 (or whatever xx/yy) rule, some little piece of code takes up most of the time. What about the rest of it, it will still be sequential code that interacts with the engine(s). We would still be forced to rewrite the code and cut it with an axe and keep one side in C and one part in HDL. If C is used as a HDL, we know thats already very inefficient compared to EE HDL code.

The Transputer & mixed language approach allows a middle road between the PC cluster and raw FPGA accelerator. It uses less resources than cluster but more than the dedicated accelerator. Being more general means that code can run on an array of cpus can leave decision to commit to HW for later or never. The less efficient approach also sells more FPGAs or Transputer nodes than one committed engine. In the Bioinformatics case, a whole family of algorithms need to be implemented, all in C, some need FP. An accelerator board that suits one problem may not suit others, so does Bio guy get another board, probably not. TimeLogic is an interesting case study, the only commercial FPGA solution left for Bio.

My favourite candidate for acceleration is in our own backyard, EDA, esp P/R, I used to spend days waiting for it to finish on much smaller ASICs and FPGAs. I don't see how it can get better as designs are getting bigger much faster than pentium can fake up its speed. One thing EDA SW must do is to use ever increasingly complex algorithms to make up the short fall, but that then becomes a roadblock to turning it to HW so it protects itself in clutter. Not as important as the Bio problem (growing at 3x Moores law), but its in my backyard.

rant_mode_off

Regards

johnjakson_usa_com

Reply to
john jakson

Thanks for pointer, I hadn't seen it yet, will take a peek.

Reply to
john jakson

This reminds me of the PACT XPP, which is an array ALUs with reconfigurable interconnect. Basically you replace the LUT with a math component.

New ideas have a hard time, unless there's a real advantage over traditional technology. PACT tries to find their niche by offering DSP IP, eg for the upcoming UMTS cellular market.

Here's their URL:

formatting link

Marc

Reply to
jetmarc
080500020603020404030706 Content-Type: text/plain; charset=us-ascii; format=flowed Content-Transfer-Encoding: 7bit

Hi John,

john jaks>Hi Goran

Actually the Ada RTOS was not that large. The whole processor THOR was created to run Ada as efficient as possible

formatting link
{7EA5439E-962C-11D5-B730-00508B63C9B4}&category=148) The processor only supported 16 tasks in HW but you could have more taks that had to be swap in and out. The other thing was that the processor didn't have interrupts only external rendezvous. There was also other implementation to handle Ada much better, HW stack control, HW handled exception handling,...

Reply to
Goran Bilski

< snip >

Hi,

Been reading this thread. I wonder if instead of using an FPGA, DSP, general purpose processor or niche product it would be possible to use a graphics processor like the ones developed for 3D graphics boards.

It seems that this particular application requires a lot of 3D processing (distance between particles, direction of interacting force, ...) and similar matrix calculations. Graphics processors are good at this. They have a huge memory bandwidth also, because they have a 64bit or 128bit bus width and DDR, and they support like 64Mbyte or 128Mbyte of memory, so they should be able to handle large data sets very well.

But I'm not sure if you can program these like a 'normal' processor. And if it would be feasable for Mike's research group to design a system around such a graphics processor.

Maybe it would be possible to 'hack' a graphics board and change its firmware to run simulations ? But are there any development environments available for these chips ? If so, they'd probably be all assembler. And they'd be very expensive I guess.

Regards, Marc

formatting link

Reply to
Marc Van Riet

It's conceivable -- you can do a lot with them nowadays -- see

formatting link
Depends on details of the problem; from the description I'd be surprised if it fit in 128MBytes. And the GPUs are only single- precision at the best; the fastest use a custom FP representation which might have only 16 bits in the mantissa.

Tom

Reply to
Thomas Womack

ElectronDepot website is not affiliated with any of the manufacturers or service providers discussed here. All logos and trade names are the property of their respective owners.