Building the 'uber processor'

- M
- mikegw
  
  Contact options for registered users
posted
20 years ago

Mon, Nov 3, 2003 5:05 AM

Hello all,

Firstly I would like to say that other than knowing what a FPGA is on a most basic level my knowledge about the subject is nil. I am looking at this from an application that needs a solution. I have seen about the place add on boards for PC's that act as co-processors. This is the interesting bit to me. Our research group is looking into building a computer (cluster perhaps) for calculation of particle dynamics, similar to CFD in application. Our programs are in C/C++ running on Linux ( any flavour will do).

My questions are

a) Will a FPGA co-processor board(s) offer a speed improvement in running our simulation jobs over using a 'traditional' cluster (mosix/Bewoulf)? Bearing in mind that ours will be the only job on the machine so can we reconfigure our FPGA boards to speed calculation?

b) Can anyone recommend a good book that I can read and hopefully be able to ask more informed questions?

Cheers

Mike

- S
- Simon Peacock
  
  Contact options for registered users
Vote on answer
posted
20 years ago

Mon, Nov 3, 2003 9:37 AM

I don't know of any good books.. but.. FPGA's can run rings around code... especially if you can define what you want them to do. that's the tricky part... and as far as parallel processing is concerned.. they will blow your mind.. or sit there flashing a light...Xilinx are working on a JAVA compiler for FPGA's. I think its a student partnership thing so am not sure how good it is but it converts java into hardware.

And FPGA's will eat any cluster.. just see above.. But if you can't define the problem in a way the FPGA can handle then it would be no faster. FPGA's are literally OR's AND's and flip flops (latches) and that's what you need to start with.. they also have adders and even processors.. small memories and stuff like that.. if you need large memory they can do that too. its hardware.. want SDRAM ? just connect it up and write a program to access it. (just don't forget to refresh it too :-)

There are already a number of super cluster FPGA projects around. One of the fusion reactor projects uses several hundred of them .. I read an article once.. don't remember the web site sorry.

Simon

most

add

will

to

- M
- mikegw
  
  Contact options for registered users
Vote on answer
posted
20 years ago

Mon, Nov 3, 2003 10:50 AM

your

compiler

good

FPGA's

it.

Thanks

Just so I understand you, if I want to "realise" my c code in a FPGA array, I can upload the code, data and the processing array. Run it and download the data?

The code (not actually mine I am just seeing if this is all possible) is basically applying an equation on a data set looping for all particles for each time step. The tricky bit (in at least the programming sense) is to constantly calculate the relative positions of each particle to calculate their effect on each other.

I would really like it if there exists such a book that could take someone who has a c/c++ program and hold their hand through a whole "Realisation" of that code.

Cheers

Mike

- J
- john jakson
  
  Contact options for registered users
Vote on answer
posted
20 years ago

Mon, Nov 3, 2003 12:47 PM

Hi Mike

Think of a coprocessor as a black box with input output channels that sits in your PC. The computing elements may be a fraction of the speed of a 3GHz P4 at some things or maybe many orders of magnitude more. I am guessing that your app needs FP calculations, maybe IEEE, maybe any adhock FP will do. The IEEE is still costly to do in FPGA but see a previous post for some pointers. An adhock FP may be all thats needed but you would have to do a similar version in SW for a unaccelerated node to get same results.

Where FPGA boards really shine is when you can arrange for them to be in series with streaming data that that may be orders faster than a PC could normally handle. If your data is on HD and has to come through PCI bus then you are IO bound. That may be ok if you can perform N million comps per word transfered such as say crypto but if you needed to do minimal comps per point, FPGA can be wrong solution.

Figure how much parallelism you can extract. P4 may run at 3GHz. An FPGA board may run at 50MHz to 200MHz, if you perform integer *+ that may limit to 100MHz. So you need to be doing atleast 30x more in parallel just to match 1 P4. If you can do an order more in parallel than that, then you could be doing fine as long as you aren't IO bound. Consider a faster PCI bus that will get you a few x more throughput. Consider if you can dump one time all data into onboard ram on PCI board, ie get the PC out of the eqn except for basic system support.

Take alook at TimeLogic Decypher board as an example of Bioinformatics that get accelerated at similar rates to your app, but AFAIK its mostly pattern matching & integer comps.

Can't say I heard of any books on this matter as its still immature field!

Good luck

johnjaksonATusaDOTcom

- J
- John Williams
  
  Contact options for registered users
Vote on answer
posted
20 years ago

Mon, Nov 3, 2003 10:53 PM

To parallel what Jon said earlier - the biggest gotcha that seems to bite people is IO bandwidth. It's not necessarily hard to develop highly pipelined FPGA designs that will crunch your numbers at 100M sample/sec, but can you keep it busy?

I read of an interesting approach a while ago - do a search for Pilchard, it's an FPGA coprocessor board developed at a Hong Kong university. Basically it fits in the standard PC memory module form factor, with custom Linux drivers to access it. The bandwidth on the memory bus is much greater than on PCI.

Regards,

John

- A
- Antti Lukats
  
  Contact options for registered users
Vote on answer
posted
20 years ago

Tue, Nov 4, 2003 6:38 AM

in München, Germany there is a research group that uses Xilinx a lot they do some 'particle' search I think FPGAs are mostly used to filter out the data coming from then experiment. as you are also in heavy research area maybe good idea to contact them - I have no addresses but there are not so many nuclear labs so the one I mentioned should be easy to find for you

antti

- M
- mikegw
  
  Contact options for registered users
Vote on answer
posted
20 years ago

Tue, Nov 4, 2003 8:34 AM

running

As we will be stepping time, the data (particle information position etc...) will be the output of the previous 'step'. The only bit that might be messy is to calculate the relative distances between particles.

I think that these devices might be the way to go. To me it seems odd that we seem to be taking a step back to the old analogue computer days when you 'built' your program.

I took a look, it seems to be fairly interesting. Given my particular data set I might be on the wrong track thinking of an accelerator card. Maybe a stand alone device which the input is up-loaded and it is sent forth to do.

So much to learn.........

Mike

- D
- Don Taylor
  
  Contact options for registered users
Vote on answer
posted
20 years ago

Tue, Nov 4, 2003 10:03 AM

This is getting away from hardware, and you haven't said how much expertise you have there to use on the problem, but I remember a series of books published by MIT press in the '90s. Each was the summary of a different phd thesis. One of those described break- throughs in the simulation of many body problems that led to orders of magnitude increase in speed for running the simulation. I don't know whether those results would apply in your case or not.

It seems to be a general rule that hardware can speed up a problem by k-fold, where k is a modestly small number usually. But finding a better algorithm can speed up a problem by n-fold, where n is the number of items you have to deal with. With both you might get k*n.

Someone once said to me "it takes six or eight years to really learn something well, and you don't have very many six or eights, so don't you go waste one." Now I realize I really should have understood what he meant then.

- P
- Philip Freidin
  
  Contact options for registered users
Vote on answer
posted
20 years ago

Tue, Nov 4, 2003 10:13 AM

Hi Mike,

So that we may better help you, please answer the following questions:

Is the arithmetic Floating Point (FP) or Integer?

If mixed, what is the ratio of the two? (i.e. 10000 integer ops to every floating point op) (If the ratio is greater than 100000:1, could you do the integer stuff in the FPGAs, and the FP in a host X86 processor?)

If floating point: Does it need to be IEEE FP (i.e. identical to a software execution on the same data set) OR (Floating point with N bits of mantissa, M bits of exponent, X guard bits, etc...)

What is the ratio of Mult, Div, Add, Sub, Sqrt, Sin, Cos, Exp, Log, ... (Are integer aproximation useable??)

For integer operations, how many bits of precission are needed? Is this precision required all the way through the algorithm, or can the precission be adjusted at each step?

How many arithmetic/logic ops per data item?

What is the data set size needed before calculations can start (i.e. 20 3D points, 10 scan lines, a 512 by 512 2D set, ...)

Can the calculations be partitioned in multiple identical sets that perform the same operation on different parts of the total data set.

If partitioning is possible how much communications (number of data items) is needed to be passed between the separate calculation clusters? How often does this need to happen (what is the inter-processor bandwidth).

How much local data is created while calculations take place? (What bandwith is needed to support it)

How much table/look up data is required by the algorithm? (What bandwith is needed to support it)

Can data be thought of as a continuous stream in and out, or is it 1 big chunk that must all arrive, then calculate till done, then spit out a result (what is size of input chunk and output gems). Is there a constant flow of chunks (Size, arrival rate, expected FP/Int ops per chunk?)

Since you want an Über processor, do you have an Über hardware designer? (It takes considerable effort to create one of these, especially if what you start with is an Über software designer. It is an order of magnitude easier to get a HW engineer to write passable SW than it is to get a SW programmer to design passable HW.)

Are you aware that SW is basically written for sequential execution, or extremely chunky parallelism (threads). Hardware design (for Über processors) typically require Ultra parallelism (100s to 1000s of operation running in parallel), which means that your algorithms will have to totally re-arranged to match such application specific hardware. Although this is daunting, there are hundreds of real life systems that have done this (i.e. your basic question of "does this make sense" to consider FPGAs to create an application specific co-processor is YES). Implementing these successful systems was never achieved by just taking the SW (C/C++ for example) and re-crafting as hardware. You will need to go back to the basics of the algorithm's intent, then design for the extreme parallelism that the FPGAs offer. This is not always possible, as discussed by others who have answered your original question.

Are you thinking of a single co-processor board in a PC or something more like a Bewoulf cluster with each node having its own accelerator board?

There are many more such questions, but this would be a good start.

Can't answer this without far more information from you. See above :-)

Note that your: "so can we reconfigure our FPGA boards to speed calculation?"

is no trivial thing. The design of the hardware may take many months to do even if you have a Über hardware designer.

There is an annual conference held in Napa California where all the people that do this type of thing meet. It is teh IEEE FCCM conference. You would be well served by looking at the titles of the proceddings for the last 7 years at

formatting link

. You can probably get copies of the proceedings from the IEEE for way too much money.

Happiness to you too.

Philip

Philip Freidin Fliptronics

- M
- Mario Trams
  
  Contact options for registered users
Vote on answer
posted
20 years ago

Tue, Nov 4, 2003 10:15 AM

Mike,

Surely, you might put something like a processor into an FPGA where you can download your code and data. But you will very likely not gain very much from this as you are still stuck with your "program code execution" paradigm.

Depending on the application, you might get a little gain by placing a very special processor into the FPGA that is optimised for your application. DSPs are a good example here. They have special features that makes them very fast for some algorithms. This would also require that you have a special compiler, that compiles the code (that you want to reuse) optimized for your special processor. But many things you would probably anyways need to code in assembly language, because there is no direct translation from an high-level language to a special machine feature possible. As far as I know, this is the same for DSPs.

However, a real speed-up you will achieve by throwing the processor concept over board and thinking just in distributed state machines. This is a completely different thing compared to implementing an algorithm in some language. At first, you have to be an experienced digital designer to do that. (Btw, you have to be the same when designing a special CPU, of course.)

Regards, Mario

- K
- Kolja Sulimma
  
  Contact options for registered users
Vote on answer
posted
20 years ago

Tue, Nov 4, 2003 12:49 PM

Yes. But you are likely to spend a lot of effort designing the processing array.

I guess that if you post the equation (maybe a simplified version), the precision you need and the number of elements in a typical data set you will get a pretty good estimate from this group about how well this can be solved in FPGAs.

Kolja Sulimma

- K
- kryten_droid
  
  Contact options for registered users
Vote on answer
posted
20 years ago

Tue, Nov 4, 2003 1:01 PM

No.

Short answer:

C/Pascal/etc compile to machine code instructions to run on a general-purpose processor, only one executes at a time.

VHDL/Verilog compile to a description of many specific-purpose hardware processes, all executing at once.

Longer answer:

Microprocessors execute a single conceptual process at a time.

In the real world there are many processes running concurrently.

Conventional micros and software require blocks of sequential instructions.

Occam was a language to describe processing in terms of communicating sequential processes. These could then be farmed out over multiple processors and done in parallel. The transputer was designed in tandem with occam, optimised for this programming model and communication between processors.

My old tutor said that hardware engineers grasped these concepts much faster, because they are already comfortable with thinking in terms of many things happening at once in hardware. Software engineers had to unlearn their usual sequential thinking.

In the past, the general-purpose microprocessor was a great alternative to single-purpose machines. The latter could be much faster but took ages do design and build and modify.

FPGA chips change that balance of power.

Like occam, VHDL and Verilog allow you to describe processing in terms of communicating sequential processes (occam has been used as a hardware description language).

However, instead of creating machine-code instructions to perform a process, they create descriptions of hardware to do all these processes. The 'fitter' then fits the design into particular makes of FPGA.

I can see that conventional programmers would love to be able to just chuck their old C programs into an FPGA and have it run faster, but I feel this is not sensible (although Handel C seems to be trying it). No pain, no gain.

I didn't find VHDL all that hard to pick up. In fact it is quite liberating to throw off the shackles of conventional software design. Instead of getting a single micro to rapidly poll, process and toggle dozens of real-time inputs and outputs, I can now simply declare dozens of independent hardware processors.

Benefits depend on the problem you want to solve. You can beat microprocessors easily at some tasks but not others. Ideal tasks are simple and easily scaled up, like a systolic processor for finding matches in DNA sequences, or sifting keys for the enigma machine. The wartime machine weighed tons, used kilowatts, and clocked at 5 kHz. It would beat many modern chips, which shows the advantage of customised hardware. You might be able to make an equivalent weighing grammes, using milliwatts, and clocked at 50 MHz! I wonder if the government kept the 95% of Enigma messages that they didn't have time to crack? I'm sure military historians would be interested in the contents...

- R
- Ron Huizen
  
  Contact options for registered users
Vote on answer
posted
20 years ago

Tue, Nov 4, 2003 1:59 PM

To add more support to the IO bandwidth being one of the major issues, one thing that I see often getting overlooked when people start clustering machines with regular networking is the overhead of just running the network connections. There was an interesting article in EE Times sometime last year (don't recall which one) showing how much of a GHz Pentium it took just to run a 1 Gb Ethernet connection. If I recall, it was on the order of 50% of the processor, assuming of course you were keeping the Ether busy. Of course, just plugging in PCI boards has the same issue if all the data has to move on the PCI bus, as the bus itself becomes the bottleneck.

If you are serious about building a monster machine out of multiple processors, don't overlook the data movement aspect.

Now, it just so happens that the architectures we use on our boards have IO capabilities that scale with system size, and that isn't a coincidence, as our customers build large multiprocessor systems out of them. The underlying support for this is inherent in the SHARC and TigerSHARC processors from Analog Devices, which have a built in IO Processor for moving data into and out of the DSP's large internal memory so the core can number crunch while data movement happens in the background. These DSPs also have multiple high speed point to point interconnects called link ports (the TigerSHARC 101S has four 250 MByte/sec links as well as its 64 bit 100 MHz external bus) which can be used for shipping data around. We also use large FPGAs and connect them to the DSPs using these links for the data flows.

While some will argue that the best approach is a bunch of GHz PCs, and others will say use traditional DSP, and yet more will say FPGA, there is no one magic approach that applies to all systems. Usually some combination of these processing types will get the job done, it's a matter of deciding which parts of your system are better served by which. And this of course is dependent on the type of number crunching you need and the associated data movement requirements.

----- Ron Huizen BittWare

running

- D
- David R Brooks
  
  Contact options for registered users
Vote on answer
posted
20 years ago

Tue, Nov 4, 2003 11:48 PM

Another area to research could be electric circuit simulation (ie Spice). There are similarities: each circuit node can influence every other (at least potentially). Spice basically revolves around inverting a mega-matrix. There's been quite a lot of work put into building hardware accelerators for Spice. You may be able to leverage off that.

snipped-for-privacy@case2000.com (Antti Lukats) wrote:

:"mikegw" wrote in message news:... :> Hello all, :> :> Firstly I would like to say that other than knowing what a FPGA is on a most :> basic level my knowledge about the subject is nil. I am looking at this :> from an application that needs a solution. I have seen about the place add :> on boards for PC's that act as co-processors. This is the interesting bit :> to me. Our research group is looking into building a computer (cluster :> perhaps) for calculation of particle dynamics, similar to CFD in :> application. Our programs are in C/C++ running on Linux ( any flavour will :> do). : :in München, Germany there is a research group that uses Xilinx a lot :they do some 'particle' search I think FPGAs are mostly used to filter :out the data coming from then experiment. as you are also in heavy :research area maybe good idea to contact them - I have no addresses :but there are not so many nuclear labs so the one I mentioned should :be easy to find for you : :antti

- J
- john jakson
  
  Contact options for registered users
Vote on answer
posted
20 years ago

Wed, Nov 5, 2003 7:04 AM

rest snipped

Alot of really good points covered by Philip and also above posts. Its clear that many FPGA pros would like to get their grubby hands on such a project as long as it pays ofcourse. I can only wish there were more of these projects but its a bit off the normal path.

I am going to suggest instrumenting the code to find out the answers unless its obvious by inspecting the code. The FP question is paramount though. Alot of DSP written in C often uses lazy math because the FP is basically free. If you were to largely eliminate FP the way we used to > 20yrs ago, you can often get just as good a result and get a much better understanding of whats really important and where precision & dynamic range are really needed and where its wasted.

In the DSP world, 16-18bits has proven to be adequate for most tasks as long as special care is taken to keep signals in range. Many algorithms can block up the range so only one exponent is needed for a whole group of points, and this exponent can be as little as a common divide by 1,2,4 etc in the FFT case. I suspect it won't be as easy for your problem.

As an aside, most cpus don't even perform integer math very well for DSP tasks, for instance when rounding, many uninformed programmers will use >> to do division by 2^N not realizing this introduces roundoff errors biasing the signed signal to negative. To do it correctly requires more integer ops to check for msbs & lsbs etc but it reduces the need for extra width and can bring the results much closer to an all FP result converted back to same size int.

One approach I used to turn a Wavelet algorithm into RTL HW was to rewrite the original C code along with the Matlab guy so that the main engine module called small C functions to eval each and every operator, *s +s, /s, and even movs etc. These could also tally the use counts and do more precise math on words that might be odd bit widths. My busses were 17,18 & 37bits wide. By tweaking the widths of various operators we could reduce the cost of HW func blocks to an acceptable value in an ASIC. When Dr Matlab was happy with the C code, all the funcs were easily turned into equiv Verilog (or VHDL) modules, params became ports. The operator counts are needed to design suitable datapaths with fixed/variable arithmetic units. The C program that had been calling these funcs in a C dataflow was then used to construct a FSM to arrange for data to keep moving between the various arithmetic modules/funcs and multiple memories. It gets harder because the HW is

10 stages of overlapping pipelines, very difficult to express in C HDL. Even still it took 6months just to get from Fortran-C to RTL Verilog. Ofcourse the C & Verilog results were identical and very close to FP Matlab model.

Since Sharcs, Transputers & Occam were mentioned, I would mention that the ADI links on the Sharcs (& TI chips IIRC) are a variation of the Transputer links that supported Occam channels across multiple cpus. If only a modern Transputer existed that was comparable to todays embedded cpus/DSPs. KROC anyone? Then it would be perfectly reasonable to build the project in Occam or C-CSP and spread it accross as many tightly coupled cpus as needed. The resulting code will be HW like but with alot less pain and could also lead to HW synthesis (HandelC).

I happen to be working on such a Uber Transputer plus compiler but its still some ways off. The native programming language for this is V++ or Verilog + C + Occam + asm all rolled into one (horrid) language. A mini version of SystemVerilog but tuned to run natively on the event/process scheduler already in the cpu.

The Verilog part allows HW to be described directly either behavioural, dataflow, RTL and most of that remains synthesizeable if written so. Processes can be engineered back & forth between synthed HW (coprocessors) & SW if cpu is in FPGA. Some might call this a HW accelerator on the cheap or a simulation engine, but thats like calling conventional cpus Turing accelerators.

The Occam part is just the !? alt,par,seq,chan model in C syntax. The underlying scheduler is not so different from the HW event timing wheel.

The C part adds data types and allows conventional seq programming and importing of code. Asm touches the HW directly. The compiler is based on the lcc design.

I am curious to know what folks think of combining HDL,CSP,C rather than keeping HW & SW miles apart as in conventional engineering.

regards John

johnjaksonATusaDOTcom (ignore yahoo)

- T
- Thomas Stanka
  
  Contact options for registered users
Vote on answer
posted
20 years ago

Wed, Nov 5, 2003 11:15 AM

When using FPGAs instead of CPUs without major changes in your algorithms you could simple build a CPU with improved datapath so you may improve the number of operations per cycle but if you manage to get 6 operations done in the number of cycles a cpu does one, you end up with no gain as a actual cpu will have ~6 times the number of cycles as your fpga [1].

When inventing an algorithm that uses the benefits from an fpga you could end up with magnitutes of speedup. For example when solving SAT, you may create a fpga with exact the formula and a counter which will test one set of variables per cycle instead of speeding up your integer operations. This approach is of course very limited to the size of your fpga.

So it doesn't seem 'odd' to me that you have to leave "normal" sequential algorithms and think about complete new ideas.

bye Thomas

[1] with io limitations like pci this is _very_ optimitic for an fpga.

- J
- Jonathan Bromley
  
  Contact options for registered users
Vote on answer
posted
20 years ago

Wed, Nov 5, 2003 6:25 PM

[snip much]

See Celoxica's products.

And observe how they've backed away from a reasonably pure CSP approach like yours, and put more emphasis on the pure-C thing.

Software people have an irrational and passionate distaste for fine-grained parallelism. I don't think you have much chance of changing their collective mind.

That observation should not distract you from an interesting and (one hopes) ultimately fruitful project.

-- Jonathan Bromley, Consultant

DOULOS - Developing Design Know-how VHDL * Verilog * SystemC * Perl * Tcl/Tk * Verification * Project Services

Doulos Ltd. Church Hatch, 22 Market Place, Ringwood, Hampshire, BH24 1AW, UK Tel: +44 (0)1425 471223 mail: snipped-for-privacy@doulos.com Fax: +44 (0)1425 471573 Web:

formatting link

The contents of this message may contain personal views which are not the views of Doulos Ltd., unless specifically stated.

- J
- john jakson
  
  Contact options for registered users
Vote on answer
posted
20 years ago

Wed, Nov 5, 2003 11:54 PM

Just as HW people generally view plain C as a likely HDL with great discomfort as has been seen by the no of deceased C HDL companies. But Celoxica is aiming at a different crowd, not hardcore HW Asic guys. I am familiar with HandelC to some degree, I see them a couple of times a year at different shows. Every time I see them I get better insight but the question remains, why use plain C with CSP semantics (Occam is underneath it right) when HDLs are far better at describing HW. Every conversation with Celoxica people tells me that you still have to describe the parallelism directly which is why it can be synthesized. It would surprise me if its possible to use HandelC in a meaningfull way and not use any of the inherited Occam keywords.

Then again HDLs are not too good for describing purely seq processes or SW so bridging the 2 worlds is difficult with either HDL or general seq SW languages. So I am addresing the audience that is comfortable on both sides and wants to move processes between HW & SW. This is not quite the same thing as SystemvVerilog as that is clearly only aimed at big $ ASIC engineers. VHDL perhaps already is a bridge language but I have never been partial to it. If Celoxica included a Transputer IP core with their tool, I imagine HandelC could also be a bridge since code could either run as synthed HW or run as plain Occam style code.

Regards John

johnjaksonATusaDOTcom

- G
- Goran Bilski
  
  Contact options for registered users
Vote on answer
posted
20 years ago

Thu, Nov 6, 2003 6:35 AM

090009030903050003010407 Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 8bit

Hi,

I have been following this thread with great interest.

If you need a processor with links to/from the processor register file then MicroBlaze could be the answer.

MicroBlaze has 18 direct links (in the current version, the ISA allows up to 2048) and 8 new instructions for sending or receiving data to/from the register file.

The connection is called LocalLink (or FSL) and has this features

- Unshared non-arbitrated communication

- Control and Data support

- Uni-directional point-to-point

- FIFO based

- 600 MHz standal>

- M
- mikegw
  
  Contact options for registered users
Vote on answer
posted
20 years ago

Thu, Nov 6, 2003 10:54 AM

and

I will post the equations if I am able. This particular project is not mine and as such I do not know if I am able to post their work. I will know in the next week. But in the meantime the basic premise of the calculations is as follows....

From time zero until time x, for each time step, calculate for n particles (Typically hundreds to thousands) their position in the next time step. Factors affecting the new position are 1)interaction between each particle

2)particle velocity and mass 3)media that the particle is in.

Currently the system is simplified by locating the particles into neighbourhoods so that effects from distant particles is ignored.

Again thanks all for your help

Mike