Interesting problems about high performance computing

There is a software application which demonds huge computation. My aim is to port the software program into hardware. So first of all, I need to evaluate the required volumn of computation.

The situation is that the software program will need up to 22 hours on personal computer (P4 3.06G, 2GB RAM) to finish the computation, while the operations mostly are float-point multiplications and additions. I need to finish the computation by hardware in serveral minites. Is it possible?

Another problem is that which kind of hardware platforms is perferable. I have checked that the production provided by Nallatech

formatting link
seems suitable for me. I am still wondering whether I should choose some boards with multiple DSP processors on it? I think DSP processors are better at floating point operations.

Reply to
hitsx
Loading thread data ...

put some counters in the software version to total the required number of operations on each variable in the calculation, and you can get that answer fairly accurately.

Next take a good look at the algorithm, for concurrency and necessary serial operations.

Any place you have a different data value used in each iteration a loop, you have the chance to do calculate two or more values per loop iterations (unrolling the loop) to gain parallelism. If however, there are serializations which offer no restructuring to parallelize, then the application might well be done best with a fast cpu/cache rather than a slower FGPA. Look at Amdahl's law regarding possible speedup between parallel and serial sections of a program.

In some cases, even a serial operation that is lengthy, may see some gains by pipelining, as another form of parallelization. So be constructive in looking for parallelism in the algorithm. In some other cases, using a completely different algorithm which completes the task equally, may be the only means for gaining the required parallelism. This sometimes requires some thought, like replacing recursion with linear code, to "unroll" the implied loops. or reordering the sequence over operations ... such as loop combining where one or more loops process the same data points.

In general, most FPGA implementations see speed improvements between

2X to 200X, sometimes worse, rarely much better, unless the main software loop is very small, and the algorithm has exceptional parallelism. The reason is that FPGA cycle times are relatively slow compared to GHz multi-issue pipelined CPUs. It doesn't really mater much if the data is fixed or floating point, what matters is the degree of parallelism you can introduce and exploit in the FPGA. You are looking for speed up's on the order of 22*60/3 = 440X which rather implies the calculations of the algorithm must have a very small kernel, and extremely easy to parallelize with little or no serial component (such as memory accesses) to meet your goals in an FPGA ... certainly possible with some careful work, but not necessarily for all algorithms or problems.

Beside looking at FPGAs, you might also consider GPUs, Cell processors, and a different CPU architecture which has better multi- issue floating point, cache, or memory performance ... including tuning the application for exactly that CPU/Memory configuration.

Have fun! John

Reply to
Totally_Lost

Hello

perhaps, this one could suit your needs:

formatting link

Regards

Christian

Reply to
Christian Kirschenlohr

Most engineers still don't realize the processing power of the FPGA. True, today's processors operate at high clock frequencies. People don't truly understand how much of that raw speed is lost to processor overhead. I created a FPGA process for processing real time 1280pixel

32bit camera scans to identifiy the leading edge of incoming documents, determine the skew angle and rotate the image in realtime while the image was still being scanned. This process originally required 8 High speed DSP's with significant propagation delay and large amounts of memory. I have also converted many software processes to the FPGA enviroment. Each have had dramatic improvements in speed that no processor could ever come close to matching.

Here is a short list of problems your software program may be experiencing.

  1. Almost 25% to 50% your speed is lost to opcode and memory access. This number varies depending on CPU cache efficiency. 2. Depending on the software program size and memory access. You could be forcing excessive cache dumps and reload which can reduce speed another 10 to 50%. 3. You must then content with program efficiency. Is the program optimized for speed. 4. Last, the efficiency of your software compiler. Is it using extensive use of libraries or inline code.

When a program is converted to hardware you eliminate items 1, 2, and 4. You are left with program efficiency, how well it is translated to hardware.

The first thing to do is reorder the statements in the program. Section the program into stages and identify the loop/repitition structure. Each stage has a dependency on the previous computation. This will usually lead to each stage having multiple computations. These can be executed in parallel in the FPGA and many equations can be executed in just one FPGA clock cycle. After reordering the statements in the proper order for hardware conversion, recompile the program and run to insure in still functions correctly. This is now your basic template for conversion.

If you use large amounts of data in the program beyond the capacity of the FPGA, you will need to create a multi channel DMA controller w/cache. This controller will provide access to each stage needing external memory.

Second, when using decimal calculation, determine the maximum decimal error. Floating point offers many advantages but slows down computation, consume large number of resources and add to the overall complexity of the hardware. You should use fix point computation if possible.

Fix point decimal accuracy (decimal portion not including integer size)

1 byte = 2 decimal places 2 byte = 4 decimal places 3 byte = 7 decimal places 4 byte = 9 decimal places 5 byte = 12 decimal places 6 byte = 14 decimal places

Xilinx devices have built in multipliers (18x18) which are fast and save hardware. Division is possible but uses more resources.

A spartan3 can do the work. The Virtex 2, and 4 are faster and offer more memory and multipliers with the addition of CPU's to help with other tasks. A DSP does provide avantages over a standard processor but does not compare to the raw power of the FPGA.

snipped-for-privacy@charter.net

Reply to
evansamuel

The type of thing you are doing we have a number of clients doing similar things with our development hardware in quite a few application areas.

designed and implemented FPGA design. Generally you should see a massive increase in performance in FPGA DSP engine compared to the software running on a generic X86 processor.

The second part of your problem is getting data to and from the DSP Processor or FPGA. We see a lot of these applications because a lot of our boards are PCI or PCI-E based and offer a high bandwidth route to the PC but you may want to consider USB2 as an alternative. It is not as performant as PCI usually. Often the USB controller itself effectively sits on a PCI bus so is limited by that if not it's own protocol handing. If you do have a high data bandwidth requirement do be careful of what PCI and PCI-E can deliver in reality. 20-50% of the bus/link speed is a reason first level guide of what can be achieved and even then there are lots of factors.

John Adair Enterpoint Ltd.

formatting link

Reply to
John Adair

If you give some hints as to what type of computation you are doing you will get better answers. Floating point, especially addition, is relatively expensive in an FPGA. You might consider the possibility of using fixed point, even if it is significantly wider. Sometimes block floating point works well, where one exponent applies to many (like a whole vector of) values. (The pre/post normalization needed for floating point is expensive.)

The actual implementation of the algorithm is usually very different in an FPGA. I recommend the systolic array architecture, but there are other possibilities.

There are some systems convenient for implementing such designs. I have seen PCMCIA cards with an FPGA, though that might not be big enough for you.

How many times do you need to run this problem, and how big is your budget?

You might also look at:

formatting link

-- glen

Reply to
glen herrmannsfeldt

Hello John,

Thank you for you quick reply!

To be specific, my program is related to 3D image reconstruction. The input data is float point numbers in the form of 3D array. I represent it using symbol: g(x1,y1,z1), while the number of x1,y1,z1 are quite large. The output data is another 3D float point array. I represent it using symbol: F(x2,y2,z2), similarily the number of x2,y2,z2 are also quite large.

The estimated memory usage is 2GB or so, while the calculations required is listed below:

integer addition 2442 Giga operations per second float add 814 Giga operations per second float substrac 2424 Giga operations per second float muliply 1610 Giga operations per second float divide 809 Giga operations per second

I want these calculations done in one minite, so we can divide the operations by 60. As a matter of fact, if the calculations could be done in 5 minites, it is OK for me. So for minimum requirment, divide the above numbers by 300(=60*5).

And your suggestion?

Reply to
hitsx

Hello glen:

Please refer to my reply to John. There is no problem about the budget. And I prefer to get the whole calculations done in 10 minites as I have said in my reply to John.

Reply to
hitsx

wrote

Is it single or double precision floats?

Divided by 300 the numbers are not very high:

10.7G fadd/fsub 5.3G fmul 2.7G fdiv

IMO your limiting factor will probably be the memory bandwidth. If you have enough memory bandwidth then it's not a problem for a large FPGA. You can even go faster than that.

Marc

Reply to
Marc Battyani

It is double precision. And can you please explain it for me why it matters? The another question is that as you mentioned memory bandwidth could be an issue, can you evaluate the approximately required memory bandwidth?

Reply to
hitsx

It is double precision. And can you please explain it for me why it matters? The another question is that as you mentioned memory bandwidth could be an issue, can you evaluate the approximately required memory bandwidth?

Reply to
hitsx

It is double precision. And can you please explain it for me why it matters? The another question is that as you mentioned memory bandwidth could be an issue, can you evaluate the approximately required memory bandwidth?

Reply to
hitsx

In news: snipped-for-privacy@i38g2000prf.googlegroups.com timestamped Thu, 21 Jun 2007 20:05:20 -0700, " snipped-for-privacy@hit.edu.cn" posted: "On 6 22 , 5 58 , "Marc Battyani" wrote: > wrote [..] >

Hello,

In case you missed them, please see my notes above in square brackets (and to see what the ^ characters are being used to point at, view in a monospace font such as Courier).

The numbers should be divided by the reciprocal of 300 (or, for efficient calculations, multiplied by 300) so instead of 10.7Gflops;

5.3Gflops; and 2.7Gflops the results should be approximately... 244200 Gigafloating point operations per second for additions; 727200 Gigafloating point operations per second for subtractions; 483000 Gigafloating point operations per second for multiplications; and 242700 Gigafloating point operations per second for divisions.

"[..] It is double precision. And can you please explain it for me why it matters?"

Slower and more memory (but less inaccurate).

"The another question is that as you mentioned memory bandwidth could be an issue, can you evaluate the approximately required memory bandwidth?"

The more bigger values which will be flowing through your gates, the more bits per second you will want to be able to handle. I am surprised that you did not notice the inappropriate calculation of

1610 Gflops / 300 = 5.3Gflops so if you plan on spending a lot of money on your new implementation, please for your own sake, double-check everything.

Regards, Colin Paul Gloster

Reply to
Colin Paul Gloster

On Jun 21, 2:54 am, " snipped-for-privacy@hit.edu.cn" wrote: I tried to email offline to discuss some of these issues more, but your email bounces :(

John

This is an automatically generated Delivery Status Notification.

Delivery to the following recipients failed.

snipped-for-privacy@hit.edu.cn

Reply to
Totally_Lost

I got what you mean, but the fact is what I wrote may make you confused.

The total computation is described below: integer add 2442 Giga operations per second float add 814 Giga operations per second float substrac 2424 Giga operations per second float muliply 1610 Giga operations per second float divide 809 Giga operations per second

If I need to get the operations done in one second, I need to finish these number of calculations. While if I need to get the operations done in one minute, I need to finish ( these number of calculations / 60 )

So it is to divide the time needed, not to multiply.

Reply to
hitsx

As well as I know image reconstruction, you should be able to do it in fixed point. It is commonly done in floating point for software implementations, but it still might be better to use fixed point for a hardware implementation. Also, for multiply and divide it makes a big difference if it is multiplied or divided by a constant or variable.

You might prescale the floating point data in software before sending it through the FPGA based hardware. I would probably consider 24 bits intermediate data for a 16 bit result.

Look for books or papers on systolic array implementations of matrix operations.

About how much can you spend on hardware? hundreds, thousands, or millions of dollars?

-- glen

Reply to
glen herrmannsfeldt

You could hack this onto a big GeForce GPU in a standard PC... check out GPGPU.

Reply to
PFC

Can you recommend some books about this field? The budget should be tens of thousands of dollars.

Reply to
hitsx

(I wrote)

formatting link

Should get you started.

Tens of thousands of $ might mean that you should use an existing board with FPGA(s) on it. One large one might be about right.

-- glen

Reply to
glen herrmannsfeldt

ElectronDepot website is not affiliated with any of the manufacturers or service providers discussed here. All logos and trade names are the property of their respective owners.