iterative algorithms + tightly coupled CPU with cloud of logic in FPGA

- W
- wallge
  
  Contact options for registered users
posted
17 years ago

Thu, Jan 4, 2007 5:41 PM

I was wondering if anyone had experience with using combinations of FPGA based CPUs and surrounding logic to perform iterative algorithms. For instance, if we want to implement different types of more complex computer vision algorithms in an embedded system, we may wish to use the parallelism of an fpga to do multiple parts of a 2d convolution or matrix operation in parallel. While the FPGA may be able to handle the number crunching requirements of a given algorithm, it seems to me to be ill suited to handle the iterative (often non-systolic) nature of many advanced image processing algorithms. Often more complex computer vision algorithms seem to be too complex to be handled solely by FPGA based logic.

I was thinking of the case were we have an FPGA connected directly to a video source, and data is flowing into the system at some fixed rate. We may wish to process this data at several scales, and iteratively search the low scales up to the higher ones until we have found features of interest in the video stream. Perhaps we wish to mark those features by altering pixels in their local neighborhood.

We may need to iteratively process multiple scales of image data and buffer the original video frame in off-FPGA DRAM, since there will not be enough on-FPGA BRAM to store full images. Once we find the region of interest, we may then wish to retrieve the original to be marked and then sent off as output video. A good example of this process might be, say, face detection.

It seems to me that the iterative nature of these kinds of algorithms needs to be handled by a combination of CPU and FPGA logic. The FPGA handling the number crunching and parallel data paths, and the CPU handling the notion of when to iterate, or when to stop, or in general, what decision to take next based on the results of the FPGA's number crunching. The CPU could be built from programmable logic, or placed off-FPGA.

Does anyone have experience with this kind of thing, or know of somewhere I might be able to find more information about optimal ways of coupling heterogenous processors?

I am aware of Altera's C2H compiler, but have not used it, and don't know how optimally it combines FPGA/CPU resources. I might be in the market to hire a consultant, if one were knowledgeable in this area.

- J
- JJ
  
  Contact options for registered users
Vote on answer
posted
17 years ago

Thu, Jan 4, 2007 7:05 PM

Some ideas If you have the dosh, you might consider using the Opteron server boards with the 2nd socket used for an FPGA plugin module, there is one product for virtex and another for stratix, you will need to google for those. They were discussed in this group a year ago when they first came out but I forget the vendor names. One issue here is that the Opterons are comunicating with the FPGAs through the HT bus and the Opterons are running at compute speeds in the 2GHz & up while the FPGA may be grunting at 300MHz or less but massively parallel. The Opteron had better be smart about partioning the problem and not get to into the FPGA at too fine a grain otherwise the HT bus will be the bottleneck and either the cpu or FPGA may be idle.

The other idea is to consider the soft core processor as a unit you can either customize at the instruction level by adding your own bit twiddly opcodes or add a coprocessor for more complex processing. Adding opcodes usually slows down the cpu since it has already been architected without your new opcodes in mind. The copro route should work better since this support is usually included in the architecture definition.

If soft cores can perform most of their workload from a Bram with little need to go to external DRAM for code or data, then quite a few of these cores might be placed in the bigger FPGAs and you might then be able to mix and match with a mix of hardware engines under software control of local soft cpu and much closer in clock speed. You could think of a FFT butterfly box as a specialized cpu engine that has it instructions set in wired logic, generalize this into a DSP engine and there are many options.

Also consider using real TI/ADI DSP chips with FPGA as possible accelerator and also look at nVidia GPUs as a PC accelerator, haven't been there but some folks claim some impressive speed ups and you probably already got the hardware.

John Jakson transputer guy

- D
- Derek Simmons
  
  Contact options for registered users
Vote on answer
posted
17 years ago

Fri, Jan 5, 2007 7:02 PM

It sounds like you have a software solution that you want to implement in hardware. I don't think who should be too hung up on the iterative nature of your algorithm but instead you need to rewrite your algorithm targeting hardware or taking advantage of the VHDL. You should be looking for things that can be done in parallel. Pipelining can reduce the need for memory.

An example from your description is as areas of interest are identified, instead of marking them, pass them onto the next stage in the solution.

I evaluated a couple of different C-to-HDL compilers. Most times they require you to rewrite code to work within their environment. To some extent it is like learning a new or another language. Now don't get me wrong, there are still advantages to using these tools but I found the VHDL that they produce wasn't optimal but 'safe'. I instead decided to spend my time becoming a better VHDL programmer.

My advice is start by implementing your solution as a state machine. In your algorithm, break up compound statements into simple steps. Each step becomes a state. I developed a technique for implementing for-next loops that I could easily manage. What I found was, while my solutions required a lot of cycles, I could achieve higher clock frequencies. After I was able to get a working solution that matched my software implementation I went back and identified things that I could implement as parallel units and pipeline.

The other thing I wanted to do was reduce the number of multipliers I was using. So, I recoded them as shared hardware instead. The amount of combinational logic I was using went up but I was able to reduce the number of multipliers to 4 from 36.