FPGA or SSE2 ?

Hi,

I am trying to decide which of these two technologies to invest a good amount of time into, and I have not found much of use on the net - there are lots of figures, but comparing apples to oranges is not much use :(

What I am looking to do is probably a typical use for SSE2, basically image processing and compression (DCT, motion estimation, others). The problem is trying to find comparisons between high-bandwidth (eg: Hyper-transport or DDR DIMM interface) FPGA implementations of this sort of operation, and SSE2.

If I have to put the FPGA onto a PCI bus, I very much doubt it could possibly compete, but if it was possible to integrate into the Northbridge, it may be a whole new ball game.

I've run the testbench for Intel's SSE libraries, and it seems to be pretty stable at ~370 clocks/pixel when doing a 720x480 '422' baseline JPEG at 75% quality. This gives 0.04 secs per image at 3.6GHz. What I'm trying to figure out is (assuming I had the data bandwidth via a low-latency connection) whether I'd be able to better that using an FPGA. Any takers ?

I can find figures for parts of the JPEG algorithm (eg: Xilinx have an

8x8 DCT that will push 140 Mpixels/sec if pipelined), which is a large part of the problem, but I'd like to see if the RLE/Huffmann encoding took time as well (I'm aware it could/should be pipelined after the DCT

- something the P4 can't do - I'm just not sure how much extra overhead it would be)

So, anyone got any figures for a best-case FPGA implementation ? They'd be much appreciated :)

John

Reply to
John
Loading thread data ...

I've done a simplified JPEG (same principle but more specialized) encoder that only took b/w (but color is basically 3 times that + color space conversion and different resolution and in a FPGA, you can just duplicate the hw) and it took about 1500 LE in a basic Altera Cyclone and it runned at about 100Mhz, needing 2 clock / pixels. I didn't try to optimize it much since the first implementation met the constraint I needed. So if you duplicate the hw for the other color component, the figure would be 50Mpixels/s => 144 fps (0.007 sec par image).

So I'd say yes, you can easily beat the P4 for such a dedicated task. The problem as you stated is to get the data in and out. 50Mpixels par second at 24bits color is over the PCI peak rate (133Mbyte/s and it's never reached).

RLE is basically a counter with a barrel shifter. Huffman, a simple encoding/decoding table. Both can be done pretty efficiently.

Sylvain

Reply to
Sylvain Munaut

Thanks a lot Sylvain - that's exactly the sort of comparison I needed :) I won't just be doing JPEG, but it's pretty typical of the sort of code so it's a good benchmark, and a speedup of almost 6x is well worth investigation :)

John

Reply to
John

True for PCI 32/33, but don't forget about PCI 32/64, PCI-X and PCI-E. You'll only find PCI-X on server grade motherboards, but at 64/133MHz, you get a good GByte/s odd (not all of which you will get, but still fairly high). You can get PCI-E on newer motherboards as well, with theoretical bandwidths of 2.5Gbit per lane (312MByte/s per lane). I'm not too clear on what you can practically achieve with PCI-E - the technology is still fairly new to me. PCI-E is available on non-server grade motherboards too - I'm figuring that the possibility of using a standard PCI-* interface is probably easier than trying to integrate the device as closely as the northbridge :)

Jeremy

Reply to
Jeremy Stringer

Actually the 2.5 Gb/s is with electrical bit rate which results by encoding using the 8b10 code so it is not available as data rate at all. The best logical bit rate is 2 Gb/s on top of which you put all the framing, packetization etc. so actual utilization is still lower than that.

Reply to
m

Woops :) Shouldn't have missed that one.. Had it somehow in my head that it was a 3.125Gb/s raw bit rate.

Jeremy

Reply to
Jeremy Stringer

ElectronDepot website is not affiliated with any of the manufacturers or service providers discussed here. All logos and trade names are the property of their respective owners.