suitability of systolic architecture on FPGA

Hi,

I understand that because of the 2d array CLB structure of FPGA, systolic architecture can be mapped efficiently into FPGA. however, what about the high IO ports requirements. if the outputs need to be stored in the memory, all the achieved speed will be lost due to the limited off chip ram bandwith [especially if using one bank]. so why this architecture is widely used in matrix algorithms implementation. The excessive I/O bandwidth requirement will surely dent the expected high clocking frequency

am i missing something?

Thanks

Reply to
tlenomade
Loading thread data ...

I guess it depends on the memory requirements of your algorithm. Most modern FPGA's have embedded memory. If your storage requirements per node are small (from a few bytes to a few Kbytes depending on the type of embedded memory) you don't need to go offchip. In a matrix algorithm usually there is a "core" size required for the local operation. If the core is larger than would fit inside one FPGA you will run into the I/O bandwidth problems you mentioned.

Reply to
Gabor

You are entirely correct. There are all kinds of designs where the memory and IO bandwidths are not a problem, but there are other cases where those limitations can reduce the FPGA to a relatively small amount of logic that spends most of its time starved for data.

Those are the times when you try to keep a straight face and explain to Xilinx that you really do need 10,000 IO to keep the device busy (a true story from Los Alamos), or when you start considering completely different approaches, preferably with the help of one of the consultants who frequent this forum.

Reply to
Neil Steiner

Reply to
Peter Alfke

Sad but true. ;) But then isn't our profession in the business of trying to change that world?

Hence the implied reference to Mr. Andraka and others on this forum.

The context was a sparse matrix multiplication core (double precision IEEE 754), that we wanted to run on the Cray XD1. The original software code had been finely optimized to run on regular processors. A pipelined FPGA design could of course run circles around the software processors, but even though the FPGA board had good memory and communication bandwidth (3.2 GB/s each, I believe) and connections to adjacent blades (four each at 2.0 GB/s, back in 2005), it still spent five out of every seven cycles waiting for data.

The functionality was both memory intensive and computation intensive, so there was no simple trade-off to be had, but because this code was used so extensively on supercomputer-sized problems, a significant performance increase could have made a big difference. (Allowing for Ahmdal's law, I believe this might still have shaved days or weeks off some of the really big problems.) They probably shouldn't have stuck an intern on the problem. ;)

Reply to
Neil Steiner

ElectronDepot website is not affiliated with any of the manufacturers or service providers discussed here. All logos and trade names are the property of their respective owners.