Systolic array architectures

Systolic array architectures are commonly used for image/video compression hardware blocks (e.g. convolution filters, motion estimation, etc). I loosely have an idea that this is because they are efficient at reusing the data, and thus reduce memory accesses in comparsion to say a custom designed high throughput singular processing element. Would this be generally considered the princicpal benefit and are there other benefits?

I have read that they are considered "i/o bandwidth efficient", I guess thats just another way of saying what I've just outlined above?

Is there ever scenario's where the area and switching overhead of a systolic array would warrant a less bandwidth efficient, more serial approach - or is that just plain ridulous to consider? For example could you hope to trade less switching in the datapath for increased switching in the memory accesses but still make an overall reduction in switching?

Looking forward to any comments/flames/rants :-)

Reply to
timotoole
Loading thread data ...

snipped-for-privacy@gmail.com schrieb:

In fact only the serial solution has an area and switching overhead for each computation. The systolic implementation only computes without doing anything else. See below.

If your alogrithm dictates that you add two numbers, you can not save the switching of the adder. But you might be able to save the memory access.

Whenever you have an algorithm that can be implemented systolically without an increase of net operations performed (Filters, Smith-Waterman, etc.) the systolic implementation must be a lot more efficient.

If you have to perform the operations anyway it is allways more efficient to perform them right away when the data is available compared to sending intermediate results a long distance across chip or even off chip. The later will slow down the process and consume a lot of power.

However, this is the overall area/delay efficiency (computations per time per area). If you have a fixed bandwidth goal perfect efficiency doesn't help you much if you only have enough data available to utilize the hardware only 1% of the time. In that case you go to a more serial solution to save hardware. But the area/time/energy per computation increases in that case.

More complicated are problems were the systolic algorithm requires more operations than the serial algorithm. In that case you need to tradeoff the gains per computation of the systolic implementation vs. the improved number of computations of the serial algorithm.

Kolja Sulimma

Reply to
Kolja Sulimma

Firstly thank you very much for your detailed answer.

I'm not too sure what you mean by this? Surely there is an area overhead for the systolic array versus a "serial"/single PE implementation?

Using your example of an adder, I have a algorithm where it is possible to avoid the addition completely if a certain condition occurs. With a systolic array the addition has to be completed, Thus my question, of whether a non- systolic implementation has major drawbacks.

[snip]

I would totally agree with this point.

Thats a very helpful comparison.

I think this is my situation, so I'm going to try model the problem with C and evaluate the operations & cycles for both approaches.

Thank you again for your very enlightening response.

Reply to
timotoole

That's not efficiency, that's cost. Cost increases.

But for twice the area I get more than twice the speed for less than twice the power. So the computations per area and the computations per energy increase. That's what's called efficiency.

This is the great thing about systolic algorithms. There are many parallel algorithms that are not systolic that get faster when adding more hardware but at diminishing efficieny. For some problems only logarithmic speedup is possible.

Do not forget the cost of checking the condition. Even in serial CPUs a branch is so expensive that doing a few extra operations can be more desirable testing whether thy can be skipped. [..]

Also check what your performance requirement is. Expect the gain of an systolic algorithm compared to operating serially on externall memory to be huge. So if you waste 90% of the operations you are probably still faster.

*Deliberatly oversimplifing here* But if a serial implementation meets your throughput target there is not much adding hardware for more speed and efficiency.

You can try to run your C-model on a microblaze. This will be VERY inefficient, but maybe it is fast enough ;-)

Kolja Sulimma

Reply to
Kolja Sulimma

ElectronDepot website is not affiliated with any of the manufacturers or service providers discussed here. All logos and trade names are the property of their respective owners.