Performance evaluation of Distributed Arithmetic architectures for FIR filters

- A
- anup
  
  Contact options for registered users
posted
19 years ago

Thu, Mar 17, 2005 11:51 PM

Hi,

I am trying to figure out how to evaluate the speed of Distributed Arithmetic architectures for FIR filter design. I was refering to the paper "A Guide to Field Programmable Gate Arrays for Application Specific Digital Signal Processing Performance" by G.R.Goslin at Xilinx. The paper says that by using a parallel distributed arithmetic architecture (where more bits of the inputs are processed at the same time) greater sampling speed (number of samples per second) can be achieved, compared to Serial Distributed Arithmetic.

I am confused about this. According to me, if you pipeline the design, you can achieve the sampling speed you want. I can see how using more resources, you can achieve shorter latency, but the sampling speed should not be affected. There is probably something I am missing here. If any of you are familiar with this field, and know the answer, please let me know.

Thank you, Anup

- E
- Eric Smith
  
  Contact options for registered users
Vote on answer
posted
19 years ago

Fri, Mar 18, 2005 12:04 AM

Adding pipelining is usually done to reduce the cycle time (increasing the clock rate) while also increasing the latency.

Suppose you wanted to build a floating-point multiply-and-add unit. Perhaps if you make it fully combinatorial, it takes

100 ns for each cycle. You can process ten million samples per second, and the latency is 100 ns.

Suppose instead you break it up into a pipeline with a combinatorial multiplier, a pipeline register, and a combinatorial adder. Suppose the multiplier and adder each take 60 ns separately, and the pipeline register setup and clock-to-output-valid time adds 10 ns. Now your latency (full operatin time) is 130 ns, which is longer. But your sample clock can be as fast as 70 ns, so you can process over fourteen million samples per second.

Now suppose you internally pipeline the actual adder and the multiplier. Perhaps each have three stages that take 20 ns each, and you still have

10 ns of delays for combined setup and clock-to-output-valid time of the pipeline registers. Now you have a latency of 170 ns, which is longer yet. But your sample clock can now be 30 ns, so you can process over thirty million samples per second.

Note that the times used in this example are probably not representative of real times for any actual system.

Eric

- A
- anup
  
  Contact options for registered users
Vote on answer
posted
19 years ago

Fri, Mar 18, 2005 1:22 AM

Hi Eric,

Thank you for your response. I guess I did not frame my question properly. Your answer actually strengthens my doubt. According to what you said, by using more pipeline stages, you can increase the sampling speed.

But I read this article, where they use more resources to make use of the parallelism in the application (FIR filters), and reduce the latency. But they claimed that the sampling speed is also increased. My doubt is that, even in the original design (with fewer resources), you can achieve higher sampling speed by pipelining the design.

One relation I see between "more resources" and "sampling speed" is that you need fewer pipeline stages to achieve higher sampling speed. For example, consider this computation:

Y = Y + A[i]*X[i];

Assume that I have only 2 adders. I can implement this as

Y1 = A[i]*X[i] + A[i+1]*X[i+1]; | Y = Y + Y1

Assume that it takes 10 ns to do (A[i]*X[i] + A[i+1]*X[i+1]); Therefore, without pipelining the adder, I can achieve sampling of 200 million samples per second.

Now suppose, I have 4 adders, I can do

Now I can achieve a sampling of 400 million samples per second.

The original system (with 2 adders) can also achieve the same sampling speed, if the adder is pipelined. I guess that if there is a limit on the amount of pipelining that you can do, adding more resources is the way to increase sampling speed. (I am trying to answer my own question here)

Anyways, thanks for your help.

-Anup

Eric Smith wrote:

design,

more

here.

(increasing

the

fourteen

multiplier.

have

the

longer

representative

- R
- RobJ
  
  Contact options for registered users
Vote on answer
posted
19 years ago

Fri, Mar 18, 2005 6:33 AM

I think the basic idea is that if you have a 51-tap FIR filter, for example, you need to do 51 multiplies per sample. The two extremes of the size/speed options are:

Use one multiplier 51 times sequentially.
Use 51 multipliers in parallel.

#1 is ~51 times smaller that #2, but #2 operates 51 times faster that #1. I didn't read the paper, but that's the classic tradeoff.

- S
- Sea Squid
  
  Contact options for registered users
Vote on answer
posted
19 years ago

Fri, Mar 18, 2005 7:07 AM

there was an author with name 'ray andraka' wrote a paper on this topic. his company is called andraka consulting.

- R
- Ray Andraka
  
  Contact options for registered users
Vote on answer
posted
19 years ago

Fri, Mar 18, 2005 1:41 PM

What you are missing is that serial distributed arithmetic has the signal input presented serially. For an N bit input, it takes N clock cycles per sample to compute the filter output. By using two similar serial filters, you can process two bits per clock cycle, and therefore achieve twice the sample rate for a given clock period. therefore by roughly doubling the hardware (not quite doubled, but that is another post), you get double the performance. This is the classic sample rate vs area tradeoff. Note that the latency, measured in sample delays is not appreciably changed. There is a tutorial on distributed arithmetic on my website at

formatting link

which you may find helpful to understanding what distributed arithmetic is, and why this is true.

--
--Ray Andraka, P.E.
President, the Andraka Consulting Group, Inc.
401/884-7930     Fax 401/884-7950
email ray@andraka.com  
http://www.andraka.com  

 "They that give up essential liberty to obtain a little 
  temporary safety deserve neither liberty nor safety."
                                          -Benjamin Franklin, 1759

- R
- Ray Andraka
  
  Contact options for registered users
Vote on answer
posted
19 years ago

Fri, Mar 18, 2005 1:45 PM

Anup, in FPGAs, without going to extraordinary lengths, you get one pipeline stage per add (those extraordinary lengths cost more area than simply duplicating the adder and processing two samples at once). At the time Greg's article was written, a highly optimized fpga design done by someone very skilled in the craft, had a maximum clock rate quite a bit less than 100 MHz, and that was for an extensively pipelined design.

--
--Ray Andraka, P.E.
President, the Andraka Consulting Group, Inc.
401/884-7930     Fax 401/884-7950
email ray@andraka.com  
http://www.andraka.com  

 "They that give up essential liberty to obtain a little 
  temporary safety deserve neither liberty nor safety."
                                          -Benjamin Franklin, 1759

- R
- Ray Andraka
  
  Contact options for registered users
Vote on answer
posted
19 years ago

Fri, Mar 18, 2005 2:13 PM

Rob, DA splits the problem up differently. With DA, you compute the sum of all the 1 bit x N bit partial products for the first input bit, and on the next clock the sum of all the 1xN partials for the second bit and so on. The partials are then scaled and accumulated. The throughput is increased by computing more than one 1xN partial product at a time.

--
--Ray Andraka, P.E.
President, the Andraka Consulting Group, Inc.
401/884-7930     Fax 401/884-7950
email ray@andraka.com  
http://www.andraka.com  

 "They that give up essential liberty to obtain a little 
  temporary safety deserve neither liberty nor safety."
                                          -Benjamin Franklin, 1759