# Compact FIR filters with multiplier blocks?

• posted

=_NextPart_000_01D7_01C3702A.3B784160 Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: quoted-printable

Hello,

Has anyone compared FPGA implementations of full-rate digital FIR = filters based on the use of Multiplier Blocks vs. traditional FIRs with = constant coefficient multipliers? By full rate, I mean: one output = result per clock cycle and no interpolation or decimation.

For anyone not familiar, a multiplier block is a network of shifters and = adders that performs multiplications by several coefficients efficiently = by exploiting common sub-expressions. The multiplier block can be = exploited in FIR filters by transposing the standard filter so that the = products of all the coefficients with the current input-sample are = required simultaneously.

Also, by representing the coefficients in the Canonical-Signed-Digit = number system (a small number of +1 and -1's) along common = sub-expression sharing the multiplier block can get even smaller.

For example, the multiplier block for a 100 tap FIR filter (fp=3D0.10 = and fs=3D0.12) can be realized with only 61 adds (zero explicit = multiplications). See filter example #4 in "FIR Filter Synthesis = Algorithms for Minimizing the Delay and the Number of Adders," =

If the adder depth is constrained to a maximum of four, then the = authors' algorithm can do the multiplier block in 69 additions.

It would seem that this approach would be very efficient in a target = such as the Xilinx Spartan-IIE (with no dedicated multipliers).

Another question: If we only need one result per K clock periods (K ~=3D =

1000 for audio applications), could a multiplier block approach realized = with, say, bit-serial addition be more efficient than some other = approach such as distributed arithmetic?

-Michael

______________________ Michael E. Spencer, Ph.D. President Signal Processing Solutions, Inc. Web:

• posted

--

--Ray Andraka, P.E. President, the Andraka Consulting Group, Inc.

401/884-7930 Fax 401/884-7950 email snipped-for-privacy@andraka.com

"They that give up essential liberty to obtain a little temporary safety deserve neither liberty nor safety." -Benjamin Franklin, 1759

• posted

Ray,

I sent this to Michael via email and he suggested the group would be interested also...

My PhD (now drawing to the end) has been on implementing full-parallel Transpose FIR filters using multiplier blocks that you mention (I use techniques/algorithms that exceed the efficiency of CSD in terms of FPGA area).

The upshot of my work is that I have written a C++ program that will generate RTL VHDL given the quantised filter coefficients, the type of filter required (singlerate, interpolation, decimation etc.) and the appropriate parameters (input width, signed/unsigned input, number of channels, rate-change factor etc.)

The VHDL my program generates exceeds the functionality (at a lower cost) of that provided by Xilinx's Distributed Arithmetic core and Altera's FIR Compiler (also DA). In fact, my program allows interpolation and decimation factors up to the number of filter coefficients and any number of data channels (for interpolation/decimation filters also).

The main point is that, once synthesised and mapped to a specific FPGA, the filters my program generates require far less FPGA area (slices/logic cells) than those generated using Distributed Arithmetic. The critical path in my filters is just the longest adder carry chain so very high speeds are possible. E.g. 154MHz for a singlerate filter (25 bit output) in a Xilinx xc2v3000-fg676-5 - obviously the speed will depend on the device family/speed grade and the longest carry chain. The facility for multiple channels in interpolation/decimation filters (not supported by Xilinx) allows lower than full-parallel sampling rates to be efficiently processed in one filter.

As Michael points out in his post, this technique would be very suitable for a Xilinx Spartan-IIE and indeed any FPGA - there are many cases where these filters would be useful even on devices with dedicated multipliers (when they are all in use for example! ;-) ).

You can find out more at

- there are also datasheets here that provide comparisons with Xilinx and Altera and demonstrate the output of another application (written in java) that generates schematic representations of the filters for use in reports, meetings and thesises! :-)

I hope this information is of use to you - please contact me if you have any questions,

Ken

```--

"Ray Andraka"  wrote in message```
• posted

I agree the multiplier block style filters are more efficient area-wise. It sounds like you have addressed the irregularity issues by using a program to do the generation, which I think is pretty much a necessity. As I thought I alluded to, the biggest problem with multiplier block filters is that the layout/size is not a constant if you change the coefficients. This means that the fiter coefficients have to be constant and known earlier in the design cycle, and necessitates a rerun of synthesis, place and route for any filter changes. Depending on the implementation, it may also mean a change in the filter's pipeline latency. These factors can make them difficult to use on some projects. The filters typically used in my projects often need to be adjusted by the customer or late in the project to accommodate minor requirements changes. I prefer to use a filter with reloadable coefficients for that reason.

Ken wrote:

--

--Ray Andraka, P.E. President, the Andraka Consulting Group, Inc.

401/884-7930 Fax 401/884-7950 email snipped-for-privacy@andraka.com

"They that give up essential liberty to obtain a little temporary safety deserve neither liberty nor safety." -Benjamin Franklin, 1759

• posted

Ray,

You raise some excellent points clearly derived from experience.

Hopefully my program will still be useful when filters can be fixed early on in the project or when a synthesis/place and route run is an acceptable cost for the area efficiency provided.

Also, for devices being used in consumer applications, perhaps the area saved using a multiplier block filter would allow a smaller and cheaper device within the family to be used - reducing production costs depending on the number of units being produced.

Cheers,

Ken

It

thought

the

that

filter

the

on

coefficients

Altera's

number of

the

cells)

my

Xilinx

multiple

processed

for

these

any

• posted

I went around the irregularity issue by having sub-multiplier block architecture that has have fixed interface to the routing and have fixed (yet reasonable) area. Therefore, when the coefficients are changed, no place and route is required and the latency remains the same (unless you change the number taps). The generation of coefficients can done at reconfiguration time thanks to symmetry in the FPGA used (Atmel 40K40). Naturally, there is the problem of hassling with run-time reconfiguration and everything that comes with that...

As part of this work we looked also into common subexpression sharing in that particular FPGA family and found it very unlikely that benefits could be obtained with similar multiplier-block architecture. This is mainly due the fact that it is different story to be able to generate the most useful common subexpressions that it is to really use them before the routing becomes congested.

T.Rissa

Ray Andraka wrote:

• posted

Ken, While the RSG solution may yield smaller designs for specific cases, the Altera FIR Compiler gives you more flexibility in terms of optimizing area vs.speed.

For instance, the numbers presented in the RSG datasheet is based on a pipeline=2 setting for the Altera FIR Compiler. Using the FIR Compiler, the design yields an fmax of 322MHz (single rate, single channel). This is much higher than the 154MHz cited for the filter using the RSG approach. This is the classic speed/area trade-off scenario.

If indeed area is the critical factor, it is possible to reduce the pipeline to 1 in the FIR Compiler. In the single rate cases, the logic cell count comparison would show that the RSG approach would be beneficial for the single and 2 channel FIR designs (58% and 80% respectively). In the 4 and 8 channel FIR designs, the distributed arithmetic approach employed by the Altera FIR Compiler yields better area compared to the RSG generated filter (106% and 133% respectively). Reducing the number of pipeline stage to 1 yields fmax of 237MHz (single rate, single channel), still well beyond the performance requirement in most cases. This, I believe, is a more accurate comparison for the RSG datasheet.

Regards, HS

that

• posted

Hong,

Firstly, my apologies for the delay in replying.

I have inserted my responses to your points below:

Agreed - RSG only produces full-parallel filters. If multiple clocks per output can be used (depending on data rates and avaible clock/power consumption requirements of course) then I would be the first to say use Distributed Arithmetic (DA) in multi-cycle mode.

The figure of 154MHz is not in the datasheet you are referring to (available from

- I do not know where you got this number from (I seem to remember it being in an old version [which did not include Altera results] but I removed it because the 154MHz was Xilinx specific for a particular filter on a particular device and has no bearing at all to Altera devices/filter implementations).

The test filters I did came in at various speeds between [244-283MHz] for RSG and [217-293MHZ] for Altera FIR Compiler (on exactly the same device and with the same constraints). This is why I used pipeline level 2 for fir compiler because pipeline level 1 made the fir compiler filters too slow and pipeline level 3 made the fir compiler filters ridiculously large and not that much faster.

Hence, pipeline level 2 does give a fair comparison with RSG.

For pipeline level 1, the FIR compiler clock rates were in most cases slower than for RSG and the FIR compiler implementations were also larger. The

8-channel singlerate FIR was the only one that was the same size but the RSG filter ran at 251MHz and FIR Comp at 206MHz.

Cheers,

Ken

area-wise. It

program

thought

that the

means that

design

filter

in the

use on

to be

coefficients

full-parallel

FPGA

of

Altera's

number of

FPGA, the

cells)

in my

Xilinx

multiple

Xilinx)

processed

suitable for

these

(when

also

reports,

have any

• posted

I went around the irregularity issue by having sub-multiplier block architecture that has have fixed interface to the routing and have fixed (yet reasonable) area. Therefore, when the coefficients are changed, no place and route is required and the latency remains the same (unless you change the number taps). The generation of coefficients can done at reconfiguration time thanks to symmetry in the FPGA used (Atmel 40K40). Naturally, there is the problem of hassling with run-time reconfiguration and everything that comes with that...

As part of this work we looked also into common subexpression sharing in that particular FPGA family and found it very unlikely that benefits could be obtained with similar multiplier-block architecture. This is mainly due the fact that it is different story to be able to generate the most useful common subexpressions that it is to really use them before the routing becomes congested.