#### Do you have a question? Post it now! No Registration Necessary

**posted on**

- Michael Spencer

September 1, 2003, 5:41 am

Hello,

Has anyone compared FPGA implementations of full-rate digital FIR =

filters based on the use of Multiplier Blocks vs. traditional FIRs with =

constant coefficient multipliers? By full rate, I mean: one output =

result per clock cycle and no interpolation or decimation.

For anyone not familiar, a multiplier block is a network of shifters and =

adders that performs multiplications by several coefficients efficiently =

by exploiting common sub-expressions. The multiplier block can be =

exploited in FIR filters by transposing the standard filter so that the =

products of all the coefficients with the current input-sample are =

required simultaneously.

Also, by representing the coefficients in the Canonical-Signed-Digit =

number system (a small number of +1 and -1's) along common =

sub-expression sharing the multiplier block can get even smaller.

For example, the multiplier block for a 100 tap FIR filter (fp3D%0.10 =

and fs3D%0.12) can be realized with only 61 adds (zero explicit =

multiplications). See filter example #4 in "FIR Filter Synthesis =

Algorithms for Minimizing the Delay and the Number of Adders," =

http://ics.kaist.ac.kr/~dk/papers/TCAD2001.pdf

If the adder depth is constrained to a maximum of four, then the =

authors' algorithm can do the multiplier block in 69 additions.

It would seem that this approach would be very efficient in a target =

such as the Xilinx Spartan-IIE (with no dedicated multipliers).

Another question: If we only need one result per K clock periods (K ~3D% =

1000 for audio applications), could a multiplier block approach realized =

with, say, bit-serial addition be more efficient than some other =

approach such as distributed arithmetic?

Comments welcome. Thanks.

-Michael

______

______

______

______

______

______

______

______

______

______

______

Michael E. Spencer, Ph.D.

President

Signal Processing Solutions, Inc.

Web: http://www.spsolutions.com

Re: Compact FIR filters with multiplier blocks?

construction is predicated on the specific coefficients. As

a result it is considerably harder to use for an arbitrary

set of coefficients. It may reduce area over a straight FIR

filter running at the same clocks per sample, but at a

considerable cost in design time and flexibility. You also

give up regularity in the structure, which may reduce the

overall performance. Essentially what the block multiplier

and distributed arithmetic approaches are is a rearrangement

of the bitwise product terms. The mutliplier block takes

advantage of duplicate terms by adding the inputs before

they are multiplied by the term.

Michael Spencer wrote:

--

--Ray Andraka, P.E.

President, the Andraka Consulting Group, Inc.

--Ray Andraka, P.E.

President, the Andraka Consulting Group, Inc.

We've slightly trimmed the long signature. Click to see the full one.

Re: Compact FIR filters with multiplier blocks?

Ray,

I sent this to Michael via email and he suggested the group would be

interested also...

My PhD (now drawing to the end) has been on implementing full-parallel

Transpose FIR filters using multiplier blocks that you mention (I use

techniques/algorithms that exceed the efficiency of CSD in terms of FPGA

area).

The upshot of my work is that I have written a C++ program that will

generate RTL VHDL given the quantised filter coefficients, the type of

filter required (singlerate, interpolation, decimation etc.) and the

appropriate parameters (input width, signed/unsigned input, number of

channels, rate-change factor etc.)

The VHDL my program generates exceeds the functionality (at a lower

cost) of that provided by Xilinx's Distributed Arithmetic core and Altera's

FIR Compiler (also DA). In fact, my program allows interpolation and

decimation factors up to the number of filter coefficients and any number of

data channels (for interpolation/decimation filters also).

The main point is that, once synthesised and mapped to a specific FPGA, the

filters my program generates require far less FPGA area (slices/logic cells)

than those generated using Distributed Arithmetic. The critical path in my

filters is just the longest adder carry chain so very high speeds are

possible. E.g. 154MHz for a singlerate filter (25 bit output) in a Xilinx

xc2v3000-fg676-5 - obviously the speed will depend on the device

family/speed grade and the longest carry chain. The facility for multiple

channels in interpolation/decimation filters (not supported by Xilinx)

allows lower than full-parallel sampling rates to be efficiently processed

in one filter.

As Michael points out in his post, this technique would be very suitable for

a

Xilinx Spartan-IIE and indeed any FPGA - there are many cases where these

filters would be useful even on devices with dedicated multipliers (when

they are all in use for example! ;-) ).

You can find out more at http://www.dspec.org/rsg.asp - there are also

datasheets here that provide comparisons with Xilinx and Altera and

demonstrate the output of another application (written in java) that

generates schematic representations of the filters for use in reports,

meetings and thesises! :-)

I hope this information is of use to you - please contact me if you have any

questions,

Thanks for your time,

Ken

I sent this to Michael via email and he suggested the group would be

interested also...

My PhD (now drawing to the end) has been on implementing full-parallel

Transpose FIR filters using multiplier blocks that you mention (I use

techniques/algorithms that exceed the efficiency of CSD in terms of FPGA

area).

The upshot of my work is that I have written a C++ program that will

generate RTL VHDL given the quantised filter coefficients, the type of

filter required (singlerate, interpolation, decimation etc.) and the

appropriate parameters (input width, signed/unsigned input, number of

channels, rate-change factor etc.)

The VHDL my program generates exceeds the functionality (at a lower

cost) of that provided by Xilinx's Distributed Arithmetic core and Altera's

FIR Compiler (also DA). In fact, my program allows interpolation and

decimation factors up to the number of filter coefficients and any number of

data channels (for interpolation/decimation filters also).

The main point is that, once synthesised and mapped to a specific FPGA, the

filters my program generates require far less FPGA area (slices/logic cells)

than those generated using Distributed Arithmetic. The critical path in my

filters is just the longest adder carry chain so very high speeds are

possible. E.g. 154MHz for a singlerate filter (25 bit output) in a Xilinx

xc2v3000-fg676-5 - obviously the speed will depend on the device

family/speed grade and the longest carry chain. The facility for multiple

channels in interpolation/decimation filters (not supported by Xilinx)

allows lower than full-parallel sampling rates to be efficiently processed

in one filter.

As Michael points out in his post, this technique would be very suitable for

a

Xilinx Spartan-IIE and indeed any FPGA - there are many cases where these

filters would be useful even on devices with dedicated multipliers (when

they are all in use for example! ;-) ).

You can find out more at http://www.dspec.org/rsg.asp - there are also

datasheets here that provide comparisons with Xilinx and Altera and

demonstrate the output of another application (written in java) that

generates schematic representations of the filters for use in reports,

meetings and thesises! :-)

I hope this information is of use to you - please contact me if you have any

questions,

Thanks for your time,

Ken

--

To reply by email, please remove the _MENOWANTSPAM from my email address.

To reply by email, please remove the _MENOWANTSPAM from my email address.

We've slightly trimmed the long signature. Click to see the full one.

Re: Compact FIR filters with multiplier blocks?

sounds like you have addressed the irregularity issues by using a program

to do the generation, which I think is pretty much a necessity. As I thought

I alluded to, the biggest problem with multiplier block filters is that the

layout/size is not a constant if you change the coefficients. This means that

the fiter coefficients have to be constant and known earlier in the design

cycle, and necessitates a rerun of synthesis, place and route for any filter

changes. Depending on the implementation, it may also mean a change in the

filter's pipeline latency. These factors can make them difficult to use on

some projects. The filters typically used in my projects often need to be

adjusted by the customer or late in the project to accommodate minor

requirements changes. I prefer to use a filter with reloadable coefficients

for that reason.

Ken wrote:

--

--Ray Andraka, P.E.

President, the Andraka Consulting Group, Inc.

--Ray Andraka, P.E.

President, the Andraka Consulting Group, Inc.

We've slightly trimmed the long signature. Click to see the full one.

Re: Compact FIR filters with multiplier blocks?

Ray,

Thanks for your quick response.

You raise some excellent points clearly derived from experience.

Hopefully my program will still be useful when filters can be fixed early on

in the project or when a synthesis/place and route run is an acceptable cost

for the area efficiency provided.

Also, for devices being used in consumer applications, perhaps the area

saved using a multiplier block filter would allow a smaller and cheaper

device within the family to be used - reducing production costs depending on

the number of units being produced.

Cheers,

Ken

It

thought

the

that

filter

the

on

coefficients

Altera's

number of

the

cells)

my

Xilinx

multiple

processed

for

these

any

address.

Thanks for your quick response.

You raise some excellent points clearly derived from experience.

Hopefully my program will still be useful when filters can be fixed early on

in the project or when a synthesis/place and route run is an acceptable cost

for the area efficiency provided.

Also, for devices being used in consumer applications, perhaps the area

saved using a multiplier block filter would allow a smaller and cheaper

device within the family to be used - reducing production costs depending on

the number of units being produced.

Cheers,

Ken

It

thought

the

that

filter

the

on

coefficients

Altera's

number of

the

cells)

my

Xilinx

multiple

processed

for

these

any

address.

Re: Compact FIR filters with multiplier blocks?

I went around the irregularity issue by having sub-multiplier

block architecture that has have fixed interface to the routing

and have fixed (yet reasonable) area. Therefore, when the

coefficients are changed, no place and route is required and

the latency remains the same (unless you change the number taps).

The generation of coefficients can done at reconfiguration time

thanks to symmetry in the FPGA used (Atmel 40K40). Naturally,

there is the problem of hassling with run-time reconfiguration

and everything that comes with that...

As part of this work we looked also into common subexpression

sharing in that particular FPGA family and found it very unlikely

that benefits could be obtained with similar multiplier-block

architecture. This is mainly due the fact that it is different

story to be able to generate the most useful common subexpressions

that it is to really use them before the routing becomes congested.

http://www.doc.ic.ac.uk/~tpr/papers/rissa_FPT02.pdf

T.Rissa

Re: Compact FIR filters with multiplier blocks?

While the RSG solution may yield smaller designs for specific cases,

the Altera FIR Compiler gives you more flexibility in terms of

optimizing area vs.speed.

For instance, the numbers presented in the RSG datasheet is based on a

pipeline=2 setting for the Altera FIR Compiler. Using the FIR

Compiler, the design yields an fmax of 322MHz (single rate, single

channel). This is much higher than the 154MHz cited for the filter

using the RSG approach. This is the classic speed/area trade-off

scenario.

If indeed area is the critical factor, it is possible to reduce the

pipeline to 1 in the FIR Compiler. In the single rate cases, the

logic cell count comparison would show that the RSG approach would be

beneficial for the single and 2 channel FIR designs (58% and 80%

respectively). In the 4 and 8 channel FIR designs, the distributed

arithmetic approach employed by the Altera FIR Compiler yields better

area compared to the RSG generated filter (106% and 133%

respectively). Reducing the number of pipeline stage to 1 yields fmax

of 237MHz (single rate, single channel), still well beyond the

performance requirement in most cases. This, I believe, is a more

accurate

comparison for the RSG datasheet.

Regards,

HS

that

Re: Compact FIR filters with multiplier blocks?

Hong,

Firstly, my apologies for the delay in replying.

I have inserted my responses to your points below:

Agreed - RSG only produces full-parallel filters. If multiple clocks per

output can be used (depending on data rates and avaible clock/power

consumption requirements of course) then I would be the first to say use

Distributed Arithmetic (DA) in multi-cycle mode.

The figure of 154MHz is not in the datasheet you are referring to (available

from http://www.dspec.org/rsg/downloads/datasheets/RSG_Overview.pdf ) - I do

not know where you got this number from (I seem to remember it being in an

old version [which did not include Altera results] but I removed it because

the 154MHz was Xilinx specific for a particular filter on a particular

device and has no bearing at all to Altera devices/filter implementations).

The test filters I did came in at various speeds between [244-283MHz] for

RSG and [217-293MHZ] for Altera FIR Compiler (on exactly the same device and

with the same constraints). This is why I used pipeline level 2 for fir

compiler because pipeline level 1 made the fir compiler filters too slow and

pipeline level 3 made the fir compiler filters ridiculously large and not

that much faster.

Hence, pipeline level 2 does give a fair comparison with RSG.

For pipeline level 1, the FIR compiler clock rates were in most cases slower

than for RSG and the FIR compiler implementations were also larger. The

8-channel singlerate FIR was the only one that was the same size but the RSG

filter ran at 251MHz and FIR Comp at 206MHz.

Cheers,

Ken

area-wise. It

program

thought

that the

means that

design

filter

in the

use on

to be

coefficients

full-parallel

FPGA

of

Altera's

number of

FPGA, the

cells)

in my

Xilinx

multiple

Xilinx)

processed

suitable for

these

(when

also

reports,

have any

address.

Re: Compact FIR filters with multiplier blocks?

I went around the irregularity issue by having sub-multiplier

block architecture that has have fixed interface to the routing

and have fixed (yet reasonable) area. Therefore, when the

coefficients are changed, no place and route is required and

the latency remains the same (unless you change the number taps).

The generation of coefficients can done at reconfiguration time

thanks to symmetry in the FPGA used (Atmel 40K40). Naturally,

there is the problem of hassling with run-time reconfiguration

and everything that comes with that...

As part of this work we looked also into common subexpression

sharing in that particular FPGA family and found it very unlikely

that benefits could be obtained with similar multiplier-block

architecture. This is mainly due the fact that it is different

story to be able to generate the most useful common subexpressions

that it is to really use them before the routing becomes congested.

http://www.doc.ic.ac.uk/~tpr/papers/rissa_FPT02.pdf

T.Rissa

tpr at doc ic ac uk

#### Site Timeline

- » What does + synthesize to?
- — Next thread in » Field-Programmable Gate Arrays

- » Re: How to use Modelsim-Altera to do the timing simulation?
- — Previous thread in » Field-Programmable Gate Arrays

- » Fully Comitted to LVDS as Comparitors
- — Newest thread in » Field-Programmable Gate Arrays

- » pemohon
- — The site's Newest Thread. Posted in » Electronics Computer-Aided Design