Compact FIR filters with multiplier blocks?

Do you have a question? Post it now! No Registration Necessary

Translate This Thread From English to

Threaded View


Hello,

Has anyone compared FPGA implementations of full-rate digital FIR =
filters based on the use of Multiplier Blocks vs. traditional FIRs with =
constant coefficient multipliers? By full rate, I mean: one output =
result per clock cycle and no interpolation or decimation.

For anyone not familiar, a multiplier block is a network of shifters and =
adders that performs multiplications by several coefficients efficiently =
by exploiting common sub-expressions. The multiplier block can be =
exploited in FIR filters by transposing the standard filter so that the =
products of all the coefficients with the current input-sample are =
required simultaneously.

Also, by representing the coefficients in the Canonical-Signed-Digit =
number system (a small number of  +1 and -1's) along common =
sub-expression sharing the multiplier block can get even smaller.

For example, the multiplier block for a 100 tap FIR filter (fp3D%0.10 =
and fs3D%0.12) can be realized with only 61 adds (zero explicit =
multiplications). See filter example #4 in "FIR Filter Synthesis =
Algorithms for Minimizing the Delay and the Number of Adders," =
http://ics.kaist.ac.kr/~dk/papers/TCAD2001.pdf

If the adder depth is constrained to a maximum of four, then the =
authors' algorithm can do the multiplier block in 69 additions.

It would seem that this approach would be very efficient in a target =
such as the Xilinx Spartan-IIE (with no dedicated multipliers).

Another question: If we only need one result per K clock periods (K ~3D% =
1000 for audio applications), could a multiplier block approach realized =
with, say, bit-serial addition be more efficient than some other =
approach such as distributed arithmetic?

Comments welcome. Thanks.

-Michael

______________________
Michael E. Spencer, Ph.D.
President
Signal Processing Solutions, Inc.
Web: http://www.spsolutions.com

Re: Compact FIR filters with multiplier blocks?
The problem with the multiplier block approach is that the
construction is predicated on the specific coefficients.  As
a result it is considerably harder to use for an arbitrary
set of coefficients.  It may reduce area over a straight FIR
filter running at the same clocks per sample, but at a
considerable cost in design time and flexibility.  You also
give up regularity in the structure, which may reduce the
overall performance.   Essentially what the block multiplier
and distributed arithmetic approaches are is a rearrangement
of the bitwise product terms.  The mutliplier block takes
advantage of duplicate terms by adding the inputs before
they are multiplied by the term.

Michael Spencer wrote:

Quoted text here. Click to load it

--
--Ray Andraka, P.E.
President, the Andraka Consulting Group, Inc.
We've slightly trimmed the long signature. Click to see the full one.
Re: Compact FIR filters with multiplier blocks?
Ray,

I sent this to Michael via email and he suggested the group would be
interested also...

My PhD (now drawing to the end) has been on implementing full-parallel
Transpose FIR filters using multiplier blocks that you mention (I use
techniques/algorithms that exceed the efficiency of CSD in terms of FPGA
area).

The upshot of my work is that I have written a C++ program that will
generate RTL VHDL given the quantised filter coefficients, the type of
filter required (singlerate, interpolation, decimation etc.) and the
appropriate parameters (input width, signed/unsigned input, number of
channels, rate-change factor etc.)

The VHDL my program generates exceeds the functionality (at a lower
cost) of that provided by Xilinx's Distributed Arithmetic core and Altera's
FIR Compiler (also DA).  In fact, my program allows interpolation and
decimation factors up to the number of filter coefficients and any number of
data channels (for interpolation/decimation filters also).

The main point is that, once synthesised and mapped to a specific FPGA, the
filters my program generates require far less FPGA area (slices/logic cells)
than those generated using Distributed Arithmetic.  The critical path in my
filters is just the longest adder carry chain so very high speeds are
possible.  E.g. 154MHz for a singlerate filter (25 bit output) in a Xilinx
xc2v3000-fg676-5 - obviously the speed will depend on the device
family/speed grade and the longest carry chain.  The facility for multiple
channels in interpolation/decimation filters (not supported by Xilinx)
allows lower than full-parallel sampling rates to be efficiently processed
in one filter.

As Michael points out in his post, this technique would be very suitable for
a
Xilinx Spartan-IIE and indeed any FPGA - there are many cases where these
filters would be useful even on devices with dedicated multipliers (when
they are all in use for example!  ;-)   ).

You can find out more at http://www.dspec.org/rsg.asp - there are also
datasheets here that provide comparisons with Xilinx and Altera and
demonstrate the output of another application (written in java) that
generates schematic representations of the filters for use in reports,
meetings and thesises!  :-)

I hope this information is of use to you - please contact me if you have any
questions,

Thanks for your time,

Ken


--
To reply by email, please remove the _MENOWANTSPAM from my email address.


We've slightly trimmed the long signature. Click to see the full one.
Re: Compact FIR filters with multiplier blocks?
I agree the multiplier block style filters are more efficient area-wise.  It
sounds like you have addressed the irregularity issues by using a program
to do the generation, which I think is pretty much a necessity.  As I thought
I alluded to, the biggest problem with multiplier block filters is that the
layout/size is not a constant if you change the coefficients.  This means that
the fiter coefficients have to be constant and known earlier in the design
cycle, and necessitates a rerun of synthesis, place and route for any filter
changes.  Depending on the implementation, it may also mean a change in the
filter's pipeline latency.  These factors can make them difficult to use on
some projects.  The filters typically used in my projects often need to be
adjusted by the customer or late in the project to accommodate minor
requirements changes.  I prefer to use a filter with reloadable coefficients
for that reason.



Ken wrote:

Quoted text here. Click to load it

--
--Ray Andraka, P.E.
President, the Andraka Consulting Group, Inc.
We've slightly trimmed the long signature. Click to see the full one.
Re: Compact FIR filters with multiplier blocks?
Ray,

Thanks for your quick response.

You raise some excellent points clearly derived from experience.

Hopefully my program will still be useful when filters can be fixed early on
in the project or when a synthesis/place and route run is an acceptable cost
for the area efficiency provided.

Also, for devices being used in consumer applications, perhaps the area
saved using a multiplier block filter would allow a smaller and cheaper
device within the family to be used - reducing production costs depending on
the number of units being produced.

Cheers,

Ken



Quoted text here. Click to load it
It
thought
the
that
filter
the
on
coefficients
Altera's
number of
the
cells)
Quoted text here. Click to load it
my
Xilinx
multiple
processed
for
these
any
address.



Re: Compact FIR filters with multiplier blocks?

I went around the irregularity issue by having sub-multiplier
block architecture that has have fixed interface to the routing
and have fixed (yet reasonable) area. Therefore, when the
coefficients are changed, no place and route is required and
the latency remains the same (unless you change the number taps).
The generation of coefficients can done at reconfiguration time
thanks to symmetry in the FPGA used (Atmel 40K40). Naturally,
there is the problem of hassling with run-time reconfiguration
and everything that comes with that...

As part of this work we looked also into common subexpression
sharing in that particular FPGA family and found it very unlikely
that benefits could be obtained with similar multiplier-block
architecture. This is mainly due the fact that it is different
story to be able to generate the most useful common subexpressions
that it is to really use them before the routing becomes congested.

http://www.doc.ic.ac.uk/~tpr/papers/rissa_FPT02.pdf

T.Rissa


Quoted text here. Click to load it









Re: Compact FIR filters with multiplier blocks?
Ken,
While the RSG solution may yield smaller designs for specific cases,
the Altera FIR Compiler gives you more flexibility in terms of
optimizing area vs.speed.

For instance, the numbers presented in the RSG datasheet is based on a
pipeline=2 setting for the Altera FIR Compiler.  Using the FIR
Compiler, the design yields an fmax of 322MHz (single rate, single
channel). This is much higher than the 154MHz cited for the filter
using the RSG approach.  This is the classic speed/area trade-off
scenario.

If indeed area is the critical factor, it is possible to reduce the
pipeline to 1 in the FIR Compiler.  In the single rate cases, the
logic cell count comparison would show that the RSG approach would be
beneficial for the single and 2 channel FIR designs (58% and 80%
respectively).  In the 4 and 8 channel FIR designs, the distributed
arithmetic approach employed by the Altera FIR Compiler yields better
area compared to the RSG generated filter (106% and 133%
respectively).  Reducing the number of pipeline stage to 1 yields fmax
of 237MHz (single rate, single channel), still well beyond the
performance requirement in most cases.  This, I believe, is a more
accurate
comparison for the RSG datasheet.

Regards,
HS

Quoted text here. Click to load it
that
Quoted text here. Click to load it

Re: Compact FIR filters with multiplier blocks?

Hong,

Firstly, my apologies for the delay in replying.

I have inserted my responses to your points below:


Quoted text here. Click to load it

Agreed - RSG only produces full-parallel filters.  If multiple clocks per
output can be used (depending on data rates and avaible clock/power
consumption requirements of course) then I would be the first to say use
Distributed Arithmetic (DA) in multi-cycle mode.

Quoted text here. Click to load it

The figure of 154MHz is not in the datasheet you are referring to (available
from http://www.dspec.org/rsg/downloads/datasheets/RSG_Overview.pdf ) - I do
not know where you got this number from (I seem to remember it being in an
old version [which did not include Altera results] but I removed it because
the 154MHz was Xilinx specific for a particular filter on a particular
device and has no bearing at all to Altera devices/filter implementations).

The test filters I did came in at various speeds between [244-283MHz] for
RSG and [217-293MHZ] for Altera FIR Compiler (on exactly the same device and
with the same constraints). This is why I used pipeline level 2 for fir
compiler because pipeline level 1 made the fir compiler filters too slow and
pipeline level 3 made the fir compiler filters ridiculously large and not
that much faster.

Hence, pipeline level 2 does give a fair comparison with RSG.

For pipeline level 1, the FIR compiler clock rates were in most cases slower
than for RSG and the FIR compiler implementations were also larger.  The
8-channel singlerate FIR was the only one that was the same size but the RSG
filter ran at 251MHz and FIR Comp at 206MHz.

Cheers,

Ken


Quoted text here. Click to load it
area-wise.  It
Quoted text here. Click to load it
program
thought
that the
means that
design
filter
in the
use on
to be
coefficients
full-parallel
Quoted text here. Click to load it
FPGA
of
Altera's
number of
FPGA, the
cells)
Quoted text here. Click to load it
in my
Xilinx
multiple
Xilinx)
Quoted text here. Click to load it
processed
suitable for
these
(when
Quoted text here. Click to load it
also
reports,
have any
address.




Re: Compact FIR filters with multiplier blocks?

I went around the irregularity issue by having sub-multiplier
block architecture that has have fixed interface to the routing
and have fixed (yet reasonable) area. Therefore, when the
coefficients are changed, no place and route is required and
the latency remains the same (unless you change the number taps).
The generation of coefficients can done at reconfiguration time
thanks to symmetry in the FPGA used (Atmel 40K40). Naturally,
there is the problem of hassling with run-time reconfiguration
and everything that comes with that...

As part of this work we looked also into common subexpression
sharing in that particular FPGA family and found it very unlikely
that benefits could be obtained with similar multiplier-block
architecture. This is mainly due the fact that it is different
story to be able to generate the most useful common subexpressions
that it is to really use them before the routing becomes congested.

http://www.doc.ic.ac.uk/~tpr/papers/rissa_FPT02.pdf

T.Rissa

tpr at doc ic ac uk

Quoted text here. Click to load it









Site Timeline