Compact FIR filters with multiplier blocks?

Question

Hello,Has anyone compared FPGA implementations of full-rate digital FIR filters based on the use of Multiplier Blocks vs. traditional FIRs with constant coefficient multipliers? By full rate, I mean: one output result per clock cycle and no interpolation or decimation.For anyone not familiar, a multiplier block is a network of shifters and adders that performs multiplications by several coefficients efficiently by exploiting common sub-expressions. The multiplier block can be exploited in FIR filters by transposing the standard filter so that the products of all the coefficients with the current input-sample are required simultaneously.Also, by representing the coefficients in the Canonical-Signed-Digit number system (a small number of  +1 and -1's) along common sub-expression sharing the multiplier block can get even smaller.For example, the multiplier block for a 100 tap FIR filter (fp=0.10 and fs=0.12) can be realized with only 61 adds (zero explicit multiplications). See filter example #4 in "FIR Filter Synthesis Algorithms for Minimizing the Delay and the Number of Adders," If the adder depth is constrained to a maximum of four, then the authors' algorithm can do the multiplier block in 69 additions.It would seem that this approach would be very efficient in a target such as the Xilinx Spartan-IIE (with no dedicated multipliers).Another question: If we only need one result per K clock periods (K ~= 1000 for audio applications), could a multiplier block approach realized with, say, bit-serial addition be more efficient than some other approach such as distributed arithmetic?Comments welcome. Thanks.-Michael______________________Michael E. Spencer, Ph.D.PresidentSignal Processing Solutions, Inc.Web:

Ray Andraka · Accepted Answer

The problem with the multiplier block approach is that theconstruction is predicated on the specific coefficients.  Asa result it is considerably harder to use for an arbitraryset of coefficients.  It may reduce area over a straight FIRfilter running at the same clocks per sample, but at aconsiderable cost in design time and flexibility.  You alsogive up regularity in the structure, which may reduce theoverall performance.   Essentially what the block multiplierand distributed arithmetic approaches are is a rearrangementof the bitwise product terms.  The mutliplier block takesadvantage of duplicate terms by adding the inputs beforethey are multiplied by the term.Michael Spencer wrote:----Ray Andraka, P.E.President, the Andraka Consulting Group, Inc.401/884-7930     Fax 401/884-7950email  "They that give up essential liberty to obtain a little  temporary safety deserve neither liberty nor safety."                                          -BenjaminFranklin, 1759

Ken · Answer

Ray,I sent this to Michael via email and he suggested the group would beinterested also...My PhD (now drawing to the end) has been on implementing full-parallelTranspose FIR filters using multiplier blocks that you mention (I usetechniques/algorithms that exceed the efficiency of CSD in terms of FPGAarea).The upshot of my work is that I have written a C++ program that willgenerate RTL VHDL given the quantised filter coefficients, the type offilter required (singlerate, interpolation, decimation etc.) and theappropriate parameters (input width, signed/unsigned input, number ofchannels, rate-change factor etc.)The VHDL my program generates exceeds the functionality (at a lowercost) of that provided by Xilinx's Distributed Arithmetic core and Altera'sFIR Compiler (also DA).  In fact, my program allows interpolation anddecimation factors up to the number of filter coefficients and any number ofdata channels (for interpolation/decimation filters also).The main point is that, once...

Ray Andraka · Answer

I agree the multiplier block style filters are more efficient area-wise. It sounds like you have addressed the irregularity issues by using a program to do the generation, which I think is pretty much a necessity. As I thought I alluded to, the biggest problem with multiplier block filters is that the layout/size is not a constant if you change the coefficients. This means that the fiter coefficients have to be constant and known earlier in the design cycle, and necessitates a rerun of synthesis, place and route for any filter changes. Depending on the implementation, it may also mean a change in the filter's pipeline latency. These factors can make them difficult to use on some projects. The filters typically used in my projects often need to be adjusted by the customer or late in the project to accommodate minor requirements changes. I prefer to use a filter with reloadable coefficients for that reason.

Ken wrote:

--

--Ray Andraka, P.E. President, the Andraka Consulting Group, Inc.

401/884-7930 Fax 401/884-7950 email snipped-for-privacy@andraka.com

formatting link

"They that give up essential liberty to obtain a little temporary safety deserve neither liberty nor safety." -Benjamin Franklin, 1759

Ken · Answer

Ray,Thanks for your quick response.You raise some excellent points clearly derived from experience.Hopefully my program will still be useful when filters can be fixed early onin the project or when a synthesis/place and route run is an acceptable costfor the area efficiency provided.Also, for devices being used in consumer applications, perhaps the areasaved using a multiplier block filter would allow a smaller and cheaperdevice within the family to be used - reducing production costs depending onthe number of units being produced.Cheers,KenItthoughtthethatfiltertheoncoefficientsAltera'snumber ofthecells)myXilinxmultipleprocessedfortheseanyaddress.

Tero Rissa · Answer

I went around the irregularity issue by having sub-multiplier block architecture that has have fixed interface to the routing and have fixed (yet reasonable) area. Therefore, when the coefficients are changed, no place and route is required and the latency remains the same (unless you change the number taps). The generation of coefficients can done at reconfiguration time thanks to symmetry in the FPGA used (Atmel 40K40). Naturally, there is the problem of hassling with run-time reconfiguration and everything that comes with that...

As part of this work we looked also into common subexpression sharing in that particular FPGA family and found it very unlikely that benefits could be obtained with similar multiplier-block architecture. This is mainly due the fact that it is different story to be able to generate the most useful common subexpressions that it is to really use them before the routing becomes congested.

formatting link

T.Rissa

Ray Andraka wrote:

Hong Shan Neoh · Answer

Ken,While the RSG solution may yield smaller designs for specific cases,the Altera FIR Compiler gives you more flexibility in terms ofoptimizing area vs.speed.For instance, the numbers presented in the RSG datasheet is based on apipeline=2 setting for the Altera FIR Compiler.  Using the FIRCompiler, the design yields an fmax of 322MHz (single rate, singlechannel). This is much higher than the 154MHz cited for the filterusing the RSG approach.  This is the classic speed/area trade-offscenario.If indeed area is the critical factor, it is possible to reduce thepipeline to 1 in the FIR Compiler.  In the single rate cases, thelogic cell count comparison would show that the RSG approach would bebeneficial for the single and 2 channel FIR designs (58% and 80%respectively).  In the 4 and 8 channel FIR designs, the distributedarithmetic approach employed by the Altera FIR Compiler yields betterarea compared to the RSG generated filter (106% and 133%respectively).  Reducing the number of...

Ken · Answer

Hong,Firstly, my apologies for the delay in replying.I have inserted my responses to your points below:Agreed - RSG only produces full-parallel filters.  If multiple clocks peroutput can be used (depending on data rates and avaible clock/powerconsumption requirements of course) then I would be the first to say useDistributed Arithmetic (DA) in multi-cycle mode.The figure of 154MHz is not in the datasheet you are referring to (availablefrom  - I donot know where you got this number from (I seem to remember it being in anold version [which did not include Altera results] but I removed it becausethe 154MHz was Xilinx specific for a particular filter on a particulardevice and has no bearing at all to Altera devices/filter implementations).The test filters I did came in at various speeds between [244-283MHz] forRSG and [217-293MHZ] for Altera FIR Compiler (on exactly the same device andwith the same constraints). This is why I used pipeline level 2 for fircompiler because pipeline level 1...

Tero Rissa · Answer

I went around the irregularity issue by having sub-multiplier block architecture that has have fixed interface to the routing and have fixed (yet reasonable) area. Therefore, when the coefficients are changed, no place and route is required and the latency remains the same (unless you change the number taps). The generation of coefficients can done at reconfiguration time thanks to symmetry in the FPGA used (Atmel 40K40). Naturally, there is the problem of hassling with run-time reconfiguration and everything that comes with that...

As part of this work we looked also into common subexpression sharing in that particular FPGA family and found it very unlikely that benefits could be obtained with similar multiplier-block architecture. This is mainly due the fact that it is different story to be able to generate the most useful common subexpressions that it is to really use them before the routing becomes congested.

formatting link

T.Rissa

Ray Andraka wrote:

Compact FIR filters with multiplier blocks?

Join the Discussion

Didn't find your answer?