corgen cic = terrible efficiency?

- C
- cpope
  
  Contact options for registered users
posted
16 years ago

Sat, Jun 23, 2007 5:30 PM

I'm working with the xilinx corgen cic v3.0. I'm finding that to get a decent rejection in the images (60 dB) I need about 4 stages. My input is only 10 bit and I still end up with a 66 bit output, 50 of which are thrown away. As a result my design won't fit in my device. Seems horribly inefficient to me, so I have some questions:

My coregen says it doesn't support V4 for the cic so I've been compiling for V2. Seems like the DSP48 with the large accumulator is ideal for CICs?
Looks like the exponential bit growith is from the number of stages. Since noone uses more than 16 bits at the output why can't the output of the first integrator be trimmed back to 16 bit before feeding the next and so on?
If the cic is just a box car filter wouldn't it be easier to implement as a single subtractor/accumulator whose inputs are the current sample and the sample delayed by R? At least for reasonable R (< 8192) seems like it should fit in block ram okay.

Thanks for any help, Clark

- J
- Jon Beniston
  
  Contact options for registered users
Vote on answer
posted
16 years ago

Mon, Jun 25, 2007 12:11 PM

If they are truely thrown away, it shouldn't be the cause of why you aren't fitting into the device.

Why do you need DSP48s? Isn't the whole point of a CIC that it doesn't use multiplies?

Aren't all the integrators cascaded together, then followed by all the combs?

Cheers, Jon

- C
- comp.arch.fpga
  
  Contact options for registered users
Vote on answer
posted
16 years ago

Mon, Jun 25, 2007 3:04 PM

It's all here:

formatting link

If you sum up R values, you have a gain of R, independently of your implementation. If you do that k times, you have a gain of R^k.

The CIC implementation has exactly the same cost as the boxcar minus the RAM. So indeed, if you can afford the RAM you can use the boxcar.

Kolja Sulimma

- C
- cpope
  
  Contact options for registered users
Vote on answer
posted
16 years ago

Mon, Jun 25, 2007 4:31 PM

implement as

the

should

Thanks, I had a colleague forward me to a similar article that included information on how to trim the bits between the integrator sectrions. I guess my point with the ram is in V4 the bram and dsp48 are designed to be efficiently integrated so it might be possible to just implement a cascade of N boxcar filters using just N dsp48/BRAM pairs rather than doing the very high bitwidth integrators in the fabric. Should be lower power and run faster.

-Clark

- C
- cpope
  
  Contact options for registered users
Vote on answer
posted
16 years ago

Mon, Jun 25, 2007 4:37 PM

is

thrown

I throw them away at the output, I'm not convinced that the compiler trims them all the way back through the integrators. In fact I'm pretty sure it doesn't because I found a colleague that had to implement his own CIC that uses significantly less resources then the coregen block because he was able to trim the widths of the integrator sections.

compiling

CICs?

I have them available. Should be less power and faster speed than implementing a 48 bit accumulator in slices right?

the

so

Yes. My point is the width of the integrators and combs don't seem to be optimized at all. For example, If I'm only using 16 bits at the output why would the combs need to be more than say 16+N*2 wide? My coregen sets them at 66. Similarly, the first integrator should only need input width plus log2(R) width, the second needs input width + 2*log2(R) and so on. And that's only if you really need full precision which I suspect you don't. At any rate the coregen doesn't seem to employ any of these optimizations?

- R
- Ray Andraka
  
  Contact options for registered users
Vote on answer
posted
16 years ago

Wed, Jun 27, 2007 3:52 PM

The coregen doesn't support V4 simply because no one has re-written it to instantiate the DSP48s instead of LUTs. You should be able to use a V2 LUT version in V4 though, although it will not be all that fast because the carry chains in V4 are fairly slow compared to the DSP48's. The DSP48 limits the CIC width to 48 bits without going through some design gymnastics to cascade DSP48's. For a CIC, 48 bits isn't very much.

The CIC's response is a sinc function, which provides only about 13dB attenuation of the first sidelobe for a first order filter. In order to increase the effectiveness of the filter, several sections are cascaded to increase the attenuation of the sidelobes. You typically need 4th or

5th order filters for practical applications. The gain of the CIC filter is (N*R)^M where N is the delay in the comb section (usually 1), R is the decimation ratio, and M is the order of the filter. The width of the integrators must accommodate the input signal times the gain without overflow. Unfortunately, because the filter relies on differences of an integration, the integration has to be full precision so that rounding errors do not get accumulated in the integrators. The result is that the integrators cannot be truncated, and they have to have enough bits to represent the maximum input times the gain of the filter. You can reduce the word width between the integrator and comb sections by simple truncation so that the comb section only has to be M bits wider than the output (and can taper down one bit per stage as it progresses to the output). Rounding isn't necessary if the width is reduced before the comb sections because the difference operation of the comb removes the bias anyway.

A first order CIC is equivalent to an N*R stage boxcar filter. To get the response of a 5th order CIC, you'd have to cascade 5 such filters together, and you can't decimate until after the last stage. The adder and subtractor still has to be wide enough to accommodate the gain of the filter (N*R) at each stage. Like the CIC, you can reduce the width of the subtractor, but the integrator has to be wide enough to accommodate the max signal times N without overflow. You wind up with adders that are only log2(N) bits wider than the input if you truncate between stages instead of M*log2(N), but you also wind up with a memory for each stage as well.

If your CIC is decimating, you can take advantage of the lower rate through the comb section by using bit or digit serial arithmetic, or by sharing the comb section among multiple channels.

- R
- Ray Andraka
  
  Contact options for registered users
Vote on answer
posted
16 years ago

Wed, Jun 27, 2007 4:13 PM

No, the integrators all have to have enough bits to hold the input times the gain. Keep in mind each one is keeping a running accumulation of its input, and the modular overflow has to be above the bits you are taking out of the bank. You cannot use fewer MSBs on the first integrators because the CIC takes advantage of the modular property of the 2's complement system and doing so puts the modular overflow in the bit field you are taking out.

The 66 bit width coregen is using should be a result of the gain calculated for the maximum decimation ratio and order of your filter. For a 4th order filter with 10 bits input, that means it is set up for decimation ratios upto 2^14. If you can reduce the maximum decimation ratio, it should reduce the width of the integrators.

Yes, implementing using the DSP48s is faster than using straight fabric adders. Most CIC's need more than 48 bits, so you have to cascade them. The comb section isn't as bad because you can reduce the width before the comb.

- C
- cpope
  
  Contact options for registered users
Vote on answer
posted
16 years ago

Wed, Jun 27, 2007 7:11 PM

I'm not sure that's true. At least this reference (and a colleagues optimized design implies they can be trimmed):

formatting link

see 11.4.2

-Clark

- C
- cpope
  
  Contact options for registered users
Vote on answer
posted
16 years ago

Wed, Jun 27, 2007 7:19 PM

is

thrown

compiling

CICs?

the

so

implement as

the

should

It is for a boxcar filter though, right? If I just implement a straight box car I need log2R+B+1 bits in the accumulator. The max log2R in coregen is 14 so that leaves data sizes up to 34 bits.

I think the integrators can be trimmed (see other response). Agree about the comb but it doesn't seem to me that xilinx is doing that optimization. In fact I don't know that it could for a CIC that supports multiple decimation rates.

Yes, but I have plenty of brams and dsp48s, I'm short on slices. Plus I want to minimize power consumption.

I'm not convinced xilinx even does this optimization.

- R
- Ray Andraka
  
  Contact options for registered users
Vote on answer
posted
16 years ago

Wed, Jun 27, 2007 7:31 PM

I assumed the OP is asking about a decimating CIC, in which case my assertion is absolutely true. The article you cited also says the same thing: "In a down-sampling filter the growth appears immediately in the first integrator stage, and all subsequent integrators and comb filters must honor the most significant bit of the first integrator stage."

It is possible he was asking about an interpolating (up-sampling) CIC, in which case the math is a little bit different, and as a result there are differences in how pruning can be applied.

- C
- cpope
  
  Contact options for registered users
Vote on answer
posted
16 years ago

Wed, Jun 27, 2007 7:54 PM

I'm the OP. That article says the LSBs can be trimmed in the decimating case, right?

- R
- Ray Andraka
  
  Contact options for registered users
Vote on answer
posted
16 years ago

Wed, Jun 27, 2007 10:13 PM

In the comb section, and in certain cases on the last integrator sections at the price of increased noise.

The variable R CIC can still drop LSBs between the integrator and comb sections by adding a shifter controlled by the decimation ratio between the stages so that you always keep the N most significant bits for that decimation ratio.