Bit error counter - how to make it faster

- G
- gamer
  
  Contact options for registered users
posted
16 years ago

Wed, Jun 27, 2007 4:34 PM

My goal is to implement a bit-error counter targeting 1GHz. The datawidth is parametrizable. I started off this way,

Verilog code:

---------- assign mismatch[datawidth-1:0] = input_data ^ expected_data; assign matched = ~( | mismatch); // reduction NOR

always @(posedge clk or posedge reset) begin if (reset) error_count = 0; else if (~matched) for (i=0; i

- G
- Günther Jehle
  
  Contact options for registered users
Vote on answer
posted
16 years ago

Wed, Jun 27, 2007 4:46 PM

Spread your algorithm to work over more than one clock cycle. Use a pipeline stage over eigth clocks. The first stage calculates bit 0 to 7, the second calculates 8 to 15 and so on. In an additional stage you sum up all results. The result is always 9 clocks behind.

See

formatting link

for a general explanation of pipelines.

hth Günther Jehle

- P
- Paul A. Clayton
  
  Contact options for registered users
Vote on answer
posted
16 years ago

Wed, Jun 27, 2007 10:48 PM

I know no verilog and am not even a C programmer, but ISTM that the summation could be highly parallel by using a tree of adders (perhaps using carry save to further accelerate the summation). Consider it like computing the population count followed by an ordinary addition: one can add together the even and odd mismatch bits in parallel, then add pairs of these sums, etc., finally adding the sum to error_count.

(BTW, if I understand correctly one should not need the "reduction NOR" and the "if (~matched)" since if the popcount is zero, adding it to error_count would effectively be a no-op, assuming worst-case speed is the only consideration. As Gunther Jehle pointed out pipelining could provide higher throughput.)

Paul A. Clayton just a technophile

- B
- Ben Jackson
  
  Contact options for registered users
Vote on answer
posted
16 years ago

Thu, Jun 28, 2007 7:21 AM

Although from a software (and possibly a simulation) perspective, that looks like it should reduce the workload by computing 'matched' and "avoiding" the loop if it is not needed, in hardware your speed is going to be limited by the worst case (and the additions are happening all the time, only the enables on the error_count register are affected by your 'if').

If you must have your answer on the very next clock, your options are severely limited. If you only care about the answer occasionally (or with a fixed latency that's greater than one clock), think about how you can take advantage of that.

To follow along your original lines of thinking: If you can really find 'matched' at your desired clockrate, and it is infrequent, then you could stuff 'mismatch' into a fifo whenever it's nonzero and add it in a separate loop which only has to keep up with the worst case bit error rate.

Or, if, as you said, the process is fast enough for 1 bit, make yourself 'datawidth' 1-bit error counters and add their results whenever you want the total number of errors.

--
Ben Jackson AD7GD

http://www.ben.com/

- T
- Terje Mathisen
  
  Contact options for registered users
Vote on answer
posted
16 years ago

Thu, Jun 28, 2007 1:40 PM

Remember this, it is a key to a possible solution!

You need to parallelize your error counter, particularly for very wide (256-bit?) paths:

A few layers of full adders to reduce the 256 input lines into a more reasonable number, then one or more carry-save accumulators to gather these into something that can be handled very quickly.

Only when reading out the number of errors do you propagate the carries to generate the final/real counts.

Terje

--
- 
"almost all programming can be viewed as an exercise in caching"

- C
- comp.arch.fpga
  
  Contact options for registered users
Vote on answer
posted
16 years ago

Thu, Jun 28, 2007 3:12 PM

In Virtex-5 the 6-LUTs point to a 6-to-3 reduction instead of full adders. You can also use BRAMs for an 11 to 4 reduction.

512 bits at 500MHz might be easier than 256 bits at 1GHz.

Kolja Sulimma

- P
- Petter Gustad
  
  Contact options for registered users
Vote on answer
posted
16 years ago

Thu, Jun 28, 2007 7:51 PM

How much resources do you have available and how often do you want to process the counters?

If you have lots of registers available and just want to sample the counters at some point in time (e.g. after one hour of operation) and present the result on a PC or whatever you could:

Use datawidth X linear feedback shift registers (of sequence length error count, or some more optimal length for the LFSR sequence) as counters. This is only a single XOR delay between two registers. Then sample all the counters. Transfer the results to your PC. Convert the LFSR sequences and add the values before you present the result.

However, you can't do this if you have to continuously present the result. You haven't provided enough details about this in your specification.

Petter

--
A: Because it messes up the order in which people normally read text.
Q: Why is top-posting such a bad thing?
A: Top-posting.
Q: What is the most annoying thing on usenet and in e-mail?

- A
- Andy Freeman
  
  Contact options for registered users
Vote on answer
posted
16 years ago

Thu, Jun 28, 2007 9:31 PM

How is the input data created? If the register values come from bit set/clear operations, you can keep track of the number of ones with a separate counter that is updated appropriately when the register is updated.

For example, suppose that you'd like to know the number of ones in a shift register. If the input bit and the bit shifted off (and thus removed from the register) are the same, the separate counter is unchanged. If the input is 1 and the bit shifted off is 0, the counter is incremented. If the input is 0 and the bit shifted off is

1, the counter is decremented. When the register as a whole is cleared, the counter is also cleared.

It's easy enough to extend this to any update scheme where the number of bits that can change each cycle is small, even if those can be in arbitrary positions.

- Y
- YUAN, Nan
  
  Contact options for registered users
Vote on answer
posted
16 years ago

Sat, Jun 30, 2007 4:00 AM

I guess the critical path of timming is summation chain. There're serveral binary tricks to get number of bit "1" in a word, take C code of 32-bit word for example, there's only 5 summations instead of 32.

unsigned int bitcount32(unsigned int x) { x = (x & 0x55555555UL) + ((x >> 1) & 0x55555555UL); // 0-2 in 2 bits x = (x & 0x33333333UL) + ((x >> 2) & 0x33333333UL); // 0-4 in 4 bits #if 1 // Version 1 x = (x & 0x0f0f0f0fUL) + ((x >> 4) & 0x0f0f0f0fUL); // 0-8 in 8 bits x = (x & 0x00ff00ffUL) + ((x >> 8) & 0x00ff00ffUL); // 0-16 in 16 bits x = (x & 0x0000ffffUL) + ((x >> 16) & 0x0000ffffUL); // 0-31 in 32 bits return x; #else // Version 2 x = (x + (x >> 4)) & 0x0f0f0f0fUL; // 0-8 in 4 bits x += x >> 8; // 0-16 in 8 bits x += x >> 16; // 0-32 in 8 bits return x & 0xff; #endif }

- G
- glen herrmannsfeldt
  
  Contact options for registered users
Vote on answer
posted
16 years ago

Mon, Jul 9, 2007 7:13 PM

(snip)

But note that you do a lot more arithmetic than is needed. This takes advantage of the availability of a 32 bit ALU. You do five 32 bit additions with carry, and four 32 bit AND operations.

The hardware implementation using carry save adders is much more efficient, in terms of both logic and speed.

11 full adders for the first level, eight for the second level, five for the third level, four for the fourth level, three for the fifth level, three for the sixth level, two each for the seventh and eighth level, and one final OR gate. If I counted right. No carry logic is needed.

Some can be half adders for ASIC, and there might be some other changes to make optimal use of FPGA LUTs.

-- glen

- G
- glen herrmannsfeldt
  
  Contact options for registered users
Vote on answer
posted
16 years ago

Mon, Jul 9, 2007 7:28 PM

That is an interesting idea...

Last time I did it was in the days of 4-LUTs. Also, I didn't need all the high bits, as I only needed 0, 1, 2, 3, 4 or more as output values.

Or pipeline the reduction stages. You have to be able to go through at least one LUT in a clock cycle, anyway, and the FFs are conveniently positioned after the LUTs.

-- glen