How to handle a data packet while calculating CRC.

Sometimes email/nntp/forums makes these technical dicussions difficult. As some background - I've written many CRCs designs in hardware for both ASICs, and FPGAs. Every one of them uses the "plain-vanilla" method as described in many of the CRC papers. The only twist is, as I've tried to describe, is putting the single-shift implementation in a for loop, to allow 'N' bits to be calculated in a single clock cycle.

This "twist" is shockingly simple to write in HDLs. The core kernel of a (generic) CRC - shifting through N-bits at a time is really just about

10 (simple) lines of code. No online-generators needed - that spit out gobs of (unreadable) XOR gates.

The "table" methods we've been discussing, are all solutions aimed at software. Software where it's difficult to do (many) parallel XORs of any random bits. But that's something that hardware excels at.

Taking those software solutions, and retargetting them back at hardware, is at best case, overcomplicating something very simple. At worst case quite costly.

I've never written a table method for calculating CRCs (because as I've asserted it's pointless for hardware), but I quite understand the underlying concepts and what they are doing.

Some numbers to back things up. I have a generic crc module that's used all over the place. I synthesized in Vivado various version of a 32-bit CRC. (The one used in the MPEG-2 spec with POLY=0x4C11DB7)

I targetting something high-speed to make sure the tools are actually trying to do at least some optimizations. Targetted 2ns clocks (500 MHZ) in a Virtex7 device

Test 1:

1 bit at a time: CLB LUTs* | 151 CLB Registers | 64 Worst case timing slack: 0.754ns You'll first note that I use twice as many registers than seems neccesary. This is because the way it's been coded it stores both the "augmented" and "non-augmented" CRC in FFs.

Test 2:

8 bits at a time: CLB LUTs* | 172 CLB Registers | 64 Worst case timing slack: 0.270ns

Notice, not much delta in LUT usage. One thing to note - the baseline may seem high. After all for a single bit shift, you only need one XOR gate per (high POLY bit). The augmentation adds more complexity too. But, in the end the overhead "housekeeping", if you will, actually is probably what's dominating the logic here. Things like loading the INIT value, holding the current value when no new data is available, etc. Plus a LUT6 is under-utilized to do just 2-bit XOR.

Test 3:

32-bits at a time: CLB LUTs* | 259 CLB Registers | 64 Worst case timing slack: 0.332ns

A few more LUTs. But paradoxically, the timing's better. Decided to go for broke, and really up the DATA WIDTH:

Test 4:

512-bits at a time CLB LUTs* | 1734 CLB Registers | 64 Worst case timing slack: 0.188ns

So even taking a very large datapath, things are fine.

Adding a "software" LUT method at it's very baseline, adds 1 BRAM block to the solution - 36Kbits - a very large, IMHO resource. I would argue that for such a software solution aimed at the "Test 3 above (32-bits at a time)" solution would likely still consume a signficant amount of CLB LUTs - just to handle the "housekeeping" and further logic behind the software LUT. I'd also argure that it would very likely run slower too, as BRAMs are in general slower than logic.

Well, don't use a BRAM you might ask - use the CLB logic. I'd argue that this solution, to the tools is the same thing as my 10-line "plain-vanilla", just overly complicated, and unclear, and likely take the tool longer to achieve the same solution (if at all).

Those software implementations are quite interesting methods, and I quite like reading up at the clever tricks being performed. But none of that is neccesary in hardware where you have free-access to any bit, and can freely perform many things in parallel.

Regards,

Mark