Optimizing internet checksum on SH-4

- N
- Noob
  
  Contact options for registered users
posted
12 years ago

Tue, Oct 25, 2011 10:06 AM

Hello,

I'm working with a 450-MHz SH-4 core. The architecture reminds me of the original Pentium in several ways, in that it is dual-pipeline with some restrictions on which types of instructions can be paired.

When the system is receiving a 100-Mbps TCP stream (thus, at the same time also sending roughly 2 Mbps of ACKs) approximately 13-15% of the CPU is busy computing the internet checksum (C code compiled with gcc-4.3.4 -Os)

This algorithm has several nice properties, e.g. it can be computed ignoring endian-ness, and if one is careful with the carry, it can be computed 32-bits (or 64-bits on a 64-bitter) at a time.

So, I was trying to write some Q&D assembly to see if I could do better than gcc, using addc (add with carry).

The gist of my pseudo-code was

r4 = address of the data to checksum r5 = count of words in the data (assume > 0) t is the carry bit (don't know why it's called "t")

clrt ; t

- A
- Arlet Ottens
  
  Contact options for registered users
Vote on answer
posted
12 years ago

Tue, Oct 25, 2011 10:32 AM

Why not ? Unrolling this loop will give the best results. Even unrolling by a factor of 8 should make the carry overhead insignificant, and only increase code size by a small amount.

- T
- Terje Mathisen
  
  Contact options for registered users
Vote on answer
posted
12 years ago

Tue, Oct 25, 2011 1:26 PM

Arlet, you're absolutely correct here.

This particular algorithm matches up perfectly with the original x86 instruction set, where INC and DEC avoids updating the carry flag, so that you can use DEC / JNZ or even JCXNZ to do loop control while the carry flag survives from round to round.

On any architecture without this particular bug/feature/glitch the best you can do is almost certainly to unroll by a small amount and then keep a separate carry accumulator like your r0 above for use at the end.

On machines without carry flags, or when working with SIMD registers, you can do the same using explicit parallel compare instructions, then aggregating the flag values resulting from those compares.

I.e. the following loop should handle 32 bytes per iteration, using 10 SSE instructions, so 3+ bytes/cycle without having to dual-issue the SIMD opcodes:

(The main tweak is the need to bias all input values, from [0..64K> to [-32K..32K> so that signed compares will work.)

next32: pxor xmm7,[esi] ; Grab next 8 words, invert sign bits pxor xmm6,[esi+16] ; and another 16 lea esi,[esi+32] paddw xmm0,xmm7 pcmpgtw xmm7,xmm0 ; Set flag if src > sum paddw xmm0,xmm6 pcmpgtw xmm6,xmm0 ; ditto psubw xmm1,xmm7 ; Subtract -1 flag -> increment movdqa xmm7,xmm2 ; Load sign bits psubw xmm1,xmm6 movdqa xmm6,xmm2 sub ecx,1 jnz next32

One alternative approach is to use shift & mask to split each 16-byte block into two halves and aggregate them separately, this will work for up to 128K packet sizes, which is well above what TCP will allow:

next16: movdqa xmm7,[esi] lea esi,[esi+16] movdqa xmm6,xmm2 ; 4 x 0x0000ffff masks pand xmm6,xmm7 psrld xmm7,16 paddd xmm0,xmm6 paddd xmm0,xmm7 sub ecx,1 jnz next16

Notice that this approach needs one additional SIMD instruction per

16-byte block.

Terje

--
- 
"almost all programming can be viewed as an exercise in caching"

- D
- dp
  
  Contact options for registered users
Vote on answer
posted
12 years ago

Tue, Oct 25, 2011 1:45 PM

Here is how I did it on power 6-7 years ago:

formatting link

(or cksm3456.txt for the native code list).

Here is the critical part:

deal with the bulk of the data using .l accesses lsr.l+ #2,d2,d3 number of .l words move.l- d3,ctr setup counter rlwinm- d2,#0,#32-2,#31,d2 number of bytes after last .l neg.l- d2,d1 residue in next segment rlwinm- d1,#0,#32-2,#31,d1 must be 0-3 beq nulllngc if none .l to process lea (-4,a0),a0 to use preincrement set loop,* adde.l- =(4,a0),d0 collect checksum like a champion brc ?unc.dbnz,loop all of the .l lea (4,a0),a0 will use postincrement now nulllngc

Never thought of this as of something consuming much resources, I don't think it does really. I have never measured it (not seeing an issue there), in a loop that tight the only stall can come from data dependencies. And they probably do kick in but wrestling them would be too major and with doubtful results for 1500 bytes (setup/registers save/restore etc).

Dimiter

------------------------------------------------------ Dimiter Popoff Transgalactic Instruments

formatting link

------------------------------------------------------

formatting link

- S
- segher
  
  Contact options for registered users
Vote on answer
posted
12 years ago

Wed, Oct 26, 2011 9:03 PM

Another way is to accumulate both the data, and separately the data shifted right by some bits (say 16 in your case). Since you add at most

65536 items, that second accumulator differs from the top 32 bits of a true 48-bit accumulator by less than 65536, so you can reconstruct it. In pseudo-C:

u32 acc_hi =3D 0; u32 acc_lo =3D 0; for (j =3D 0; j < N; j++) { acc_lo +=3D data[j]; acc_hi +=3D data[j] >> 16; } acc_hi +=3D ((acc_lo >> 16) - acc_hi) & 0xffff;

// now acc_hi is the top 32 bits of the 48-bit accumulator, and acc_lo // is the low 32 bits

This saves two instructions from your SSE loop, but it's much better on most RISC.

- D
- Don Y
  
  Contact options for registered users
Vote on answer
posted
12 years ago

Mon, Nov 28, 2011 5:47 PM

... assuming, of course, that you are operating over a buffer having a size that is an *exact* multiple of 4.

Are you sure the decrement operation touches the carry? Often, CY is unaffected but Z/NZ is signaled, instead. (by contrast, SUBTRACTING 1 when you *really* want to affect CY)

Why? Are you that pressed for space? Checksums are a *huge* bit of overhead *if* (!) you need to support them! Think hard on whether you want to skimp on space here instead of someplace less critical (i.e., that isn't being executed as frequently)

Depending on the cache available in the processor, I typically unroll the loop into different sized pieces and process N, 2N, 4N, ... "units" at a time (this also takes care of the *actual* size of the dataset instead of requiring it to be an exact multiple of 32 bits)

E.g., (incredibly bogus pseudo-code): while (remaining >= 4N) { ADD *ptr++ // 1 ADC *ptr++ // 2 ... ADC *ptr++ // 4N ADC 0 // fold in final CY remaining -= 4N } while (remaining >= 2N) { ADD *ptr++ // 1 ADC *ptr++ // 2 ... ADC *ptr++ // 2N ADC 0 // fold in final CY remaining -= 2N } while (remaining > N) { ADD *ptr++ // 1 ADC *ptr++ // 2 ... ADC *ptr++ // N ADC 0 // fold in final CY remaining -= N }

If the CPU has alignment restrictions, you need to handle those up front to avoid faults, etc. If you exercise some care in how you fabricate your mbufs, you can sometimes skip this.

If you really want to cut down on TEXT, you can represent "N" as "chunksize" and halve it each time a "while" fails (then use a switch to skip over a certain number of ADC's based on the value of chunksize). Something like:

chunksize = 4N while (chunksize > 1) { ADD *ptr while (remaining > chunksize) { switch (chunksize) { case 4N: ADC *ptr++ // 4N ... ADC *ptr++ // 2N+1 // fall thru case 2N: ADC *ptr++ // 2N ... ADC *ptr++ // N+1 // fall thru case N: ADC *ptr++ // N ... ADC *ptr++ // 1 } ADC 0 // fold in final CY } remaining -= chunksize chunksize /= 2 }

[ignore the discrepancies in my counting, I'm just trying to illustrate a general approach :> ]

Make sure you consider when and why you are computing the checksums! Depending on your application and the facilities that your stack provides, you might consider a more "undisciplined" API for all of this. I.e., if you can *omit* the computation, you save even bigger!

[Layering can be A Bad Thing]

E.g., you might want to restrict the portion of the packet/buffer on which you perform this computation. I.e., you may opt NOT to check some of the payload (depends on your application) -- this is becoming common in UPnP type applications, for example.

You *also* might want to make some of this data *visible* to your application! E.g., if our application is going to checksum some larger "transmission unit", then it could possibly benefit from the computations that you are performing, here. This is the flipside of letting each encapsulation layer IN THE STACK benefit from computations made in other layers (i.e., incrementally updating the sums)

Alternatively, if your application is going to make some particular change to the packet and then return/forward it, you maybe able to benefit from these previous computations.

Reconsider whether a strict TCP implementation is necessary or if you can get what you want from other protocols -- even if you have to add some bit of "acknowledgement" in the application layer.

Lastly, look at what you are going to *do* in the event of a failed checksum -- if *all* you are EFFECTIVELY doing is updating some counters/statistics, then REconsider whether you actually need to invest this effort.

- D
- Don Y
  
  Contact options for registered users
Vote on answer
posted
12 years ago

Tue, Nov 29, 2011 3:19 PM

Grrrr... s.b. "4 octets"! :<