Fast bit-reverse on an x86?

Does the incoming data have to be available on an immediate byte-for- byte basis? If not, would it be better to accumulate some number of bytes (16 to 512, say), then bit reverse the buffer with a tight loop and lookup table. This might reduce cache-miss problems during the bit reversal.

If the bytes are arriving with strict timing intervals, you could set things up so that input bytes are stuffed into one buffer, while bytes are fetched from another for processing. That would result in a timing delay, but no jitter.

If you're recording the data to a file, just bit-reverse the buffer prior to the fwrite().

I must say that worrying about "fast" on systems with large caches and an unknown number of interrupts to interefere with cache management is a matter of theoretical curiosity for me. I have no real experience with assembly-language programming on X86 CPUs with on-chip or off-chip cache.

Mark Borgerson

Reply to
Mark Borgerson
Loading thread data ...

No. It would be interesting to see a solution based on pshufb - if you are offering.

I was thinking of things like

A 16-entry table (instead of a 256-entry one) used twice Paired shift lefts and shift rights Swapping half-bytes then bit pairs then bits Methods in

formatting link
MMX/SSE instructions

The simple table lookup still seems best - in so many ways.

Not sure I understand the question.

Just a word on caches that others have objected to:

  1. Size - Even the original Pentium processor had an 8k data cache. Modern machines have much more caching. A 256-byte lookup table would not normally be a problem.
  2. Populating the cache - Also not a problem. If a cache-line's worth of the lookup table is used only once it adds less than a microsecond to the run time of the program. If it's used repeatedly then subsequent reads come from cache.

James

Reply to
James Harris

|No. It would be interesting to see a solution based on pshufb - if you |are offering. | |I was thinking of things like | | A 16-entry table (instead of a 256-entry one) used twice | Paired shift lefts and shift rights | Swapping half-bytes then bit pairs then bits | Methods in

formatting link
| MMX/SSE instructions | |The simple table lookup still seems best - in so many ways.

Yes, I can confirm this from practice. A faster method can only be a hardware-solution.

|Not sure I understand the question.

yeah, what's "regular" today ? perhaps 'not SSE3/4a/5b/6c..' ? :) I'll interprete it as: 'without SSE/3DNow! at all' A regular loop "SHL-SHR per bit" variant would be at least four times slower for single byte conversion and about 12 times slower for 32-bit.

|Just a word on caches that others have objected to: | |1. Size - Even the original Pentium processor had an 8k data cache. |Modern machines have much more caching. A 256-byte lookup table would |not normally be a problem. | |2. Populating the cache - Also not a problem. If a cache-line's worth |of the lookup table is used only once it adds less than a microsecond |to the run time of the program. If it's used repeatedly then |subsequent reads come from cache.

Exact, except (you're right: it's less than a micro-second) I see just a few "nano"-seconds for four cache-burst-reads (64 byte junks) on a 3GHz CPU with 1600 MHz RAM. Each of my four cores got 6 MB L3 cache, 512KB L2 and 2*64K data/code on L1, so a 64K LUT may fit well into.

__ wolfgang

Reply to
wolfgang kern

...

Yes, I was being a bit vague on uncached DRAM timing due to ignorance. I was thinking of a 60 ns value but being unsure of what all the detailed DRAM timings were for I figured "less than a microsecond" would cover it. With a 32-byte data cache line size I guess the entire

256-byte table (eight cache lines) could be loaded in about a microsecond - which makes the point even more forcefully.

James

Reply to
James Harris

Yes, like (Warning! untested code follows):

movdqa xmm1, magic1 ; Assume 16 bytes to be reversed are in xmm0 pand xmm1, xmm0 ; xmm1 has high nibbles pxor xmm0, xmm1 ; xmm0 has low nibbles psrld xmm1, 4 ; Shift high nibbles to low position movdqa xmm2, magic2 ; High reversed LUT pshufb xmm2, xmm0 ; xmm2 has high nibbles of reversed bytes movdqa xmm0, magic3 ; Low reversed LUT pshufb xmm0, xmm1 ; xmm0 has low nibbles of reversed bytes por xmm0, xmm2 ; Now xmm0 has the 16 reversed bytes

.data .align 16 magic1 db 0f0h,0f0h,0f0h,0f0h,0f0h,0f0h,0f0h,0f0h db 0f0h,0f0h,0f0h,0f0h,0f0h,0f0h,0f0h,0f0h magic2 db 0h, 80h, 40h,0c0h, 20h,0a0h, 60h,0e0h db 10h, 90h, 50h,0d0h, 30h,0b0h, 70h,0f0h

magic3 db 0h, 8h, 4h, 0ch, 2h, 0ah, 6h, 0eh db 1h, 9h, 5h, 0dh, 3h, 0bh, 7h, 0fh

So that seems to me to be the fastest possibility. Shifting and masking is about all you can do if you don't have the mega-lookup instruction PSHUFB available:

movdqa xmm1, magic1 ; Assume 16 bytes to be reversed are in xmm0 pand xmm1, xmm0 pxor xmm0, xmm1 psrld xmm1, 4 pslld xmm0, 4 por xmm0, xmm1 ; Groups of 4 bits have been swappedmovdqa xmm1, magic2 movdqa xmm1, magic2 pand xmm1, xmm0 pxor xmm0, xmm1 psrld xmm1, 2 pslld xmm0, 2 por xmm0, xmm1 ; Groups of 2 bits have been swappedmovdqa xmm1, magic2 movdqa xmm1, magic3 pand xmm1, xmm0 pxor xmm0, xmm1 psrld xmm1, 1 paddd xmm0, xmm0 por xmm0, xmm1 ; Now xmm0 has the 16 reversed bytes

.data .align 16 magic1 db 16 dup(0f0h) magic2 db 16 dup(0cch) magic3 db 16 dup(0aah)

If you want to be competitive with the above algorithms you are going to need an LUT capable of converting two bytes simultaneously. that would be 2**16*2 = 131072 bytes and may start to grow outside of L1 cache. Also if the data starts out in noncacheable memory, you would want to load it 16 bytes at a time if possible and then unpack it into byte or word registers to perform the lookups. It all depends on how the data gets into the processor. For example, there is a really cool algorithm for bit reversal of indices for FFTs but that isn't reversing data at all, just generating the bit-reversals for a set of numbers in sequence generating each pair exactly once.

--
write(*,*) transfer((/17.392111325966148d0,6.5794487871554595D-85, &
6.0134700243160014d-154/),(/'x'/)); end
Reply to
James Van Buskirk

Do you mean fastest for MMX/SSE? Or, do you mean fastest x86 solution overall?

Why? I don't understand why you'd need to convert two byte simultaneously to outperform that routine.

Let's take your routine above. You used 18 instructions. The latencies are listed as 2 for each instruction used. They're all "DirectPath Double" at

1.5/cycle. That's 2*18/1.5=24 cycles. If we add a xmm register load and store at the start and end of the routine, that's 24+(5/1.5)= 27.3 cycles for 16 bytes, or 1.7 cycles per byte.

Now, let's take a routine loop for 32-bits using bswap, xchg, and mov's and unroll it four times:

(untested pseudo-code x86) (repeat this four times for 16 bytes)

mov dword (load eax) mov byte (ah) mov byte (al) bswap dword (eax) mov byte (ah) mov byte (al) mov dword (store eax)

They're all DirectPath Single at 3/cycle. So, that's ((7*1)/3)*4=9.3 cycles for 16 bytes, or 0.58 cycles per byte.

ISTM, that standard x86 should be three times faster than your SSE2 code...

Have I made a mistake somewhere? Doesn't that perform without converting two bytes simultaneously?

Rod Pemberton

Reply to
Rod Pemberton

As it is, the pseudo-code reorders the bytes but it leaves the bit order inside the bytes intact. At least, you need to add a bit-reverse table lookup for handling each byte.

--
Tauno Voipio
tauno voipio (at) iki fi
 Click to see the full signature
Reply to
Tauno Voipio

I think that the pseudocode:

mov byte(al)

is intended to be the super-slow XLAT instruction, and the:

mov byte(ah)

is intented to be the nonexistent XLATH instruction. Then:

bswap dword (eax)

is intended to cause a partial register stall via the instruction:

bswap eax

So he would need another bswap instruction (and another PRS) to get the bytes back in order. Overall, perhaps an order of magnitude slower than the SSE2 code I posted, once he gets working code. Get the partial register stalls and other issues worked out and still significantly slower.

--
write(*,*) transfer((/17.392111325966148d0,6.5794487871554595D-85, &
6.0134700243160014d-154/),(/'x'/)); end
Reply to
James Van Buskirk

and

cycles

code...

converting

That's what the "move byte" were for, byte table lookup...

But yes, it probably needs a couple more instructions to address the table, since the x86 address modes can't access "memory+AL" or "memory+AH". However, the code would still need to triple in size to be slow enough to match SSE2 routine, according to Intel's latency data. Even with a working routine, it won't become that large.

Rod Pemberton

Reply to
Rod Pemberton

converting

Uh, no... Yes, the move byte's do need an extra instruction or two to fix up memory addressing.

ROFL.

...

What? One bswap suffices to reorder the bytes. The bytes then only need their bit order reversed. So, why are two bswap's required?

That's making the assumption that XLAT was used. It wasn't. Even with register stalls, and a few extra instructions to fix up the "move byte" for memory addressing issues, the Intel manual's latency data indicates it should be much faster... If they had indicated DirectPath Single at 1/cycle or 2/cycle, then I could see the claim that SSE2 was faster. But, at

3/cycle? No way! As number of latencies per cycle increases, it becomes even more unlikely that DirectPath Double instructions outperform DirectPath Single. E.g., 2-1=1, 3-1.5=1.5, 5-2.5=2.5. I can accept that the SSE2 routine is faster on certain Intel processors due to different latencies for them, or faster on AMD's because of different latencies. But, from the calculations I made from Intel's data, it isn't. Timing the routines on cpu with 3/cycle may prove otherwise. As you've pointed out their are register stalls and cache issues. I'm of belief that the SSE2 code is slower, at least for Intel's with the 3/cycle... I don't have an Intel to test, but I've no intent to time this anyway. You, or others can, if you wish. It's not hard to flush out my or the other posted routines.

Rod Pemberton

Reply to
Rod Pemberton

Well, that's very good. Without checking I can see it has the potential to be a lot faster - at least given certain conditions, i.e. bytes to be reversed stored next to each other, having to wait for 16 values, possible end cases if there are not multiples of 16 and presence of the pshufb instruction which requires the far from universal presence of SSSE3 according to

formatting link

Given the issues and the OP's problem description his own suggestion of a 256-byte lookup table (or a 16-byte one used twice) still seems best but he may want to consider the above so copying the reply back to the original groups.

BTW, I'm not sure if you are aware but both your replies have dropped the original groups. Why choose to do that? Or is it a newsreader setting or restriction?

James

Reply to
James Harris

"James Van Buskirk" wrote in message news:ib18lc$fm4$ snipped-for-privacy@news.eternal-september.org...

I'm kind of rusty at FASM so I should I might as well time a couple of algorithms as an exercise. In the following, reverse_SSSE3 is the code using PSHUFB (my version, not Harold's), reverse_SSE2 is the divide and conquer code using shifting and masking, reverse_xlatb is the code using xlatb as proposed by Rod Pemberton, reverse_bytes and reverse_store are two codes intended to be similar to the code of aleksa. reverse_bytes accumulates results in a register then stores them en masse, whereas reverse_store stores each reversed byte as it is generated. OS, assembler and compiler:

C:\gfortran\clf\reverse>ver

Microsoft Windows [Version 5.2.3790]

C:\gfortran\clf\reverse>fasm flat assembler version 1.67.18 usage: fasm [output] optional settings: -m set the limit in kilobytes for the memory available to assembler -p set the maximum allowed number of passes C:\gfortran\clf\reverse>gfortran -v Built by Equation Solution . Using built-in specs. COLLECT_GCC=gfortran COLLECT_LTO_WRAPPER=c:/gcc_equation/bin/../libexec/gcc/x86_64-pc-mingw32/4.6.0/l to-wrapper.exe Target: x86_64-pc-mingw32 Configured with: ../gcc-4.6-20100626-mingw/configure CFLAGS=-O0 --host=x86_64-pc

-mingw32 --build=x86_64-unknown-linux-gnu --target=x86_64-pc-mingw32 --prefix=/h ome/gfortran/gcc-home/binary/mingw32/native/x86_64/gcc/4.6-20100626 --with-gmp=/ home/gfortran/gcc-home/binary/mingw32/native/x86_64/gmp --with-mpfr=/home/gfortr an/gcc-home/binary/mingw32/native/x86_64/mpfr --with-mpc=/home/gfortran/gcc-home /binary/mingw32/native/x86_64/mpc --with-sysroot=/home/gfortran/gcc-home/binary/ mingw32/cross/x86_64/gcc/4.6-20100626 --with-gcc --with-gnu-ld --with-gnu-as --d isable-shared --disable-nls --disable-tls --enable-libgomp --enable-languages=c, fortran,c++ --enable-threads=win32 --disable-win32-registry Thread model: win32 gcc version 4.6.0 20100626 (experimental) (GCC)

C:\gfortran\clf\reverse>type reverse.asm format MS64 coff ; File: reverse.asm ; Assembled with: fasm reverse.asm

section 'CODE' code readable executable align 16 align 16 public reverse_SSSE3 reverse_SSSE3: mov r10, rdx rdtsc shl rdx, 32 lea r9, [rdx+rax] ; r9 has initial time stamp mov rdx, r8 shl rdx, 4 add rcx, rdx lea rax, [r10+rdx] neg rdx jns loop1_end loop1: movdqa xmm0, [rcx+rdx] movdqa xmm1, [magic1] ; Assume 16 bytes to be reversed are in xmm0 pand xmm1, xmm0 ; xmm1 has high nibbles pxor xmm0, xmm1 ; xmm0 has low nibbles psrld xmm1, 4 ; Shift high nibbles to low position movdqa xmm2, [magic2] ; High reversed LUT pshufb xmm2, xmm0 ; xmm2 has high nibbles of reversed bytes movdqa xmm0, [magic3] ; Low reversed LUT pshufb xmm0, xmm1 ; xmm0 has low nibbles of reversed bytes por xmm0, xmm2 ; Now xmm0 has the 16 reversed bytes movdqa [rax+rdx], xmm0 add rdx, 16 js loop1 loop1_end: rdtsc shl rdx, 32 or rax, rdx sub rax, r9 ret

public reverse_SSE2 reverse_SSE2: mov r10, rdx rdtsc shl rdx, 32 lea r9, [rdx+rax] ; r9 has initial time stamp mov rdx, r8 shl rdx, 4 add rcx, rdx lea rax, [r10+rdx] neg rdx jns loop2_end loop2: movdqa xmm0, [rcx+rdx] movdqa xmm1, [magic1] ; Assume 16 bytes to be reversed are in xmm0 pand xmm1, xmm0 pxor xmm0, xmm1 psrld xmm1, 4 pslld xmm0, 4 por xmm0, xmm1 ; Groups of 4 bits have been swapped movdqa xmm1, [magic4] pand xmm1, xmm0 pxor xmm0, xmm1 psrld xmm1, 2 pslld xmm0, 2 por xmm0, xmm1 ; Groups of 2 bits have been swapped movdqa xmm1, [magic5] pand xmm1, xmm0 pxor xmm0, xmm1 psrld xmm1, 1 paddd xmm0, xmm0 por xmm0, xmm1 ; Now xmm0 has the 16 reversed bytes movdqa [rax+rdx], xmm0 add rdx, 16 js loop2 loop2_end: rdtsc shl rdx, 32 or rax, rdx sub rax, r9 ret

public reverse_xlatb reverse_xlatb: mov r10, rdx rdtsc push rbx push rsi lea rbx, [LUT8] shl rdx, 32 lea r9, [rdx+rax] ; r9 has initial time stamp mov rdx, r8 shl rdx, 4 add rcx, rdx lea rsi, [r10+rdx] neg rdx jns loop3_end loop3: mov eax, [rcx+rdx] db 48h xlatb rol eax, 8 db 48h xlatb rol eax, 8 db 48h xlatb rol eax, 8 db 48h xlatb rol eax, 8 mov [rsi+rdx], eax add rdx, 4 js loop3 loop3_end: pop rsi pop rbx rdtsc shl rdx, 32 or rax, rdx sub rax, r9 ret

public reverse_bytes reverse_bytes: mov r10, rdx rdtsc push rbx push rsi lea rbx, [LUT8] shl rdx, 32 lea r9, [rdx+rax] ; r9 has initial time stamp mov rdx, r8 shl rdx, 4 add rcx, rdx lea r8, [r10+rdx] neg rdx jns loop4_end loop4: mov rax, [rcx+rdx] rol rax, 16 movzx esi, ah movzx esi, byte [rbx+rsi] shl esi, 8 movzx r10d, al movzx r10d, byte [rbx+r10] or r10d, esi rol rax, 16 shl r10d, 8 movzx esi, ah movzx esi, byte [rbx+rsi] or r10d, esi shl r10d, 8 movzx esi, al movzx esi, byte [rbx+rsi] or r10d, esi rol rax, 16 shl r10, 8 movzx esi, ah movzx esi, byte [rbx+rsi] or r10, rsi shl r10, 8 movzx esi, al movzx esi, byte [rbx+rsi] or r10, rsi rol rax, 16 shl r10, 8 movzx esi, ah movzx esi, byte [rbx+rsi] or r10, rsi shl r10, 8 movzx esi, al movzx esi, byte [rbx+rsi] lea rax, [rsi+r10] mov [r8+rdx], rax add rdx, 8 js loop4 loop4_end: pop rsi pop rbx rdtsc shl rdx, 32 or rax, rdx sub rax, r9 ret

public reverse_store reverse_store: mov r10, rdx rdtsc push rbx push rsi lea rbx, [LUT8] shl rdx, 32 lea r9, [rdx+rax] ; r9 has initial time stamp mov rdx, r8 shl rdx, 4 add rcx, rdx lea r8, [r10+rdx] neg rdx jns loop5_end loop5: mov rax, [rcx+rdx] movzx esi, ah movzx esi, byte [rbx+rsi] mov [r8+rdx+1], sil movzx esi, al movzx esi, byte [rbx+rsi] mov [r8+rdx], sil shr rax, 16 movzx esi, ah movzx esi, byte [rbx+rsi] mov [r8+rdx+3], sil movzx esi, al movzx esi, byte [rbx+rsi] mov [r8+rdx+2], sil shr rax, 16 movzx esi, ah movzx esi, byte [rbx+rsi] mov [r8+rdx+5], sil movzx esi, al movzx esi, byte [rbx+rsi] mov [r8+rdx+4], sil shr rax, 16 movzx esi, ah movzx esi, byte [rbx+rsi] mov [r8+rdx+7], sil movzx esi, al movzx esi, byte [rbx+rsi] mov [r8+rdx+6], sil add rdx, 8 js loop5 loop5_end: pop rsi pop rbx rdtsc shl rdx, 32 or rax, rdx sub rax, r9 ret

section 'DATA' data readable writeable align 16 align 16 label magic1 dqword db 0f0h,0f0h,0f0h,0f0h,0f0h,0f0h,0f0h,0f0h db 0f0h,0f0h,0f0h,0f0h,0f0h,0f0h,0f0h,0f0h label magic2 dqword db 0h, 80h, 40h,0c0h, 20h,0a0h, 60h,0e0h db 10h, 90h, 50h,0d0h, 30h,0b0h, 70h,0f0h label magic3 dqword db 0h, 8h, 4h, 0ch, 2h, 0ah, 6h, 0eh db 1h, 9h, 5h, 0dh, 3h, 0bh, 7h, 0fh label magic4 dqword db 0cch,0cch,0cch,0cch,0cch,0cch,0cch,0cch db 0cch,0cch,0cch,0cch,0cch,0cch,0cch,0cch label magic5 dqword db 0aah,0aah,0aah,0aah,0aah,0aah,0aah,0aah db 0aah,0aah,0aah,0aah,0aah,0aah,0aah,0aah LUT8 db 0h, 80h, 40h,0C0h, 20h,0A0h, 60h,0E0h db 10h, 90h, 50h,0D0h, 30h,0B0h, 70h,0F0h db 8h, 88h, 48h,0C8h, 28h,0A8h, 68h,0E8h db 18h, 98h, 58h,0D8h, 38h,0B8h, 78h,0F8h db 4h, 84h, 44h,0C4h, 24h,0A4h, 64h,0E4h db 14h, 94h, 54h,0D4h, 34h,0B4h, 74h,0F4h db 0Ch, 8Ch, 4Ch,0CCh, 2Ch,0ACh, 6Ch,0ECh db 1Ch, 9Ch, 5Ch,0DCh, 3Ch,0BCh, 7Ch,0FCh db 2h, 82h, 42h,0C2h, 22h,0A2h, 62h,0E2h db 12h, 92h, 52h,0D2h, 32h,0B2h, 72h,0F2h db 0Ah, 8Ah, 4Ah,0CAh, 2Ah,0AAh, 6Ah,0EAh db 1Ah, 9Ah, 5Ah,0DAh, 3Ah,0BAh, 7Ah,0FAh db 6h, 86h, 46h,0C6h, 26h,0A6h, 66h,0E6h db 16h, 96h, 56h,0D6h, 36h,0B6h, 76h,0F6h db 0Eh, 8Eh, 4Eh,0CEh, 2Eh,0AEh, 6Eh,0EEh db 1Eh, 9Eh, 5Eh,0DEh, 3Eh,0BEh, 7Eh,0FEh db 1h, 81h, 41h,0C1h, 21h,0A1h, 61h,0E1h db 11h, 91h, 51h,0D1h, 31h,0B1h, 71h,0F1h db 9h, 89h, 49h,0C9h, 29h,0A9h, 69h,0E9h db 19h, 99h, 59h,0D9h, 39h,0B9h, 79h,0F9h db 5h, 85h, 45h,0C5h, 25h,0A5h, 65h,0E5h db 15h, 95h, 55h,0D5h, 35h,0B5h, 75h,0F5h db 0Dh, 8Dh, 4Dh,0CDh, 2Dh,0ADh, 6Dh,0EDh db 1Dh, 9Dh, 5Dh,0DDh, 3Dh,0BDh, 7Dh,0FDh db 3h, 83h, 43h,0C3h, 23h,0A3h, 63h,0E3h db 13h, 93h, 53h,0D3h, 33h,0B3h, 73h,0F3h db 0Bh, 8Bh, 4Bh,0CBh, 2Bh,0ABh, 6Bh,0EBh db 1Bh, 9Bh, 5Bh,0DBh, 3Bh,0BBh, 7Bh,0FBh db 7h, 87h, 47h,0C7h, 27h,0A7h, 67h,0E7h db 17h, 97h, 57h,0D7h, 37h,0B7h, 77h,0F7h db 0Fh, 8Fh, 4Fh,0CFh, 2Fh,0AFh, 6Fh,0EFh db 1Fh, 9Fh, 5Fh,0DFh, 3Fh,0BFh, 7Fh,0FFh

C:\gfortran\clf\reverse>type driver.f90 program main use ISO_C_BINDING implicit none interface function reverse_SSSE3(source, dest, size) & bind(C,name='reverse_SSSE3') use ISO_C_BINDING implicit none integer(C_INT64_T), value :: size integer(C_INT8_T), intent(in) :: source(16*size) integer(C_INT8_T), intent(out) :: dest(16*size) integer(C_INT64_T) reverse_SSSE3 end function reverse_SSSE3

function reverse_SSE2(source, dest, size) & bind(C,name='reverse_SSE2') use ISO_C_BINDING implicit none integer(C_INT64_T), value :: size integer(C_INT8_T), intent(in) :: source(16*size) integer(C_INT8_T), intent(out) :: dest(16*size) integer(C_INT64_T) reverse_SSE2 end function reverse_SSE2

function reverse_xlatb(source, dest, size) & bind(C,name='reverse_xlatb') use ISO_C_BINDING implicit none integer(C_INT64_T), value :: size integer(C_INT8_T), intent(in) :: source(16*size) integer(C_INT8_T), intent(out) :: dest(16*size) integer(C_INT64_T) reverse_xlatb end function reverse_xlatb

function reverse_bytes(source, dest, size) & bind(C,name='reverse_bytes') use ISO_C_BINDING implicit none integer(C_INT64_T), value :: size integer(C_INT8_T), intent(in) :: source(16*size) integer(C_INT8_T), intent(out) :: dest(16*size) integer(C_INT64_T) reverse_bytes end function reverse_bytes

function reverse_store(source, dest, size) & bind(C,name='reverse_store') use ISO_C_BINDING implicit none integer(C_INT64_T), value :: size integer(C_INT8_T), intent(in) :: source(16*size) integer(C_INT8_T), intent(out) :: dest(16*size) integer(C_INT64_T) reverse_store end function reverse_store

function rev(x) use ISO_C_BINDING implicit none integer(C_INT8_T), value :: x integer(C_INT8_T) rev end function rev end interface integer(C_INT64_T), parameter :: blocksize = 4096 integer(C_INT8_T) buff(blocksize,4) integer i integer(C_INT64_T) time

do i = 1, blocksize buff(i,1) = i-1 buff(i,2) = rev(buff(i,1)) end do write(*,'(a,i0,a)') 'Testing SSSE3, 2*',blocksize,' bytes' do i = 1, 10 buff(:,3) = buff(:,1) time = reverse_SSSE3(buff(1,3), buff(1,4), blocksize/16) if(any(buff(:,4) /= buff(:,2))) then write(*,'(a)') 'Algorithm failed' else buff(:,3) = buff(:,2) time = time+reverse_SSSE3(buff(1,3), buff(1,4), blocksize/16) if(any(buff(:,4) /= buff(:,1))) then write(*,'(a)') 'Algorithm failed' else write(*,'(a,i0)') 'time = ', time end if end if end do write(*,'(/a,i0,a)') 'Testing SSE2, 2*',blocksize,' bytes' do i = 1, 10 buff(:,3) = buff(:,1) time = reverse_SSE2(buff(1,3), buff(1,4), blocksize/16) if(any(buff(:,4) /= buff(:,2))) then write(*,'(a)') 'Algorithm failed' else buff(:,3) = buff(:,2) time = time+reverse_SSE2(buff(1,3), buff(1,4), blocksize/16) if(any(buff(:,4) /= buff(:,1))) then write(*,'(a)') 'Algorithm failed' else write(*,'(a,i0)') 'time = ', time end if end if end do write(*,'(/a,i0,a)') 'Testing xlatb, 2*',blocksize,' bytes' do i = 1, 10 buff(:,3) = buff(:,1) time = reverse_xlatb(buff(1,3), buff(1,4), blocksize/16) if(any(buff(:,4) /= buff(:,2))) then else buff(:,3) = buff(:,2) time = time+reverse_xlatb(buff(1,3), buff(1,4), blocksize/16) if(any(buff(:,4) /= buff(:,1))) then write(*,'(a)') 'Algorithm failed' else write(*,'(a,i0)') 'time = ', time end if end if end do

write(*,'(/a,i0,a)') 'Testing bytes, 2*',blocksize,' bytes' do i = 1, 10 buff(:,3) = buff(:,1) time = reverse_bytes(buff(1,3), buff(1,4), blocksize/16) if(any(buff(:,4) /= buff(:,2))) then write(*,'(a)') 'Algorithm failed' else buff(:,3) = buff(:,2) time = time+reverse_bytes(buff(1,3), buff(1,4), blocksize/16) if(any(buff(:,4) /= buff(:,1))) then write(*,'(a)') 'Algorithm failed' else write(*,'(a,i0)') 'time = ', time end if end if end do

write(*,'(/a,i0,a)') 'Testing store, 2*',blocksize,' bytes' do i = 1, 10 buff(:,3) = buff(:,1) time = reverse_store(buff(1,3), buff(1,4), blocksize/16) if(any(buff(:,4) /= buff(:,2))) then write(*,'(a)') 'Algorithm failed' else buff(:,3) = buff(:,2) time = time+reverse_store(buff(1,3), buff(1,4), blocksize/16) if(any(buff(:,4) /= buff(:,1))) then write(*,'(a)') 'Algorithm failed' else write(*,'(a,i0)') 'time = ', time end if end if end do end program main

function rev(x) use ISO_C_BINDING implicit none integer(C_INT8_T), value :: x integer(C_INT8_T) rev integer i

rev = 0 do i = 0, bit_size(x)-1 rev = ishft(rev, 1) rev = IOR(rev,IAND(x,1_C_INT8_T)) x = ishft(x,-1) end do end function rev

C:\gfortran\clf\reverse>fasm reverse.asm flat assembler version 1.67.18 (1297254 kilobytes memory)

3 passes, 1449 bytes.

C:\gfortran\clf\reverse>gfortran driver.f90 reverse.obj -oreverse

C:\gfortran\clf\reverse>reverse Testing SSSE3, 2*4096 bytes time = 6680 time = 3350 time = 3340 time = 3400 time = 3390 time = 3390 time = 3390 time = 3390 time = 3390 time = 3400

Testing SSE2, 2*4096 bytes time = 4530 time = 4770 time = 4770 time = 4760 time = 4760 time = 4760 time = 4770 time = 4760 time = 4760 time = 4760

Testing xlatb, 2*4096 bytes time = 33100 time = 32950 time = 32950 time = 32970 time = 32970 time = 32970 time = 32960 time = 32970 time = 32960 time = 32970

Testing bytes, 2*4096 bytes time = 15680 time = 15620 time = 15610 time = 15610 time = 15610 time = 15610 time = 15620 time = 15620 time = 15620 time = 15620

Testing store, 2*4096 bytes time = 13180 time = 12490 time = 12510 time = 12490 time = 12510 time = 12510 time = 12510 time = 12510 time = 12510 time = 12510

So reverse_SSSE3 takes 3390/(2*4096) = 0.41 clocks/byte, reverse_SSE2 takes 4760/(2*4096) = 0.58 clocks/byte, reverse_xlatb takes 32960/(2*4096) = 4.02 clocks/byte, reverse_bytes takes 15620/(2*4096) = 1.91 clocks/byte, and reverse_store takes 12510/(2*4096) = 1.53 clocks/byte.

Even if the speed of the LUT method could be doubled by going to a 131072-byte table, which it can't, it would still be slower then the SSE2 divide-and-conquer method on my Core 2 Duo. Rod Pemberton's xlatb method is about an order of magnitude slower than the SSSE3 PSHUFB method.

--
write(*,*) transfer((/17.392111325966148d0,6.5794487871554595D-85, &
6.0134700243160014d-154/),(/'x'/)); end
Reply to
James Van Buskirk

Albert van der Horst answered James Harris: ...

You're right in terms of code-size. But the OP asked for speed ...

XLAT is an awful slow relict from 8080 days (vector path, latency = 5)

__ wolfgang

Reply to
wolfgang kern

Come on!

MOV BX, bit_reverse_table XLAT

Intel didn't spend a whole single byte instruction for nothing. (That is .4% of the instruction space!)

Groetjes Albert

--

--
Albert van der Horst, UTRECHT,THE NETHERLANDS
Economic growth -- being exponential -- ultimately falters.
 Click to see the full signature
Reply to
Albert van der Horst

m4$ snipped-for-privacy@news.eternal-september.org...

...

This is not comparing like with like if the conditions of the tests are set up to favour SSE. I agree there are speed advantages to SSE code but for a real comparison, while the translation tables can be pre-cached (to simulate prior iterations of the loop), values to be converted should be picked up from memory. The OP didn't specify that bytes to be converted were neatly adjacent and in groups of 16. A better test would be to pick them up individually from, say, 53 bytes apart.

Are you sure Rod specified xlat? AFAICS he suggested a 256-byte translation table and move instructions. He used bswap so he could effectively operate on the upper half of EAX. He could have used ror eax, 16 or ror eax, 16 instead.

I ran a few tests of xlat on a Pentium M. It was slowish at 4 cycles per use when the CPU had to wait for the result of the translation (which goes back into AL) before it could initiate the next xlat. Timing of xlat was similar to running the operation explicitly as in:

mov al, [table + eax] ;Avg of 4 cycles per use

However, it was faster to repeat the following.

mov bl, [table + eax] ;Fairly consistent 1 cycle per use

In other words, as long as xlat didn't have a dependency chain it was fast. To confirm I tried

mov eax, xlat

These ran repeatedly at 1 cycle for each pair. So xlat is *not* inherently slow on the Pentium M. This is possibly true for any P6 machine. I would normally still use the explicit instruction in code, though, because: xlat may be slower on older machines such as the Pentium 1, explicit allows the dependency to be broken and explicit makes the code clearer.

James

Reply to
James Harris

In comp.dsp James Harris wrote: (snip)

(snip)

I remember stories from when the PPro came out that it was slower on code mixing registers sizes. That was most noticable at the time running 16 bit MS-DOS code.

I don't remember the rules now for what is fast and what isn't, but much of the P6 core was kept in later processors. It seems that intel was expecting full conversion to 32 bit code by the time the PPro became popular.

-- glen

Reply to
glen herrmannsfeldt

Go back and look at his opcounts. He had pseudocode that addressed off of a byte register, stored the result back in the same byte register, and did so in a single instruction. If he wants to prove otherwise he should man up and rig his own benchmarks to do so. Until then as author of the test bench I get to set the conditions of the problem.

Thinking about it since then it seems to me that it's more realistic, since the problem arose in comp.arch.embedded, to assume that the data is coming in maybe one byte at a time and being acquired in interrupt-driven code, perhaps via:

in al, dx

so that you only have the opportunity to process one byte at a time and none of the code or data that the code will use is in cache. In that case any LUT method will require loading an extra cache line, making it slow pretty much no matter what.

In the following, reverse_small is the divide-and-conquer method applied to a single byte, reverse_mul is the multiplication method from:

formatting link

where the scatter mask and bit select mask have been shifted right by one to avoid a 64-bit data load. reverse_loop is the very low code length method that shifts one bit at a time, reverse_loop1 is the same method but shifting the bit through a register instead of CF, reverse_SSE2a is the same method again but processing 16 bytes at once (I know...) and reverse_mul1 is the multiplication method revisited with with gather mask also shifted left by two to avoid another 64-bit load.

New code (see

formatting link
for old code and driver program):

public reverse_small ; 29 bytes to bit-reverse al reverse_small: mov r10, rdx rdtsc push rbx push rsi shl rdx, 32 lea r9, [rdx+rax] ; r9 has initial time stamp mov rdx, r8 shl rdx, 4 add rcx, rdx lea rbx, [r10+rdx] neg rdx jns loop6_end loop6: mov al, [rcx+rdx] movzx esi, al and eax, 55h xor esi, eax shr esi, 1 lea eax, [2*rax+rsi] mov esi, eax and eax, 0ffffffcch xor esi, eax shr eax, 2 lea eax, [4*rsi+rax] ror al, 4 mov [rbx+rdx], al add rdx, 1 js loop6 loop6_end: pop rsi pop rbx rdtsc shl rdx, 32 or rax, rdx sub rax, r9 ret

public reverse_mul ; 38 bytes to bit-reverse al (high bits of rax precleared) reverse_mul: mov r10, rdx rdtsc push rbx push rsi shl rdx, 32 lea r9, [rdx+rax] ; r9 has initial time stamp mov rdx, r8 shl rdx, 4 add rcx, rdx lea rbx, [r10+rdx] neg rdx jns loop7_end loop7: movzx eax, byte [rcx+rdx] imul rax, 40100401h mov rsi, 442211088h and rax, rsi mov rsi, 202020202h imul rax, rsi shr rax, 32 mov [rbx+rdx], al add rdx, 1 js loop7 loop7_end: pop rsi pop rbx rdtsc shl rdx, 32 or rax, rdx sub rax, r9 ret

public reverse_loop ; 11 bytes to bit-reverse al reverse_loop: mov r10, rdx rdtsc push rbx shl rdx, 32 lea r9, [rdx+rax] ; r9 has initial time stamp mov rdx, r8 shl rdx, 4 lea r10, [rcx+rdx] lea r8, [r10+rdx] neg rdx jns loop8_end loop8: mov al, [r10+rdx] mov cl, 8 loop8_inner: shr al, 1 adc bl, bl sub cl, 1 jnz loop8_inner mov [r8+rdx], bl add rdx, 1 js loop8 loop8_end: pop rbx rdtsc shl rdx, 32 or rax, rdx sub rax, r9 ret

public reverse_loop1 ; 17 bytes to bit-reverse al to bl (high bits of rax precleared) reverse_loop1: mov r10, rdx rdtsc push rbx push rsi shl rdx, 32 lea r9, [rdx+rax] ; r9 has initial time stamp mov rdx, r8 shl rdx, 4 lea r10, [rcx+rdx] lea r8, [r10+rdx] neg rdx jns loop9_end loop9: movzx eax, byte [r10+rdx] mov cl, 8 loop9_inner: mov esi, eax shr eax, 1 and esi, 1h lea ebx, [2*rbx+rsi] sub cl, 1 jnz loop9_inner mov [r8+rdx], bl add rdx, 1 js loop9 loop9_end: pop rsi pop rbx rdtsc shl rdx, 32 or rax, rdx sub rax, r9 ret

public reverse_SSE2a reverse_SSE2a: mov r10, rdx rdtsc shl rdx, 32 lea r9, [rdx+rax] ; r9 has initial time stamp push rbx mov rdx, r8 shl rdx, 4 lea rbx, [rcx+rdx] lea rax, [r10+rdx] neg rdx jns loop10_end loop10: movdqa xmm0, [rbx+rdx] mov cl, 8 loop10_inner: movdqa xmm2, [magic6] pand xmm2, xmm0 psrld xmm0, 1 paddb xmm1, xmm1 por xmm1, xmm2 sub cl, 1 jnz loop10_inner movdqa [rax+rdx], xmm1 add rdx, 16 js loop10 loop10_end: pop rbx rdtsc shl rdx, 32 or rax, rdx sub rax, r9 ret

public reverse_mul1 ; 34 bytes to bit-reverse al (high bits of rax precleared) reverse_mul1: mov r10, rdx rdtsc push rbx push rsi shl rdx, 32 lea r9, [rdx+rax] ; r9 has initial time stamp mov rdx, r8 shl rdx, 4 add rcx, rdx lea rbx, [r10+rdx] neg rdx jns loop11_end loop11: movzx eax, byte [rcx+rdx] imul rsi, rax, 40100401h mov rax, 442211088h and rsi, rax sub rax, 3e1d0c84h imul rax, rsi shr rax, 33 mov [rbx+rdx], al add rdx, 1 js loop11 loop11_end: pop rsi pop rbx rdtsc shl rdx, 32 or rax, rdx sub rax, r9 ret

section 'DATA' data readable writeable align 16 align 16 label magic6 dqword db 1h, 1h, 1h, 1h, 1h, 1h, 1h, 1h db 1h, 1h, 1h, 1h, 1h, 1h, 1h, 1h

with results:

C:\gfortran\clf\reverse>reverse Testing SSSE3, 2*4096 bytes time = 6800 time = 3390 time = 3380 time = 3360 time = 3340 time = 3340 time = 3340 time = 3360 time = 3390 time = 3360

Testing SSE2, 2*4096 bytes time = 4540 time = 4610 time = 4780 time = 4800 time = 4790 time = 4770 time = 4790 time = 4770 time = 4780 time = 4760

Testing xlatb, 2*4096 bytes time = 33020 time = 33020 time = 32980 time = 32980 time = 33010 time = 32990 time = 32990 time = 32980 time = 32980 time = 32980

Testing bytes, 2*4096 bytes time = 15640 time = 15690 time = 15690 time = 15650 time = 15680 time = 15660 time = 15700 time = 15670 time = 15660 time = 15710

Testing store, 2*4096 bytes time = 20990 time = 21280 time = 12650 time = 12640 time = 12620 time = 12680 time = 12590 time = 12660 time = 12770 time = 12640

Testing small, 2*4096 bytes time = 78320 time = 78330 time = 78290 time = 78290 time = 78320 time = 78270 time = 78230 time = 78310 time = 78230 time = 78290

Testing mul, 2*4096 bytes time = 55650 time = 33360 time = 33310 time = 33320 time = 33340 time = 33350 time = 33290 time = 33290 time = 33330 time = 33390

Testing loop, 2*4096 bytes time = 196910 time = 196810 time = 196850 time = 196820 time = 196820 time = 209150 time = 196910 time = 196850 time = 196840 time = 196800

Testing loop1, 2*4096 bytes time = 219790 time = 219790 time = 219580 time = 219770 time = 366270 time = 219620 time = 219760 time = 219790 time = 219580 time = 219720

Testing SSE2a, 2*4096 bytes time = 13140 time = 13130 time = 13130 time = 13080 time = 13020 time = 13020 time = 13020 time = 13020 time = 13050 time = 13130

Testing mul1, 2*4096 bytes time = 33210 time = 33200 time = 33070 time = 33050 time = 33050 time = 33080 time = 33060 time = 33030 time = 33060 time = 33060

So reverse_small uses 29 bytes to process bytes in 78300/8192 = 9.56 clocks/byte reverse_mul uses 38 bytes for 33300/8192 = 4.06 clocks/byte reverse_loop uses 11 bytes for 197000/8192 = 24.04 clocks/byte reverse_loop1 uses 17 bytes for 220000/8192 = 26.86 clocks/byte and reverse_mul1 uses 34 bytes for 33100/8192 = 4.04 clocks/byte

reverse_SSE2a really doesn't belong with this group because it processes multiple bytes and uses memory, but it takes

13100/8192 = 1.60 clocks/byte.

That reverse_mul1 algorithm looks pretty hot, using only about a half cache line of code and no data and crunching out the reversal in 4 clocks (although this is reciprocal throughput, not latency) on my Core 2 Duo.

You can also look in

formatting link
to see what timings Agner Fog has measured for the various processors.

--
write(*,*) transfer((/17.392111325966148d0,6.5794487871554595D-85, &
6.0134700243160014d-154/),(/'x'/)); end
Reply to
James Van Buskirk

James, wait wait..

Intel invented that instructions back in the 8086 days.. That's a loooong time ago.

It was a *killler* instruction back then. We didn't had much (if any) instruction-cache, so instruction size had a direct effect on the performance.

Nowadays the instruction still exist, but no modern compiler that I'm aware of uses it. It exists for backward compatibility. That instruction is a candidate for slow microcode.

Cheers, Nils

Reply to
Nils

I don't understand something about this thread: why not use FFTW's implementation and be done with it? It may be fun to reinvent the wheel, but it's not profitable.

--Randy

Nils writes:

--
Randy Yates                      % "How's life on earth? 
Digital Signal Labs              %  ... What is it worth?" 
 Click to see the full signature
Reply to
Randy Yates

What for? Did you intend to address the poster, Albert, who suggested xlat?

Albert's reference to BX suggests he may be still there :-)

Based on tests I made some relevant comments on this in my prior post on this thread (q.v.).

James

Reply to
James Harris

ElectronDepot website is not affiliated with any of the manufacturers or service providers discussed here. All logos and trade names are the property of their respective owners.