Fast bit-reverse on an x86? - Page 3

Do you have a question? Post it now! No Registration Necessary

Translate This Thread From English to

Threaded View
Re: Fast bit-reverse on an x86?

...

Quoted text here. Click to load it

Makes sense in the presence of register renaming (Tomasulo-type).
Writing just part of the register requires the hardware to merge with
the rest of the real register whereas writing the whole thing allows
the value to propagate without reference to the original value.
Interestingly, according to "Inner Loops" the Pentium had little
problem with partial writes but the original Pentium Pro had a big
seven-cycle problem with them (page 136). Helpfully, the book also
says that the PPro is OK with

  xor eax, eax
  mov al, 53

Not sure when Intel improved things or when/if AMD did. At any rate,
as mentioned, partial writes are good to avoid where practicable.

Quoted text here. Click to load it

I don't follow your logic here. You seem to be saying that an AMD (any
AMD? surely not) can make two read requests per cycle but can still
only achieve 1 of the above instructions per cycle rather than two.

Quoted text here. Click to load it

You don't like AMD for some reason? :-(

Don't forget these are timings from a specific CPU with specific
pieces of code. Timings in practice depend on the CPU, the specific
instruction mix, what comes before and after and even alignment. I
should have made that clear.

Nevertheless the key in many CPUs is to break dependency chains. IIRC
Fogg makes that point. Taking your specific example, doing something
useful with bl doesn't preclude loading another byte register or
carrying out other work.

Quoted text here. Click to load it

If it's the zero you are concerned about, as stated each instance of
the instruction took zero or one cycles. All the zero means is that
some instances were paired, so adding nothing to the overall cycle
count.

James

Re: Fast bit-reverse on an x86?

Quoted text here. Click to load it

If your requirements are not worse than that and assuming 12-16 bit
ADC, just use 4-64 KiW look up table. Prefetching the LUT might help
in some cases, but in general I do not expect any significant
improvement.

With 16 bit ADCs, you could experiment with using a single 256 byte
LUT and two translations/sample, however, it is hard to predict, if
this is useful or not for a particular cache architecture.


Re: Fast bit-reverse on an x86?

Quoted text here. Click to load it

Is this a long bit string that needs to be reversed that is stored
into multiple bytes ?

Is the bit count a multiple of 8 or perhaps a multiple of 32 ?

Swap the bytes starting from opposite ends of the byte string with
byte moves and then bit swap each byte using table look up.

To reduce the number of memory accesses, perform the table look up
during each byte move.

For bit string with multiple of 32 bits and properly aligned, load a
register with 4 bytes from one end of the string, perform byte swaps
within the register, use each byte separately to perform a table
lookup and store 4 bytes in a single dword move into the opposite end
of the string.

The effects of cache is hard to predict with multilevel cache
hierarchies.  However, there are several articles dealing with
optimizing memcpy() functions e.g. by prefetching data into cache by
doing 64 bit dummy loads into double precision stack or prefetching
accessing every 32th byte (one byte from each cache line). This may
have an effect how cache write back/write though is performed at
different cache hierarchy levels.

Of course the data should be properly aligned in relative to dwords,
32 byte cache lines and for even larger data types the 4096 byte
virtual memory pages.
 

Re: Fast bit-reverse on an x86?
Quoted text here. Click to load it

If you need the bit reversal for a bit-reversal shuffle loop for FFT I
may be able to help out with a piece of code. I've written this two
years ago, and as far as I remember it performed better than the
lookup-table variant.


n is a power of two.
data points to an array that is shuffled in place, 8 bytes per entry
(aka a float complex). It's easy to modify this for a different element
size by changing the movq instructions to something else.


void BitReverse1 (void * data, int n)
{
  _asm
  {
    mov   eax, [n]
    mov   edi, [data]

    pushad

    dec   eax         ; l--;
    xor   ebx, ebx    ; jp = 0
    xor   ecx, ecx    ; jn = 0;
    bsr   esi, eax    ; s = bsr(l-1)

theloop:
    bsf   ebp, eax    ; k = bsf (i)
    mov   edx, esi    ; x = s;
    sub   edx, ebp    ; x = s-k
    btc   ebx, ebp    ; jp ^= (1<<k)
    btc   ecx, edx    ; jn ^= (1<<x)
    cmp   ecx, ebx
    jge   skipShuffle
    movq  mm0, [edi + ebx*8]
    movq  mm1, [edi + ecx*8]
    movq  [edi + ebx*8], mm1
    movq  [edi + ecx*8], mm0
skipShuffle:
    dec   eax
    jnz   theloop
    emms
    popad
  }
}

feel free to use this code.. I declare it as public domain.

 Nils Pipenbrinck


Re: Fast bit-reverse on an x86?
Quoted text here. Click to load it

If you have parallel access in hardware, perhaps a "gated" bit
reversing ROM may work for you.

Re: Fast bit-reverse on an x86?
Not knowing anything past 586, I'd do it something like:

mov eax,[table] ;aligned on 256 bytes
mov ecx,[src]
mov edx,[dst]
dec ebx ;count

loop:
mov al,[ecx+ebx]
mov al,[eax]
mov [edx+ebx],al
dec ebx
jns loop




Site Timeline