Fast bit-reverse on an x86?

Doing some work on a PC. Customer says it's not real time, but real fast is still better than real slow. I need to bit-reverse a slew of bytes, so the faster the better.

AFAIK the Pentium instruction set doesn't have a bit-reverse instruction, and there certainly isn't one available from C. My knee-jerk reaction to this problem is to make up a table of bit-reversed bytes, and do a lookup to retrieve them.

Anyone know a faster way?

--

Tim Wescott
Wescott Design Services
http://www.wescottdesign.com

Do you need to implement control loops in software?
"Applied Control Theory for Embedded Systems" was written for you.
See details at http://www.wescottdesign.com/actfes/actfes.html
Reply to
Tim Wescott
Loading thread data ...

Don't know the x86 really but generally this should be doable the "usual" way: shift source once right (so LSB goes into carry, X or whatever), shift destination once left (so carry goes into LSB). Repeat 8 times. No memory accesses involved so perhaps not slower than table lookup at all, and if slower not that bad anyway. I would measure which is faster on the particular hardware if speed is really important.

Dimiter

------------------------------------------------------ Dimiter Popoff Transgalactic Instruments

formatting link

------------------------------------------------------

formatting link

Reply to
Didi

faster than table lookup???

how big are the words? did you say bytes?

if the words aren't that big, the table ain't that big, memory is cheap, why wouldn't LUT be the fast and simple way to do it?

r b-j

Reply to
robert bristow-johnson

That's how I've always done it.

If there's no bit-reverse instruction, I can't really imagine anything faster than doing a lookup in a 256-byte table.

I do know somebody (a HW engineer) who once wired up two I/O ports so that he could do a bit-reversal by writing to one port and reading from the other. However, on most machines that's still not any faster than a lookup table (and probably not something you can easily do on a PC).

--
Grant Edwards               grant.b.edwards        Yow! !  The land of the
                                  at               rising SONY!!
                              gmail.com
Reply to
Grant Edwards

On a modern Intel CPU (if that's what he means, they haven't been x86 for a long time), it can make a big difference if the LUT isn't in the cache memory. A programmed solution (like the suggestion to shift in and out of the carry bit) may well be faster if it runs out of cache and the LUT accesses don't.

I don't know how hard it is to make sure the LUT is in the cache if it's done that way. Maybe that's not hard with the right tools, but it would be an important point to verify if the difference matters.

Eric Jacobsen Minister of Algorithms Abineau Communications

formatting link

Reply to
Eric Jacobsen

d

And if you're doing this in x86 assembler, the last two lines can be replaced with a bswap. Plus in assembler you can eliminate roughly half the shifts by using rotates, and then compensating in the next step for the overall rotation of the partial result. Not tested, but roughly:

; input/output in eax mov ebx,eax and eax,0xaaaaaaaa ror eax,2 and ebx,0x55555555 or eax,ebx

mov ebx,eax and eax,0x66666666 ror eax,4 and ebx,0x99999999 or eax,ebx

mov ebx,eax and eax,0x1e1e1e1e ror eax,8 and ebx,0xe1e1e1e1 or eax,ebx

rol eax,7 bswap eax

That should work for 64 bit words on x86-64 by just changing the E?Xs with R?Xs.

Reply to
robertwessel2

Table lookup is very fast but this is quicker than doing it the hard way

unsigned int reverse(register unsigned int x) { x = (((x & 0xaaaaaaaa) >> 1) | ((x & 0x55555555) > 2) | ((x & 0x33333333) > 4) | ((x & 0x0f0f0f0f) > 8) | ((x & 0x00ff00ff) > 16) | (x

Reply to
Walter Banks

Are the bytes contiguous? (i.e., can you treat them as words, etc.) Or, are they discrete bytes that you must process independant of each other?

Well, not an *instruction*, per se, but you can do this algorithmically in C. In ASM you can exploit carry's...

Depending on how "modern" the PC is (legacy vs. IA64, etc.) it is hard to state which way would ALWAYS be fastest. E.g., you can shift left out, shift right in, count, repeat in almost the same number of instructions as doing a table index operation. So, the "code" is comparable in size (the point being, either approach is equally likely to be present in a particular cache line).

OTOH, unless you know for a fact that the table will be present in the cache (in it's entirety), you might take a "swing and a miss" trying to access the appropriate entry in that table (256 bytes being a large-ish object).

I would, instead, ask if the reason for the bit reversal can be eliminated. I.e., if it is going to be used in some subsequent processing/calculation, can you modify *that* logic to deal with the bits in their "unreversed" order?

Equivalently, can you *acquire* the bits in a reversed order (instead of however they are currently acquired)?

Reply to
D Yuniskis

What do you do? Mirroring some bitmaps?

  1. Take FFT.
  2. Invert the imaginary part
  3. Take inverse FFT.

Show them that you are the DSP engineer, not some code monkey :-)

Vladimir Vassilevsky DSP and Mixed Signal Design Consultant

formatting link

Reply to
Vladimir Vassilevsky

It's for a bunch of data that's recorded off of a 1-bit ADC -- they're changing their frequency plan and need things converted, and by the way they're changing from lsb first to msb first.

--

Tim Wescott
Wescott Design Services
http://www.wescottdesign.com

Do you need to implement control loops in software?
"Applied Control Theory for Embedded Systems" was written for you.
See details at http://www.wescottdesign.com/actfes/actfes.html
Reply to
Tim Wescott

Is this a long bit string that needs to be reversed that is stored into multiple bytes ?

Is the bit count a multiple of 8 or perhaps a multiple of 32 ?

Swap the bytes starting from opposite ends of the byte string with byte moves and then bit swap each byte using table look up.

To reduce the number of memory accesses, perform the table look up during each byte move.

For bit string with multiple of 32 bits and properly aligned, load a register with 4 bytes from one end of the string, perform byte swaps within the register, use each byte separately to perform a table lookup and store 4 bytes in a single dword move into the opposite end of the string.

The effects of cache is hard to predict with multilevel cache hierarchies. However, there are several articles dealing with optimizing memcpy() functions e.g. by prefetching data into cache by doing 64 bit dummy loads into double precision stack or prefetching accessing every 32th byte (one byte from each cache line). This may have an effect how cache write back/write though is performed at different cache hierarchy levels.

Of course the data should be properly aligned in relative to dwords,

32 byte cache lines and for even larger data types the 4096 byte virtual memory pages.
Reply to
Paul Keinanen

If you need the bit reversal for a bit-reversal shuffle loop for FFT I may be able to help out with a piece of code. I've written this two years ago, and as far as I remember it performed better than the lookup-table variant.

n is a power of two. data points to an array that is shuffled in place, 8 bytes per entry (aka a float complex). It's easy to modify this for a different element size by changing the movq instructions to something else.

void BitReverse1 (void * data, int n) { _asm { mov eax, [n] mov edi, [data]

pushad

dec eax ; l--; xor ebx, ebx ; jp = 0 xor ecx, ecx ; jn = 0; bsr esi, eax ; s = bsr(l-1)

theloop: bsf ebp, eax ; k = bsf (i) mov edx, esi ; x = s; sub edx, ebp ; x = s-k btc ebx, ebp ; jp ^= (1

Reply to
Nils

There are two things that cause unexpected slowdowns on a processor like this. One is cache - as has been mentioned, table lookup may be very slow if the table is not in cache (and if it /is/ in cache, it is using up valuable cache space that other code could use).

The second thing is pipelining. Code like the function above is full of pipeline stalls, since the operations are highly dependent on the results of the previous lines. It may be possible to re-arrange the code in some way to make it more pipeline-friendly.

formatting link

Reply to
David Brown

Maybe it can be done in hardware. How is the data getting from the ADC into the PC in the first place? Do you have a microcontroller in the data path? If the ADC has an SPI bus, you could use an FTDI chip connected to USB, and use that to read the SPI MSB first.

Reply to
David Brown

If you have parallel access in hardware, perhaps a "gated" bit reversing ROM may work for you.

Reply to
Clay

On x86-32 your first idea - the table lookup - should compile to something like

and eax, 0xff mov al, [bit_reverse_table + eax]

That's it. These are two simple (i.e. single cycle or better) instructions. The table for this is only 256 bytes (which you should align on the line boundary of your data cache if possible). The table or the relevant parts of it will be cached and you won't see the code for dust....

I can think of some more complex schemes but none of them would be as fast even where they handle 4 or 8 bytes at a time.

cc. alt.lang.asm - an x86 forum

James

Reply to
James Harris

If your requirements are not worse than that and assuming 12-16 bit ADC, just use 4-64 KiW look up table. Prefetching the LUT might help in some cases, but in general I do not expect any significant improvement.

With 16 bit ADCs, you could experiment with using a single 256 byte LUT and two translations/sample, however, it is hard to predict, if this is useful or not for a particular cache architecture.

Reply to
Paul Keinanen

Not knowing anything past 586, I'd do it something like:

mov eax,[table] ;aligned on 256 bytes mov ecx,[src] mov edx,[dst] dec ebx ;count

loop: mov al,[ecx+ebx] mov al,[eax] mov [edx+ebx],al dec ebx jns loop

Reply to
aleksa

Your first idea should be to use assembly. Implementing the routine in C will likely be worse than assembly, whether it's shift based or table lookup. Most C compilers don't effectively use the x86 instruction set for such situations, and/or the C routine uses more shifts etc. than assembly would.

There's an x86 instruction for the byte table lookup. It's called XLAT. Although, MOV is probably faster on modern x86's, due to x86 instruction pipelining.

For 4 or 8 bytes, using the x86 BSWAP instruction will reverse the byte order, so you can then use a table lookup for the bytes. Of course, one issue with the XLAT instruction is moving each byte into AL to do the lookup. You'll probably need a few intermediate registers and additional reordering instructions to place each byte into AL, e.g., XCHG AH,AL and SHR EAX,16, etc. However, this should be as simple as a hex display routine. Even without using XLAT, say using MOV, you'd still need to access

32-bit or 64-bit value as bytes to use a lookup table. x86 only allows the lower 16-bits of a register, as two byte registers, to be accessed directly as bytes. This is only for some of the original registers, and not the new registers. So, some shifts, rotates, BSWAPs or XCHGs are needed. Of course, you can do it entirely with shifts (SHR/SHL/SAL), as you would in C. I think that's probably slower, however. Rotates (ROR/ROL/RCL/RCR) could be used also, with similar speed. If you decide to use shifts, shifts and masks with a couple registers can be optimized to handle multiple bytes at once, like this hex routine handles multiple nybbles at once using two registers:
formatting link

Of course, whatever you come up with, you may want to test for speed against other solutions. It seems Nils' posted MMX solution in the original c.a.e. thread. An MMX solution could be faster.

Rod Pemberton

Reply to
Rod Pemberton

James Harris copied to ALA: and I have no access to comp.dsp, comp.arch.embedded ...

My first thought about a 1-bit ADC was: "must be a digital device" :)

Ok, serial input ADC are a bit cheaper than parallel ones especially if unbuffered. MSB first does make sense to overcome settling issues. Fast buffered flash-ADC ask for more U$s, so I think we see one more time a merchants decision for a cost-saving change which usually end up with workarounds and unsatisfied customers.

Yes, if it's an 8-bit ADC here, but when I look at audio-recorder, voltage/temperature-sensors and graphic aquisition ADC, there are

12...24 bits to convert. It would work with the byte table as well, even some more lines are required then.

Right, unfortunately we haven't got bit reversal functions on x86 except SHL-SHR for every single bit, which is somehow slow, but I know that LIFO-hardware and wired converters exists since digital electronic joined in. Modern CPU hardware could have parallel bit reversion within one cycle by adding a gate and a few wires.

And if I'd had a problem like reported here, I'd add/modify/replace a few hardware-parts :)

__ wolfgang (ALA)

Reply to
wolfgang kern

ElectronDepot website is not affiliated with any of the manufacturers or service providers discussed here. All logos and trade names are the property of their respective owners.