Fast bit-reverse on an x86?

- A
- aleksa
  
  Contact options for registered users
Vote on answer
posted
13 years ago

Fri, Nov 12, 2010 11:23 AM

and IN takes about 1uS to execute, while all other instructions are lightning fast compared to that.

- W
- wolfgang kern
  
  Contact options for registered users
Vote on answer
posted
13 years ago

Fri, Nov 12, 2010 8:01 PM

"aleksa" mentioned:

not sure if my reply make it to comp.dsp and to comp.arsch.embedded

Exact, on my 3GHZ machine any legacy port access delay itself for seveal hundred (upto 3500) CPU-cycles ... but I assume ADC output is the OP's task, I see no issues for memory-mapped I/O ...

So in a multicore environment the lost time can be used to convert the input values into whatsoever wanted, even a few hardware designs would allow faster access than just one MHz.

__ wolfgang

- W
- wolfgang kern
  
  Contact options for registered users
Vote on answer
posted
13 years ago

Fri, Nov 12, 2010 8:17 PM

James Harris mentioned XLAT: ... |> Intel invented that instructions back in the 8086 days.. That's a |> loooong time ago.

|Albert's reference to BX suggests he may be still there :-)

XLAT still work.

|Based on tests I made some relevant comments on this in my prior post |on this thread (q.v.).

I checked it on my AMD and cannot confirm your test on XLAT to perform within one cycle. I measured 4-5 regardless of dependencies. (AL=rBX+AL) got a depency in itself anyway. And I once read that XLAT may not be supported in future 64-bit CPUs. Meanwhile it's a short hack, even not on top of speed-acceleration.

__ wolfgang

- J
- James Harris
  
  Contact options for registered users
Vote on answer
posted
13 years ago

Fri, Nov 12, 2010 11:12 PM

...

Interesting. I've now tried the tests on my AMD Athlon X2 in protected mode. Best I could get xlat was 2 cycles for the following

mov eax, 53 xlat

A similar test but with the mov to the subregister AL cost 6 cycles rather than 2 so for AMD there seems to be a penalty for trying to make use of a full register immediately after writing part of the register ... not that that's recommended anyway.

Best performance on my AMD was for

mov bl, [xlat_table + eax]

This executed in 0 or 1 cycle.

Well, arguably it's an old CISC instruction. Now we have more regular addressing modes (since at least 80386) instructions such as xlat don't add anything except code space and, as seen, they lose in terms of performance - though xlat is probably not as slow as may have been expected. Either way,

mov bl, [xlat_table + eax]

is fast on both Intel (P6) where it took 1 cycle and AMD where it took

0 or 1 cycles.

James

- J
- James Van Buskirk
  
  Contact options for registered users
Vote on answer
posted
13 years ago

Sat, Nov 13, 2010 2:35 AM

Agner Fog mentions that this is how partial register stalls work on AMD: the register is not split into subregisters but the whole register is dependent on any operation that writes any part of it. Thus when you try:

mov al, 53

the operation can't start until bits 31:8 of eax are done updating from the previous xlatb instruction. When you write:

mov eax, 53

the processor doesn't have to wait for bits 31:8 of eax to be updated because the result of the operation is not dependent on them.

You know that just isn't right. In this case the value of al had to come from data, so there is at least another read in there. Pentium Classic and Pentium MMX processors could issue two loads per clock cycle, but PentiumPro and later processors issue all loads to port p2 for a maximum throughput of one load per clock cycle. Thus realistically you have a lower bound of one byte translated per every two clock cycles on Intel for this reason alone. AMDs can issue two requests to L1 cache per clock cycle so the best you can do is 1 per clock cycle. Also, to be useful you have to do something with bl above, at least storing it into memory or shifting it into a bigger register. Either operation would slow the AMD processor down even more, either by exceeding the maximum memory requests per clock cycle or by exceeding the maximum total instructions issued per clock cycle.

Put another way, if you measure how a single operation affects throughput when inserted in the context of several other operations, you may measure it to be zero, but if you take a lot of those operations measured at zero effect on throughput and string them together you find that the sum of all those zeros is more than zero.

Also James.

--
write(*,*) transfer((/17.392111325966148d0,6.5794487871554595D-85, &
6.0134700243160014d-154/),(/'x'/)); end

- J
- James Harris
  
  Contact options for registered users
Vote on answer
posted
13 years ago

Sun, Nov 14, 2010 10:09 PM

...

Makes sense in the presence of register renaming (Tomasulo-type). Writing just part of the register requires the hardware to merge with the rest of the real register whereas writing the whole thing allows the value to propagate without reference to the original value. Interestingly, according to "Inner Loops" the Pentium had little problem with partial writes but the original Pentium Pro had a big seven-cycle problem with them (page 136). Helpfully, the book also says that the PPro is OK with

xor eax, eax mov al, 53

Not sure when Intel improved things or when/if AMD did. At any rate, as mentioned, partial writes are good to avoid where practicable.

I don't follow your logic here. You seem to be saying that an AMD (any AMD? surely not) can make two read requests per cycle but can still only achieve 1 of the above instructions per cycle rather than two.

You don't like AMD for some reason? :-(

Don't forget these are timings from a specific CPU with specific pieces of code. Timings in practice depend on the CPU, the specific instruction mix, what comes before and after and even alignment. I should have made that clear.

Nevertheless the key in many CPUs is to break dependency chains. IIRC Fogg makes that point. Taking your specific example, doing something useful with bl doesn't preclude loading another byte register or carrying out other work.

If it's the zero you are concerned about, as stated each instance of the instruction took zero or one cycles. All the zero means is that some instances were paired, so adding nothing to the overall cycle count.

James