arm11/armv6 right shift signed packed values

I'm attempting to pack some numbers for output after doing some work. They're currently r7 = v0|v4 and r10 = v1|v5. They all need to be >>3, before or after repacking. Output will be v1|v0 and v5|v4 (little endian architecture). I managed to get the v1|v0 written reasonably efficiently: mov r8, r10, asr #3 ; 1>>3|xxx pkhtb r8, r8, r7, asr #19 ; 1>>3|,0>>3 str r8, [r0], r2 ; o1|o0, post inc

But v5|v4 is a little ugly because I'm starting with the least significant bits, so right shifting is going to drag in the bottom of the upper word (right?). Right now I'm sign extending, then writing individual shorts. mov r8, r10, asr #3 ; 5 >> 3 strh r8, [r0, #2] ; o5 sxth r1, r1 ;

sxth r7, r7 ; mov r8, r7, asr #3 ; 4 >> 3 strh r8, [r0], r2 ; o4, post inc

I found

formatting link
PKHBT R3, R1, R2, LSL #15 ; R3 = [R2>>1, R1] PKHTB R3, R3, R1, ASR #1 ; R3 = [R2>>1, R1>>1] However, that seems to rely on the input being full words.

Is there a better way to do this?

Reply to
johann.koenig
Loading thread data ...

t inc

Bit lazy with the copy/paste. Should be: sxth r10, r10 ; sign extend 5 sxth r7, r7 ; sign extend 4 mov r8, r10, asr #3 ; 5 >> 3 strh r8, [r0, #2] ; o5 mov r8, r7, asr #3 ; 4 >> 3 strh r8, [r0], r2 ; o4, post inc

Reply to
Johann

formatting link

An easy alternative would be to shift r10 and r7 left by 16 and then apply your first sequence. This way you save and instruction and use str.

However the best option would be to avoid shifting at this stage. Unless it is the final result, delaying the shift until the next processing step might be cheaper. Another possibility is to use halving additions if you do any, so that the result is already shifted.

Wilco

Reply to
Wilco Dijkstra

y

it

ght be

o

Thanks for the tip. At first I thought it would use extra instructions to do the shift, but then I realized that would just replace the sign extends. Unfortunately, this is the only way I can do the operation. The shift has to be the last thing, and can't be pre-processed at the receiving end. New code saves 1 store per loop: mov r10, r10, lsl #16 ; 5|x mov r7, r7, lsl #16 ; 4|x mov r10, r10, asr #3 ; 5>>3|xxx pkhtb r10, r10, r7, asr #19 ; 5>>3|4>>3 str r10, [r0], r2 ; o5|o4, post inc

You mentioned halving addition, but I can't find anything about that. It probably wouldn't help in this case, since the math goes like (x+y

+4)>>3 or (x-y+4)>>3. I can't add 4 first because the subtraction is associative.

--

-Johann

Reply to
Johann

ElectronDepot website is not affiliated with any of the manufacturers or service providers discussed here. All logos and trade names are the property of their respective owners.