single ARM instruction to copy C into r0 ?

- F
- Francois Grieu
  
  Contact options for registered users
posted
17 years ago

Wed, Feb 14, 2007 5:26 PM

Hello,

I'm trying to replace the ARM sequence MOV R0, #0 ADC R0, R0, #0

with a single ARM instruction that copy C into R0, clearing bits 1..31. I do not care for status bits afterwards, and have no register with a known value.

Any idea ?

Francois Grieu

- T
- tum_
  
  Contact options for registered users
Vote on answer
posted
17 years ago

Wed, Feb 14, 2007 10:07 PM

Hi Francois,

I gave myself about an hour to think about your riddle and I don't think it's possible. I'd be very glad to hear the opposite. (Still thinking about it, although it's really time to have a nap)

PS. Why comp.arch.embedded? Try comp.sys.arm....

- B
- Boudewijn Dijkstra
  
  Contact options for registered users
Vote on answer
posted
17 years ago

Thu, Feb 15, 2007 9:11 AM

Op Wed, 14 Feb 2007 18:26:08 +0100 schreef Francois Grieu :

Only one of these instructions will be executed:

MOVCC R0, #0 MOVCS R0, #1

Does that count?

--
Gemaakt met Opera's revolutionaire e-mailprogramma:  
http://www.opera.com/mail/

- P
- Peter Dickerson
  
  Contact options for registered users
Vote on answer
posted
17 years ago

Thu, Feb 15, 2007 9:53 AM

Surely that depends on your view of what 'executed' means in this context. Is there an ARM processor where the unexecuted instruction takes no (extra) time. I tend to think of these conditional instructions as going down the pipeline, having (some of) their results calculated, but then the writeback inhibited. To be not executed at all, not taking up any execution resources (saving power or time), would require the knowledge of the carry flag setting during decode, stalling decode if there is a carry-changing instruction ahead in the pipeline (or carry prediction).

Of course, if not executed means having no architechtural side effects then presumably NOP is also not executed. OK, I'm being a pedant.

Anyway, the best I can do in one instruction is SBC R0,Rn,Rn but this setc R0 to -1 if no carry and 0 if carry - one low in either case. Perhaps this can be compensated for later in the instruction stream but we have no info on how the result is to be used.

Peter

- T
- tum_
  
  Contact options for registered users
Vote on answer
posted
17 years ago

Thu, Feb 15, 2007 10:28 AM

;) I'm quite sure it doesn't.

This is too obvious
Francois specified 'single instruction'.

In a few datasheets that I've read it was explicitly stated that 'unexecuted' conditional instructions take 1 cycle.

AFAIK, there's no NOP in ARM, "mov r0,r0" is used instead (being pedantic too).

I've been cracking my brains with a way to use RRX shift somehow but so far no luck. I agree that if the OP gave us a slightly larger picture we could be more productive with proposals but I guess he doesn't want to.

Disclaimer: I'm not familiar with all ARM architectures & variants, so some of my statements may be wrong.

- F
- Francois Grieu
  
  Contact options for registered users
Vote on answer
posted
17 years ago

Thu, Feb 15, 2007 10:58 AM

I can tell without needing legal advice that

- CPU core is ARM922T

- context is this routine (not tested), performing addition of two m-word integers in radix-2^32 representation

- caller will immediately store the returned r0, and I do not want to change calling convention

; perform result = X+Y (expressed as little-endian radix 2^32) ; on entry: ; r0 points to result ; r1 and r2 point to sources X and Y ; r3 length in byte of X, Y and result, a non-negative multiple of 4 ; on exit: ; r0 is 1 or 0 depending on if result overflows or not STMFD SP!,{r4-r5} ; save temp registers used

ADDS r3, r3, #0 ; Z = (r3==0), C=0 ADD r3, r3, r1 ; r3 = r3 + r1, r3 points after end of X BEQ adddone ; -> early abort if Z is set (zero length) addloop LDR r4, [r1], #4 ; get 32-bit from X, advance pointer LDR r5, [r2], #4 ; get 32-bit from Y, advance pointer ADCS r4, r4, r5 ; C:r4 = r4+r5+C (the actual arithmetic) STR r4, [r0], #4 ; store 32-bit into result, advance pointer TEQ r1, r3 ; Z = (r1==r3) BNE addloop ; -> loop until r1 reaches r3 adddone MOV R0, #0 ADC R0, R0, #0 ; r0 = C (could we save one instruction ?) LDMIA SP!,{r4-r5} ; restore temp registers used BX LR ; return to caller

Optimizing this is actually not critical, but I'm compacting the code to the max as an intellectual exercise to deeply familiarize myself with ARM.

Francois Grieu

- B
- Boudewijn Dijkstra
  
  Contact options for registered users
Vote on answer
posted
17 years ago

Thu, Feb 15, 2007 11:01 AM

Op Thu, 15 Feb 2007 11:28:16 +0100 schreef tum_ :

Often the obvious solution is accompanied with: "Why didn't I think of this before?"

He didn't specify whether it was supposed to be a stored instruction or an executed instruction.

Yes. The execution stage of the pipeline just waits for the next instruction to ripple through.

--
Gemaakt met Opera's revolutionaire e-mailprogramma:  
http://www.opera.com/mail/

- T
- tum_
  
  Contact options for registered users
Vote on answer
posted
17 years ago

Thu, Feb 15, 2007 11:46 AM

after end of X

(zero length)

pointer

arithmetic)

advance pointer

instruction ?)

After 20 minutes of thinking: I can't squeeze it any further, let's see what others say. All that I can propose is:

1) use r12 instead of r5. r12 doesn't have to be preserved. This will improve speed & stack usage. 2) swap BEQ and ADD instructions, this will improve speed in case of zero length ;-). 3) When size is the issue consider using Thumb (I understand that your goal is an exercise with ARM, not Thumb).

ps. my previous post still didn't appear in the thread (I use Google Groups), hopefully it will appear later but I'll paste the link here just in case:

formatting link

- F
- Francois Grieu
  
  Contact options for registered users
Vote on answer
posted
17 years ago

Thu, Feb 15, 2007 1:09 PM

Thanks, had missed that one, although it is implied by

formatting link

Seems like, in a piece of code with only self-references (no linker veener), and calling no external code, there is a carved-in-a-next-as-strong-as-hardware-stone rule that register r12 belongs to me.

After this optimization, is it worth, neutral, counterproductive or impossible to reformulate STMFD SP!,{r4} into something like STR r4, [r13,#-4] (did I get this right?); that kind of thing would be wise on a

680x0 (assuming condition codes do not matter).

Yes, thanks. Also, in the context, since there is no TEQ in Thumb, I found no way to loop without interfering with the C bit, this in turn made some extra instructions necessary; but indeed, probably still a bit more compact.

Francois Grieu

- L
- Laurent
  
  Contact options for registered users
Vote on answer
posted
17 years ago

Thu, Feb 15, 2007 2:16 PM

Doesn't EOR fit your needs?

Laurent

- L
- Laurent
  
  Contact options for registered users
Vote on answer
posted
17 years ago

Thu, Feb 15, 2007 2:18 PM

No it doesn't, sorry :)

Laurent

- T
- tum_
  
  Contact options for registered users
Vote on answer
posted
17 years ago

Thu, Feb 15, 2007 2:43 PM

by

formatting link

Yes. If you're dealing with software that is conformant with Procedure Call Standards proposed by ARM. There may be software out there that is not conformant.

;) you forgot the '!' at the end (but I had to peep into the manual to correct you, I don't know this stuff by heart). To the best of my (limited) knowledge, the two instructions are identical in effect/speed/size for ARM7 & 9 cores.

Why would it be wise? (not familiar with 68k)

- P
- Peter Dickerson
  
  Contact options for registered users
Vote on answer
posted
17 years ago

Thu, Feb 15, 2007 3:13 PM

I can make it more compact by removing two instructions from outside the loop and adding one inside and changing one slightly. Leave R3 as a count, counting down by 4. Then after the loop R3 is known to be zero so use ADC R0,R3,#0.

Peter

- W
- Wilco Dijkstra
  
  Contact options for registered users
Vote on answer
posted
17 years ago

Thu, Feb 15, 2007 3:19 PM

How about (using new UAL syntax):

PUSH {r14} ADDS r3, r3, r1 ; r3 = r3 + r1, r3 points after end of X, C = 0 B loopstart addloop LDR r14, [r1], #4 ; get 32-bit from X, advance pointer LDR r12, [r2], #4 ; get 32-bit from Y, advance pointer ADCS r14, r14, r12 ; C:r4 = r14+r12+C (the actual arithmetic) STR r14, [r0], #4 ; store 32-bit into result, advance pointer loopstart EORS r14, r1, r3 ; Z = (r1==r3), r14 = 0 BNE addloop ; -> loop until r1 reaches r3 ADC r0, r14, #0 ; r0 = C POP {pc}

Note this assumes r1 + r3 doesn't overflow, ie. the array pointed to by r3 doesn't wrap around at the end of memory.

Wilco

- T
- tum_
  
  Contact options for registered users
Vote on answer
posted
17 years ago

Thu, Feb 15, 2007 3:22 PM

SUBS r3,r3,#4 ?

But this will kill the carry... or am I missing something? sorry, a bit in a haste at the moment.

- T
- tum_
  
  Contact options for registered users
Vote on answer
posted
17 years ago

Thu, Feb 15, 2007 3:31 PM

Nice. Just another example of a solution that appears obvious after someone has shown it to you ;))) Nice. EORS doesn't touch the C if there are no shifts involved.

- W
- Wilco Dijkstra
  
  Contact options for registered users
Vote on answer
posted
17 years ago

Thu, Feb 15, 2007 3:35 PM

No, LDM/STM of one register is takes 2 cycles on ARM9 while a LDR takes just 1, so it is best to avoid single register LDMs on ARM9. Thumb-2 doesn't support single register LDM/STM although Thumb-1 supports single register PUSH/POP. They are useful for codesize.

Wilco

- T
- tum_
  
  Contact options for registered users
Vote on answer
posted
17 years ago

Thu, Feb 15, 2007 3:46 PM

Thanks. ARM9 is still new to me.

- F
- Francois Grieu
  
  Contact options for registered users
Vote on answer
posted
17 years ago

Thu, Feb 15, 2007 7:36 PM

In article , "Wilco Dijkstra" proposed:

Actually the assumption is stronger, and quite a bit less safe: it is that the array pointed to by r3 does not REACH 0xFFFFFFFF. ADDS r3, r3, r1 is still a nice trick, if not one that I would dare to promote heavily.

The real gem is EORS r14, r1, r3 and how it leaves R14 zeroed. I had wrongly concluded that "C Flag = shifter_carry_out" meant that C was destroyed by EORS, and now realize it is not, which opens a whole new universe of possibilites. Thanks a lot.

Also I like PUSH {r14} / POP {pc} After considerable hunt in ARM DDI 0100E (2000-06-23), I conclude that "On architecture version 5 and above" (my target), it is a perfectly legitimate idiom to preserve a working register, and return, including switching back to Thumb mode as needed. This can be put to excellent use in a lot of code; looks like if a terminal subroutine needs to preserve some registers for temp usage, it pays to make r14 part of the temporary registers pool, and return by restoring the saved r14/LR into r15/PC, leaving r14 indeterminate, which is allowed by the usual calling conventions.

Thanks a lot, Wilco Dijkstra. BTW that was fun to see somemone with your name use B loopstart ;-) Is it a FAQ to ask the relationship with Edsger W. Dijkstra?

Francois Grieu

- F
- Francois Grieu
  
  Contact options for registered users
Vote on answer
posted
17 years ago

Thu, Feb 15, 2007 8:10 PM

I did mean pre-indexed, but failed to read the ! on the PDF.

"Wilco Dijkstra" wrote: ] LDM/STM of one register takes 2 cycles on ARM9 while a ] LDR takes just 1, so it is best to avoid single register ] LDMs on ARM9.

Thanks! I take good notice that STR r4, [r13,#-4]! works faster than STMFD SP!,{r4} on ARM922T; and that apparently the modern idiom is POP {r4}

The 68k also has multiple move instructions to save and restore several registers; and similar to ARM9, when dealing with a single register, multiple move is slower (also: less dense) than a standard move; an additional twist is that the effect on status bits might not be the same.

The ARM922T looks a bit like a 68030 gone RISC, with lots of nice additional twists (Thumb, UMLAL)

Francois Grieu