I'm trying to replace the ARM sequence MOV R0, #0 ADC R0, R0, #0
with a single ARM instruction that copy C into R0, clearing bits 1..31. I do not care for status bits afterwards, and have no register with a known value.
I gave myself about an hour to think about your riddle and I don't think it's possible. I'd be very glad to hear the opposite. (Still thinking about it, although it's really time to have a nap)
Surely that depends on your view of what 'executed' means in this context. Is there an ARM processor where the unexecuted instruction takes no (extra) time. I tend to think of these conditional instructions as going down the pipeline, having (some of) their results calculated, but then the writeback inhibited. To be not executed at all, not taking up any execution resources (saving power or time), would require the knowledge of the carry flag setting during decode, stalling decode if there is a carry-changing instruction ahead in the pipeline (or carry prediction).
Of course, if not executed means having no architechtural side effects then presumably NOP is also not executed. OK, I'm being a pedant.
Anyway, the best I can do in one instruction is SBC R0,Rn,Rn but this setc R0 to -1 if no carry and 0 if carry - one low in either case. Perhaps this can be compensated for later in the instruction stream but we have no info on how the result is to be used.
In a few datasheets that I've read it was explicitly stated that 'unexecuted' conditional instructions take 1 cycle.
AFAIK, there's no NOP in ARM, "mov r0,r0" is used instead (being pedantic too).
I've been cracking my brains with a way to use RRX shift somehow but so far no luck. I agree that if the OP gave us a slightly larger picture we could be more productive with proposals but I guess he doesn't want to.
Disclaimer: I'm not familiar with all ARM architectures & variants, so some of my statements may be wrong.
- context is this routine (not tested), performing addition of two m-word integers in radix-2^32 representation
- caller will immediately store the returned r0, and I do not want to change calling convention
; perform result = X+Y (expressed as little-endian radix 2^32) ; on entry: ; r0 points to result ; r1 and r2 point to sources X and Y ; r3 length in byte of X, Y and result, a non-negative multiple of 4 ; on exit: ; r0 is 1 or 0 depending on if result overflows or not STMFD SP!,{r4-r5} ; save temp registers used
ADDS r3, r3, #0 ; Z = (r3==0), C=0 ADD r3, r3, r1 ; r3 = r3 + r1, r3 points after end of X BEQ adddone ; -> early abort if Z is set (zero length) addloop LDR r4, [r1], #4 ; get 32-bit from X, advance pointer LDR r5, [r2], #4 ; get 32-bit from Y, advance pointer ADCS r4, r4, r5 ; C:r4 = r4+r5+C (the actual arithmetic) STR r4, [r0], #4 ; store 32-bit into result, advance pointer TEQ r1, r3 ; Z = (r1==r3) BNE addloop ; -> loop until r1 reaches r3 adddone MOV R0, #0 ADC R0, R0, #0 ; r0 = C (could we save one instruction ?) LDMIA SP!,{r4-r5} ; restore temp registers used BX LR ; return to caller
Optimizing this is actually not critical, but I'm compacting the code to the max as an intellectual exercise to deeply familiarize myself with ARM.
After 20 minutes of thinking: I can't squeeze it any further, let's see what others say. All that I can propose is:
1) use r12 instead of r5. r12 doesn't have to be preserved. This will improve speed & stack usage.
2) swap BEQ and ADD instructions, this will improve speed in case of zero length ;-).
3) When size is the issue consider using Thumb (I understand that your goal is an exercise with ARM, not Thumb).
ps. my previous post still didn't appear in the thread (I use Google Groups), hopefully it will appear later but I'll paste the link here just in case:
Thanks, had missed that one, although it is implied by
formatting link
Seems like, in a piece of code with only self-references (no linker veener), and calling no external code, there is a carved-in-a-next-as-strong-as-hardware-stone rule that register r12 belongs to me.
After this optimization, is it worth, neutral, counterproductive or impossible to reformulate STMFD SP!,{r4} into something like STR r4, [r13,#-4] (did I get this right?); that kind of thing would be wise on a
680x0 (assuming condition codes do not matter).
Yes, thanks. Also, in the context, since there is no TEQ in Thumb, I found no way to loop without interfering with the C bit, this in turn made some extra instructions necessary; but indeed, probably still a bit more compact.
Yes. If you're dealing with software that is conformant with Procedure Call Standards proposed by ARM. There may be software out there that is not conformant.
;) you forgot the '!' at the end (but I had to peep into the manual to correct you, I don't know this stuff by heart). To the best of my (limited) knowledge, the two instructions are identical in effect/speed/size for ARM7 & 9 cores.
I can make it more compact by removing two instructions from outside the loop and adding one inside and changing one slightly. Leave R3 as a count, counting down by 4. Then after the loop R3 is known to be zero so use ADC R0,R3,#0.
Nice. Just another example of a solution that appears obvious after someone has shown it to you ;))) Nice. EORS doesn't touch the C if there are no shifts involved.
No, LDM/STM of one register is takes 2 cycles on ARM9 while a LDR takes just 1, so it is best to avoid single register LDMs on ARM9. Thumb-2 doesn't support single register LDM/STM although Thumb-1 supports single register PUSH/POP. They are useful for codesize.
Actually the assumption is stronger, and quite a bit less safe: it is that the array pointed to by r3 does not REACH 0xFFFFFFFF. ADDS r3, r3, r1 is still a nice trick, if not one that I would dare to promote heavily.
The real gem is EORS r14, r1, r3 and how it leaves R14 zeroed. I had wrongly concluded that "C Flag = shifter_carry_out" meant that C was destroyed by EORS, and now realize it is not, which opens a whole new universe of possibilites. Thanks a lot.
Also I like PUSH {r14} / POP {pc} After considerable hunt in ARM DDI 0100E (2000-06-23), I conclude that "On architecture version 5 and above" (my target), it is a perfectly legitimate idiom to preserve a working register, and return, including switching back to Thumb mode as needed. This can be put to excellent use in a lot of code; looks like if a terminal subroutine needs to preserve some registers for temp usage, it pays to make r14 part of the temporary registers pool, and return by restoring the saved r14/LR into r15/PC, leaving r14 indeterminate, which is allowed by the usual calling conventions.
Thanks a lot, Wilco Dijkstra. BTW that was fun to see somemone with your name use B loopstart ;-) Is it a FAQ to ask the relationship with Edsger W. Dijkstra?
I did mean pre-indexed, but failed to read the ! on the PDF.
"Wilco Dijkstra" wrote: ] LDM/STM of one register takes 2 cycles on ARM9 while a ] LDR takes just 1, so it is best to avoid single register ] LDMs on ARM9.
Thanks! I take good notice that STR r4, [r13,#-4]! works faster than STMFD SP!,{r4} on ARM922T; and that apparently the modern idiom is POP {r4}
The 68k also has multiple move instructions to save and restore several registers; and similar to ARM9, when dealing with a single register, multiple move is slower (also: less dense) than a standard move; an additional twist is that the effect on status bits might not be the same.
The ARM922T looks a bit like a 68030 gone RISC, with lots of nice additional twists (Thumb, UMLAL)
ElectronDepot website is not affiliated with any of the manufacturers or service providers discussed here.
All logos and trade names are the property of their respective owners.