Qestion about cycle count in ARM Cortex-A8?

Question

Hi,

I am learning ARM Cortex-A8 CPU. In order to write optimized assembly code, I want to know the instruction scheduling. From A8 TRM, it gives the follo wing Table 16-4. I don't know how to use the cycle count, and its relations hip with source and destination register.

If cycle count is independent, the question is how to use it in scheduling.

If cycle count is relevant to the source and destination register, I cannot get the cycle number from the pipeline stages from below source and destin ation registers.

Could you explain it to me, experts?

Thanks,

DDI0344I_cortex-a8_r3p1_trm.pdf: Table 16-4 Multiply instructions Multiply type Cycles Source1 Source2 Source3 Source4 Result1 R esult2 Normal: MUL 2 Rm:E1 Rs:E1 [Rd:E3] {Rn:E4}a Rd:E5 - Long: SMULL, UMULL 3 Rm:E1 Rs:E1 {[RdLo:E3]} {[RdHi:E3]} RdLo:E5 R dHi:E5 Long: SMLAL, UMLAL 3 Rm:E1 Rs:E1 {[RdLo:E2]} {[RdHi:E1]} RdLo:E5 R dHi:E5 Halfword: SMLAxy, 2 Rm:E1 Rs:E1 [Rd:E2] {Rn:E4}a Rd:E5 -

rxjwg98 · Accepted Answer

e, I want to know the instruction scheduling. From A8 TRM, it gives the fol lowing Table 16-4. I don't know how to use the cycle count, and its relatio nship with source and destination register. g. ot get the cycle number from the pipeline stages from below source and dest ination registers.  Result2  RdHi:E5  RdHi:E5 Excuse me. I forgot to add the relevant information in last post. The cycle  count has its definition: the minimum number of cycles required for each instruction I feel that it looks like it is about the execution unit needs that time. T hen, why I cannot get the cycle number from the source and destination regi ster pipeline stages? What is your opinion? Thanks,  The tables in this section provide information to determine the best-case i nstruction scheduling for a sequence of instructions. The information includes: ? when source registers are required ? when destination registers are available ? which register, such as Rn or Rm, is meant for each source or destination...

David Brown · Answer

My first question here is why are you doing this? There is a great deal more involved in performance than cycle counts for instructions - pipelines, instruction scheduling, caches, prefetches, write buffers, etc. Your most important tools here are not the manual, but your system itself - measure the real-world speed for the particular algorithm you want to use.

And why are you writing assembly here? Have you tried using a reasonable compiler, with different flag settings and different details in the source code, and found that the code is too slow for your needs?

Of course, if you are just doing this for learning or for fun, it's a different matter - but if you are working on a real application then you are starting from the wrong end.

dp · Answer

Cycle counting on pipelined processors is not very practical. I don't know ARM, I use power processors - they specify "latencies". But you will find out that things depend on more than such latencies, e.g. data dependencies (you need the result from an operation to initiate the next one, say like in multiply-add; so even if the achievable throughput is 1 instruction/cycle if you try to accumulate in the same register and you have a 6 stage pipeline this will mean 6 cycles per multiply-add, you will have to figure out how to do the programming).

Basically if you write using assembler (with the crippled RISC mnemonics this is not such a good idea but you don't have many choices, not for ARM at least) you will have to live with the cycle count as it is, you can't influence that; what you can influence is the order of the opcodes, like spreading opcodes such that needed results of previous operations are used as late as practical etc.

Dimiter

------------------------------------------------------ Dimiter Popoff, TGI

formatting link

------------------------------------------------------

formatting link

Tim Wescott · Answer

Just to add to what David is saying: hand-writing assembly code for high speed used to make a lot of sense, in the days when compilers did not optimize very well.

That's not the case any more. Unless you have some oddball corner-case, such as a compiler that does not know how to efficiently use some instruction or set of instructions on the processor, there's no point in doing things in assembly.

The last time I wrote assembly and got significant code speed improvement was for an TMS320F2812 DSP processor, because Code Composter couldn't seem to cough up a one-cycle-per-multiply loop using hardware looping and the MAC instruction. That was well over 10 years ago, and the only reason I did the code writing was because I wanted to do something that didn't quite fit with the library code that TI provided.

I may do it again, if I find that the gnu compiler can't figure out how to efficiently use the MAC instructions in the Cortex M4 core -- time will tell, but I'm figuring I have even odds of giving up on the compiler and doing things in assembly (and you can bet that I'll ask here to see if there's some magic, if it doesn't work for me right off).

Tauno Voipio · Answer

Rest assured, at least when optimizing for size (-Os), the GNU compiler (v 4.7.x) does use the MAC instructions. There is little to be gained (and plenty to lose) with hand coding for an ARM.

The current RISC -based cores are a PITA to program in assembly language, and it is not intended, either. It is up to the compiler writers to handle the intricacies of the instruction set.

Despite of nearly 50 years of assembler programming, I have left the assembly code to GCC, with few exceptions which can be handled with the embedded assembly code handling of GCC.

upsidedown · Answer

I once encountered a web page about implementing the memcpy() with Pentium processors (apparently assuming virtual memory page and/or cache line alignment). Apparently quite high speeds could be achieved by first loading as much as possible into the floating point/MMS registers available, before storing the data to the destination.

One other trick was "touching" every 32 byte cache line and hence loading the Src data into cache an then perform a actual fast copy.

Unfortunately, I do not remember the link to that page.

Anyway, for fast data transfers you really have to consider data alignment, dynamic memory, cache lines, processor pipelines, instruction reordering etc.

This is far more demanding than trying to optimize how many PDP-11 integer instruction you could squeeze between PDP-11 floating point instructions :-)

David Brown · Answer

I have only 30 years of assembly experience, but I have the same attitude. The gcc inline assembly is so well integrated with the compiler that if you need to use it for a particular odd instruction, it can happily optimise the rest of the code around it.

I still think it is very important to be able to /understand/ the assembly generated by the compiler, although it can be hard with complicated RISC cpus with lots of registers. But sometimes for critical code it is good to look closely at the assembly to see what is happening, and it can affect the way you write the C code (especially for less powerful processors).

dp · Answer

Don't know about MMS (is that x86?) but on power (603e based flavour at least) this is definitely the case. Some years (10?) ago when I was optimizing the window scroll code for DPS this was the fastest of all (and I did try them all I think). Read 32 64-bit FP registers, then write them.

That did help, too; I am not sure whether I left it in because the help was not that huge and the "touch" buffer is core specific so I may have opted out of using it but it did help all right. I have used that touch elsewhere on that core since though, was pretty useful for DSP-ing - and it mattered there as the code could end up using 75+% of the cpu resources so every relief was welcome.

Alignment matters a lot indeed, one has to do the bulk of the transfer as aligned as it can get and start/finish it up by handling the few misaligned bytes.

Dimiter

------------------------------------------------------ Dimiter Popoff, TGI

formatting link

------------------------------------------------------

formatting link

Tauno Voipio · Answer

That is why I always make the compiler (or actually the toolkit) generate an assembly listing of a compilation. Just add    -Wa, to the GCC command line. --  -TV

David Brown · Answer

I have the same thing in every Makefile (except that I put the lst files in a different directory).  It is highly recommended.

Tim Wescott · Answer

I do that too.  It regularly saves my ass, sometimes by forcing me to  realize than yes, the compiler just did exactly what I told it to. --  Tim Wescott Wescott Design Services

Qestion about cycle count in ARM Cortex-A8?

Join the Discussion

Didn't find your answer?