Reordering of functions

Hi,

I've a question about the influence of compiler optimizations that reorder functions on the system performance.

Assume a modern DSP with all state-of-the art features like prefetching, branch prediction and a superscalar pipeline. Further assume that all caches are disabled. Will the program runtime change when just the order of functions is changed (without any other code transformation)?

I'm of the opinion that a reordering of function should have little influence on the program execution, maybe due to some prefetch effects but thes should be marginal. Of course, with caches this situation would look different.

How do you see that? Do you see any other side-effects that might influence the program?

Regards, Tim

Reply to
Tim Frink
Loading thread data ...

There can be many small caches or buffers between the program memory and the cpu core - it all depends on the type of memory and the type of cpu you are using. For example, if you have 32-bit wide memory and 16-bit wide instructions, the alignment of code to 32-bit boundaries may have an effect. If you are using dynamic memory, you'll have paging effects which will slow down access if you the memory controller has to switch pages. The memory controller might also fetch 16 byte lines (for example) even if you are not caching the results.

You will also find that some cpus have smaller instructions for local jumps compared to far jumps - if the compiler/linker is smart enough, the jumps may be faster for local code compared to far code.

But for most cpus, if you have disabled caches and are using internal memory (or at least, memory with consistent access times), there should be no difference in speed when you reorder the functions.

Reply to
David Brown

Of course! All of the micro architectural features you mention critically rely on code layout. Instruction fetch usually reads more than one instruction, so branch target alignment matters. A branch predictor works like a cache, so has the usual conflicts due to branch alignment and aliasing. Even with caches disbled instruction and data fetches are done using wide bursts. And unless you disable the TLB as well, you get different TLB misses.

I've seen a factor of 2 difference in execution time of small loops on a superscalar CPU while running from tightly coupled SRAM (not cached, nor translated). Using a higher associativity (cache and branch predictor), wider fetch and larger instruction buffers (between fetch and decode) tends to reduce these effects significantly.

The really interesting question is whether it is feasible to optimise code alignment in a compiler. The problem is you often don't know the final alignment of the code. But even if you do, you can't easily deduce which branches/cachelines will actually conflict at runtime. Even profile feedback does not help here...

Wilco

Reply to
Wilco Dijkstra

news: snipped-for-privacy@yahoo.de...

critically

instruction,

cache,

with

predictor),

code

feedback

positive or negative way if compiler generates a different code if yo reorder the functions. So you never know what is going to happen.

>
Reply to
badal_akr

Lets assume that all functions are doing the same stuff with same style, now there should be no difference in reordering. but, putting functions with different scopes SHOULD effect the app throughput. For example TLB miss, far jumps and fetching the large array are noticeable examples.

ali

Reply to
Ali

I agree, that was what I meant with the little impact of prefetching, i.e. it might happen that the number of fetched instructions at a branch target differs.

Ok, that's true. I assume that you mean branch target buffer which stores computed branch targets to avoid redundant recomputations.

bursts.

What do you mean with wide bursts?

But how do compilers handle that? There are alignment optimizations. Do they statically align all frequently accessed branch targets to an appropriate address?

Reply to
Tim Frink

This can be answered by the programmer - in this case, the author of the compiler and/or the library functions you will have linked after writing your lines of HLL code. Or you can do some reverse engineering on the compiler you use and get the answer yourself. I suspect you will get as many different answers as there are programmers who are out there offering their compiler code. A much easier way out of this is to program the section of interest yourself, of course.

Dimiter

------------------------------------------------------ Dimiter Popoff Transgalactic Instruments

formatting link

------------------------------------------------------

formatting link

Tim Fr> Hi,

Reply to
Didi

instruction,

I wouldn't call it a small impact - as badal pointed out, it also affects instruction pairing. This can have a major effect if the CPU pairs instructions based on their position in a fetch slot (a bad idea but it is common in older in-order cores).

Correct. A BTB is typically indexed by the fetch address, so the number of branches and their position in the fetch block matter. BTB's usually have low associativity and are relatively small, so the aliasing effects can be large (I've seen 5-10% performance difference on Dhrystone).

bursts.

Modern DDR DRAMs are always accessed in burst mode, so even with the cache disabled, a memory access will read 32 or 64 bytes. Usually the fill buffers act as a mini cache, so further accesses are fast. Executing a loop inside a single fillbuffer rather than spread over 2 bursts gives a huge speedup.

Compilers can align code to fetch boundaries and do instruction scheduling/pairing based on that. Some compilers align function entry and critical loops to a cache line. However overaligning code slows things down due to being larger and having to execute large numbers of NOPs. So my advice to hardware designers has always been to design micro architectures that minimise the effects of fetch and cache alignment.

Wilco

Reply to
Wilco Dijkstra

...and the errata of 100 pages, with the half of bugs due to the incorrect operation of the pipeline in the certain cases.

Not much, unless the functions are very short so their length is comparable with the depth of the pipeline, and the time of execution is comparable with T_RAS.

It can significantly affect the interrupt latency though. Interrupt handlers should be aligned to the prefetch size.

Vladimir Vassilevsky DSP and Mixed Signal Design Consultant

formatting link

Reply to
Vladimir Vassilevsky

ElectronDepot website is not affiliated with any of the manufacturers or service providers discussed here. All logos and trade names are the property of their respective owners.