ARM926 caching question

Stargazer · 2010-09-05T11:44:55+00:00

Greetings,this question is for ARM experts, in particular it's about ARM926 core(which is used in TI's DM6467 DaVinci processor).I want to use cache for speeding up processing on video buffer of sizeYCbCr 4:2:0 1080P (1920x1088x1.5). Normally the buffer is not cached,since it is shared between ARM code, C64 DSP core and with anadditional PCI master. Data flow is the follows: external PCI masterfills in raw uncompressed frame -> we add several processings (layoutbuilding, background, some graphics and OSD test belnding), then thewhole resulting frame is passed to DSP for compression.ARM core runs MontaVista Linux 4.0.1 with kernel 2.16.18 (MV-patchedfrom MontaVista 5.0 distribution).I'd like to enable caching on ARM for this buffer, process it inchunks of 4K (D-cache on DM6467's ARM core is 8K 4-way associative, soI want to leave at least two ways for caching of other program's dataand stack) and then call a kernel module, which will write-back andinvalidate each 4K chunk. So by the end of wbinvd'ing the last chunkthe whole buffer will be consistent in external RAM ready for DSPprocessing (obviously, before starting such a processing, the whole D-cache will have to be invalidated without write-back).Sounds good, but I see problems with doing so, according to ARM926 TRM(or may be I just misunderstand).ARM caches data in 32-byte lines tagged with Modified Virtual Address.MVA is made by appending a special field FCSE PID in CP15 reg. c13 toprogram's virtual address, if that address is below 32M; if theaddress is above 32M, no appending takes place and VA = MVA (that'swhat happens in kernel mode). User-mode programs are mapped to lower32M VA and hence use that FCSE PID. I tried to use user-mode pointersin kernel mode, and got inconsistent data in user-mode buffers;apparently, the kernel changes FCSE ID on system call entry or justdisables it.Now the TRM says: "FCSE translation is not applied for addresses usedfor entry based cache or TLBmaintenance operations. For these...

S

Stargazer 15 years ago

Hi Marcus,

PID replaces upper sever bit for addresses that by definition have them 0s (addresses >= 32M do not pass the translation by TRM's definition, again). I believe that this can be called "7 bits of PID are appended to 25 bits of VA" without losing much of a sense.

Are you sure? That was my first thought when I considered how I would solve it, should I design the caching structure. But the quote that I posted above doesn't say that this is a MVA suitable for the specific PID which was cached. The problem, and the reason for me posting this question here is that regarding cache not many experiments can have a visible showcase.

Thanks, Daniel

Vote

M

Marcus Harnisch 15 years ago

I just wanted to make sure that we're on the same page. "Appending" could be interpreted as creating a virtual address space which is larger than 32 bit.

An MVA is always associated to a specific PID as reflected by its upper bits. This section describes the parameter format for cache ops:

formatting link

Regards

-- Marcus Harnisch Senior Consultant

DOULOS - Developing Design Know-how VHDL * SystemC * Verilog * SystemVerilog * e * PSL * Perl * Tcl/Tk ARM Approved Training Centre (ATC)

Doulos Ltd., Central European Office, Garbsener Landstr. 10, 30419 Hannover Tel: +49 (0)511 2771340 mailto: snipped-for-privacy@doulos.com Fax: +49 (0)511 2771349 Web:

formatting link

This e-mail and any attachments are confidential and Doulos Ltd. reserves all rights of privilege in respect thereof. It is intended for the use of the addressee only. If you are not the intended recipient please delete it from your system, any use, disclosure, or copying of this document is unauthorised. The contents of this message may contain personal views which are not the views of Doulos Ltd., unless specifically stated.

Vote

S

Stargazer 15 years ago

"Faster"/"slower" is a misnomer here, the need to look up TLB results in longer CPU pipeline and more latency; but that would only be felt during cache miss handling (which will incur 1-2 sycles longer a penalty). During "normal" (cache-hit) operation, which is wanted (I'm afraid to write "expected" here) to occur about 90% of the time, there will be no difference. PIPT-caching x86s do memory access instructions in 1 cycle (cache hit assumed) since 486, when caching was first introduced to the architecture.

This latency advantage is approximately the only one of VA-based caching (of any sort - be it virtually indexed or tagged). Now begin the disadvantages:

1) need to flush caches on context switch. This may be solved with VSIDs (PIDs), like the ARM926 in discussion does, but it limits both number of address spaces that may simultaneously exist and the address range which constitutes address space (limits for ARM926 are 128 address spaces and 32M of range per address space) 2) multiple cache entries for physical memory that has multiple virtual mappings, without any consistency between them. I can think of several awkward methos of ensuring consistency, but nothing I could recommend for a performing processor design. ARM926 TRM doesn't mention that this issue is solved at all 3) shared memory (same physical address mapped at possibly the same virtual address in different/all address spaces) is not already shared: it gets a separate cache entry for each address space. This is actually a special case of problem (2), but it kills ideas of both caching and shared memory on its own 4) when doing cache write-back you would need to look-up TLB for physical address, losing the only advantage over physical address- based caching. This may be solved by double-tagging (keeping translated physical along with VA in tag memory), which is indeed what ARM926 does. Resulting are additional memory cells (could be used for implementing just more cache)

note that ASIDs requre an additional read from register before you can index cache, thrashing the only advantage of virtual caches over physical caches.

Unfortunately, this is especially wrong for my OP. I have a task which has some very specific requirements:

1) it must be done on DM6467 (ARM926 implementation by TI), within its architecture and caching, regardless of what I personally think about ARM926's caching 2) it must be implemented within MontaVista Linux kernel 2.6.18, and no newer kernel (TI simply don't have complete BSP+drivers set for DM6467 for any newer kernel) 3) it must fit within the current design, in particular it may not use EDMA channels, because all of those are taken (and hence I can't "ping- pong" to any regularly-cached memory or DTCM) 4) it is a performance optimization, performance killers like copy_from_user() / copy_to_user() are strictly prohibited! :-)

Daniel

Vote

S

Stargazer 15 years ago

The ARM926 uses virtual indexing and virtual tagging, based on modified virtual addresses. If VA < 32M, it is prepended with 7-bit PID, and the combined 32-bit value is called MVA and used for both cache indexing and tagging. If VA >=3D 32M, then VA =3D MVA. Additionally, it stores translated physical addresses of each cached line in order to perform write-backs without additional TLB look-up.

[...]

Correct.

r

Here you are misleading: latency !=3D speed. Higher latency is slower only when it is faced - and that's only on cache misses. During regular "happy" cache-hit operation I think that all caching devices are designed to withstand overall cache-accessing instruction execution of 1 cycle.

Don't know what you mean here, all caching architectures that I dealt with (granted, that's not all architectures existing) use special bits to specify validity/dirtiness of a cache line.

Right, but again, such latency results in longer CPU instruction pipeline. It will be felt only during cache miss.

I-caches are generally less interesting for performance optimizations

- they are tightly coupled with instruction pre-fetching and decoding and there's usually little that can be done to improve it. The only cases I would remember of them are code modifications (like copying code to fixed exception handler addresses), address space replacements (like exec()), software breakpoints and different self-modifying code tricks. Nothing that's really relevant to running performance.

My OP question was about D-caches.

I put my own thoughts about P/V caching advantages/disadvantages in another post in this thread. I identify that latency issue is the only advantage of V-cache. Summary of disadvantages:

1) Need to flush cache on address space change. This may be solved with VSID/PID appended to every address space. I think that following my impression from ARM926 caching structure I incorrectly indentified the V-cache limitation there - nothing prevents an architecture from defining 32-bit address space appending to 32-bit VSID, resulting in just 64-bit MVA tag and no such limitations

However, it presents a problem for deliberate flushing the cache cased on virtual addresses: you may not know at a given time which PIDs were cached, so you would either need to loop via all possible PIDs to flush by address or just flush all cache contents addressing with index/way.

2) Alias caching and as a special case shared memory (IPC) caching. I believe that drawbacks of possible solutions to this would leave any kind of mmap()ed memory non-cached 3) Need to look-up TLB on cache write-backs, or hold PA per cached line (instead of using the same memory for implementing more cache) 4) If PIDs are used, PID register read adds to V-cache latency, decreasing/removing the difference in latencies between V-cache and P- cache.

Daniel

Vote

D

David Brown 15 years ago

You are correct that latency is not the only factor affecting speed (very often there is a tradeoff between latency and throughput), and you are certainly correct that delays are not a problem unless they are actually in the path of the data. For example, extra latency in cache writeback to main ram is seldom visible.

Again, I don't know about the ARM but it is far from correct to say that caching devices are single cycle. On GHz+ processors a L1 cache hit will typically take several cycles, and L2/L3 accesses can take dozens of cycles on some devices.

The issue here, however, is with latency - translating a virtual address into a physical address takes time. Even if all the required TLB entries, page maps, etc., are in registers or fast ram buffers, it still takes time. Whether it can be done in pure combinational logic or takes clock cycles depends on the speed of the device and the complexity of the translation. Typically on a very fast clock device it will take a number of cycles, making it a very significant cause of delays. For slower clock devices it may be possible to do it in combinational logic and avoid any clock cycle delays, but the cost is paid in silicon complexity, size, and power. If you can do the cache indexing with the virtual memory addresses, you don't need the physical address until later during tag matching, and thus have a lot more leeway.

Perhaps using the word "valid" here was misleading, as it is also used for status bits on the cache line. I meant it is easy to see whether you get a hit or not on the cache entry by comparing the physical address.

No, it's not just a longer pipeline - it's a critical part of the pipeline and may cause stalls. The latency occurs on every cache access, hit or not. The biggest cause of pipeline stalls is waiting for data (or instructions) from memory - unless you can fill the processor with calculations, that extra cycle is a cycle lost on a very regular basis. And longer pipelines means more overhead for jumps in the instruction stream and pipeline flushes.

And for smaller/slower devices, the extra pipeline step is extra complexity and costs.

As you can see, there is much to gain by indexing with the virtual address. That's why that is the caching scheme used on /many/ processors, especially on smaller devices, embedded devices with simple virtual memory systems, and for the L0 or L1 caches on faster devices.

The overheads and complexities of virtual address indexing become more of an issue as the cache gets bigger. That's why physical address indexing is the normal choice for larger caches.

If it were as clear-cut as you (and a couple of others in this thread) are suggesting, then you simply would not see virtual address indexed caches in practice. While we might not agree with all the choices and tradeoffs made by the designers of any given CPU, especially as they may not be optimal for /our/ use, it's a fair assumption that the designers know more about cache design than you or I, and they have thought about the costs and benefits of different designs before choosing the compromise that best fits their processor's target audience. I find it hard to accept that virtual address indexing is a "flawed design" and that the costs of virtual to physical translation simply "disappears in the pipeline" - the theory as I understand it, along with the practical reality of implementations in real-world processors argues against it.

Instructions are extremely important for performance, and thus need to be optimised. One common optimisation is to use virtual address indexing, especially for the parts nearest the processor, to avoid extra latency. By the time the instructions reach the decoded pre-fetch buffers or branch buffers, you are certainly in the virtual address domain.

OK. I must be honest in that I haven't paid too much attention to the original question, since it is specific to the ARM chip and I'm not familiar with it.

Cache aliasing is definitely an issue with virtually indexed caches, and must be solved with either clever hardware or clever software.

I am by no means arguing that virtual address indexing is a /better/ caching scheme than physical address indexing, just that it is better in /some/ ways. There is always a choice to be made when designing a cache, and pros and cons of both methods.

Extra latency during write-backs is seldom of concern - the processor doesn't have to wait for writes to finish before continuing. What /is/ a concern is that the write-backs go to the correct place even if the virtual memory mapping has been changed, so caching the physical address is often a good idea. In practice, most virtually indexed caches use physical addresses for tags anyway.

No, reads of a register like that are basically free. There is no need for any sort of lookup to access it - it's handled purely in combinatorial logic and costs no more than a couple of simple buffers at most.

Vote

D

Didi 15 years ago

Yeah, it is not flawed. Like a seven wheeled car is not flawed compared to a four-wheeled one - it does move, you know.

Dimiter

Vote

D

Didi 15 years ago

Ah, so they have been working to fix it. I remember reading something years ago about ARM not being pipelined or sort of, how does this add up with the above?

Dimiter

Vote

V

Vladimir Vassilevsky 15 years ago

ARM derivatives made by Intel always had PIPT cache. This makes me think that the absurd organization of the native ARM caches is caused by the intellectual property obstacles set by Intel.

ARM cores are pipelined; the pipeline structure is quite different from core to core.

Vladimir Vassilevsky DSP and Mixed Signal Design Consultant

formatting link

Vote

D

Didi 15 years ago

But they are doing it now the correct way, so it sounds more like this has been just a design blunder.

So they have just being learning the trade and are getting better. Well, it is always good to have some usable diversity. Although they will remain stuck with their 16- GP registers, which are fairly scarce for a RISC architecture (but one can live with that).

Dimiter

Vote

D

David Brown 15 years ago

No, it's more like a two-wheeled vehicle. It's flawed if you are looking for a car, but great if a motorbike fits the job.

Virtual address indexed caches are /used/ in practice, because in certain cases they do a better job than the more common physical address indexed caches.

I'm currently working with a processor using the e200z6 PowerPC core. It has a virtual address indexed cache. It's a fairly new device, evolved from through many previous devices in the family - if the core would have been improved by using a physically indexed cache, I'm fairly sure Freescale would have figured that out with earlier generation parts.

Once you can give me a concrete example of a seven wheeled car, I'll grant that you have a valid comparison.

Vote

S

Stargazer 15 years ago

[...]

AFAIK, L1 caches were almost always designed to work on CPU speed. That is, they are intended to support instructions throughput of 1 load per cycle (cache hit assumed), even if all instructions were loads. Note that this is again latency VS speed discussion, it's not that cache access is 1 cycle, but rather that cache access is intended to fit within the CPU's pipeline so that overall throughput is 1 instruction per cycle.

[one more latency VS speed occasion snipped]

r

If CPU waits for data on cache hits then it's designed poorly. In such a design it would already be better to make cache directly addressable fast RAM. Intel's x86 achieved 1-cycle throughput on cache hits since they first introduced cache with 486 (1991). And everybody who is competitive today must do the same... the key point is that modern CPUs have long pipelines - I think 10-12 cycles wouldn't be too much, and that loaded data are not needed until they can be read from cache for sure. When a cache miss occurs, it breaks the whole sequence, and longer pipelines will incur higher cache miss penalties.

Well, I initially didn't want this thread to take a directlion of V- cache VS P-cache discussion, although I suspected it would inevitably get there in this or that form :-) I believe that in real world if things exist, there is reason for that. V-caches introduce many problematic use cases, but they allow decently performing CPU designs simpler (in logic and silicon), cheaper and "colder". So as long as they can be used, we will see both caching approaches.

Vote

D

Didi 15 years ago

ch

-

t

Uhm, so you have yet to understand how the cache on your device works. They call it "virtually indexed physically tagged" and in the manual explain explicitly that there is no way to get synonyms, aliases and other sort of nonsense we were talking about here. Not a valid example.

Dimiter

Vote

D

David Brown 15 years ago

The manual does say that "The cache is physically addressed, thus eliminating any problems associated with potential cache synonyms due to effective address aliasing". Actually, this is only part of the reason you don't get synonyms - the same physical cache line in two different entries of the cache indexed by two different virtual addresses. The main reason is that the smallest unit of virtual address mapping on the device is 4K, and the cache is indexed by 4K of address lines (with 8 way associativity to give 32K total). Thus any two virtual addresses that map to the same physical address will have the same lowest 12 bits of the address, and map to the same set in the cache.

Synonyms, homonyms and aliasing are very real issues with virtually indexed caches. They have to be avoided - either through hardware, or through software restrictions (on some processors the OS is responsible for avoiding synonyms). The e200z6 code uses a very common method to avoid synonyms - match the cache size per way to the virtual memory minimum page size. As I wrote earlier, virtually indexed caches are generally small - this is main reason.

So as far as I can see, this is a perfectly good example of a virtually indexed cache used in practice. You can look up the ColdFire v4 core for another very similar example if you want.

You can also find other examples in "classic risc" architectures such as MIPS, SPARC and Alpha. These usually have physically indexed caches for the large caches, but can have virtually indexed caches for small caches (L0/L1 caches, or for earlier devices which only had small caches).

No matter how you implement caches, there are always potential complications. While using physical address indexing avoids some of these, it certainly doesn't remove them all. In particular, data coherency is an issue whenever there is more than one master (multiple CPUs, DMA, etc.). And whenever there are complications, there are clever techniques to work around them - there is a lot more involved in real-world virtually indexed caches for fast processors than has been discussed here, so that the processor can get the benefits of the low latency without imposing too many requirements on the software.

Vote

D

David Brown 15 years ago

I agree with this. There are two ways to make sure your cached data gets there on time - start fetching it at an earlier clock cycle (i.e., earlier in the pipeline), or use a faster cache design. Virtually indexed caches access their data faster (lower latency - throughput or bandwidth is the same), so you can have a shorter pipeline. With physically indexed caches you need a longer pipeline to hide the latency.

That's correct. The only thing you are missing is to note that shorter pipelines are faster than longer pipelines for a given clock speed and stall rate. Long pipelines have larger costs on flushes due to branches, exceptions, etc. They have more issues with hazards and data dependencies, leading to stalls. They are also more complex to design, involve much more logic to track, and take more power.

You can see this in the history of Intel's x86 designs. Their pipeline got longer and longer as they aimed for faster clock speeds, just so that they could claim higher MHz/GHz numbers in marketing. This cumulated (IIRC) in the P4 with about 30+ pipeline stages. Then they realised this was not the way to go - the Pentium M was almost as fast as the P4 in real-life applications despite having under half the clock rate. The main reason is that P4 seldom reached its theoretical throughput - pipeline flushes were so common and so costly.

You are right that virtual indexing gives more value for the power and cost for "medium range" CPUs (say, 100 - 400 MHz). These typically only have a small cache, and thus avoid the biggest issues of a virtually indexed cache, and the virtual indexing gives lower latency, shorter pipelines and easier logic than a physical indexed cache. For bigger caches, physical indexing is the only sensible choice. And for very fast processors, there are often small virtually indexed caches close to the processor, but these also have many other features (such as integrating instruction decode).

I think it has been an interesting and thought-provoking discussion. I hope you also got some useful help for your original question!

Vote

D

Didi 15 years ago

rote:

ks

s.

such

t

ay -

out

u

ss

ly

ts.

Precisely, around this has been the whole point of the talk against the flawed cache design of the ARM part in question.

Like designing your car with 4 rather than with 7 wheels, yes.

Dimiter

Vote

D

David Brown 15 years ago

Have I misunderstood your point all along? I was under the impression that you were arguing against virtually indexed caches in general. Does your "flawed cache design" comment apply only to this particular processor because the index is larger than the minimum virtual page size? If so, then I understand what you are saying, and agree to a large extent. It is /possible/ to have a virtually indexed cache with that is bigger (per associative way) than the minimum page size, but it certainly makes many things harder - you have to have either extra hardware or specific software restrictions to avoid aliasing.

Vote

D

Didi 15 years ago

asks

nes.

y such

ght

ng

e

way -

about

you

ress

.

re

irly

arts.

.

to

s

Not so sure about "this particular processor", but against the design allowing its cache behaviour, so yes, this is what this was about.

I am not sure what you imply by "larger than virtual page size" in that context, if you mean using the lowest 12 bits of the logical address os OK I suppose I don't have to tell you this does not differ from using the lowest 12 bits of the physical address at the usual 4k page size (in fact these are the same wires).

Anyway, apparently we agree on the cache design allowing synonyms and needing to be totally updated & flushed on task switch being generally "flawed".

Dimiter

Vote

D

David Brown 15 years ago

Yes, that's what I mean. Different processors have different sizes for the minimum page size of their virtual memory system. If it is 4K, then as you say the lowest 12 bits of the address are the same for the virtual address and the physical address. If only those bits are used in the index into the cache, then it's easy to avoid aliases - any aliased physical addresses would have the same lowest 12 bits in their virtual address, and therefore index the same cache set. In the case of the e200z6 cache, this is how it is done. Each cache set is 8 way, so it gets 32 KB cache. If you want more bits in the index than in the minimum page size, you need to have extra checks somewhere (hardware or software) to avoid aliasing.

If the ARM926 uses more bits of the virtual address to index the cache than the minimum page size, then a possible software solution is simply to use larger pages in the address mappings.

Certainly if a full cache flush is needed on task switches (or other changes to the virtual memory mapping), then it is certainly very inefficient.

I have heard that sometimes ucLinux is used on some ARMs even though they have an MMU, because it is more efficient. Perhaps this is the reason? There may be other reasons - perhaps the ARMs in question have only a limited MMU which is awkward to use.

Vote

ARM926 caching question

Join the Discussion

Didn't find your answer?