ARM926 caching question

- S
- Stargazer
  
  Contact options for registered users
posted
13 years ago

Sun, Sep 5, 2010 11:44 AM

Greetings,

this question is for ARM experts, in particular it's about ARM926 core (which is used in TI's DM6467 DaVinci processor).

I want to use cache for speeding up processing on video buffer of size YCbCr 4:2:0 1080P (1920x1088x1.5). Normally the buffer is not cached, since it is shared between ARM code, C64 DSP core and with an additional PCI master. Data flow is the follows: external PCI master fills in raw uncompressed frame -> we add several processings (layout building, background, some graphics and OSD test belnding), then the whole resulting frame is passed to DSP for compression.

ARM core runs MontaVista Linux 4.0.1 with kernel 2.16.18 (MV-patched from MontaVista 5.0 distribution).

I'd like to enable caching on ARM for this buffer, process it in chunks of 4K (D-cache on DM6467's ARM core is 8K 4-way associative, so I want to leave at least two ways for caching of other program's data and stack) and then call a kernel module, which will write-back and invalidate each 4K chunk. So by the end of wbinvd'ing the last chunk the whole buffer will be consistent in external RAM ready for DSP processing (obviously, before starting such a processing, the whole D- cache will have to be invalidated without write-back).

Sounds good, but I see problems with doing so, according to ARM926 TRM (or may be I just misunderstand).

ARM caches data in 32-byte lines tagged with Modified Virtual Address. MVA is made by appending a special field FCSE PID in CP15 reg. c13 to program's virtual address, if that address is below 32M; if the address is above 32M, no appending takes place and VA = MVA (that's what happens in kernel mode). User-mode programs are mapped to lower

32M VA and hence use that FCSE PID. I tried to use user-mode pointers in kernel mode, and got inconsistent data in user-mode buffers; apparently, the kernel changes FCSE ID on system call entry or just disables it.

Now the TRM says: "FCSE translation is not applied for addresses used for entry based cache or TLB maintenance operations. For these operations VA = MVA." That is, I can use VA-based cache manipulation CP15 instructions in kernel mode without caring about FCSE PID. Now if that was true, suppose that there are currently data from 3 different processes cached for the same VA, just different PID and we're invalidating cache entry for that VA via CP15 reg c7, for which "translation is not applied". Which of the 3 entries above will get invalidated? All of them?

Thanks, Daniel

- D
- Didi
  
  Contact options for registered users
Vote on answer
posted
13 years ago

Sun, Sep 5, 2010 12:55 PM

Not having any ARM experience - I live in power (PPC) - I would still question your understanding they cache based on anything but physical address (i.e. I would expect caching is done after all translation has been done).

But this is just my speculation, again, I don't know ARM.

Dimiter

------------------------------------------------------ Dimiter Popoff Transgalactic Instruments

formatting link

------------------------------------------------------

formatting link

- S
- Stargazer
  
  Contact options for registered users
Vote on answer
posted
13 years ago

Sun, Sep 5, 2010 2:09 PM

OFFTOPIC

[...]

Unfortunately, that's how it is designed. I know that on x86 and PowerPC cache is tagged with physical address, and IMO it's correct way to do. ARM926 specs says that:

"The caches are virtual index, virtual tag, addressed using the Modified Virtual Address (MVA). This enables the avoidance of cache cleaning and/or invalidating on context switch." (ARM926EJ-S TRM -- 4.1 About the caches and write buffer)

Again, IMO it's not an argument, physical address tagging is irrelevant to cache cleaning/invalidation; it may be only argument over tagging with "unmodified" VA, which would be just "more wrong" way to do caching. MVA-tagged (as any VA-tagged) caching also faces a problem of multiple virtual mappings of the same physical address - a common case for any OS that does mmap(): the same physical memory when accessed with different mappings will get different cache entries without any coherency or consistency between them.

I deliberately didn't want to discuss issues with ARM's caching design because I just have to do work on ARM926 processor, and I can't change how it caches data :-)

Daniel

- D
- Didi
  
  Contact options for registered users
Vote on answer
posted
13 years ago

Sun, Sep 5, 2010 2:19 PM

Ouch, quite a mess the way they have designed it then. Can't be any help with ARM anyway, my previous post was just a "can't believe it" sort of thing, not very useful to you - although your reply was useful to me, thanks.

Dimiter

- P
- Paul Gotch
  
  Contact options for registered users
Vote on answer
posted
13 years ago

Sun, Sep 5, 2010 5:53 PM

'Have' being the operative word, contemporary ARMs are VIPT. Use of the FCSE was deprecated in ARM v6. As far as I am aware current Linux kernels (unless they have thirdparty patches applies) don't use the FCSE therefore MVA == VA.

-p

--
Paul Gotch
--------------------------------------------------------------------

- D
- Didi
  
  Contact options for registered users
Vote on answer
posted
13 years ago

Sun, Sep 5, 2010 6:42 PM

Well if they cache based on logical address (if this is what VA means) it is still no good - and still beyond my belief such a clunky design could go into production... not to speak about its popularity nowadays.

Dimiter

------------------------------------------------------ Dimiter Popoff Transgalactic Instruments

formatting link

------------------------------------------------------

formatting link

- L
- larwe
  
  Contact options for registered users
Vote on answer
posted
13 years ago

Sun, Sep 5, 2010 8:27 PM

At the risk of arguing how many angels can dance on the head of a pin

- can you elucidate why this is so evil?

The "only" downside that's obvious to me is that you might have the same physical address being tracked by multiple cache lines. Not being a computer science guru, I can't immediately see that this is worse for all cases than the physical address cache. (Though, I agree it does seem more logical to me to cache after translation).

- D
- Didi
  
  Contact options for registered users
Vote on answer
posted
13 years ago

Sun, Sep 5, 2010 8:52 PM

It is not that it can't be made to work, just like a car could have seven wheels instead of the usual four. You have to flush all the cache every task switch - different tasks may have same logical addresses mapping to different physical ones. Now think flushing 32k to DDRAM, this makes task switching many times slower than it has to be. There are probably a lot more implications than I can think of right away, and I can imagine they have been building on top of a mess to minimize its effects only to create a messier mess. Frankly, I would not bother looking into that line once I am aware of such an obvious, fundamental design flaw.

Dimiter

------------------------------------------------------ Dimiter Popoff Transgalactic Instruments

formatting link

------------------------------------------------------

formatting link

- V
- Vladimir Vassilevsky
  
  Contact options for registered users
Vote on answer
posted
13 years ago

Sun, Sep 5, 2010 9:11 PM

I had to deal with ARM9 cache management at low level and it looks like the organization of the program/data caches on ARM doesn't make much sense. Perhaps, the reason why it was done in such a cumbersome and inefficient way was avoiding intellectual property conflicts with Intel, etc.

Vladimir Vassilevsky DSP and Mixed Signal Design Consultant

formatting link

- P
- Paul Gotch
  
  Contact options for registered users
Vote on answer
posted
13 years ago

Mon, Sep 6, 2010 12:02 AM

VIPT means that you can do the index lookup in parallel with the TLB lookup, or if you miss the TLB the page table walk.

VIVT vs VIPT vs PIPT (1) is a frequency of flushing vs speed of lookup problem. VIVT is fastest however it means you have to flush on context switches. PIPT is slowest, because you can't do anything until you've translated the address but you don't have to flush. VIPT gives you the best of both worlds at the expense of needing more bits for the tag and therefore more area.

Additionally ASIDs (Address Space Identifiers) are used to tag TLB entries such that TLB flushes are not needed on context switches.

However this is all irrelevant as operating systems abstract away such things and provide kernel level APIs that guarantee to Do the Right Thing (tm)

Linux has a specific API for accessing user space memory from kernel spcace for example:

formatting link

-p (1) PIVT is theoretically possible but useless in practice.

--
Paul Gotch
--------------------------------------------------------------------

- D
- Didi
  
  Contact options for registered users
Vote on answer
posted
13 years ago

Mon, Sep 6, 2010 12:45 AM

Yeah, like moving boxes. But design and programming are still being done, you know. Even if you are not writing the OS you still have to pay the price of flushing huge amounts of data each task switch because your architecture is flawed. Make no mistake, no matter how many and acronyms get in circulation for it this remains a design flaw. While you may add some latency if you wait for TLB lookup to complete this will still have a negligible if any impact on throughput compared to updating the whole cache each task switch. I have just about completed a new device (400 MHz power core plus lots of DMA, really pushed to the limit - no C, VPA written) it would stand no chance if it had to do all that cache movement all the time, not by a great margin.

Dimiter

------------------------------------------------------ Dimiter Popoff Transgalactic Instruments

formatting link

------------------------------------------------------

formatting link

- L
- larwe
  
  Contact options for registered users
Vote on answer
posted
13 years ago

Mon, Sep 6, 2010 12:56 AM

In surprisingly few projects, really. One of my electronics professors was baffled that I would find anything to learn in a BSEE degree, because "all this stuff is done in chips now - nobody needs to know how discrete designs work". Um.... yeah.

Most large companies making high-volume products do not design; they package. I must say there is very little difference between companies that have no engineering department and OEM 100% of their products, and companies that have domestic engineering departments. Neither one creates anything new; they just package application notes.

- D
- Didi
  
  Contact options for registered users
Vote on answer
posted
13 years ago

Mon, Sep 6, 2010 1:16 AM

Yep, that's how it is indeed...

Dimiter

- D
- Didi
  
  Contact options for registered users
Vote on answer
posted
13 years ago

Mon, Sep 6, 2010 1:28 AM

Well don't know about the reasons but the power architecture (PPC) designers must have had no such issues, I rarely if ever have to think about caches. The one that got me a few times was forgetting to flush the instruction cache - this was while I was porting DPS from cpu32 to ppc about a decade ago. Still came to bite me on a new platform, had to chase some forgotten cacheline because of some alignment calculation error or sort of - but in general the whole thing is quite well behaved. And it does deliver its specified performance, although it took me 2-3 days to make it do a dual FP MAC apr. each 6 nS (at 400 MHz, 2 cycles per MAC specified not including memory accesses).

Dimiter

------------------------------------------------------ Dimiter Popoff Transgalactic Instruments

formatting link

------------------------------------------------------

formatting link

- V
- Vladimir Vassilevsky
  
  Contact options for registered users
Vote on answer
posted
13 years ago

Mon, Sep 6, 2010 2:40 AM

The instruction cache on ARM can run either in physical or in virtual space, data cache is virtual only. For the reason you mentioned, logical addresses must not overlap. Another nonsense is that ARM associative cache replacement policy is not LRU but sequential or random.

But why do you need to flush the instruction cache? DMAing when loading applications from disk?

Mistakes in the cache management are very difficult to isolate and fix. The glitches because of cache could be very peculiar.

I work mainly with the DSPs; one MAC per cycle is standard thing if you do the programming in assembly. C adds tonns of overhead but there are hundreds of MHz to spare.

Vladimir Vassilevsky DSP and Mixed Signal Design Consultant

formatting link

- D
- Didi
  
  Contact options for registered users
Vote on answer
posted
13 years ago

Mon, Sep 6, 2010 3:07 AM

This is probably PPC specific, probably not on all cores. The i-cache is not snooped when DMA-ing nor is it kept coherent with the data cache by hardware; it must be flushed by software. No performance issue at all since this must rarely be done but forget to do it once and you'll be executing the code which used to be in that area a while ago... :-) . Actually DPS does that at two places, one when loading a PSCT module from a file - this got me 10 or so years ago - and when loading a runtime object (different procedure/destination), this got me a few times more, don't remember why but I kept on forgetting the last cacheline (and the error would manifest itself rarely because of that :-) ).

Well apart from that - which was pretty obvious and specifically indicated in the manual as "programming error" - I don't remember having any other issues.

I too have worked with a DSP which would do 1 MAC/cycle in a straight forward manner (just in a loop), the 5420. But on the PPC FPU I have (a 603e derivative core) when I tried that I got 20+ nS/MAC instead of 5 (2 cycles for dual precision, it can do 1 at single which I did not use for that). It took careful loop unrolling, eliminating data dependencies, taking advantage of having 32 FPU registers (no chance doing it with 8 and almost no chance doing it with 16, 24 was fine).

Dimiter

------------------------------------------------------

Dimiter Popoff Transgalactic Instruments

formatting link

------------------------------------------------------

formatting link

- M
- Marcus Harnisch
  
  Contact options for registered users
Vote on answer
posted
13 years ago

Mon, Sep 6, 2010 8:34 AM

Time flies. Cortex-A9 has a PIPT dcache and VIPT icache.

Regards

-- Marcus Harnisch Senior Consultant

DOULOS - Developing Design Know-how VHDL * SystemC * Verilog * SystemVerilog * e * PSL * Perl * Tcl/Tk ARM Approved Training Centre (ATC)

Doulos Ltd., Central European Office, Garbsener Landstr. 10, 30419 Hannover Tel: +49 (0)511 2771340 mailto: snipped-for-privacy@doulos.com Fax: +49 (0)511 2771349 Web:

formatting link

This e-mail and any attachments are confidential and Doulos Ltd. reserves all rights of privilege in respect thereof. It is intended for the use of the addressee only. If you are not the intended recipient please delete it from your system, any use, disclosure, or copying of this document is unauthorised. The contents of this message may contain personal views which are not the views of Doulos Ltd., unless specifically stated.

- D
- David Brown
  
  Contact options for registered users
Vote on answer
posted
13 years ago

Mon, Sep 6, 2010 9:00 AM

Note that I don't know the caching structure of this particular chip at all, I'm only giving general information here.

If you are used to physical address caches, it is easy to misunderstand how virtual address caches work and therefore dismiss them as useless because you think they must be flushed for every task switch (or more precisely, every time the virtual-to-physical address mapping changes). This is not true unless you have a very simplistic implementation of the cache.

To understand this you need to know the different between "indexing" and "tagging". "Indexing" is about how you look up the right line in the cache, while "tagging" is about how the cache line is identified.

For a physical address cache, these are approximately the same - some part of the physical address is used to pick the line to look at, and the full physical address is stored in that line's tag to see if it is valid (I'm skipping over associativity here for simplicity).

The advantage of this is that your cache line's validity is independent of the virtual address mapping, and thus you don't need to take any special action if that changes (such as during a context switch). The disadvantage is latency - you have to wait until the virtual address is fully translated to a physical address before you can check the cache.

For a virtual address indexed cache, you can use either virtual address tagging or physical address tagging. With virtual address tagging, your cache is invalid (and must be flushed) whenever the virtual to physical mapping changes. However, it is the fastest scheme - there is no need to wait for any address translations on a cache hit. On devices with very fast clock rates this can make a big difference. The disadvantage is that you need to flush the cache often, although this can be avoided to some extent by using some sort of task identifier as part of the tag.

With physical address tagging, it's easy to see whether a cache line is valid or not by comparing the physical addresses just like for a physically addressed cache. However, since the physical address is needed only after the initial lookup, the address translation and the cache lookup can be done in parallel, thus reducing latency compared to the pure physical cache case.

With virtual indexed caches, you always have issues with synonyms - different virtual addressees that point to the same physical address. There are various schemes to avoid or detect these. It is much less of an issue for instruction caches rather than read-write data, and also if you have a write-through cache (which does not store any dirty data).

Trying to attach numbers to the tradeoffs for various cache characteristics is notoriously difficult, and highly dependent on the workload. But generally speaking the cost of the virtual address indexing gets higher with cache size, as cache flushes get more expensive, and it is particularly expensive when you have dirty data in the cache.

Thus virtual address indexing is mostly restricted to small caches, with virtual indexing and tagging being ideal for small closely-coupled caches aimed at speeding up loops, stack access, etc. Virtual indexed, physically tagged caches also work well for smaller caches (the PPC-based device I am using now has 32K of such cache). They are also good when you don't make many (or any) changes to the virtual-to-physical mapping, as is common on embedded systems. In such systems, virtual indexing is faster and simpler.

Bigger caches are almost always physically indexed (and tagged).

More complex processors use more advanced schemes to allow the use of virtual indexing on the smaller caches (L0 or L1) with less risks of flushes or coherency problems.

- M
- Marcus Harnisch
  
  Contact options for registered users
Vote on answer
posted
13 years ago

Mon, Sep 6, 2010 9:08 AM

The PID is not appended, but *replaces* the upper seven bits of the VA.

The address based cache ops already use the MVA as parameter. You will have to create it manually. Applying FCSE on top of that wouldn't make sense.

Regards

--
Marcus Harnisch
Senior Consultant

                DOULOS - Developing Design Know-how
    VHDL * SystemC * Verilog * SystemVerilog * e * PSL * Perl * Tcl/Tk  
                  ARM Approved Training Centre (ATC)

Doulos Ltd., Central European Office, Garbsener Landstr. 10, 30419 Hannover
Tel: +49 (0)511 2771340                  mailto: marcus.harnisch@doulos.com
Fax: +49 (0)511 2771349                  Web:    http://www.doulos.com

This e-mail and any attachments  are confidential and  Doulos Ltd. reserves
all rights of  privilege in respect thereof.  It is intended for the use of
the addressee only. If you are not the  intended recipient please delete it
from  your system,  any use,  disclosure,  or copying of  this document  is
unauthorised. The contents of this message may contain personal views which
are not the views of Doulos Ltd., unless specifically stated.

- D
- David Brown
  
  Contact options for registered users
Vote on answer
posted
13 years ago

Mon, Sep 6, 2010 9:21 AM

It may be that the implementation in this device is flawed in some way - I don't know it and therefore can't comment. But you are wrong about virtually indexed caches necessarily needing lots of flushes, and you are wrong about it being "flawed". It's a different tradeoff of costs and benefits.

Again, you are wrong here - the physical-to-virtual translation is not negligible if you are talking about fast closely-coupled caches, and you are also making the incorrect assumption that you will always have to make a lot of cache flushes.

It's a tradeoff, and in some cases one cache architecture will outperform the other - but there is no "best" choice that is always applicable.

What's best - unified or split caches? Two-way associative or 8-way associative? 16-byte cache lines or 128-byte cache lines? One big cache, or multi-level caches? There is no single correct answer, there are only tradeoffs and choices.

Perhaps you are doing a great deal of manipulation of the virtual to physical address mapping, in which case a physically mapped cache is a better choice. But that's the choice for /your/ application, not everyone else's.

I am using a PPC core (at 128 MHz, later versions will be 264 MHz), with a virtually indexed physically tagged cache. I think it is fair to say that the manufacturer (Freescale) chose that cache architecture because it was the best choice for the chip and the applications used on it - not because they like to use "flawed architectures".