Just to add some extra information, ARM processors starting with the v6 architecture (ARM1136 and later cores) have tagged TLB's and also have VIPT caches (Virtually Indexed, Physically Tagged) so that no flushing is required at a context switch. The drawback of the physically tagged caches is that they require a TLB look-up to get the physical address before looking into the cache. This might not make any difference with the modern, pipe-lined, processors though.
There is another thing to consider for OS's like Linux (not that Linux can be used for hard real-time) - the application code and read-only data pages are loaded from the filesystem on demand. I.e. the application initially starts with only few pages loaded/mapped and when branching happens to a location in a different page, the kernel traps the prefetch abort and loads the new page into memory, mapping it into the task's address space (on some architectures, this requires the flushing of the whole TLB). This could cause significant delays. Another situation is the malloc'ed memory which Linux doesn't really allocate until it is accessed (can use calloc instead which forces the write). Even if you don't have the swap enabled, Linux on MMU systems can remove read-only pages from RAM if it runs short of available memory.
For some ARM cores (pre v6 architecture), the MMU or MPU (Memory Protection Unit) needs to be enabled to be able to use the caches.