I'm currently working on a "typical" set-top box project (digital TV).
The system can be considered an "heterogeneous computing" system, with various "processing elements" for different tasks:
- an SH4 (ST40) "system" CPU, where the app runs on top of a mini OS
- a micro-controller for watchdog and low-power/stand-by functions
- a co-processor for audio decoding
- another co-processor and/or ASIC for video decoding (the media decoders are not well documented)
- a few DMA engines
- a blitter gizmo for UI whiz-bang
- a crypto co-processor
- stuff I don't even know about
All of these accessing a shared resource: RAM (through a shared bus??)
The ODM provides minimal profiling tools (instruction pointer sampling, and a post-processing script tho parse the symbol table, matching IP with the corresponding function).
Problem is, these tools only profile the "system" CPU. The rest of the system is a giant black-box to me.
Profiling shows that merely decoding one HD channel (audio and video) pegs the system CPU to 50%, which is unexpected, because all the heavy lifting is done elsewhere.
If I disable the audio, the load drops to 25%... even though audio tasks were far from taking 25%. When audio is disabled, the system CPU spends less time in ALL other parts of the software.
This would seem to incriminate some kind of bus contention for a shared resource, and I'm thinking main memory.
Drop audio decoding => bus contention drops => everything runs smoother.
Does this theory make sense/hold water?
More importantly, how would I validate/invalidate it?
In order to be a credible explanation, it is required that when the system CPU needs to access RAM, if the bus is locked by another entity, the CPU just spins, instead of switching to a different task.
I'm thinking maybe I can use the perfcounters to high-light CPU twiddling its thumbs while waiting for RAM access?
Hmmm, there is a "ram" event, but its only a counting event, so no cigar. Perhaps using cache misses? pfi and pfo (Pipeline Freeze due to cache miss Instruction/Operand)
But the problem is not really SHARING the memory, but merely accessing it. In the limit, each processor could have its own little private part of RAM, but only one processor can access RAM at the same time. But that would still impact the latency of the pipeline freeze on a miss. (Sorry for thinking out loud, I'm really in the dark here.)
Anyway, I'm open to suggestions / advice / warnings / etc.