PPC405 Performance Monitoring

A

Anthony Mahar 21 years ago

Hello,

Is there a way to do performance monitoring on the PPC405 in the Virtex II Pro? I am specifically interested in cache hits.

I have wedged my own device between the CPU's instruction and data PLB interfaces and can currently get cache misses. But I need to find a way to determine cache hits of an application running under an operating system.

If it was stand alone I could figure that information out by the number of load and store instructions, but this is an operating system with context switches, interrupt handlers, etc.

Is there a way to gather this information? There did not seem to be any performance monitoring registers as seen with newer PowerPC and x86 systems. Can the trace port be used to passively monitor execution for load/store instructions?

Thank you, Tony

Vote

N

Nju Njoroge 21 years ago

Virtex

PLB

way

number

any

for

Unfortunately, I have few answers to your questions. However, I know of a research group in Georgia Tech that is designing/designed a memory access monitor, which sounds similar to yours. You may want to correspond with them to exchange notes. I learned of their monitor at the HPCA 2005 FPGA workshop. Here is a link to the workshop http//cag.csail.mit.edu/warfp2005/. A link to the workshop presentations is here at http//cag.csail.mit.edu/warfp2005/program.html. Their presentation was titled "Evaluating System wide Monitoring Capsule Design using Xilinx Virtex II Pro FPGA". Their paper has their contact information.

As for the trace port, I have used it with a IBM/Agilent RISCWatch (RW) box, which collects a dynamic trace of the instructions over 8 million CPU cycles. The main limitation is that it only works for stand alone apps. When you have virtual memory enabled (while running Linux for instance), RW uses the TLB to conduct the virtual to physical address translations. This is great for regular code. However, when an interrupt is detected, the CPU converts to using physical addresses for the interrupt handler. Unfortunately, RW continues to use the TLB so it tries to translate physical addresses, for which no "translations" exists, so RW is unable to resolve interrupt handler instructions. After this point, the trace is corrupted. In any case, if you are interested in learning more about RW, you can refer to this appnote http//direct.xilinx.com/bvdocs/appnotes/xapp545.pdf. It has links to all manuals for the RW box and its tools.

Lastly, for my own curiosity, how difficult was it to design and debug your monitor? The guy I spoke to from Georgia Tech at the workshop said they used Chipscope to learn the protocol (along with IBM's PLB spec). He claims that this was a painstaking process.

NN

Vote

P

Paul Hartke 21 years ago

I looked into doing this a while back.

From the sounds of it, you have already create a data side cache miss collection engine, now you need the number of total loads and stores. As you surmised, this info can be collected by the debug interface (note the debug interface is different than the trace interface:

formatting link

counted in a similar fashion as as you currently do for the misses. Except here you need to identify the ld/st from the other instructions but the decode is pretty straighforward.

For CPI and instruction cache miss rate measurements, the same general technique can be used.

You should check out Nju's xapp545 appnote for another method of collecting the trace data. You can learn a lot about what the code is actually doing by looking at 8Million-cycle dumps of instruction execution.

The issue of OS context switches and interrupts is really orthogonal. You don't mention your OS but Oprofile

formatting link

for Linux handles this by adding code to every context switch-causing event to collect the values of the counters--in this case the ones you've insterted between the PPC405 and PLB bus--and assign them to the currently running code.

A similar approach is valid for other OSs but leveraging Oprofile is a good starting point since they've already figured out the relevant hooks into the kernel.

Paul

Anth>

Vote

A

Anthony Mahar 21 years ago

Thank you Nju,

I am going to dig into those docs right now.

My design was not intended to be a monitor, but an active bus transaction modifier. On certain transactions, I have to perform certain operations on the data going to the PPC405. This means I selectively pass data through, or perform some higher latency operations.

Since I am currently interested in cache-miss performance, I only count the number of transaction requests from L1 cache. Because it is an individual word that caused the instruction miss, all other words retrieved in the transaction are, of course, not considered as a miss. This makes it extremely easy to monitor the number of transaction requests.

While the module is an active component between the CPU and PLB, it is very easy to add a passive monitor once you have a way to have the EDK inject the monitor in the middle. For myself, It required some time to understand the EDK .mpd format and effectively create a PLB-PLB bridge (no logic, pure pass through), and there may be better ways with the "transparent" bus format that I haven't had time to look into. But at the time it was also my first EDK peripheral.

As for 'learning' the PLB system, I found the IBM CoreConnect Bus Functional Model (BFM) for the PLB, with the PLB doc, to be instrumental in observing every kind of transaction I had to handle. I think the BFM would be far easier than using ChipScope/Docs alone. The BFM allows the generation of almost any kind of cycle-accurate PLB transaction a master and slave can use.

One other model I would like to begin using is the Xilinx provided PPC405 swift model, which will allow the same code used by the real processor to run on the simulation swift model simulation. This will cause PLB transactions to occur in the same way they will on the real system, i.e. cache line fills based on the PPC405 MMU's state, etc.

Regards, Tony

Vote

A

Anthony Mahar 21 years ago

Interesting question for the "Monitoring Capsule Design" paper... they state they monitor behavior "between the CPU and L1 Dcache." Did they explain how they were able to do this, since the PPC405 and L1 are part of the same hard core?

There would be interesting (positive) implications for my research if I could also inject myself between CPU and L1, instead of only between L1 and some instantiated L2 cache or memory bus.

Thank you, Anthony

Vote

N

Nju Njoroge 21 years ago

they

part

You are right--the CPU and the L1 cache are in the same hard core, so we don't have access to the interface inside the CPU core and the cache. As I described in my previous post, they placed their monitor at the interface of the L1 cache port that are usually connected to the PLB. Thus, instead of connecting their CPU to the PLB bus, they connected the PPC core to their monitor, which is then connected to the PLB.

NN

Vote

N

Nju Njoroge 21 years ago

a

operating

with

be

know of

memory

at

was

Xilinx

(RW)

million

alone

address

for

so it

to

debug

said

spec).

operations.

count

miss.

is

EDK

to

bridge

at

If I understand correctly, you are saying that your transaction modifier acts as a PLB Bus to PLB Bus bridge. So, in the EDK project, you connected the CPU to a PLB bus, then connected your module to that PLB bus and then connected another PLB bus on the other side of your pcore?

CPU PLB Bus -> your pcore PLB BUS Memory (Cache/BRAM)

If my understanding is correct, you in essence designed a PLB-PLB bridge, like the PLB-OPB bridge, right?

In our research, we also designed a PLB to PLB bridge. Our pcore was initially a pass-through in between the two buses, then we placed our real module when we got the pass-through running.

The guys from Georgia Tech, however, interfaced their monitor module directly with PPC's PLB ports, so they couldn't use EDK's abstraction of the bus protocol through the PLB IPIF module. In fact, they had to synthesize their project in ISE since EDK wouldn't support what they were trying to do. That's why they had to use ChipScope to really see what the processor does.

instrumental

BFM

the

master

In designing our pass-through, we used the swift models. I definitely recommend learning how to use them. The swift models allow you to conduct full-system simulations. As for the BFM's, we weren't able to use them for our pcore since EDK 6.3i IPIF Create/Import wizard didn't support the use of Verilog modules (7.1 now supports this). We could have hacked this by using a netlist, but you cannot pass parameters/generics into a netlist, which is a feature that is required for our pcore. I have used the BFM's for a VHDL module I worked on in the past and I agree that they too were helpful.

NN

Vote

N

Nju Njoroge 21 years ago

a

operating

with

be

know of

memory

at

was

Xilinx

(RW)

million

alone

address

for

so it

to

debug

said

spec).

operations.

count

miss.

is

EDK

to

bridge

at

If I understand correctly, you are saying that your transaction modifier acts as a PLB Bus to PLB Bus bridge. So, in your XPS project, you connected the CPU to a PLB bus, then connected your module to that PLB bus and then connected another PLB bus on the other side of your pcore? I assume you also used Create/Import IPIF Wizard, right.

CPU PLB Bus -> your pcore PLB BUS Memory (Cache/BRAM)

If my understanding is correct, you in essence designed a PLB-PLB bridge, as in the diagram above.

In our research, we also designed a PLB to PLB bridge. Our pcore was initially a pass-through in between the two buses, then we placed our real RTL when we got the pass-through working.

The guys from Georgia Tech, however, interfaced their monitor module directly with PPC's PLB ports, so they couldn't use EDK's abstraction of the bus protocol through the PLB IPIF module. In fact, they had to synthesize their project in ISE since EDK wouldn't support what they were trying to do. That's why they had to use ChipScope to really see what the processor does.

instrumental

BFM

the

master

In designing our pass-through, we used the swift models. I definitely recommend learning how to use them. The swift models allow you to conduct full-system simulations. As for the BFM's, we weren't able to use them for our pcore since EDK 6.3i IPIF Create/Import wizard didn't support the use of Verilog modules (7.1 supports this now). We could have hacked this by using a netlist, but you cannot pass parameters/generics into a netlist, which is a feature we require for our pcore. I have used the BFM's for a VHDL module I worked on in the past and I agree that they too were helpful. NN

Vote

A

Anthony Mahar 21 years ago

As they state in their paper

formatting link

"In our initial study, we deploy a monitoring capsule in Dcaches to mon- itor the memory behavior between a CPU and L1 Dcache."

It is not possible to monitor signals between the CPU and L1 cache (I or D). Was the monitoring of CPU/L1 inferred by the cache misses seen coming from L1? Even so, a lot of memory behavior is missed when only observing cache misses.

Regards, T> Anth>

Vote

N

Nju Njoroge 21 years ago

mon-

or

They had two versions of their monitor--one for the MicroBlaze core and one for the PPC. For the PPC, they inferred the cache missess as seen from the L1. With the uBlaze, since they have access to the L1 cache signals, they could wedge their monitor in it.

so

monitor at

the

Vote

PPC405 Performance Monitoring

Join the Discussion

Didn't find your answer?