PPC cache errata

Hi folks,

I wonder how do you fight (bypass) problems with a PPC405 (Xilinx Virtex-4FX) caching problems - silicon errata available at ftp://ftp.xilinx.com/pub/documentation/misc/ppc405f6v5_2_0.pdf.

I tried disabling cache, but performance drops significantly.

I also find this code snippet in GSRD2 design:

val = mfspr(XREG_SPR_CCR0); val |= 0x50000000; XCache_WriteCCR0(val);

XCache_EnableICache(0x80000000); XCache_EnableDCache(0x80000000);

Actually I need caching for some 10MB of DDR memory. How to specfy this in XCache_EnableXCache() ?

Cheers,

Guru

Reply to
Guru
Loading thread data ...

Hi,

currently I think they should have left the caches out. In my design (something with MPMC, some local-link-DMA's and Gig-Ethernet), everytime I activate the instruction caches I get Program-Exceptions (jump to offset 0x700 in the vector table), sporadic resets and so on. I suspect that in my case the I-side plb (also taken from the mpmc-example) is too aggresive for the cache. Or whatever. Tried everything mentioned in the erratas... Sometimes it gets better, sometimes not... Played around with priorities in the MPMC...

To address your question: in the PowerPC Processor Reference Guide there is a chapter about how to use the different cache registers. The address space is clustered in 128MB-regions. Each bit while enabling/disabling cache-stuff represents one of these regions. Bit 0 (like in your example) is the memory-space from 0x0 to 0x7ffffff. So cache enables for only 10mb is not possible, you have to enable the whole 128MB.

Greetings, Christian.

Reply to
cpmetz

Are you clearing the caches first? The tags seem to persist across resets, so if you enable the caches without clearing them they will immediately start returning (stale) data.

--
Ben Jackson AD7GD

http://www.ben.com/
Reply to
Ben Jackson

There is no problem with the PPC405 caches in Virtex-4, i.e. they work as expected. CPU errata 213 does not apply to the PPC405 in V4 as you can see in Table 1 (pg. 2) of the errata document you are refering to.

Setting bits 1 and 3 in CCR0 was needed in early silicon (PVR

0x20011430) to work around a problem that was fixed in production silicon (PVR 0x20011470). The workaround is enabled automatically by the boot code used by XPS and is transparent to the user.

XCache_EnableDCache(0x80000000) and XCache_EnableDCache(0x80000000) enables the instruction and data caches for the first 128MB [0..0x07ffffff] of memory.

Cheers,

- Peter

Reply to
Peter Ryser

This would have been a really bad idea. Everything would go so slowly, and the caches take up a tiny amount of area compared to an equivalent number of blockrams.

If you disable the caches you will also change how the processor interacts with the PLB bus. With caches disabled there will be no bursting on the bus. With caches enabled some (or in some modes all) transactions will result in bursting an entire cacheline.

I would suspect the problem has much more to do with the memory controller - possibly it won't do a burst read correctly in all circumstances (i.e. design flaw). The timings may not be set correctly. The next possibility is a power distribution or signal integrity problem. With the cache enabled you will be operating the dram in a burst mode which will both increase noise and power draw. Maybe there isn't enough margin in the design.

If you want to test the memory controller, disable the Icache, enable the Dcache in an appropriate mode, and write a memory test that is guaranteed to force burst reads at addresses that will result in continually reloading the same cacheline with different data. You will have to look at the data sheet for the processor to determine the cacheline to physical address mapping.

What you pick for the data will allow you to determine where/what the problem is. You should be able to pick patterns that will allow you to determine if the memory controller won't burst on the PLB side, or the external port, and whether the timings are marginal or not. These problems aren't fun to track down.

Before getting too carried away I would suggest that you look at the setup of the clock tree and phases of the clocks/data. If the controller works reliably in single transfers, but not bursts, it could be an insufficient hold time, as the bus will capacitively hold the data if it is not being overwritten. Looking at the address/data patterns that make the memory tests fail can give good insight.

I can second the voice of Peter Ryser from Xilinx, the caches do work in V4, even with ES silicon.

Regards, Erik.

--
Erik Widding
President
 Click to see the full signature
Reply to
Erik Widding

I use MPMC2 memory controller with my peripheral connected to NPI port (DMA write capable). When I enable caches the PPC does not read the addr 0x00000000 properly, but when caching is disabled it works OK. I am not clearing the caches before using it. I have two Avnet Virtex-4 Mini Modules with ES FPGAs.

Is it possible to put a memory address at the end (or the beginnig) of cached area; e.g. caching first 128MB and DDR memory start at

0x07800000? Can this work?

Why is there no specs for bits 1-8 of CCR0? What do the bit 1 and 3 do?

Cheers,

Guru

Erik Widd> > currently I think they should have left the caches out.

Reply to
Guru

Well, we seem to have the same problem. I did some cache-tests as proposed, and everything works fine in my design; at least when no DMA-transfers of the LocalLink-Interface are issued. So I apologize for the PPC-bashing, it was a sign of my frustration :-)

Anyone had issues with the MPMC2 under heavy load from different ports?

Cheers, Christian.

Guru schrieb:

Reply to
cpmetz

Frustration is what you get when you use PPC. If you want to have little more performance than MB then this route is inevitable.

I also have problems with delays in my NPI DMA engine. For some reason short transactions are delayed. I am using 64 word transactions from asyncronous source. I fill all the 64 words in the NPI FIFO then I fire addr request (which initiates xfer). For shorter transactions I put BE to 0x0 for the doublewords that should not be written. Maybe this BE is the reason for delays.

PS: If anyone wants a partially working GSRD2 design for Virtex-4 Mini Module I can send it.

Cheers,

Guru

snipped-for-privacy@googlemail.com wrote:

Reply to
Guru

I think not clearing the caches is your problem. If memory serves, you need to invalidate every single tag in the cache, post reset. You will also need to invalidate the cache tags for any data that MAY have been shanged outside the context of the processor.

There are at least two ways to do this. The first is to use the TLB to do a "virtual" double map of the memory into a cacheable and non-cacheable region. This is overkill if you are not using the TLB for any other reason.

The second, easier way, is to do a "physical" double map of the memory bank so it is visible from two 128MB regions. You would set the PLB address space for the memory controller to 256MB even though the SDRAM might be

There is no need for you to know this. This is an "oh shit" port for the IP owner (IBM) to fix problems with his IP.

Cache coherency is the responsibility of the user with the PPC405. The cache takes a tiny amount of silicon area, and drastically improves performance. But it does have some limitations as a result. There is also a great deal of flexibility in the modes that it can be employed. It is the responsibility of the user to understand this.

Respectfully, I suggest you read all of the documentation regarding the cache, specifically the initial conditions (i.e. power up, and post reset state), and the setup and initialization procedures. The PPC is remarkable easy to use. We have come across a few gotchas along the way, but no more than usual. The documentation from IBM for this piece of IP is extremely complete. And the performance of the PPPC405 is not a little more perfomormance than the MB, it is many multiples for compute intensive (as opposed to IO intensive) applications, so long as the caches are turned on.

The last thing I wonder about is this: Why the fascination with the GSRD design as a starting point? The power of this design comes from the fact that it allows a system with a memory bank that is capable of providing multiples of the easy to get to 800MB/sec performance of a PLB bus, to multiple such sources. With a x16 DDR configuration, and

400Mb/pin, you would only be capable of filling a single PLB bus. PLB bus masters can be extremely compact and easy to implement. And 70%+ utilization of the PLB bus is quite easy to achieve with quad-doubleword transfers. Double the burst length and your idle time on the bus will drop 2x.

The topic of the IBM Coreconnect buses seems to constantly come up in this news group. The way in which Xilinx chose to implement the interfaces with the IPIF packages is not necessarily the simplest or most efficient. The Coreconnect buses just aren't that complicated.

If we were to release an application note (or XCell Journal article) that described how to architect an efficient coreconnect based system with very high throughput, that also included basic VHDL behavioral code that implemented the following most basic cores: OPB Slave that can be 32bit read/written OPB Slave (for setup) / PLB Master that does quad-doubleword reads

Would that help with all of this silliness of new users being drawn to these overly complicated reference designs? Our time is a little short these days, so I make no promises, it just seems that the same sort of issues come up repeatedly.

Regards, Erik.

--
Erik Widding
President
 Click to see the full signature
Reply to
Erik Widding

Well, I can only talk for myself and I'm no hardware engineer. Most stuff I do is software, and in my job we had the question: are FPGA's a way to easy build and evaluate different SoC-designs, support some speedup (e.g. TCP-offloading, pattern recognition on the fly etc.) and is there a stable process to get these designs. Our first guess was the EDK/SOPC/whatever way to do things. We found out very quickly that the included IP doesn't cover all of our needs. Then we wanted to know what the complexity is to build a own design, something which stresses some hardware and is easily explained to the higher ranks. The most easy one is (from my perspective) an adopted GSRD-example: most boards have ethernet on-board, stacks are freely available etc. We added some RGMII-support for our ML410 and got it running (using lwip, TCP offloading etc., except when using ICache...).

I'm sure it is possible to build highly efficient designs in respect to latency and throughput on the busses, but these designs need some major experience (like you seem to have :-) ) and don't allow using common off the shelf IP. I don't want to care about slow IPIF-implementations or special access-patterns/scheduling of functions to get the most out of some hw, but I know this wont happen for some time...Nevertheless I think EDK goes into the right direction, allowing non-EE's like me using FPGA's for system evaluation.

Reply to
cpmetz

...

If you're suggesting the OP do this, it would probably be helpful (it would certainly be helpful to me!) to name or even link the most useful document(s) or chapter(s). It is all too easy to get lost in the documentation - either not even knowing which one to start with, or thinking "I saw that somewhere, three months ago ... but where?"

[...]

I would add to this list, the most basic OPB master capable of bursting data to/from another slave. (Along the lines of the "initiator" in the PCI ping example)

VERY VERY much so!

I spent several unproductive weeks trying to figure out where to start. (It didn't help that I started from a vendor-specific example design which didn't port to the EDK version I was using, but that's another story)

I just assumed from the sheer volume of ready-made interface designs that there must be some non-obvious difficulty with interfacing directly to the buses, so the best place to start was one of the ready-made designs (OPB_IPIF in my case). A bare-metal example proving otherwise would have been a goldmine.

Even then, deciding which version to use, I picked the "wrong" one. Newer = better, right? So choose version 3.01 over the (obsolescent?) version 2.xx. But 2.xx supports bus mastering (but not slave burst transfers) and 3.01 supports burst transfers but not bus mastering.

As a new user, it was a maze, and there really wasn't the time to get to grips with it all.

A year later, I will have to re-visit the design sometime, to clear up the mess (it works, but performance suffers) so even now I'd find this appnote extremely useful.

I'm going to guess, approx half the difficulty will be in creating the .mpd, .pao etc hooks to enable its use in EDK?

One vote for the article...

- Brian

Reply to
Brian Drummond

ElectronDepot website is not affiliated with any of the manufacturers or service providers discussed here. All logos and trade names are the property of their respective owners.