Help! Processing power needed...

Hi Everyone

I'm reposting some of this because the original subject title didn't cover some of the questions.

I have an application which is currently running on four AD 21060 SHARCs. I want to replace these with a single processor, and have some PPC experience.

The app is large-ish (1Mbyte) and very cache-unfriendly. When running on an MPC8245, a typical fragment executes in 25us (twice as fast as a single SHARC) after cache is flushed and invalidated, or 4us if allowed to loop (it all fits in cache). This indicates a rather low hit rate, not surprising since the app contains very few loops and few multiple data references - it's like a long, ragged piece of string which seldom visits the same place twice. I need something that will run this code twice as quickly.

I need to escape from the slow random-access SDRAM problem. One solution may be to use something like an MPC8540, the 256k L2 cache of which could be configured to hold 1/8 of the critical code and data in the app. Anyone know how much faster than SDRAM this is? The processor together with its L1 cache would then give me the performance of two SHARCs for less critical code. This approach seems a bit close to the limit for comfort.

Another possibility would be an MPC7448 (can you buy these yet?) with 1M of L2 cache configured as private RAM. A 7448 looks attractive for its low power (could I run it slow and still benefit from fast internal RAM?). Again, anyone know how fast the 7448 L2 cache would be as compared to SDRAM? Would it be easy to put a few meg of fast SRAM on the MPX bus?

TIA, cheers Geoff

Reply to
Geoffrey Mortimer
Loading thread data ...

[...]

So what you need is not more processing power, as your Subject line says, but rather more memory power (in particular, lower latency).

From that point of view, it may be worth noting that just because a piece of code rarely visits the same place more than once by no means implies this piece of code must be seriously cache-unfriendly. It's all in the sequence of those visits to various places, and how that sequence fits in with the expectations that went into designing the caching strategies of the memory subsystem. The trick is to make your code behave in a way the cache designers accounted for.

In short, it seems like a solid dose of memory access pattern optimization (i.e. straightening out your 'ragged string' somewhat) might be able to help. It may even be cheaper than adding a high-performance memory subsystem.

--
Hans-Bernhard Broeker (broeker@physik.rwth-aachen.de)
Even if all the snow were burnt, ashes would remain.
Reply to
Hans-Bernhard Broeker

cover

SHARCs. I

experience.

Thanks for the reply, Hans. Yes, you're absolutely right, it's all about memory.

The app is a control system in which, every millisecond, a thread executes about 80000 instructions (this is done in 4 processors at present), then goes to sleep until the next millisecond tick, allowing a non-time-critical foreground thread to run. This code contains few loops, and the data it accesses are seldom accessed more than once. In this case, is it not true that even if the code and data are all allocated optimally, the only cache hits will be on the code and data contained in the cache lines most recently fetched?

There's a lot of critical legacy code in this app. I don't know if we'll be able to achieve a x2 improvement by rewriting it (which would carry a degree of risk that might well be unacceptable to management).

Reply to
Geoffrey Mortimer

Of course. Thus, the trick is to know which those are, and to take advantage of them, i.e. write software such that it uses the data the hardware already put into a cache pre-fetch line by itself, whenever it can. Maybe fine-tune the caching strategy, particularly the pre-fetches, if your CPU allows that. This is tricky business, sure, but if you can pull it off, it's sure worth the try.

Your preliminary measurements suggest you have a headroom of about a factor of 6. But yes, predictions are always risky, especially those concerning the future, and even more so without seeing the source code in question.

I don't quite see how a port to a different architecture would incur a significantly smaller risk than a rewrite of the software. But that's assuming you're not already betting the farm on software that you don't actually understand well enough to be able to re-write it from scratch, if needed.

--
Hans-Bernhard Broeker (broeker@physik.rwth-aachen.de)
Even if all the snow were burnt, ashes would remain.
Reply to
Hans-Bernhard Broeker

We know all about this code. We've ported this code (now 13M of source) pretty much once every 2 years for the last 15 years to all sorts of weird and wonderful architectures. The time has come to do it again. So the choice is between porting, and porting _and_ rewriting :-)

The app controls a well-known F1 car. Gets modified a lot too - about 70 versions per season.

Have a good weekend, and thanks for your interest!

Cheers Geoff

Reply to
Geoffrey Mortimer

Consider using a link script to fit all of your main loops into L1 cache.

Try some dcbt (or dcbst) instructions. The key is to avoid stalling on memory references. If you can identify them a few loops early then you can be executing instructions while the data is fetched.

More importantly it has a DDR controller so you could use (up to) 333Mhz memory.

--
Ben Jackson

http://www.ben.com/
Reply to
Ben Jackson

"Geoffrey Mortimer" schrieb im Newsbeitrag news: snipped-for-privacy@individual.net...

Geoff,

you have not told how much data memory your application uses. If it is not too much why not use SRAM instead of SDRAM. Normally SDRAM is chosen because it is much cheaper, but I guess that this is not the most critical issue in your application.

Quicky checked at Samsungs Website. There are SRAMs with capacities upto 72MBits and speeds down to 2.3nS. I guess your PPC should be able to connect to SRAM (ROM/Flash mode?).

If there is a lot of random data access just disable the cache for these regions, to avoid the unneccesary cacheline (16 byte) accesses.

Yours

- Rene

Reply to
Rene

"Ben Jackson" schrieb im Newsbeitrag news:a87sd.129856$V41.124594@attbi_s52...

DDR will not help in Geoffs case, as the latency is the same as for SDR.

- René

Reply to
Rene

I presume it's running on a AD14060 which is 4x21060s in 1 package, sharing 16Mbit SRAM? If not, then that's a possibility to look at.

Staying with single core Analog Devices, the ADSP-TS201 TigerSharc runs at 500/600MHz and has 24Mbit of in-chip DRAM in 6 banks. On the down side it is $186 but should have plenty of spare horsepower. The data sheet is a bit short on on-chip latencies unfortunately. I know from racing cars I have worked on that heat soak can be a problem, so would having an industrial/mil version be a distinct advantage?

As an aside, 8 seems a lot of hardware generations to go through for the same code. I suppose if it aint broke dont fix it, but perhaps if it aint scarlet it just might be broke and time to fix it.

Mike

Reply to
MSC

ElectronDepot website is not affiliated with any of the manufacturers or service providers discussed here. All logos and trade names are the property of their respective owners.