Help! Processing power needed...

Do you have a question? Post it now! No Registration Necessary

Translate This Thread From English to

Threaded View
Hi Everyone

I'm reposting some of  this because the original subject title didn't cover
some of the questions.

I have an application which is currently running on four AD 21060 SHARCs. I
want to replace these with a single processor, and have some PPC experience.

The app is large-ish (1Mbyte) and very cache-unfriendly. When running on an
MPC8245, a typical fragment executes in 25us (twice as fast as a single
SHARC)
after cache is flushed and invalidated, or 4us if allowed to loop (it all
fits in cache). This indicates a rather low hit rate, not surprising since
the app contains very few loops and few multiple data references - it's like
a long, ragged piece of string which seldom visits the same place twice. I
need something that will run this code twice as quickly.

I need to escape from the slow random-access SDRAM problem. One solution may
be to use something like an MPC8540, the 256k L2 cache of which could be
configured to hold 1/8 of the critical code and data in the app. Anyone know
how much faster than SDRAM this is? The processor together with its L1 cache
would then give me the performance of two SHARCs for less critical code.
This approach seems a bit close to the limit for comfort.

Another possibility would be an MPC7448 (can you buy these yet?) with 1M of
L2
cache configured as private RAM. A 7448 looks attractive for its low power
(could I run it slow and still benefit from fast internal RAM?). Again,
anyone know how fast the 7448 L2 cache would be as compared to SDRAM? Would
it be easy to put a few meg of fast SRAM on the MPX bus?

TIA, cheers
Geoff





Re: Help! Processing power needed...

Quoted text here. Click to load it


[...]
Quoted text here. Click to load it

So what you need is not more processing power, as your Subject line
says, but rather more memory power (in particular, lower latency).

From that point of view, it may be worth noting that just because a
piece of code rarely visits the same place more than once by no means
implies this piece of code must be seriously cache-unfriendly.  It's
all in the sequence of those visits to various places, and how that
sequence fits in with the expectations that went into designing the
caching strategies of the memory subsystem.  The trick is to make your
code behave in a way the cache designers accounted for.

In short, it seems like a solid dose of memory access pattern
optimization (i.e. straightening out your 'ragged string' somewhat)
might be able to help.  It may even be cheaper than adding a
high-performance memory subsystem.

--
Hans-Bernhard Broeker ( snipped-for-privacy@physik.rwth-aachen.de)
Even if all the snow were burnt, ashes would remain.

Re: Help! Processing power needed...

Quoted text here. Click to load it
cover
SHARCs. I
experience.

Thanks for the reply, Hans. Yes, you're absolutely right, it's all about
memory.

The app is a control system in which, every millisecond, a thread executes
about 80000 instructions (this is done in 4 processors at present), then
goes to sleep until the next millisecond tick, allowing a non-time-critical
foreground thread to run. This code contains few loops, and the data it
accesses are seldom accessed more than once. In this case, is it not true
that even if the code and data are all allocated optimally, the only cache
hits will be on the code and data contained in the cache lines most recently
fetched?

There's a lot of critical legacy code in this app. I don't know if we'll be
able to achieve a x2 improvement by rewriting it (which would carry a degree
of risk that might well be unacceptable to management).




Re: Help! Processing power needed...

Quoted text here. Click to load it

Of course.  Thus, the trick is to know which those are, and to take
advantage of them, i.e. write software such that it uses the data the
hardware already put into a cache pre-fetch line by itself, whenever
it can.  Maybe fine-tune the caching strategy, particularly the
pre-fetches, if your CPU allows that.  This is tricky business, sure,
but if you can pull it off, it's sure worth the try.

Quoted text here. Click to load it

Your preliminary measurements suggest you have a headroom of about a
factor of 6.  But yes, predictions are always risky, especially those
concerning the future, and even more so without seeing the source code
in question.

Quoted text here. Click to load it

I don't quite see how a port to a different architecture would incur a
significantly smaller risk than a rewrite of the software.  But that's
assuming you're not already betting the farm on software that you
don't actually understand well enough to be able to re-write it from
scratch, if needed.

--
Hans-Bernhard Broeker ( snipped-for-privacy@physik.rwth-aachen.de)
Even if all the snow were burnt, ashes would remain.

Re: Help! Processing power needed...
Quoted text here. Click to load it
We know all about this code. We've ported this code (now 13M of source)
pretty much once every 2 years for the last 15 years to all sorts of weird
and wonderful architectures. The time has come to do it again. So the choice
is between porting, and porting _and_ rewriting :-)

The app controls a well-known F1 car. Gets modified a lot too - about 70
versions per season.

Have a good weekend, and thanks for your interest!

Cheers
Geoff



Re: Help! Processing power needed...
Quoted text here. Click to load it

Geoff,

you have not told how much data memory your application uses. If it is not
too much why not use SRAM instead of SDRAM. Normally SDRAM is
chosen because it is much cheaper, but I guess that this is not the most
critical issue in your application.

Quicky checked at Samsungs Website. There are SRAMs with capacities
upto 72MBits and speeds down to 2.3nS. I guess your PPC should be
able to connect to SRAM (ROM/Flash mode?).

If there is a lot of random data access just disable the cache for these
regions, to avoid the unneccesary cacheline (16 byte) accesses.

Yours
- Rene




Re: Help! Processing power needed...
Quoted text here. Click to load it

Consider using a link script to fit all of your main loops into L1 cache.

Quoted text here. Click to load it

Try some dcbt (or dcbst) instructions.  The key is to avoid stalling on
memory references.  If you can identify them a few loops early then you
can be executing instructions while the data is fetched.

Quoted text here. Click to load it

More importantly it has a DDR controller so you could use (up to) 333Mhz
memory.

--
Ben Jackson
We've slightly trimmed the long signature. Click to see the full one.
Re: Help! Processing power needed...
Quoted text here. Click to load it

DDR will not help in Geoffs case, as the latency is the same as for SDR.

- René




Re: Help! Processing power needed...
On Fri, 3 Dec 2004 16:11:58 +0100, "Geoffrey Mortimer"

Quoted text here. Click to load it

I presume it's running on a AD14060 which is 4x21060s in 1 package,
sharing 16Mbit SRAM? If not, then that's a possibility to look at.

Staying with single core Analog Devices, the ADSP-TS201 TigerSharc
runs at 500/600MHz and has 24Mbit of in-chip DRAM in 6 banks. On the
down side it is $186 but should have plenty of spare horsepower. The
data sheet is a bit short on on-chip latencies unfortunately. I know
from racing cars I have worked on that heat soak can be a problem, so
would having an industrial/mil version be a distinct advantage?

As an aside, 8 seems a lot of hardware generations to go through for
the same code. I suppose if it aint broke dont fix it, but perhaps if
it aint scarlet it just might be broke and time to fix it.

Mike


Site Timeline