What application requires 500MHz for embedded processors

- J
- jade
  
  Contact options for registered users
posted
18 years ago

Sun, Mar 5, 2006 5:52 AM

ARM 11 and MIPS 24K both provides clock rate up to 500MHz or even

550MHz. That's an incredible high frequence and thus very high performance. But I wonder what kind of application requires such high frequence? I think handheld product will NOT carry server level application. A 200MHz should be enough for most application run on mobile products.

Any idea?

Thanks.

Jade.

- A
- Alex Gibson
  
  Contact options for registered users
Vote on answer
posted
18 years ago

Sun, Mar 5, 2006 7:43 AM

Mobile phones with video and integrated pda etc Like the Samsung chips in some phone pdas.

Look at some of the TI omap chips and the DM6446 and the like.

If you want to do video and simultaneous audio processing you need a reasonably fast processor and handle the rest of the usual phone and pda type apps.

One of the new model Japanese phones has at least 4 way video conferencing using mpeg4.

With a lot of mobile phone apps writen in java, even with hardware assist you need a reasonably fast processor. Same with the mobile phones running windows of what ever version.

TV / video playback is the latest feature coming with phones. Some newer phone chips have seperate 2d and 3d graphics units.

One of the more powerful ones has 3d that is more powerful/better than the original 3dfx voodoo chips.

It seems the faster the chip in a phone the larger the software and more bells and whistles in it.

Omap and other phone chips are already "dual core" just that the cores are different.

Alex

- L
- Leon
  
  Contact options for registered users
Vote on answer
posted
18 years ago

Sun, Mar 5, 2006 10:51 AM

Mobile phones?

lEON

- L
- larwe
  
  Contact options for registered users
Vote on answer
posted
18 years ago

Sun, Mar 5, 2006 11:35 AM

Try decoding DVD-resolution video in software on a 200MHz processor and see how far that gets you. Java interpreters. 3D rendering.

Applications chase high-performance chips, not the other way around.

- W
- Wilco Dijkstra
  
  Contact options for registered users
Vote on answer
posted
18 years ago

Sun, Mar 5, 2006 2:43 PM

It's not that fast compared to high-end embedded (1+ GHz in networking). The ARM1176 can do 750 Mhz, XScale goes up to 800. The next generation is superscalar and will do 1 GHz. Interestingly the efficiency per MHz is improving, so a 1Ghz Cortex-A8 is ~3 times as fast as a 500Mhz ARM11. If you count SIMD extensions, the factor is higher.

One of the reasons for using higher performance CPUs is removing the need for dedicated hardware/separate DSP to do stuff like modem/audio/ video processing. 3G is also a lot more complex, so needs more power. Mobiles are fast becoming PDA's, digital cameras, tv's, movie players, camcorders and games consoles all in one package, while screen sizes, quality and bandwidth are increasing quickly. Software is also becoming more complex as a result, with Windows CE being used in many phones. Java never runs fast enough. All this requires a lot more performance...

Wilco

- E
- Everett M. Greene
  
  Contact options for registered users
Vote on answer
posted
18 years ago

Sun, Mar 5, 2006 5:21 PM

Which demonstrates that if you throw enough software inefficiencies at a processor, you can kill its performance no matter how fast it is.

- L
- larwe
  
  Contact options for registered users
Vote on answer
posted
18 years ago

Sun, Mar 5, 2006 8:54 PM

You've seen the Birth sketch from the start of Monty Python's Meaning of Life, right? Where the hospital administrator comes in, knows nothing about anything except that they sold the "ping" machine back to the company they bought it from, then leased it back so that the cost appears on the expenses and not the capital account?

By the exact same principle, semiskilled development staff - programming day laborers - are used to develop commercial applications. The BOM cost of the product increases due to the need for faster CPUs and more RAM, but the per-hour engineering rate decreases. Thus, a savings is announced.

- J
- jade
  
  Contact options for registered users
Vote on answer
posted
18 years ago

Mon, Mar 6, 2006 1:32 AM

Use processor to do DSP seems not proper. The general tasks run on processor are random but the DSP task is quite uniform in DSP algorithm and implementation.

Thus I think it's justified to dispatch heavy loading of DSP to another co-processor instead of run by pure software in processor.

Do you have examples in current design that use processor to do DSP?

Thanks!

- D
- David Brown
  
  Contact options for registered users
Vote on answer
posted
18 years ago

Mon, Mar 6, 2006 8:56 AM

Wirth's law: Software gets slower faster than hardware gets faster.

- L
- Leon
  
  Contact options for registered users
Vote on answer
posted
18 years ago

Mon, Mar 6, 2006 10:08 AM

A separate DSP is generally not needed as most of these newer chips (like ARMs) have some DSP instructions.

Leon

- W
- Wilco Dijkstra
  
  Contact options for registered users
Vote on answer
posted
18 years ago

Mon, Mar 6, 2006 10:28 AM

That used to be the case, but nowadays most general purpose CPUs have added DSP features which makes them reasonable DSPs. Perhaps not as good as a dedicated DSP, however the bulk of the code is still general purpose, so it makes more sense to improve the DSP capabilities of a general purpose CPU than the other way around.

One of the driving factors of adding DSP capabilities to general purpose CPUs is to remove the need for a separate DSP, which adds a lot of extra cost to a project (think of needing 2 teams to do development,

2 sets of development tools, more complex hardware interconnect, higher cost of product, higher power consumption etc).

A general purpose CPU can be relatively easily modified to add DSP capabilities. For example, an ARM11 can do 2 16-bit MACs and load/store 4 16-bit values per cycle.

Most harddiscs use ARM9E rather than DSPs nowadays (the head flying code is definitely hard realtime).

Wilco

- J
- Jim Granville
  
  Contact options for registered users
Vote on answer
posted
18 years ago

Mon, Mar 6, 2006 6:46 PM

another issue with separate DSP, is the decision of how much resource to give it. The CODE RAM is likely to be larger die cost than the DSP core, and gives a rather hard ceiling. - so it's hard to spec a generic device this way.

However, if the market is large enough and the task well defined, you WILL find very specific co-processors doing the DSP stuff - this will always be lower power than spinning all that external memory & BUS lines. For examples, look at the new MP3 and MPEG chips - often with ARM's alongside.

-jg

- D
- Didi
  
  Contact options for registered users
Vote on answer
posted
18 years ago

Mon, Mar 6, 2006 7:06 PM

Perhaps not so easy. Todays DSPs do multiple memory accesses per cycle to do MMAC; e.g. the 54xx of TI can fetch a coefficient with address increment, data with address increment, and write back with address increment in a single cycle... A 500 MHz general purpose CPU (even if PPC) has no chance to match a 100 MHz DSP when it comes to brute force MMAC which is what DSPs are all about.

Dimiter

------------------------------------------------------ Dimiter Popoff Transgalactic Instruments

formatting link

------------------------------------------------------

Wilco Dijkstra wrote:

- W
- Wilco Dijkstra
  
  Contact options for registered users
Vote on answer
posted
18 years ago

Mon, Mar 6, 2006 9:22 PM

That is 1 MAC and 3 16-bit reads/writes per cycle. The ARM11 can do

2 MACs and read/write 4 16-bit values per cycle including address increment. Which is faster?

Well, if you look at the DSP scores on BDTI.com, an ARM11 is 12% faster than a C54xx running at the same frequency.

That's clearly wrong. A 100Mhz DSP like the C54xx is about 5.6 times as _slow_ as a 500 Mhz ARM11... The magical brute force of DSPs is greatly exaggerated - general purpose CPUs are currently beating all but the very high-end DSPs. Even that is under threat, I wonder how a 1GHz Cortex-A8 stacks up against a 1GHz C64x+?

Wilco

- L
- larwe
  
  Contact options for registered users
Vote on answer
posted
18 years ago

Mon, Mar 6, 2006 9:22 PM

What are the MIPS/mW figures like?

- W
- Wilco Dijkstra
  
  Contact options for registered users
Vote on answer
posted
18 years ago

Mon, Mar 6, 2006 10:07 PM

Yes, and it is becoming a problem in the embedded world too unfortunately... I guess desktop software is moving down with its wastful attitude. I also suspect few graduates nowadays start on a tiny 8-bit system where every byte and every single cycle matters. Equally few understand the details of programming in high level languages well enough to accurately predict resource usage.

One of the fallacies that proponents of wasteful programming repeat is that nothing is a bottleneck until proven so - however that's the wrong way around... Even the least expected part of a program can become a serious bottleneck. I've seen a command-line parser of a compiler slowing down by 3 orders of magnitude due to STL strings. Parsing the command-line was slower than running the back-end at full optimization...

Wilco

- D
- Didi
  
  Contact options for registered users
Vote on answer
posted
18 years ago

Tue, Mar 7, 2006 12:06 AM

Wilco,

I am not intimately familiar with ARM so you might be right after all. However, I need to ask: how many address and data busses has the ARM in question so it can read/write 3 independent address areas simultaneously? What memory interface does it use so it can read/write to memory in a single cycle (to all 3 busses)?

But to make the comparison fairer, let us compare a practical case - something I have done on a 5420. The sampled signal is 14 bits, there are 9.2 MSPS continuously running and DMA buffered without missing a single sample on a circular queue in the on-chip memory. Parallel with the sampling, the DSP does some filtering to recognize events using about 90% of its theoretical MAC bandwidth; the remaining bandwidth goes on qualifying found events and programming yet another of the many DMACs to pass enough of the samples surrounding the event to the other DSP core. This takes exactly a 100 MHz clocked 54xx core, all memory being internal, in fact you can safely claim there is no external hardware of interest related to the comparison.

Can you do that with a 500 MHz ARM? I know you cannot do it with a 500 MHz PPC which is what I am familiar with (memory latencies will kill you). Remember, all this is done in real time, no missed samples, no missed events. When you add the second core's consumption (also almost 100% busy, but this one so only under the toughest conditions), you get about 300 mW...

Dimiter

------------------------------------------------------ Dimiter Popoff Transgalactic Instruments

formatting link

------------------------------------------------------

Wilco Dijkstra wrote:

- W
- Wilco Dijkstra
  
  Contact options for registered users
Vote on answer
posted
18 years ago

Tue, Mar 7, 2006 10:56 PM

According to TI numbers the lowest power C54xx uses around 0.6mW/Mhz (160Mhz max, not sure what process). It's measured as 50% NOPs, 50% MACs (I'm not convinced that is typical). ARM11 uses 0.8mW/Mhz (500Mhz at 130nm, with IEM it becomes 0.5mW/Mhz). So the C54 has ~16% advantage in energy per task.

So general purpose processors are competitive but not quite there yet. A superscalar CPU with a lower maximum frequency will be more power efficient than an ARM11 while achieving similar performance. It will be interesting to see how good Cortex-R4 turns out...

Wilco

- J
- Jim Granville
  
  Contact options for registered users
Vote on answer
posted
18 years ago

Tue, Mar 7, 2006 11:33 PM

Another thing to watch, is if the values for C54xx include memory (probably, as it is on chip?) and the ARM11 ones exclude memory (probably not, as that is off-chip ? ) - so the cores themselves might be in the same ballpark, but what about the system figures ?

-jg

- W
- Wilco Dijkstra
  
  Contact options for registered users
Vote on answer
posted
18 years ago

Wed, Mar 8, 2006 12:58 AM

Internally the ARM11 has a L2 sub system with 4 independent

64-bit buses which can run at full core speed and a 32-bit peripheral bus. The L1 system is Harvard with separate 64-bit buses connecting to the core and L2 system. All of these can work in parallel. However that wasn't what I was talking about.

An ARM11 can't execute 3 independent loads/stores per cycle, every cycle. But it can read or write 64-bits per cycle. In 4 cycles it can read 4x4 16-bit values from 4 independent addresses. So on streaming data it effectively achieves the same throughput as 4 independent 16-bit memory accesses without using XYZ memories. ARM11 can also issue a load multiple instruction and continue to do computations while the memory system fetches the data in the background - even on a cachemiss.

Yes, from what you say it could do it in 100Mhz. If the ADC has a FIFO you could soft-DMA the samples directly into local memory or cache, something like 256 bytes would give an interrupt rate of

100K, taking about 10MHz. Using the built-in DMA allows a smaller buffer and has virtually no overhead. If all else fails, the samples could be written to main memory in 32-byte blocks, as the required bandwidth of 20MBytes/s is tiny.

Doing 100M MACs would take less than 100MHz if all data is in the L1 cache (which can be locked) or local memory. Using SDRAM or L2 costs a bit more, but the data can be streamed into local memory using the DMA or using software prefetch. If everything fits in the caches/local memory doing it in 100Mhz is achievable.

I can imagine a PPC without Altivec has a problem doing MACs. Also it may not have good enough cache lockdown/local memory facilities so worst case latencies may just be too high. I don't know which PPC you meant, but ARM11 was designed for precisely this kind of realtime on-chip processing.

Wilco