What application requires 500MHz for embedded processors

While it is true that general purpose processors give more and more performance, to be able to compare two particular architectures takes a lot more knowledge than there is in the parametrics lists. This kind of general talk using numbers in no context is in itself quite common - has been for decades - but can be misleading to beginners, so be wanrned :-).

Dimiter

------------------------------------------------------ Dimiter Popoff Transgalactic Instruments

formatting link

------------------------------------------------------

Reply to
Didi
Loading thread data ...

Wilco, here we go in a detailed comparison... It may be tougher to do than a few postings exchanged, but having warned the beginners to beware reading our stuff and think on their own, let's give it a try.

This is a key issue. You do not need 64 bits accesses, you do need to access a table with coefficients - probably at a static address, but can be lengthy (kilowords), an area with the data which and an area with the filtered results... this makes 3 *independent* accesses per cycle, plus program fetch, plus one DMA moving data using one of these busses (some of the memory allows two accesses per cycle and this is very handy), plus another DMA doing other work.

The way you describe the ARM in question, I would say it might be able manage the example - given the external memory interface is fast enough (DDR2 should do, I suppose). But this has to be proven - you know how it is, the devil is in the detail. But then again, I have done the 5420 design

5 years ago, I might have made a different choice today, like I said in another posting, it takes more knowledge than I now have for that ARM to be able to tell.

Dimiter

------------------------------------------------------ Dimiter Popoff Transgalactic Instruments

formatting link

------------------------------------------------------

Wilco Dijkstra wrote:

Reply to
Didi

Both include on-chip memory/caches of course. Nobody includes power consumption of external memory as it depends too much on the particular core and memory configuration. If you somehow need to access external memory a lot on a cached core, there is something seriously wrong...

There is no independent company doing power consumption benchmarks, something like that is needed. Measuring energy accurately and fairly is difficult...

Wilco

Reply to
Wilco Dijkstra

Hello Wilco,

The discussion till here seems over my knowledge but I'd like to know the detail. Could you please give some guildances where I can get it?

I've pipeline and basic cache knowledge but not the mix-up of both and the relationship to performance.

Many thanks.

Reply to
jade

Sure comparisons between different architectures is difficult, especially if they are built using different design principles (DSP vs RISC, RISC vs CISC etc). But that has never stopped me before!

So these are 2 different ways of achieving the same bandwidth: N narrow memory accesses in parallel or a single N-way wide access are equivalent. DSPs often go for VLIW while general purpose CPUs use SIMD. Each of these has advantages and disadvantages. High end DSPs use both.

There is no need for DDR2, in your particular example there was no need for external storage as all data movements are done on-chip. With several buses running in parallel at several GBytes/s each that's fast enough. :-)

You could get 1GByte/s bandwidth using 32-bit DDR SDRAM, which is still way too much for the example (which needs around

40MBytes/s if the samples are stored in DRAM).
5 years ago there was no ARM11, and while ARM10 might be able to do it, it doesn't have many of the ARM11 features, so it would need to run at a much higher frequency, probably 200..250Mhz.

Wilco

Reply to
Wilco Dijkstra

If you're interested in (micro) architecture then you need Hennessy and Patterson's "Computer Architecture a Quantitative Approach". Reading comp.arch and articles on sites like realworldtech.com may be useful (these are not embedded though).

There are various books about the ARM architecture, eg. ARM System Architecture by Steve Furber,and ARM System Developer's Guide. The first is an introduction into the ARM architecture and various implementations, while the second is more software and optimization oriented with lots of highly optimized code examples. There is a DSP chapter of course with details on how to write FIR filters and such.

Wilco

Reply to
Wilco Dijkstra

No no, the addresses of all the 3 busses just cannot be tied together, you have to make 3 data accesses per cyle, plus obviously an instruction fetch. The processor has to be able to do 4 simultaneous memory accesses every 10 nS, each being

16 bits wide and having its own address - and allow the DMAC to keep on buffering in yet another address area at the same time. I do not know the ARM11, but I considered an MPC5200 - 400 MHz PPC with DDR - and I estimated it would be far from sufficient. Can you say the ARM11 at 500 MHz has 3 times the power the 400 MHz PPC (603e core, 32-bit DDR 266) has (this is what I thought would be about enough to care to do further evaluations)?

There is no VLIW with the 54xx series DSPs, and the issue is not opcode cycles, the PPC can do floating point MAC in a single cycle, probably the ARM can the same etc. The issue is memory bandwidth, not just sequential burst bandwidth, but random accesses to multiple areas. This is the major difference between a DSP and a general purpose processor.

Well the 5420 has about 200 kilowords - 400 kilobytes - of memory and my application does use it all. I don't know if the ARM has that much and whether the program code will be as short to fit in - I know the PPC does not have it, it will need external memory.

The more you bring me back into it, the more details come back, and I would say you might be able to convince me that the 500 MHz ARM can do the job of a 100 MHz 54xx core - provided it has all the memory or you add the external it takes. But there are two

100 MHz cores in the 5420 ... I tend to think "no chance", and since I am too busy to do the entire work it takes to estimate whether I could fit it in such an ARM chip, I guess we'll have to wait until I have to design another device of the kind and pick the most suitable part at the moment, then I'll know.

Dimiter

------------------------------------------------------ Dimiter Popoff Transgalactic Instruments

formatting link

------------------------------------------------------

Wilco Dijkstra wrote:

Reply to
Didi

There is no need to do 3 independent accesses per cycle. This is a very inefficient way of increasing bandwidth and that is why modern CPUs increase the width of buses instead. 64-bit is pretty normal these days in the embedded space, and 128-bit is being introduced in the high-end.

With a 64-bit bus you can read 4 16-bit values per cycle, every cycle. This is clearly faster than reading 16-bits from 3 independent address per cycle, right?

Maybe you could show a C snippet of what you do, then I'll show you an equivalent one that doesn't need 3 memory accesses per cycle.

There is no need to do several independent accesses per cycle as long as you've got enough bandwidth. 4 16-bit accesses every 10ns is only 800MBytes/s. Just the data bandwidth between the core and L1 is 4GBytes/s on a 500Mhz ARM11 for example.

In terms of external bus bandwidth they are the same, 32-bit DDR 266 gives 1GByte/s bandwidth like the ARM11. The core is a 2-way out of order superscalar, similar in performance as an ARM11 (which is not superscalar but is more modern). The PPC L1 to integer core bandwidth is half that of ARM11, while its MAC performance is 8 times less (1 16-bit MAC every 4 cycles)...

So low fixed point DSP performance is what kills it. It needs

400Mhz for 100M MACs. An ARM11 can do this in only 50Mhz.

Its FP unit has single cycle FMAC just like ARM11 indeed. Using floats would increase the bandwidth and Mhz requirement by 2 times however. Both CPUs could do 100M FMACS plus

300M 32-bit memory accesses in 200MHz. If you can reuse values in registers then you can significantly reduce the number of memory accesses, this is easy with FIR filters for example.

Actually if you do the sums you'll see that general purpose processors have much higher bandwidth than the 5420. I guess you're using the external bus bandwidth and comparing that against the DSPs internal bandwidth. That's incorrect as most (90+%) data movement happens between the core and the L1 memory system.

It depends on what variant you use, but you can get an ARM11 with

32KB I&D caches and 128KByte L2 cache. As long as the working set fits in the caches the speed of external memory is irrelevant.

Wilco

Reply to
Wilco Dijkstra

This tells me you have never actually done any DSP programming. Please correct me if I am wrong (I certainly mean no offence).

Well I have had a 64-bit wide PPC (8240/45) to communicate with my DSP based device and other things for over 5 years now...

No. Every 16 bit value has a separate address, which is - in the case of the 5420 - another 16 bits. I will not go into explanation why this is so, I guess there are sufficient books on digital signal processing around.

There is no C source to show. I used assembly, a precursor of the VPA language I use nowadays on the PPC. I do not take seriously any programming done in C at all (it could be argued I am wrong but I shall not enter such a discussion), and I can definitely tell that using C on a DSP is a waste of time and/or money. Also, I will not go into further details on my device as it is still unique on the market (my competitors are still doing the job of one of the two cores I have in analog circuity and there are algorithms I use which I know they have been keen to guess for 5 years now...).

Here we go again, you don't want to believe DSPs have been designed as they are because of necessity. OK, I'll try a general example. There is an area - say, 4 kilobytes - with the coefficients, there is a circular queue - say, 64 k - with the incoming data, and there is another circular queue - say, again 64 k - with the filtered results. All are 16 bits wide, you do 4k MACs per sample each time starting one address further in the input queue and write the result to the output queue. Can you tell me how you do this without separate addresses (especially on the ARM where the registers are so scarce)?

Finally, let me say this: there are many applications which do need a general purpose processor anyway and with some more power and memory bandwidth a DSP could be made unnecessary. Perhaps (probably?) bandwidths will get high enough and lead to the extinction of the DSPs as we know them. However, there is a long way to go until this happens.

Dimiter

------------------------------------------------------ Dimiter Popoff Transgalactic Instruments

formatting link

------------------------------------------------------

Wilco Dijkstra wrote:

Reply to
Didi

The usual way to handle this on a general purpose processor is to unroll and pack the loads.

Can you explain why this is not applicable in your case?

Steve.

Reply to
Stephen Clarke

Are you sure you read my postings. General purpose processors are applicable, just at a different performance cost. I would estimate a 500 MHz decent RISC of todays generation could perhaps do what a

54xx 100 MHz DSP can do in terms of real time signal processing. If you would explain (perhaps by an example) how you want to unroll and pack the trivial filtering example I had given, I might be able to explain more.

Dimiter

------------------------------------------------------ Dimiter Popoff Transgalactic Instruments

formatting link

------------------------------------------------------

Stephen Clarke wrote:

Reply to
Didi

I did not intend to ask, "why are general purpose processors not applicable". Rather, I was trying to understand your assertion that you need to do a 16-bit load from three independent addresses every cycle.

The non-DSP orthodoxy is that this is not necessary, because you can unroll the loop by 4, merge the loads, and load up to twelve 16-bit objects from three independent addresses in three cycles. i.e. even though you can only do one 64-bit load per cycle, over three cycles, the effect is equivalent.

I cannot see that you have provided any example code. However, Wilco has already offered to provide an explanati> Maybe you could show a C snippet of what you do, then I'll show you

ARM expertise is not often free: someone should take him up on that offer!

Steve.

Reply to
Stephen Clarke

So is DSP and PPC (and much more, for that matter) expertise some of which I already gave for free in this thread. Enjoy.

Dimiter

------------------------------------------------------ Dimiter Popoff Transgalactic Instruments

formatting link

------------------------------------------------------

Stephen Clarke wrote:

Reply to
Didi

You're wrong. For example I've written a highly optimised JPEG (de)compressor on ARM using software SIMD techniques.

I know why low and mid-end DSPs do this, however there are major limitations with this approach. Alternatives exist which do not have these limitations, and general purpose CPUs use these to improve DSP performance without needing the traditional features of a DSP.

My point is that these alternatives allow modern general purpose CPUs to easily beat traditional DSPs.

It's not necessity, more a particular design approach (like RISC/CISC). It works fine at the low end, but it is simply not scalable. If you use it like a dogma then you'll crash and burn, just like CPUs that were too CISCy or RISCy...

The standard way of doing FIR filters is to block them. This reduces the memory bandwidth requirements by the blocking factor. An example of how a 4x4 filter looks like on ARM11 - since you don't like C, this is ARM assembly language :-)

fir_loop LDM x!,{x45,x67} ; load 4 16-bit input values and post inc SMLAD a0,x01,c01,a0 ; do 2 16-bit MACs SMLAD a1,x01,d01,a1 LDM c!,{c23,d23} ; load 4 16-bit coefficients SMLAD a2,x23,c01,a2 SMLAD a3,x23,d01,a3 LDM c!,{c01,d01} ; load 4 16-bit coefficients SMLAD a0,x23,c23,a0 SMLAD a1,x23,d23,a1 SMLAD a2,x45,c23,a2 SMLAD a3,x45,d23,a3

... repeat another time with x45x01 and x67x23 swapped

TST c,#mask ; test for end of loop BNE fir_loop ; branch back - 24 instructions total

This code uses 4 accumulators a0-a3, 8 coefficients c0-c3 and d0-d3,

8 input values x0-x7, a coefficient address and an input pointer - total 14 registers (2 16-bit values fit in a 32-bit register). The coefficient array is duplicated to avoid alignment issues and interleaved to avoid the need of a second pointer. There is no need for a loop counter as we can use the coefficient pointer. The instructions are scheduled to avoid any interlocks.

On ARM11 this computes 8 taps per iteration of 4 outputs (32 MACs) in 24 cycles. In terms of bandwidth, it only does 6 loads every 32 MACs (0.2 loads per MAC or 0.25 loads per cycle). So a 100Mhz ARM11 easily outperforms the 5420 at the same frequency.

FIR filters are clearly MAC rather than bandwidth bound. If we could do 4 MACs per cycle, the loop would go faster. Now why do you insist that you need at least 3 loads per MAC?

Wilco

Reply to
Wilco Dijkstra

Depends on application constraints.

Not for some applications.

Always forcing all data through a processor can for some applications cause problems.

.......

Having done various work with real time video, whereby the video must have minimal delay and NO non-deterministic delays or stops, (i.e. continuous operation), often because of other limitations of the system (broadcast effects, mixing, scaling or equipment in loops with eye/hand co-ordination). There are times where you have to have dedicated hardware as every pixel on multiple video streams at the same time are undergoing 24 multiply and 9 adds at pixel rate. Having done standards conversion and rescaling from input to output in less than 15 input TV lines delay, most of the delay was changing the start times for active video due to blanking differences.

Often in these types of applications, the blockiness and delays of frame delays can screw things up as all the delays add up.

There are times when the delay does not matter, still images, or open loop methodology (e.g. set-top boxes, DVD players, audio players), but others where the closed loop nature of the WHOLE system means DSP or fast processor will not cut it.

Horses for courses, and various other reasons (often internal politics).

--
Paul Carpenter          | paul@pcserviceselectronics.co.uk
    PC Services
              GNU H8 & mailing list info
             For those web sites you hate
Reply to
Paul Carpenter

Not at all. I had a look at the ARM11 architecture, and the first thing I saw was that there are no 40 bit accumulators. You are going to need them if you want to compete with the 54xx series for my (and many other) DSP applications. Also, the 54xx have a FIRS instruction for symmetric filters which does two MACs per cycle. Then there are details like memory bandwidth - you can have all the coefficients cached but you cannot - generally you always have a miss - on the incoming data, and then you probably have all the snoop issues with the DMA pushing the data to memory etc. etc. Under the score, you will find out that your 500 MHz ARM will likely be about the same as a 100 MHz 54xx when it comes to the complete application - if you have one which would tolerate only 32 bit accumulator width. The 54xx operates on every memory address as if it were a register, and a (great) number of on-chip DMACs can access all that space without incurring any delay to the the program flow at all, you cannot just neglect all that overhead.

I did not insist - but here you go, two accesses for the data and one for the opcode as it is in your case (the 54xx can stop fetching in loop mode, it is indeed highly specialized, I also prefer to program normal processors). And yes there are opcodes which make 3 data accesses per cycle on the 54xx.

Finally, the 54xx is almost 10 years of age now, no wonder there are newer candidates for its job. The ARM architecture is not bad, they have been learning from the right sources (68k and PPC), and it does have the potential to compete for some DSP applications. If it only had 32 registers and could evolve into 64 bits ...

And more finally, may I suggest that we include some information about ourselves whenever this is relevant, had I known Wilko was directly associated with ARM I would have been a lot less willing to support his agenda by contributing to a discussion. (I have no interest neither in TI nor in Freescale or other PPC manufacturers).

Dimiter

------------------------------------------------------ Dimiter Popoff Transgalactic Instruments

formatting link

------------------------------------------------------

Wilco Dijkstra wrote:

Reply to
Didi

The architecture supports 32-bit and 64-bit accumulators. For many purposes (graphics for example), 32 bit is more than enough. 64-bit accumulators need more registers and are slower in some cases. A common trick is to use 32-bit accumulators for several iterations, then do a 64-bit accumulate. This allows the inner loop to run at optimal speed without overflow (you can precompute how many iterations are possible without overflow).

Also, the 54xx have a

I'd say acc += (A + B) * C does 1 MAC and 1 ADD, not 2 MACs... But yes, it would run twice as fast effectively. The ARM11 version would run faster too of course, my guess is that the 54xx would be around 30% faster in this case.

The DMA stores the data in DTCM (fast local memory), which doesn't have any issues associated with caches, such as misses and consistency etc. So there is really no cost in accessing the incoming data - that's the point of DMA!

This is mind boggling. I showed you actual code that uses 75Mhz on the ARM11 doing the same work as a 54xx at 100Mhz, and then you do some handwaving and suddenly it needs 500Mhz? Can you explain where the other 425Mhz is going? (Not on cache misses)

Caches allow you do the same, it's why they exist. DSP programs generally exhibit ideal cache behaviour compared to general purpose programs. So they behave like fast high bandwidth memory without any overhead. But if you dislike caches there are TCMs anyway.

I agree the 54xx can do 3 memory accesses per cycle, however what I was asking is why you think that is the only way another CPU could achieve the same performance? My code example proves you don't.

32 registers would have been nice indeed, but it's not a big problem in integer code. SIMD always wants more though, so the next generation uses 32 64-bit registers.

In what sense is what I do or who I work for relevant to this dicussion? Would it make what I wrote any less true? In my spare time I post about subjects I'm interested in, that's all. I could argue you have a hidden agenda by repeatedly posting false statements about how much faster DSPs are compared to general purpose processors with DSP extensions.

Wilco

Reply to
Wilco Dijkstra

I agree. Your postings on this topic have been very professional and well informed. It doesn't matter whether you're speaking for yourself or your employer. Keep up the good work.

Reply to
Bob White

So much about your code example. Which registers do you use for the 64 bit accumulate. Depending on the incoming signal and the coefficients, you may need to do 64 bit accumulate every few loops so memory accumulate will not do any good.

Because it does 3 data accesses per cycle, yes.

Do you have > 64k of that memory (something like 80 would be tight but migh be enough) to dedicate on that alone? If not, you are out of busyness. Remember,

10 MSPS is not sound sampling at some KSPS where some of your examples might be applicable. Pointing to some real world working product which does use 10 MSPS/16 bit data sampling based on your ARM will help a lot - can you do that?

I know what caches are. I have about 20 M sources of text, running on a PPC, which I have written over the past 10 years, a full-blown OS included, with MMU, VM and all. Now tell me what caches are about.

I agree, but it is the major problem in the example you are trying to sell me. They are just too few for the 40 bit accumulate to fit in at the speeds you claim.

Well, good luck with the next architectural generation. I am sure it will be better - and the current one, like I said before, is not bad at all, perhaps the new architecture will be as successfull.

This is wrong. To operate on a cached value, you need at least two cycles - to load and to execute. The wider bus combined with multiple operands per register minimises and can even beat this, of course - wherever applicable. Notice that I am not at all advocating some weird DSP architecture against a "normal" one. The PPC architecture is by far the most advanced I know of and most likely the development will continue in this direction. DSPs are highly specialised things which, like other specialised logic circuits did in the past, will probably disappear. But, like I said before, there is still a way to go. Even if we assume the ARM you are talking about is as good as a 100 MHz 54xx at 130 MHz (which it s not), matching the 5420 (two 100 MHz cores) will take 260 MHz. How much does it consume at that speed? The 5420 needs something like 300 mW, with a lot (times) more on chip memory than the ARM.

So which is my hidden agenda. Yours is the fact that you "forgot" to mention you were an ARM employee. Using your email address @arm.com would have ben enough for me, but you did not do so. Now tell me you did not know this was unethical.

Dimiter

------------------------------------------------------ Dimiter Popoff Transgalactic Instruments

formatting link

------------------------------------------------------

Wilco Dijkstra wrote:

Reply to
Didi

ElectronDepot website is not affiliated with any of the manufacturers or service providers discussed here. All logos and trade names are the property of their respective owners.