Compare ARM MCU Vendors

Dave Graffio · 2010-09-01T22:37:22+00:00

How do you compare ARM MCU manufacturers for a project in the USA?I see Atmel, St micro, nxp, Texas instr, Freescale, Marvell - are they allselling the same stuff or is there any real difference? I see St has faster parts but Atmelhas more of them. Is price and support all the same?Google doesn't seem to show any information anywhere on this, which is reallyshocking.I am wondering if I should move between them or standardize on one company.

D

D Yuniskis 15 years ago

Like you would any other vendor! See who has what you want/need. How much they want for it. What their reputation is. etc.

Then, see who *else* has "something that you can *tweek*" to do the same job -- possibly better/worse -- and repeat the process.

Finally, make a "value judgement" on all of the candidates that fall through the above process.

selling the

Atmel has more

I don't think you will find "the same part" from any two vendors. The ARM world is like the "stereo" (HiFi) business of ages past (modern parallel would be multimedia): you bought a turntable from vendor A, the *stylus* for that turntable from vendor B, the (phono) preamp from vendor C, amplifier from vendor D, speakers from vendor E, etc. Until you got the "system" that fit your price/performance/ego.

With ARM, each vendor *packages* various "components" (referencing the above analogy) into an MCU. So, the processing power of the "core", amount of memory (and flavors thereof) included/supported, other peripherals onboard, etc. varies.

In theory, you can find The Ideal MCU for your application -- but, chances are, it is only sold by *one* vendor (though the various components inside it may appear in a smattering of offerings from other vendors... though not in the exact same configuration).

shocking.

I dunno... google doesn't tell me which *car* is "right" for me, either! Amazing!

That's a value judgement. Do you want to establish a relationship with

*one* company? (there are pros and cons, of course) Do you want to tailor your solutions to your problems (or pick the closest fit from the offerings of that *one* company)?

That's why they call it "Engineering" instead of "shopping for shoes"...

--don

Vote

R

rickman 15 years ago

really shocking.

Google is just a way to search for info that others provide. There are some comparisons of ARM devices, at one time I put up a comparison of ARM7 devices myself. But it is very hard to keep updated. Now ARM7 is on the down slope and the Cortex architectures are the hot, new thing. Good luck trying to keep up with all the new product introductions there! I have a lot of things on my plate before I could do this, but I may take a stab at another comparison chart for ARM CM devices over the winter. The last one was done when I needed the info for myself and I may need to evaluate ARM cores again soon.

The reasons for standardizing on one company in the (distant) past had to do with the differences in CPU architectures. Once you learned the PIC12 devices you didn't want to restart with the MSP430 parts, both learning about the CPU as well as the tools. Now that ARM has provided a more complete MCU core in the CM3/1/0 the CPUs are nearly all the same eliminating this issue. But the peripherals are very different between brands. The tools have a lot more support for the peripherals. So there is still reason for developing a brand loyalty. At some point I expect the tool vendors may have reason to help mitigate this issue and it will be much easier to port between brands. But that may be a long time off, if ever.

If you think you will have many designs that need a variety of MCU parts with different capabilities, I would suggest that you consider the major players with broad product lines. This can save cost on tools as well as relearning how to use the peripherals. It should cost you little in terms of recurring part costs or a match to your application. In other words, don't switch vendors unless you have a reason.

Rick

Vote

U

Ulf Samuelsson 15 years ago

The disadvantage of having a 256 byte wide memory, is power consumption. You will have 2048 active sense amplifiers. I dont see that coming soon.

Best Regards Ulf Samuelsson These are my own personal opinions, which may or may not be shared by my employer Atmel Nordic AB

Vote

D

David Brown 15 years ago

You don't need 256 byte wide memory - you need a 256 byte sram buffer on the flash. If we assume that the processor ideally wants to read 32-bit wide data from the flash at 100 MHz, and the flash itself is capable of providing data once per cycle at 50 MHz (perhaps with a couple of cycles delay for initial access to a page), then the flash-to-buffer width should be 64 bits. Then there is a brief stall when accessing a new page, but otherwise the processor gets its instructions at full speed.

Yes, those 64 bits means 64 sense amplifiers, compared to 16 amplifiers that might be used on a slower flash setup. But apart from a small leakage current, the amplifiers only take power when they are used, so the number of amplifiers doesn't affect the power much - the total power is proportional to the bits read from the flash. With a buffer arrangement, you'll get some unnecessary reads to fill the buffer, but you'll avoid duplicate reads on many loops - my guess is you'd reduce the total number of reads.

Vote

D

Dave Nadler 15 years ago

d

.

t

LPC1800... can operate at 150MHz straight from its 1Mbyte flash memory, or from RAM... The flexible dual-bank 256bit wide flash memories...

Dual-bank seems to be not for performance - doesn't get the benefit of

512bit width as they aren't interleaved.

See:

formatting link

Hz-ARM-Cortex-M3.htm

Interesting trade-off ! Best Regards, Dave

Vote

U

Ulf Samuelsson 15 years ago

2010-09-20 21:04, Dave Nadler skrev:

formatting link

In practice you see that the 128 flash LPC2xxx draws a lot more current than the 32 bit SAM7. In thumb mode, the SAM7 is faster than the LPC (at the same clock frequency) due to the faster flash. The wide flash memories will give you some extra boost at the top performance level. The programmable nature of the SAM3, allowed me to test the difference between 64 & 128 bit and it is ~5%. Normally it is better to increase the clock than it is to increase the width of the flash. Same performance, but less power.

Best Regards Ulf Samuelsson These are my own personal opinions, which may (or may not) be shared by my employer Atmel Nordic AB

Vote

R

rickman 15 years ago

tle and Dunsel.

Not sure what you mean by "Atmel is weak in Cortex-M3". The CM3 is new enough that not everyone has their products out yet. I think Atmel dilly dallied too long with the CM3, but I expect this was due to company goal issues and not because of "weakness" of any kind. They have a competing 32 bit MCU product and I expect they could only throw so many resources at bringing out a totally new MCU line. Give them a few more months and I think they will not disappoint.

"Fastest" is always a short lived title. Clock speed is seldom a determining criterion in selecting an MCU and I expect it is often given too much weight by engineers when initially winnowing their MCU choices. It is a simple number that is easy to verify. CPU speed is a much more complex measurement that is very hard to verify for your application, but this is the one that may actually make a difference in your design.

ss.com/psoc5is a good place to start.

How can you plan to use a part, even if you can wait six months for production, if you don't know the price? Has anyone heard a number for production pricing on the PSOC5?

Some three or four years ago I put together a list of ARM7 devices available. By the time Luminary came on the scene it got to be too much work to update. Now with all the CMx devices out there it would be a major effort to keep this updated. Does anyone have a comprehensive comparison of features and capabilities of the CMx MCUs available?

Rick

Vote

R

rickman 15 years ago

I hope you aren't involved in architecting new MCU designs. I don't think anyone said they wanted 2048 sense amplifiers. I would either interpret the above to be "256 bits" or I would consider an implementation that used a 256 byte cache of some sort. What would be the utility of a 256 byte wide interface to the Flash? Even the fastest CM3 CPUs can't run at nearly that speed.

Rick

Vote

D

David Brown 15 years ago

I was referring to a 256 byte cache, but perhaps I wasn't clear in my description. Such a page cache will be filled from the flash at a speed that suits the flash, with a width that matches the flash (perhaps something like 64-bit or even 128-bit for performance-optimised parts, and maybe as small as 16-bit for price or power optimised parts). On the other side of the cache, the processor will read out with a speed and width that matches its instruction bus - typically 32-bit.

It is effectively a specialised type of instruction cache - less flexible, but much simpler to implement.

I've read about such a cache, but I can't remember which chip used it - it may not even have been an ARM device (perhaps it was a ColdFire v2 microcontroller). And many parts have some sort of "flash accelerator" in their feature list, which are probably a similar idea.

Vote

R

rickman 15 years ago

n

full

o,

or

r you

om

mics

n.

e

d

Yes, simpler to implement, but definitely less effective. For example, lets assume the flash reads out 32 bytes (256 bits) at a rate of 50 MHz. That's 1600 MB/s. It would take 160 nS (8 reads) to fill the buffer on a jump. If the destination instruction was in the last line read, that would be a long stall of the processor. Of course, you could make the fill a bit smarter, reading the needed line first, but if the second instruction word was in the next line the processor would still have to wait for both reads to complete, rather slow in that case.

So yes, there are tradeoffs and the fact that this sort of cache is seldom seen makes me think the bottom line is either work with no cache (meaning a very minimal cache like a single line cache) or design an associative cache that doesn't need to refill the whole cache. There are volumes of material written on cache memory designs and yet we keep seeing the same basic ones used in practice... for the most part.

Rick

Vote

U

Ulf Samuelsson 15 years ago

I am certainly involved in the definition of new MCU designs, altough mostly by providing ideas.

He said that he wanted a 256 byte buffer, and i really doubt that this should be interpreted as bits.

He only said that the buffer will be filled when you accessed a new page, and did not state how many cycles it would take. From performance point of view, it makes more sense to load it in one cycle. If you start loading using sequential accesses to the flash, you will probably waste both cycles and power.

The proposal is already implemented in page mode DRAMs, so it may make sense at first, unless you know more about flash internals.

Best Regards Ulf Samuelsson These are my own personal opinions, which may (or may not) be shared by my employer Atmel Nordic AB

Vote

U

Ulf Samuelsson 15 years ago

and Dunsel.

Except for the AT32UC3L which has better power consumption than the Energy Micro on a pure CPU comparision. Didn't study the peripherals power consumption of the EFM32, but I know it has a UART which can run at low frequency. The UC3L has the "Sleep-Walking" feature which will turn on/off power to the peripherals using the event system, rather than waking up the CPU to do this.

The EFM32 is very limited in flash size. Almost all customers I talk to want to have fairly large amount of flash.

Have no clue about decision criteria, but I have always had the opinion that as long as only ST has the CM3, Atmel does not need it, If/when NXP & others go for it, then Atmel needs it as well, but there is time to catch up.

While this strategy will lose some designs I know that if I sum up all the "big" design lost by beeing a tad late, this is less than half the volume of a single project where additional focus on the SAM7 enabled Atmel to win a project which will move to CM3 once Atmel has that available.

The parallel strategy of having an 8 & 32 bit AVR has enabled Atmel to enter the mobile phone market, which people watching NASDAQ has noticed this year.

There are two groups AVR (8 and 32 bit) is handled by one group and the ARM products are handled in another group. You will see competition for resources between Cortex-M3 chips and ARM9 chips, but not between Cortex-M3 and 32 bit AVR chips.

it.http://www.cypress.com/psoc5is a good place to start.

Best Regards Ulf Samuelsson These are my own personal opinions, which may (or may not) be shared by my employer Atmel Nordic AB

Vote

D

David Brown 15 years ago

From the performance viewpoint, loading in a single cycle would be ideal - but from the space and power viewpoint that would be a bad idea. So loading sequentially with a medium-width bus (I suggested 64 bit) is likely to be the best compromise.

I know enough about flash internals to know it is a useful idea, and could be a cheap, simple and low-power method to improve flash access speeds. I know enough about chip design and logic design to know that de-coupling the flash access and control logic from the processor's memory bus will simplify some of the logic, and reduce the levels of combination logic that must be completed within a clock cycle. It also allows the processor and the flash module to run at independent speeds.

I also know that it would complicate other parts of the design, and the extra unnecessary flash reads may outweigh the flash reads spared.

In effect, my suggestion is a cache front-end to the flash with just one line, but a large line width and perhaps two-way associativity. The ideal balance may be different - half the line width and four-way associativity might be better. It's all a balancing act.

I also know that I don't know nearly enough detail to judge whether the sums will add up to making this a good idea in practice. It depends on so many factors such as flash design (some incur extra delays when switching pages), access times, power requirements of the different parts, access patterns on the instruction bus, area costs, design times and design costs, etc., and I don't know anything about these.

I am also fairly sure that the designers who /are/ capable of calculating and balancing these tradeoffs will have thought of doing something like this. There are certainly similar solutions used on many high-speed flash microcontrollers, though they may be much smaller. It could well be that my suggested 256 byte buffer is far too big, and that an 8 or 16 byte buffer is fine when your cpu clock speed is not too much higher than the flash access speed.

Vote

U

Ulf Samuelsson 15 years ago

I think that the way this is implemented is through an instruction queue. This was implemented in early 32 bit chips, like the NS32016 and the MC68010. The MC68010 even allowed you to loop in the queue.

It is not implemented on the ARM, and I do not think that it exists in the Cortex-M3 as well. The AVR32 does have a queue and will fetch instructions faster that it will execute, and this is one reason why the AVR32 can handle waitstates better than the Cortex-m3.

On the AVR32 you lose about 7% due to the waitstate on the first access, and you only need one waitstate at 66 MHz, the top speed of current production parts.

You will not get 100% hitrate, so your boost will be less than 7%. If you do add SRAM, you might be better off adding a branch-target cache to get rid of the initial waitstates. Once you start running sequential fetch the wide memory will give you a benefit but even a 128 bit flash can be a hog on power.

The SAM7 with a 32 bit flash is faster than an LPC2xxx with 128 bit flash, at the same frequency when running Thumb Mode, and it draws much less current. The faster flash makes all the difference. The LPC2xxxx can offset this with a slightly higher clock rate, but that will not make power consumption better.

Best Regards Ulf Samuelsson These are my own personal opinions, which may (or may not) be shared by my employer Atmel Nordic AB

Vote

R

rickman 15 years ago

n

full

o,

or

r you

om

mics

n.

e

.

I wish I had a nickle for every time some one said bytes when they meant bits or the other way around... especially when it was me!

I'm not following you really. You say using 2048 sense amps is power hungry and then you say loading it in sequential accesses will waste power. You can't have it both ways, one is worse than the other unless you are saying each is equally bad. The difference is that using 2048 sense amplifiers pulls the data out of the flash some huge factor faster than the CPU can use it! So it has pretty much no upside to match the downside.

BTW, the power consumption is not because of using 2048 sense amplifiers. The power consumption comes from making the reads. So if the CPU only needed the Flash to make new reads proportionally less often, the power consumption might not be much if any higher than if it were read out sequentially. That however, is a big IF.

My point is that there are very many tradeoffs and very many solutions. Only a few have worked out in practice given the fundamentals of IC design. As the processing makes more and more transistors cheaper and cheaper, the tradeoffs shift to different solutions. So there is no one answer and yesterday's bad idea can be tomorrow's great idea. But we only have to concern ourselves with today.

Rick

Vote

R

rickman 15 years ago

an

ull

ics

.

e

t

h

So many IFs, so little time. Benchmarking is an art, not a science. Best to run your app and see what is faster for your app.

Rick

Vote

U

Ulf Samuelsson 15 years ago

Very few speak about bits for buffers.

If you jump to a position in flash page, and the next instruction is a jump to another flash page, then if you have a 2048 bit flash, you certainly waste power.

If you have a 64 bit flash , which starts reading until it has

256 bytes of cache, then again you waste power.

if You jump forward within the page, then do you read all intermediate values? Then you waste power and performance. If you dont read intermeidates,, then you have to skip the contents of the buffer, or move to a real cache with valid bits.

Reading a word at the time is not wasting power. Then you read as much as you need. Drawback is that you do not have fast sequential access.

The "locality" of instructions is important. What is the likelyhood that the CPU will execute 2,3,4,...,n instructions in a sequence, and that gives you the ideal buffer size. The ideal buffer size is of course application dependent.

Best Regards Ulf Samuelsson These are my own personal opinions, which may (or may not) be shared by my employer Atmel Nordic AB

Vote

U

Ulf Samuelsson 15 years ago

If fast is the parameter you are looking for! Many applications need a certain speed, but once it is there, it will not use additional performance.

You have a basic selection between speed and code size on the ARM7, but with waitstates the lower memory use of the Thumb instruction set can make it faster than the ARM instruction set.

Best Regards Ulf Samuelsson These are my own personal opinions, which may (or may not) be shared by my employer Atmel Nordic AB

Vote

D

David Brown 15 years ago

I think it is interesting to look at the history of instruction sets. Long ago, there were two competing ideas - there were CISC instruction sets with very varied instruction sets (typically in 8-bit parts), and RISC which were all consistent and wide (typically 32-bit). It turns out that both extremes were "wrong", and the most efficient modern instruction sets for small devices are 16-bit wide for most instructions, with some 32-bit (or 48-bit) for flexibility. Consistency and orthogonality of the architecture is important, but should not be taken to extremes. There is a lot to like about the Thumb2 set - I think it's a big improvement on the original ARM ISA.

Of course, the 68000 designers at Motorola figured this out about 30 years ago...

Vote

W

Walter Banks 15 years ago

There is a lot I like about the Thumb 2 ISA. I have worked on ISA design on several commercial processors. M68K (that you mention and I clipped) patterned after the PDP11 is the classical orthogonal instruction set. It takes a lot more than that to make an efficient processor. The TI9900 a contemporary of the 68K development with similar roots was less effective at executing applications. The difference between 68k and 9900 was essentially data flow inside the processor. The 9900 was easier to program in many ways BUT it relied on more indirect data accesses to data and was significantly less efficient.

Clean data flow between executing instructions is as important as the instructions. The classic example of how to kill a processor is to need to process memory management through primary accumulator(s). This killed several processors in the 90's.

RISC can be very efficient but requires a different approach to code generation. The xgate is a simple 16 bit RISC that driven with a well designed code generator will compete with well designed CISC processors. Our application based benchmarks showed that the difference was about 10%. There is a whole area of instruction design that trades compile time complexity for processor simplicity or timing.

Many of the most successful ISAs make very good use of redundant instructions. This has been done four ways.

1) Conceptually have a page 0 space where some RAM areas are more valuable but the access is quicker and requires less generated code. 2) Memory to memory operations that don't require intervening register involvement. 3) Instructions with implied arguments. For example inc dec compliment. 4) Mapping registers (real and virtual) on RAM space reduces register specific instructions an extreme example is the move machines with one instruction.

Regards,

w..

-- Walter Banks Byte Craft Limited

formatting link

Vote

Compare ARM MCU Vendors

Join the Discussion

Didn't find your answer?