New ARM Cortex Microcontroller Product Family from STMicroelectronics

My contacts are very credible. And, unfortunately, what we are discussing is something that would not be made obvious in datasheet diagrams. Datasheet diagrams are mean to provide an overview of functionality in as clear a visual format as possible. Like the model of the atom, it almost never reflects 100% what's inside the chip.

Chip buffers are RAM, and RAM is very greedy when it comes to die area. Allowing for an extra buffer could price out the chip as non-competitive, or it could be unwieldy from a layout POV.

My GUESS - the chip was designed for a primary customer, who wanted either USB or CAN (but not both). During chip design, a smart marketing person asked about adding the extra peripheral. Adding the CAN or USB adds small pieces of a penny to the chip cost. But adding an extra buffer probably priced the chip beyond what was quoted to the target customer.

Jim, if you want to take this off-line, I can be reached at the first email address on this page:

formatting link

-Bill.

Reply to
Bill Giovino
Loading thread data ...

In the case of my original article:

formatting link
the answer is YES - the ST part can execute code from RAM off the data (system) bus. Of course, there will be extra cycles.

Back when I was but a fledgling FAE, in my presentations I used to label architectures as Harvard, Modified Harvard, Von Neumann, etc.

In PPT presentations, engineers would ALWAYS debate amongst themselves as to the differences between these architectures, and sometimes whether or not Von Newman (What, Me Worry???). was spelled right...

What you decide to call the architecture is much less important than what you decide to do with it.

Bill

Reply to
Bill Giovino

(system) bus. Of

In some parts, RAM CODE execution is promoted for speed (due to slower FLASH speeds). Is that not the case in the ST device Core/RAM/FLASH combination ?

-jg

Reply to
Jim Granville

(system) bus. Of

Good question... in the ST part, if you are running out of Flash at zero wait states, then you are getting simultaneous fetches from both data & address buses - taking full advantage of the Harvard (ahem!) architecture gets you *mostly* single-cycle execution.

But for the same example, if you are running out of RAM, then you are using the same bus for instructions and data, you lose the advantages of the Harvard architecture and so it's slower.

However, if you are running out of Flash with the CPU at a higher speed than the Flash, and so the Flash requires wait states while taking advantage of the Harvard architecture - the speed compared to running instructions & data out of RAM off the data bus and so there are extra cycles the answer is - it depends...

Reply to
Bill Giovino

the Flash,

Any idea if the ST Cortex M3 can run without wait states from flash at their rated speed? That would be quite impressive.

Eric

Reply to
Eric

the Flash,

The data sheet says it requires one wait state from 24 to 48 MHz and 2 wait states above 48 MHz. So compared to the Luminary parts running at 50 MHz with *NO* wait states, I say the ST M3 parts are dogs.

The power consumption is not great either, at least not compared to parts like Atmel SAM7. The advertisement says it gets "0.5 mA/MHz in RUN mode from Flash", but this is not very accurate. The power curve does not have a 0.5 mA/MHz slope. The STM32F103 data sheet shows higher current per MHz at low clock speeds with a Y intercept of about

9 mA. I think the lower mA/MHz at higher clock speeds reflects the lower MIPS available due to the required wait states. Accounting for that, the mA/MHz ranges from 0.54 at 24 MHz to 0.88 at 72 MHz. I think this may be better than the Luminary Stellaris parts, but not as good as the Atmel SAM7 parts which are claimed to be a true 0.5 mA/MHz with very low static current in the uA range. I have not looked at the newer Luminary parts in detail.

Actually, I guess a power factor would be required for the SAM7 parts as well since they run with one wait state at their top speed. So maybe the STM32 part do better on power than I realized!

I am still waiting for Luminary to announce parts on a smaller geometry process. I was told they would be out toward the end of the year in a 130 nm process, IIRC. These parts should be very low power, but I don't know if they will keep 5 volt tolerance and what the static current will be.

Reply to
rickman

FreeRTOS.org wrote:

Note to Richard: When posting Google Groups links, the browse_frm paradigm is nicer.

formatting link

In addition, using periods (or hyphens[1]) to form phrases makes things more searchable (no %22ST stuff) ...and lnk=gst& is just noise. . . [1] A hyphen (grease-monkey) will find e.g BOTH **grease monkey** AND **greasemonkey**.

Reply to
JeffM

than the Flash,

It's not that bad. Cortex-M3 has a prefetch buffer and branch prediction. This means that the cost of a single waitstate can be hidden for conditional branches, ie. only indirect branches have a penalty. With 2 wait states the branch prediction only works on unconditional branches, so you'll get a slowdown. However you can change loops to use an unconditional branch at the end so they run at the speed of zero-wait state memory.

It is the flash power consumption. When you add wait states the power consumption flash drops to 50% (1 wait state) or 33% (2 wait states). Ie. the flash has identical power consumption at 24, 48 and 72MHz.

Of course the secondary effect of adding wait states is the core slows down and so uses less power. Based on their numbers I estimate the slowdown is between 10 and 15% - not too bad for 2 wait states.

I calculate 40mA at 72MHz, so 0.56mA/MHz. Not quite 0.5, but close. But I don't see where you get the idea they are worse than SAM7. I'm not sure what part you were comparing with, but the SAM7A3 (also CAN and USB like STM32F103) shows 70mA at 60MHz, or more than twice at the same frequency.

Now consider that an M3 runs twice as fast as a SAM7 at the same frequency, so the MIPS/Watt is 4 times as good!

If you're trying to compare MIPS/Watt don't forget that different cores running at the same frequency do not run at the same speed.

Wilco

Reply to
Wilco Dijkstra

messagenews: snipped-for-privacy@o11g2000prd.googlegroups.com...

branches,

prediction

I don't follow what you are saying at all. Branch prediction relates to pipelining. I don't see how it relates to wait states. The required wait states are added because of a fundamental limitation in the bandwidth of the Flash memory. You can look-ahead all you want, but you can still only return one word from Flash per 3 clock cycles when running at full speed. Unless the Flash word width is increased (as in the NXP designs) or the instruction size is reduced (many Cortex M3 instructions are 16 bits, but they would need to be 10 bits with two wait states and 32 bit memory) this will limit performance in the Cortex M3.

Am I completely missing something? I always leave that possibility open...

Yes, that is all pretty obvious. But it does not address the point of the Y intercept being a hefty 9 mA. This is not as high as the Analog Devices ARM parts, but it is significant. It means you need to use modes and hardware features to get better power savings compared to just slowing the clock which is much simpler to do.

The SAM7A3 is one of the oldest SAM7 parts and is not a useful basis for comparison. Personally, I do not expect to have a use for the CAN controller and I don't expect it was running when the power measurements were made. I was using the SAM7S parts as a point of comparison. I have a spread sheet that was provided by Atmel which shows the power rating of the CPU since you can control all the various power consuming sections. Ignoring the peripherals, the CPU (with PLL running) consumes 0.5 mA/MHz with a very small Y intercept (as I initially said).

The power for the STM32 is from the data sheet and includes basic power to the peripherals, although since they are not performing work the power they draw is less than typical. So the comparison is not perfect.

How do you support the claim that the M3 runs twice as fast as the SAM7 at the same frequency??? Maybe I don't want to know...

I have not seen anyone claim that the M3 runs twice as fast as an ARM7 clock for clock. I don't even think ARM claims that. I seem to recall that after all the hoopla is removed, you might see from 10% to

25% speedup from the ARM7 to the M3 depending on your application. If you disagree on this basic point, then I think we should not discuss it further. I have seen it discussed before ad nauseum with no hard information to support any given number.

Yes, but that is a small delta compared to adding waitstates with a 2x or 3x reduction in performance and therefore the same effect on power efficiency.

Reply to
rickman

than the Flash,

Harvard

branches,

prediction

Completely correct. But you must remember that often devices like these are not often used at their full speed.

ST certainly has excellent embedded Flash processes that can run faster than

24MHz and they deliberately chose not to use any of them for this product. In the case of this device, it looks like it was developed speifically for low power applications, where the issue isn't really instructions per second, but milliamps per second. The intelligent peripherals, and especially the non-intrusive DMA, allow developers to run the core slower.

When competing with commodity devices, (and anything licensed from ARM has become a commodity), a microcontroller company needs a competitive advantage. ST's advantage is their superior in-house process technology. Only TI (who also licenses the ARM Cortex) competes with ST when it comes to superior in-house process technology, and, hey, ST and TI are so close in process ability I wouldn't bet on the difference between the two.

Bill Giovino

formatting link
formatting link

Reply to
Bill Giovino

messagenews: snipped-for-privacy@o11g2000prd.googlegroups.com...

This

branches,

prediction

can

speed

Adding a wait state is the same as increasing the pipeline depth, and branch prediction coupled with prefetching can hide some of that latency.

You have to increase the fetch width if you add waitstates, that's a given. The M3 TRM recommends 64-bit flash fetch. While this allows straightline code to run at full speed, branches are still slow. What I meant is that M3 has branch optimizations that reduce this slowdown.

power consumption at lower speeds. The specs showed various settings that use significantly less than 9mA below 8MHz. So I don't think it really burns 9mA at 0MHz (which is what I think you mean with Y intercept, right?).

The ST specs list some numbers with peripherals off, and that is less than half the normal current, 21mA from flash at 72MHz IIRC - pretty good.

Because I've benchmarked it myself?

ARM claims 70% improvement, but that is an understatement due to the Dhrystone benchmark. EEMBC is more accurate here.

You seem to forget that the M3 was designed from the ground up to be an efficient MCU, while ARM7 wasn't:

  1. Thumb-2 is more efficient than Thumb-1 (just like ARM is faster than Thumb-1)
  2. Cortex-M3 has a more efficient micro architecture than ARM7 (it beats ARM9)
  3. Cortex-M3 slows down less with wait states than ARM7
  4. New features like unaligned access, hardware divide, better interrupts

Both 1 and 2 provide about 40% improvement, so about 2 times speedup together.

3 can give somewhere between 10 and 20% extra performance, but only at higher frequencies when waitstates are used. Depending on the application used, 4 can make a huge difference (I've seen benchmarks go twice as fast just because of hardware divide).

I've personally benchmarked the effects of 1, 2 and 4 on a large amount of benchmarks. Now where did you get that 10-25% number from?

running

Not at all. Adding waitstates doesn't slow you down by that much as you make the memory wider (not doing that makes no sense at all, so I do not consider that a valid option). But instruction set and microarchitecture differences can easily make a factor of 2 difference, as shown above.

Wilco

Reply to
wilco.dijkstra

messagenews: snipped-for-privacy@o11g2000prd.googlegroups.com...

This

branches,

prediction

can

speed

Adding a wait state is the same as increasing the pipeline depth, and branch prediction coupled with prefetching can hide some of that latency.

You have to increase the fetch width if you add waitstates, that's a given. The M3 TRM recommends 64-bit flash fetch. While this allows straightline code to run at full speed, branches are still slow. What I meant is that M3 has branch optimizations that reduce this slowdown.

power consumption at lower speeds. The specs showed various settings that use significantly less than 9mA below 8MHz. So I don't think it really burns 9mA at 0MHz (which is what I think you mean with Y intercept, right?).

The ST specs list some numbers with peripherals off, and that is less than half the normal current, 21mA from flash at 72MHz IIRC - pretty good.

Because I've benchmarked it myself?

ARM claims 70% improvement, but that is an understatement due to the Dhrystone benchmark. EEMBC is more accurate here.

You seem to forget that the M3 was designed from the ground up to be an efficient MCU, while ARM7 wasn't:

  1. Thumb-2 is more efficient than Thumb-1 (just like ARM is faster than Thumb-1)
  2. Cortex-M3 has a more efficient micro architecture than ARM7 (it beats ARM9)
  3. Cortex-M3 slows down less with wait states than ARM7
  4. New features like unaligned access, hardware divide, better interrupts

Both 1 and 2 provide about 40% improvement, so about 2 times speedup together. 3 can give somewhere between 10 and 20% extra performance, but only at higher frequencies when waitstates are used. Depending on the application used, 4 can make a huge difference (I've seen benchmarks go twice as fast just because of hardware divide).

I've personally benchmarked the effects of 1, 2 and 4 on a large amount of benchmarks. Now where did you get that 10-25% number from?

running

Not at all. Adding waitstates doesn't slow you down by that much as you make the memory wider (not doing that makes no sense at all, so I do not consider that a valid option). But instruction set and microarchitecture differences can easily make a factor of 2 difference, as shown above.

Wilco

Reply to
wilco.dijkstra

messagenews:4679cb90$ snipped-for-privacy@clear.net.nz...

memory, and no read code from ram, without

connection).

The std 80C51 has no Code-Data opcodes, and

Whether you can transfer data from code to data memory in a single instruction is not relevant. You always need 2 accesses anyway (read from code into a register, store register into memory). Instruction sets that can do this in a single instruction still need these 2 steps and an intermediate buffer.

Put simply, the question is "Can you read data from code memory at a random address?".

Wilco

Reply to
wilco.dijkstra

That's the question you'd ask to decide between a pure Harvard and a modified Harvard? (The discussion makes me think so, but I'm not sure.)

Jon

Reply to
Jonathan Kirwan

messagenews:4679cb90$ snipped-for-privacy@clear.net.nz...

memory, and no read code from ram, without

connection).

The std 80C51 has no Code-Data opcodes, and

OOPs, Why is this changed from your previous assertion :

"That means there is a connection (at least an uni-directional bus) between code and data memories. "

If you are going into the fine ground of semantics (which I find rather a waste of time, as end users frankly don't care ) then you do need to be consistant in your details ;)

ie an even better question is can you buy silicon that fits in each of your categories. If not, then what is the point ?

-jg

Reply to
Jim Granville

Any Harvard architecture can read data into registers using the immediate addressing mode.

getdata: execute label[r0:d] ret

label: label0: ld r0,#0 label1: ld r0,#1 label2: ld r0,#2 label3: ld r0,#3 label4: ld r0,#4 label5: ld r0,#5 label6: ld r0,#6

Should work on a Harvard architecture without going to a "modified Harvard" so the question is irrelevant to determine Harvard/No Harvard.

--
Best Regards,
Ulf Samuelsson
 Click to see the full signature
Reply to
Ulf Samuelsson

Wilco always modifies his statements, and forgets what he is saying....

--
Best Regards,
Ulf Samuelsson
 Click to see the full signature
Reply to
Ulf Samuelsson

Yes this question would be sufficient to decide between the early Harvards and everything that came after it. However the address space question "can you run code from data memory?" is a more important distinction for today's Harvards.

What I am getting at is that just about all CPUs can do read data from code memory and so are not as pure as the early Harvards. As I'm sure you are aware, at the time Von Neumann was seen as a major improvement over Harvard, and they were quickly forgotten. It was much later that modified Harvards were used for performance reasons.

Wilco

Reply to
wilco.dijkstra

messagenews:4679cb90$ snipped-for-privacy@clear.net.nz...

memory, and no read code from ram, without

memories,

connection).

The std 80C51 has no Code-Data opcodes, and

of

If you can read data from code memory there must be a connection via a bus. If you have a bus, you can at least read and usually write. What difference you see?

My point is precisely that the early Harvards are no longer in use. Ie. current Harvards are not "true Harvards", so it makes no sense to use that definition for the current CPUs.

Wilco

Reply to
wilco.dijkstra

So how does this read the *data* contained in the code memory? I'd like to read the bitpatterns of the instructions, not execute them.

Wilco

Reply to
wilco.dijkstra

ElectronDepot website is not affiliated with any of the manufacturers or service providers discussed here. All logos and trade names are the property of their respective owners.