What is happening to Atmel EEPROMs?

- U
- Ulf Samuelsson
  
  Contact options for registered users
Vote on answer
posted
14 years ago

Sat, Mar 27, 2010 7:15 AM

messagenews: snipped-for-privacy@news.planet.nl...

messagenews: snipped-for-privacy@4ax.com...

When you run in Thumb mode and 1 waitstate, all instructions are 16 bit and the SAM7 memory controller fetches 32 bit so and with prefetch, there should always be zero waitstates for sequential fetch.

For Thumb mode, you have several cases depending on processor speed. Figures are for (non-sequential/sequential) access.

LPC SAM7 < 24 MHz: 0/0 0/0 same speed

24-33 MHz: 1/0 0/0 (SAM7 faster) 33-48 MHz: 1/0 1/0 same speed 48-66 MHz 2/0 1/0 (SAM7 faster)

so the LPC2xxx has to run at higher clock frequencies to meet the SAM7S performance. The 128 bit memory is overkill for thumb mode and just wastes power.

You really need to run ARM mode for the 128 bit memory to make sense.

You can try overclocking the SAM7S if you are not running over the full temp range.

48 MHz zero waitstates seems to work OK, but not up to +85'C.

- J
- Jon Kirwan
  
  Contact options for registered users
Vote on answer
posted
14 years ago

Sat, Mar 27, 2010 8:01 AM

Ulf, let me remind you of something you wrote about the SAM7:

"In thumb mode, the 32 bit access gives you two instructions per cycle so in average this gives you 1 instruction per clock on the SAM7."

I gather this is regarding the case where there is 1 wait state reading the 32-bit flash line -- so 2 clocks per line and thus the 1 clock per 16-bit instruction (assuming it executes in 1 clock.)

Nico's comment about the NPX ARM, about the 128-bit wide flash line-width, would (I imagine) work about the same except that it reads at full clock rate speeds, no wait states. So I gather, if it works similarly, that there are eight thumb instructions per line (roughly.) I take it your point is that since each instruction (things being equal) cannot execute faster than 1 clock per, that it takes 8 clocks to execute those thumb instructions.

The discussion could move between discussing instruction streams to discussing constant data tables and the like, but staying on the subject of instructions for the following....

So the effect is that it takes the same number of clocks to execute 1-clock thumb instructions on either system? (Ignoring frequency, for now.) Or do I get that wrong?

You then discussed power consumption issues. Wouldn't it be the case that since the NPX ARM is accessing its flash at a

1/8th clock rate and the SAM7 is constantly operating its flash that the _average_ power consumption might very well be better with the NPX ARM, despite somewhat higher current when it is being accessed? Isn't the fact that the access cycle takes place far less frequently observed as a lower average? Perhaps the peak divided by 8, or so? (Again, keep the clock rates identical [downgraded to SAM7 rates in the NXP ARM case.]) Have you computed figures for both?

Jon

- B
- bigbrownbeastie
  
  Contact options for registered users
Vote on answer
posted
14 years ago

Sat, Mar 27, 2010 8:38 AM

Some ARM cored micros are on 40 week lead times, good thing atmel manafacture the AVRs themselves. For related see here:

formatting link

- N
- Nico Coesel
  
  Contact options for registered users
Vote on answer
posted
14 years ago

Sat, Mar 27, 2010 9:56 AM

I think this depends a lot on what method you use to measure this. Thumb code is expected to be slower than ARM code. You should test with drystone and make sure the same C library is used since drystone results also depend on the C library!

That doesn't surprise me. From my experience with STR7 and the STM32 datasheets it seems ST does a sloppy job putting controllers together. They are cheap but you don't get maximum performance.

NXP has some sort of cache between the CPU and the flash on the M3 devices. According to the documentation NXP's LPC1700 M3 devices use a Harvard architecture with 3 busses so multiple data transfers (CPU-flash, CPU-memory and DMA) can occur simultaneously. Executing from RAM would occupy one bus so you'll have less memory bandwidth to work with.

--
Failure does not prove something is impossible, failure simply
indicates you are not using the right tools...
nico@nctdevpuntnl (punt=.)
--------------------------------------------------------------

- P
- Paul Carpenter
  
  Contact options for registered users
Vote on answer
posted
14 years ago

Sat, Mar 27, 2010 10:03 AM

Different manufacturers have different levels of outsourcing, from all processes are outsourced (100% outsourced), to all in house.

Sometimes some of the processes are outsourced because the majority of their machinery is now for smaller geometry, and the wafers only may be outsourced for some products to be made somewhere that has the larger geometry proceses.

....

Reminds me of an ASIC company whose customer in purchasing wanted to bring forward the next 6 months of production to that week, and asked "can't you just put more people on it?". At the time that would have been impossible even with stocks of wafers, as this was an avionics ASIC.

The testing procedure for this avionics ASIC was

Wafer test electronically room temperature Package good parts Package test electronically room temperature

Place large batch in oven and power all devices with clocks attached, and leave all parts running for a week at 125 deg C After a week slowly drop temperature to then test electronically at room temperature lower temperature to -55 deg C electronically test

Parts needed a second packaging process and then retest at room temperature.

All with full serial number of device and batch testing logged.

If new wafers are needed you can add 12 weeks in front of that.

Environmental chambers and testing for full temperature range is a long job and about every 12 to 18 months you have to strip down and replace ALL the internal wiring, connectors and boards.

Imagine the setups required for testing upto 120 off 100 pin devices in enviromental chambers and how many you require.

Designing the PCBs is also fun...

--
Paul Carpenter          | paul@pcserviceselectronics.co.uk
    PC Services
 Timing Diagram Font
  GNU H8 - compiler & Renesas H8/H8S/H8 Tiny
 For those web sites you hate

- P
- Peter
  
  Contact options for registered users
Vote on answer
posted
14 years ago

Sat, Mar 27, 2010 12:02 PM

Ulf Samuelsson wrote

We have a development kit for the 128 (bought ~ 2 years ago) so we will get a new one of those.

What kind of price is the 128A, 1k+, these days?

- U
- Ulf Samuelsson
  
  Contact options for registered users
Vote on answer
posted
14 years ago

Sat, Mar 27, 2010 12:56 PM

Peter skrev:

No clue, but they should be lower than the ATmega128. Should have lower power consumption as well.

BR Ulf Samuelsson

- U
- Ulf Samuelsson
  
  Contact options for registered users
Vote on answer
posted
14 years ago

Sat, Mar 27, 2010 1:14 PM

Jon Kirwan skrev:

Yes, the SAM7 is very nicely tuned to thumb mode. The LPC2 provides much more bandwidth than is needed when you run in thumb mode. Due to the higher latency for the LPC, due to slower flash, the SAM7 will be better at certain frequencies, but the LPC will have a higher max clock frequency.

The real point is that you are not neccessarily faster because you have a wide memory. The speed of the memory counts as well. There are a lot of parameters to take into account if you want to get find the best part.

People with different requirements will find different parts to be the best.

If you start to use high speed communications, then the PDC of the SAM7 serial ports tend to even out any difference in performance vs the LPC very quickly.

Yes, this will have an effect. Accessing a random word should be faster on the SAM7 and, assuming you copy sequentially a large area having 128 bit memory will be beneficial.

Yes, the LPC will in certain frequencies hjave longer latency so it will be marginally slower in thumb mode.

As far as I understand the chip select for the internal flash is always active when you run at higher frequencies so there is a lot of wasted power.

Best is to check the datasheet. The CPU core used is another important parameter. The SAM7S uses the ARM7TDMI while most other uses the ARM7TDMI-S (S = synthesizable) which inherently has 33 % higher power consumption.

- U
- Ulf Samuelsson
  
  Contact options for registered users
Vote on answer
posted
14 years ago

Sat, Mar 27, 2010 1:24 PM

It is pretty clear, that if you

execute out of flash in thumb mode

do not access flash for data transfers

run the chips at equivalent frequencies

run sequential fetch at zero waitstates.

the difference will be the number of waitstates in non-sequential fetch.

The SAM3 uses the same AHB bus as the ARM9. The "bus" is actually a series of multiplexers where each target has a multiplexer with an input for each bus master.

As long as noone else wants to access the same target, a bus master will get unrestricted access.

If you execute from flash, you will get full access for the instruction bus, (with the exception for the few constants). If you execute out of a single SRAM, you have to share access with the data transfers, which will slow you down.

BR Ulf Samuelsson

- J
- Jon Kirwan
  
  Contact options for registered users
Vote on answer
posted
14 years ago

Sat, Mar 27, 2010 5:56 PM

I think I gathered that much and didn't disagree, just wondered.

I remember you writing that "SAM7 uses a 33 MHz flash, while the LPC uses a 24 Mhz flash." It seems hard to imagine, though, except perhaps for data fetch situations or branching, it being actually slower. If it fetches something like 8 thumb instructions at a time, anyway. As another poster pointed out, the effective rate is much higher for sequential reads no matter how you look at it. So it would take branching or non-sequential data fetches to highlight the difference.

One would have to do an exhaustive, stochastic analysis of application spaces to get a good bead on all this. But ignorant of the details as I truly am right now, not having a particular application in mind and just guessing where I'd put my money if betting one way or another, I'd put it on 384 mb/sec memory over 132 mb/sec memory for net throughput.

Yes, but the key here is the careful "not necessarily" wording. Not necessarily, is true enough, as one could form specific circumstances where you'd be right. But it seems to me they'd be more your 'corner cases' than 'run of the mill.'

Of course. So people who seem to care about the final speed and little else should indeed do some analysis before deciding. But if they don't know their application well enough to make that comparison... hmm.

Yes. That seems to ever be true!

Yes, no argument. I was merely curious about something else which you mostly didn't answer, so I suppose if I care enough I will have to go find out on my own.... see below.

Some parts have such wonderfully sophisticated peripherals. Some of these are almost ancient (68332, for example.) So it's not only a feature of new parts, either. Which goes back to your point that there are a lot of parameters to take into account, I suppose.

The 'random' part being important here. In some cases, that may be important where the structures are 'const' and can be stored in flash and are accessed in a way that cannot take advantage of the 128-bit wide lines. A binary search on a calibration table with small table entry sizes, perhaps, might be a reasonable example that actually occurs often enough and may show off your point well. Other examples, such as larger element sizes (such as doubles or pairs of doubles) for that binary search or a FIR filter table used sequentially, might point the other way.

I find this tough to stomach, when talking about instruction streams Unless there are lots of branches salted in the mix. I know I must have read somewhere someone's analysis of many programs and the upshot of this, but I think it was for the x86 system and a product of Intel's research department some years ago and I've no idea how well that applies to the ARM core. I'm sure someone (perhaps you?) has access to such anaylses and might share it here?

By "at higher frequencies" do you have a particular number above which your comment applies and below which it does not?

In any case, this is the answer I was looking for and you don't appear to answer now. Why would anyone "run the flash" when the bus isn't active? It seems.... well, bone-headed. And I can't recall any chip design being that poor. I've seen cases where an external board design (not done by chip designers, but more your hobbyist designer type) that did things like that. But it is hard for me to imagine a chip designer being that stupid. It's almost zero work to be smarter than that.

So this suggests you want me to go study the situation. Maybe someone already knows, though, and can post it. I can hope.

I wondered if you already knew the answer. I suppose not, now.

I'm aware of the general issue. Your use of "most other" does NOT address itself to the subject at hand, though. It leaves open either possibility for the LPC2. But it's a point worth keeping in mind if you make these chips, I suppose. For the rest of us, it's just a matter of deciding which works better by examining the data sheet. We don't have the option to move a -S design to a crafted ASIC.

So this leaves some more or less interesting questions.

(1) Where is a quality report or two on the subject of instruction mix for ARM applications, broken down by application spaces that differ substantially from each other, and what are the results of these studies?

(2) Does the LPC2 device really operate the flash all the time? Or not?

(3) Is the LPC2 a -S (which doesn't matter that much, but since the topic is brought up it might be nice to put that to bed?)

I don't know.

Jon

- U
- Ulf Samuelsson
  
  Contact options for registered users
Vote on answer
posted
14 years ago

Sun, Mar 28, 2010 12:04 AM

Jon Kirwan skrev:

That is because you ignore the congestion caused by the fact that the ARM7 core only fetches 16 bits per access in thumb mode. At 33 MHz, the CPU can only use 66 MB / second, At 66 MHz, the CPU can only use 132 MB / second. Since you can sustain 132 MB / second with a 33 Mhz 32 bit memory, you do not need it to be wider to keep the pipeline running at zero waitstates for sequential fetch. For non-sequential fetch, the width is not important. Only the number of waitstates, and the SAM7 has same or less # of waitstates than the LPC.

---- The 128 bit memory is really only useful for ARM mode. For thumb mode it is more or less a waste.

I dont think running in Thumb mode is a corner case.

LPC with 1 waistates at 33 Mhz.

NOP 2 (fetches 8 instructions) NOP 1 NOP 1 NOP 1 NOP 1 NOP 1 NOP 1 NOP 1 ......... Sum = 9

Same code with SAM7, 0 waitstate at 33 MHz.

NOP 1 (fetches 1 instruction) NOP 1 (fetches 1 instruction) NOP 1 (fetches 1 instruction) NOP 1 (fetches 1 instruction) NOP 1 (fetches 1 instruction) NOP 1 (fetches 1 instruction) NOP 1 (fetches 1 instruction) NOP 1 (fetches 1 instruction) ......... Sum = 8

It should not be to hard to grasp.

Each chip designer makes their own choices. I know of some chips starting to strobe the flash chip select when below 1 - 4 Mhz

This is an automatic thing which measures the clock frequency vs another clock frequency, and the "other" clock frequency is often not that quick.

Looking at the LPC2141 datasheet, which seems to be the part closest to the SAM7S256 you get

57 mA @ 3.3V = 188 mW @ 60 Mhz = 3.135 mW/Mhz.

The SAM7S datasheet runs 33 mA @ 3.3 V @ 55 MHz = 1.98 mW/Mhz, You can, on the SAM7S choose to feed VDDCORE from 1.8V.

The SAM7S is specified with USB enabled, so this has to be used for the LPC as well for a fair comparision.

You do not have any figures in the datasheet indicating low power mode.

Yes it is. It should be enough to look in the datasheet.

Ulf

- J
- Jon Kirwan
  
  Contact options for registered users
Vote on answer
posted
14 years ago

Sun, Mar 28, 2010 1:00 AM

I'm not entirely sure I understand. If both processors are internally clocked at the same rate, they both have exactly the same fetch rate in thumb mode.

Okay. I'm with you. Except that I haven't looked at the data sheets to check for maximum core clock rates, since that might bear on some questions.

In thumb mode and only talking about instructions and assuming 66MHz peak. Do the processors (either of them) sport separate buses, though, which can compete for the same memory system? (Data + Instruction paths, for example.)

In the case of instructions, I think I take your meaning. Regarding data, no, I don't.

... In the case of non-sequential instruction fetch.

All this still fails to account for actual application mix reports. I'm still curious (and I'm absolutely positive that this is _done_ by chip designers because I observed the sheer magnitude of the effort that took place at Intel during the P2 design period) about application analysis that must have been done on ARM (32-bit, 16-bit, and mixed modes) and should be available somewhere. Do you have access to such reports? It might go a long way in clarifying your points.

Actually, I meant this plural, not singular. And I don't have a perspective on actual applications in these spaces. So I'll just plead mostly ignorance here and hold off saying more, as I'm mostly trying to understand, not claim, things.

What you wrote is obvious. But it is completely off the question I asked. Take a close look at my words. I am asking about the kind of analysis I observed taking place at Intel during the P2 development. It was quite a lot of work getting applications, compiler tools, and so on and generating actual code and then analyzing it before continuing the processor family design.

Such a simple NOP case would have been laughed at, had it been presented as representative in such meetings. I'm looking for the thorough-going analysis that often takes place when smart folks attack a design.

I guess I can't follow your words, here, at all. Maybe I didn't write well, myself. In any case, I will just leave this with my question still hanging there for me. Someone else may understand and perhaps answer.

Again, this misses my question entirely. But it may provide some answers to some questions not asked by me.

A question which you went around completely in the above and which still remains...

I don't think I was asking about low power modes. I think there must be a language problem, now. Let me try this again.

When a memory system is cycled, there is power consumption due to state changes and load capacitance and voltage swings based upon the current from C*dV/dt and the supply voltages involved. When the memory system isn't clocked, when it remains 'static', leakage current can take place but the level is a lot less. This isn't about a low power mode. It's simply something fairly common to memory systems. I don't know enough about flash to know exact differences here, but I suspect that an unclocked flash memory consumes less power than one being clocked consistently. Let me use your simplistic example from above:

LPC with 1 waistates at 33 Mhz.

NOP 2 (fetches 8) 1 memory cycle NOP 1 0 memory cycles NOP 1 0 memory cycles NOP 1 0 memory cycles NOP 1 0 memory cycles NOP 1 0 memory cycles NOP 1 0 memory cycles NOP 1 0 memory cycles ...................................... Sum 9 1 memory cycle

Same code with SAM7, 0 waitstate at 33 MHz.

NOP 1 (fetches 1) 1 memory cycle NOP 1 (fetches 1) 1 memory cycle NOP 1 (fetches 1) 1 memory cycle NOP 1 (fetches 1) 1 memory cycle NOP 1 (fetches 1) 1 memory cycle NOP 1 (fetches 1) 1 memory cycle NOP 1 (fetches 1) 1 memory cycle NOP 1 (fetches 1) 1 memory cycle ...................................... Sum 8 8 memory cycles

As you say, "It should not be to hard to grasp."

I am imagining that 8 cycles against the flash will cost more power than 1. But I may not be getting this right.

Thanks. That's a much clearer statement than before.

Jon

- U
- Ulf Samuelsson
  
  Contact options for registered users
Vote on answer
posted
14 years ago

Sun, Mar 28, 2010 7:13 PM

Jon Kirwan skrev:

Yes, but the memory speed is important whenever you do a jump.

In some frequency ranges the LPC has more waitstates than the SAM7, so the jump will be one clock cycle slower. The more jumps you have, the slower the relative performance of the LPC is.

Yes, as I already mentioned, the LPC can run at a slightly faster clock, but this will not improve power consumption.

This is one of the weaknesses of the ARM7. It only has a single bus, so data and instruction is shared. The LPC adds a bridge from ASB to AHB which allows multiple transfers but this also causes synchronization delays for the CPU. Pro's and Con's of doing this.

I am not sure, but my guess is that one of the most common reasons for data access to the flash, is that the compiler loads 32 bit constants by pc relative reads, and then the number of waitstates is really critical.

and random data fetch

I know that this was done for the AVR32 but I dont have those reports. It is fairly obvious that NXP is focusing on people using ARM mode and Atmel is focusing on people running Thumb mode from the design decisions.

If you can meet your design goals in thumb mode, then it is almost always better to go for thumb mode, (due to code size) If you run in ARM mode, you can run at a lower frequency assuming zero waitstates. If you add waitstates, then thumb mode may actually be faster, so you may have to run at a higher frequency in ARM mode to compensate.

Itr is just to show that the SAM7 does not need to have a 128 bit memory to achieve better performance than the LPC in Thumb mode. A faster flash memory is what is needed.

The real performance is application specific, so it may be as useless. My real purpose is showing that the

128 bit memory does not neccessarily perform better than a faster 32 bit memory.

For that, the example is good and simple enough.

You have to figure out if you are running faster or slower than a certain frequency, and you have a limited amount of clocks available inside the chip. If you keep the flash on, then it can respond instantly. If it is off, then it may take some extra time to start it up, so you have to have clock edges on the alternate clock before the access, which can be used to start up the flash.

It shows the expected power consumption, with any tricks applied.

Dont have any real data, but I would expect that jumps are

I think that when the flash is activated, you have quite a lot of static current in the sense amplifiers. The total sense amplifer current is proportional to the number of active sense amplifiers.

Not if the sense amplifiers are turned on in both cases. There will be some switching current, but I have been told that the current in the sense amplifiers are much more significant.

The 1st generation AVRs did not turn off the chip select. At 16 MHz, the flash will provide data out in < 67 ns. When you run at 33 Khz, the instruction cycle is 30 us. The flash still delivers data out in 67 ns, but the flash burns power almost like it the chip is running at 16 MHz.

- J
- Jon Kirwan
  
  Contact options for registered users
Vote on answer
posted
14 years ago

Sun, Mar 28, 2010 8:19 PM

So much being reposted. I'll focus on this one point, since it is the more interesting to me, right now.

Again the *if* here. I had imagined (and perhaps incorrectly) that any chip designer worth their salt (and perhaps working on generation N, where N > 1) would arrange to turn them off when not in use. It's not complex to do that and it doesn't take up real estate to notice. And the power savings are, as you indicate here, worth the doing.

I can't imagine they'd miss that opportunity.

But I'm speaking ignorantly and don't know. It just doesn't pass a basic rationality test for me, is all. But irrational things do happen. So... I still wonder.

Note N=1 here. That makes sense. Job 1 is getting a foot in the door, so to speak. And priorities are much different at this point. Forgiven.

You don't say, but I assume this is true because you claim here that the sense amplifiers were always on regardless and that this accounts for the power burn on those parts. I also gather, reading between your lines here, that this was fixed and pretty much everything later on now turns these sense amps off when not in use.

Why don't you imagine others do the same thing??

I'm sure the lesson was learned, especially when considering opening the door to applications where they might want to run at 32/33 kHz with your parts and don't want to pay the power cost.

I just don't get your argument here about why Atmel learns and "gets it" and other chip designers are somehow blind and dumb by comparison.

I would imagine, in this market today, that this is a well understood issue. TI has forged new territory with the MSP430 in hyper low power apps and Microchip, for one, and I'm sure you and other companies as well, have decided that this market is worth paying at least some attention to. Besides, it's simply not hard to turn the static draw for the flash OFF. There is no reason not to do it. Yet you seem to suggest that there __might__ be the situation here where a fairly sophisticated team (I can assume) didn't get it and didn't do it when Atmel only made this mistake on, I take it, generation 1 AVRs.

Something doesn't make sense here. Are you arguing you folks get this right and no one else does?

Jon

- T
- TheM
  
  Contact options for registered users
Vote on answer
posted
14 years ago

Sun, Mar 28, 2010 9:05 PM

The datasheet I've seen for one of the NXP ARM did not seem as elaborate as Atmel usually supplies. Probably need general family user manual to cover gray areas.

M

- L
- Leon
  
  Contact options for registered users
Vote on answer
posted
14 years ago

Sun, Mar 28, 2010 10:39 PM

messagenews: snipped-for-privacy@news.planet.nl...

messagenews: snipped-for-privacy@4ax.com...

Atmel usually supplies. Probably need general family

The User Manual is essential, the data sheet only has the pinouts and the electrical characteristics.

- N
- Nico Coesel
  
  Contact options for registered users
Vote on answer
posted
14 years ago

Sun, Mar 28, 2010 10:55 PM

Atmel usually supplies. Probably need general family

You need the user manual. Just scroll down on NXP's web page for the specific controller and you'll find it between the appnotes and example software.

--
Failure does not prove something is impossible, failure simply
indicates you are not using the right tools...
nico@nctdevpuntnl (punt=.)
--------------------------------------------------------------

- J
- JosephKK
  
  Contact options for registered users
Vote on answer
posted
14 years ago

Sun, Mar 28, 2010 11:06 PM

messagenews: snipped-for-privacy@news.planet.nl...

messagenews: snipped-for-privacy@4ax.com...

finally.

faster.

SAM7.

It all starts with Amdahl's Law. Double the speed of one thing and if = none=20 of the rest of the system can use the speed increase you get nothing.=20 (slightly overstated)

- J
- Jon Kirwan
  
  Contact options for registered users
Vote on answer
posted
14 years ago

Sun, Mar 28, 2010 11:29 PM

Sounds like the famous "weakest link" phrase, "A chain is no stronger than its weakest link," which apparently traces back to the English clergyman Charles Kingley's letter, dated December 1, 1856, where he wrote "The devil is very busy, and no one knows better than he, that 'nothing is stronger than its weakest part.'"

Others have also written similarly, since. See very near the bottom of page 433 here, for example:

formatting link

I guess we can add Amdahl to a long list of many stating the exact same thing in slightly different words.

Jon

- K
- krw
  
  Contact options for registered users
Vote on answer
posted
14 years ago

Mon, Mar 29, 2010 12:22 AM

Ahmdahl's Law is more quantitative than that:

formatting link