Atmel releasing FLASH AVR32 ?

- J
- -jg
  
  Contact options for registered users
posted
17 years ago

Mon, Mar 19, 2007 8:57 AM

It seems Atmel are doing the next logical thing, and releasing FLASH variants of their AVR32.

See:

formatting link

Top MHz still seems to be Flash constrained - we've been stuck in the

50-100MHz zone for what seems like years....

Seems to make the Cortex M3 look 'ordinary' - but will the peripheral specs be as impressive as the new Infineon XC2200 ?

formatting link

5V operation, Automotve specs, and ECC flash

-jg

- U
- Ulf Samuelsson
  
  Contact options for registered users
Vote on answer
posted
17 years ago

Mon, Mar 19, 2007 5:32 PM

"-jg" skrev i meddelandet news: snipped-for-privacy@p15g2000hsd.googlegroups.com...

The AVR32 chips so far have been using the same set of peripherals as the SAM7/SAM9

formatting link

The article mentions Ethernet and USB. This is not available for the XC2200.

--
Best Regards,
Ulf Samuelsson
This is intended to be my personal opinion which may,
or may not be shared by my employer Atmel Nordic AB

- T
- tesla
  
  Contact options for registered users
Vote on answer
posted
17 years ago

Mon, Mar 19, 2007 7:10 PM

I am having difficulty to understand why they cannot read 10 instructions at 50 Mhz from flash in parallel and run processor at 500 Mhz. Branch operatons (unpredictable PC changes depending on user input etc) will still run at 50 mhz but it is still a good gain..

- A
- Arie de Muynck
  
  Contact options for registered users
Vote on answer
posted
17 years ago

Mon, Mar 19, 2007 9:14 PM

"tesla" ...

I think most of these "50 MHz, directly from Flash" processors are already taking 256 bits wide from the flash. See for example the LPC series from Phi^H^H^H NXP.

Arie de Muijnck

- J
- Jim Granville
  
  Contact options for registered users
Vote on answer
posted
17 years ago

Tue, Mar 20, 2007 12:27 AM

I realise they push into different sectors, but the industrial/embedded market is somewhere in the middle. [Article does not mention CAN bus] The 3 key points I mentioned are relevent to industrial users, perhaps less so to a MP3 vendor!

When will the Atmel link referenced in the press release above actually be relevent to the UC3 ?

With no package or price infos, it's hard to judge just how significant this really is.

-jg

- R
- rickman
  
  Contact options for registered users
Vote on answer
posted
17 years ago

Tue, Mar 20, 2007 12:45 AM

The NXP parts are the only ones I am familiar with that do that trick to get speeds up to 75 MHz, IIRC. The Luminary Micro Stellaris parts run at up to 50 MHz with no wait states without prefetch from flash. Most other devices run at up to about 35 MHz or so before the flash wait states have to be added.

That reminds me, I need to update the ARM MCU reference sheet at

formatting link

There are a few typos I need to fix and there are a number of new devices I need to add.

- J
- Jim Granville
  
  Contact options for registered users
Vote on answer
posted
17 years ago

Tue, Mar 20, 2007 8:16 AM

More info has popped up here :

formatting link

This shows 512KF, 64KR, 100p/144p

10/100 Ethernet USB OTG 12 Mbps [Shame it's not 480Mbd, like the other AVR32's :) ] two master/slave SPIs, one SSC, one master/slave TWI four USARTs with hardware flow-control. One USART has special extensions to support modem, IrDA and smart-card ISO7816 serial protocols.

"DSP Instructions - The UC3 ?s multiply-accumulate unit executes, in a single cycle, a plethora of multiply and multiply-and-accumulate instructions on standard and fractional numbers, with and without saturation and rounding. Multiply or MAC results can be 32-, 48- or

64-bit wide; 48- and 64-bit results are placed in two registers. DSP instructions also include many add and subtract instructions as well as data formatting instructions such as data shift with saturation and rounding."

"The two first devices of the UC3 Series A are sampling now and will be available in volume production in 4Q-2007. The AT32UC3A0512, with EBI, comes in a QFP144 and is priced at $9.24 in quantities of 10,000 The AT32UC3A1512, without EBI, comes in a QFP100 package and is priced at $8.67 in quantities of 10,000."

Compare with ARM7 line ? : "Pricing for the 512K Flash variants of the SAM7S, SAM7X and SAM7XC devices start at US$6 in quantities of 10,000 units."

- so there is a slight premium for the higher performance AVR32 core.

-jg

- U
- Ulf Samuelsson
  
  Contact options for registered users
Vote on answer
posted
17 years ago

Tue, Mar 20, 2007 11:01 AM

"tesla" skrev i meddelandet news: snipped-for-privacy@e65g2000hsc.googlegroups.com...

Maybe the sense amplifiers for the flash are large, or draw a lot of current. I still remember page mode DRAM memories with 4096 bits per page, and noone has been able to tell me why this is not possible with flash.

After talking to multiple memory companies about this, my conclusion is that memory people do not understand microprocessors and their needs.

--
Best Regards,
Ulf Samuelsson
This is intended to be my personal opinion which may,
or may not be shared by my employer Atmel Nordic AB

- W
- Wilco Dijkstra
  
  Contact options for registered users
Vote on answer
posted
17 years ago

Tue, Mar 20, 2007 12:34 PM

There is no point if you branch every few instructions as most programs do...

Not at all. If you run a CPU at 500MHz but branches take 10 cycles then you're lucky if you get the performance of a 150MHz CPU with 5 times the power consumption...

The solution is to use a cache and branch prediction.

Even if it were feasible, a cache with 1 line of 512 bytes is totally useless. A fully associative cache with 32 lines of 16 bytes would be better, but likely still too small to be useful (about 4KB is the absolute minimum). Combining prefetch with a branch target instruction cache would make even better use of such a small cache.

Most memory (flash, DRAM, even SRAM) is optimised for density, not speed. RLDRAM attracts a premium, so is rarely used. Hopefully new technologies like MRAM will become mainstream soon.

Wilco

- U
- Ulf Samuelsson
  
  Contact options for registered users
Vote on answer
posted
17 years ago

Tue, Mar 20, 2007 5:26 PM

Unless you branch to another part of the page. I think most branches are pretty short, even though I do not have hard data.

Really, I think a large page memory and H/W multithreading is a much better solution. Cache and branch prediction is waste of energy and gates. H/W multithreading simplifies the CPU, No need for nasty feedback muxes in the datapath allowing higher frequencies. No need for branch prediction, since you can execute computable threads while you are waiting for the flash access to complete.

I think you need to think about worst case and best case behaviour. Adding a cache leads to more unpredictability. A cache can even reduce worst case performance since it can introduce delays in the critical path.

I would not be surprised if a 512 byte page could fit an entire interrupt routine. If you can read the flash in one cycle in page mode, then I do not see how the cache/branch prediction brings a lot of benefit.

--
Best Regards,
Ulf Samuelsson
This is intended to be my personal opinion which may,
or may not be shared by my employer Atmel Nordic AB

- E
- Eric
  
  Contact options for registered users
Vote on answer
posted
17 years ago

Tue, Mar 20, 2007 8:27 PM

I'm guessing it makes more sense to compare AVR32 with Arm9? Once the Cortex A8 comes out, I think that will be a better comparison. It's my guess the AVR32 is targetting the same applications as the A8?

Eric

- L
- linnix
  
  Contact options for registered users
Vote on answer
posted
17 years ago

Tue, Mar 20, 2007 8:35 PM

Yes, the A8 has 13 stages pipeline vs. 3 for M3. I am sure you pay for it in price and power consumptions.

- W
- Wilco Dijkstra
  
  Contact options for registered users
Vote on answer
posted
17 years ago

Tue, Mar 20, 2007 11:18 PM

The UC3 is similar to ARM7 and Cortex-M3 in terms of target market, performance etc. ARM9 is significantly faster, but uses caches and external memory. Note AVR32 is the architecture, not one of the 2 implementations.

Cortex-A8 is in a completely different league. The fastest AVR32 does

150MHz I believe and has 32-bit wide SIMD, Cortex-A8 runs at 1GHz and does 128-bit wide SIMD...

Wilco

- J
- Jim Granville
  
  Contact options for registered users
Vote on answer
posted
17 years ago

Tue, Mar 20, 2007 11:40 PM

You are right, and wide reads will work well, until you hit the fishhook that page reads are absolute, whilst code relocates.

This adds a compile-dependant variance on code execution.

Imagine if your routine that fits well into one page, has some minor changes elsewhere, and now that moves to be across two pages...

Of course, the tools could be made smarter, so they page-snap code blocks if told to....

Fundamental memory structure has faster access on some pins, than others, ( as some have to go thru the cells, and some just de-mux the array out) but that is rarely spec'd into modern data sheets.

I've seen memories that issue a pause/busy flag, when the cross such bondaries, but can be faster on sequential read, and there was one memory (even Atmel's IIRC) that did interleaved sequential reads.

I see the press-release in the links above, says "128-bit wide bus with 40ns Tacc" for AT91SAM9XE512

-jg

- J
- Jim Granville
  
  Contact options for registered users
Vote on answer
posted
17 years ago

Tue, Mar 20, 2007 11:43 PM

Well, seems Atmel want to target everything :) Their press release says

" Atmel?s Cost-optimized AVR32 UC3 Core Targets ARM7/9 and Cortex-M3 Sockets "

but this on chip FLASH advance, does move AVR32 from Microprocessor usage, to microcontroller usage. ARM9 with flash are also appearing.

-jg

- W
- Wilco Dijkstra
  
  Contact options for registered users
Vote on answer
posted
17 years ago

Tue, Mar 20, 2007 11:57 PM

That's true, but function calls are common too and they would typically branch between pages. And then you have the nasty case of a function or a loop split between 2 pages...

solution.

On the contrary, caches typically reduce power consumption as code runs faster and you avoid having to use the bus. A small and fast local memory always wins.

Similarly, branch prediction makes a CPU go faster and so it burns less power to do a given task. Cortex-M3 has a special branch prediction scheme to improve performance when running from flash with wait states, so it makes sense even in low-end CPUs.

Unless the other threads are also branching... Multithreading is not relevant in the embedded space, it would add a lot of complexity and die area for hardly any gain. It really only makes sense on high-end CPUs, but even there the gains are not that impressive.

A single page cache is a cache. Performance of otherwise identical code is completely unpredictable due to code layout. Adding more cachelines evens this effect out, making performance more predictable.

So would a page cache. That is the price you have to pay when improving performance: the best case is better but the worst case is typically worse. Overall it is a huge win.

If you can read the flash in one cycle then you don't need page mode! If reading takes several cycles then it makes sense to fetch more than one instruction. Page mode is bad because you fetch a whole page even if you only call a small function. So you burn power for fetching a lot of data you didn't need. A proper cache has smaller lines, thus reducing wastage.

Wilco

- U
- Ulf Samuelsson
  
  Contact options for registered users
Vote on answer
posted
17 years ago

Wed, Mar 21, 2007 7:39 AM

"Wilco Dijkstra" skrev i meddelandet news:2y_Lh.16902$ snipped-for-privacy@newsfe6-win.ntli.net...

Fixed by compiler pragma...

On an ARM7, adding a cache also adds on waitstate to all non-cache accesses.

Branch prediction cost is chasing an ever eluding target. With multithreading you can swap in a computable process and use EVERY cycle.

Yes it is, just look at a mobile phone, lots of ~20 MIPS CPUs handling Bluetooth, WLAN, GPS etc , just because noone has designed a proper multithreading for embedded.

If you believe that, you dont understand multithreading for embedded. The purpose is not to increase performance, it is to improve real time response so you do not have to have multiple CPUs.

Can, and should be handled by tools.

No, your unpredictability comes from jumping to a place and instead of accessing memory, to fetch the page you have a cache hit, and then your timing is screwed.

No it is not a win if you have to guarantee that a job completes in a certain time.

The cache in itself draws power, and you cannot compare accesses to cache compared to accesses to flash memory.

Totally different technology. I have never been able to get any hard data on this but I suspect that flash sense amplifiers do not exhibit a linear curve for access speed vs current.

You have to run the cached CPU at a higher clock frequency to compensate for loss of worst case performance.

--
Best Regards,
Ulf Samuelsson
This is intended to be my personal opinion which may,
or may not be shared by my employer Atmel Nordic AB

- J
- Jim Granville
  
  Contact options for registered users
Vote on answer
posted
17 years ago

Wed, Mar 21, 2007 9:05 AM

I'm with you up to this point, but the challenge with hard-real-time multithread, is the code-fetches feeding that "computable process" still has to come from somewhere ? I can see Multithread doing good things for removing SW taskswitch, and lowering interrupt latencies, but unless you do fancy things with the code pathway, you are actually thrashing the memory about even more.

One solution is what some called a locked cache, where small critical code is fetched from fast local RAM, others really do have separate cores, and separate memory pathways.

but this is also a memory-bandwidth problem. If you had your magic CPU, that could do 5 x 20 MIPS, how do you keep that fed, with today's memory technologies ?

IP core vendors do not design memory, so they get fancier and fancier with the cache handling, and of course, they pitch peak MIPS, not real world mips.

-jg

- J
- Jim Granville
  
  Contact options for registered users
Vote on answer
posted
17 years ago

Wed, Mar 21, 2007 9:22 AM

I've found the device, a AT27LV1026, and it dates from 1999, when it offered 35ns (double speed) sequential access, _without_ page boundary gotchas.

It was a clever way to get more bandwidth from memory busses, and even reduced the pin count needed via the ALE, but because CPUs are designed for very 'dumb' memory interfaces, the idea has never hit critical mass.

-jg

- U
- Ulf Samuelsson
  
  Contact options for registered users
Vote on answer
posted
17 years ago

Wed, Mar 21, 2007 7:03 PM

If you fetch a large chunk of code in each fetch to a prefetch buffer this is not a problem.

No, I see embedded multithreading as one threa accessing external memory while all the other threads (mostly) accesses internal high bandwidth memory without any nasty cache in between.

Single core, single data bus and plenty of register banks and program counters. Any cache will be used by gebneric thread, all other threads on tightly coupled memory

You can do a WLAN MAC in 10's of kB with a 20 MIPS CPU. No need to have external memory. Same for Bluetooth implementing up to HCI. The Bluetooth Stack will run on the "generic" thread.

--
Best Regards,
Ulf Samuelsson
This is intended to be my personal opinion which may,
or may not be shared by my employer Atmel Nordic AB