Atmel releasing FLASH AVR32 ?

- J
- Jim Granville
  
  Contact options for registered users
Vote on answer
posted
17 years ago

Wed, Mar 21, 2007 9:18 PM

I think you have mentally added quite a bit of hardware. If you do what you describe, then you need a wide buffer per thread ?

- so have dictated a quite special memory architecture, and that has to be on-chip. [ it is still simple, and deterministic to a point, but it is special ]

If you extend that wide buffer to be interleaved (see AT27LV1026], then you can cross a boundary (sequentially) and not have that affect things

- so the tools can be simpler.

Someone like Atmel could do this, but the IP suppliers who sell Microprocessors as Microcontrollers are pushing in a different direction.

-jg

- W
- Wilco Dijkstra
  
  Contact options for registered users
Vote on answer
posted
17 years ago

Wed, Mar 21, 2007 9:26 PM

Easy to say, a bit harder in reality. If you don't care about codesize you could align big functions to 512-byte boundaries and pack small functions in the gaps. But even that is hardly a solution as every minor change in the code results in a different memory layout making performance unpredictable. Basically it is an unsolveable problem.

No, a cache doesn't impact other accesses to non-cacheable memory areas. A local flash cache is something you could just drop into an existing design without even worrying about needing to turn it on or flush it. It's completely transparent.

Branch prediction is pretty trivial as branches are very predictable. A small global branch predictor (for example as used in the ARM1156) gives an amazing good prediction at a neglegible hardware cost.

So what? There are few wasted cycles on modern embedded CPUs. Only very high-end CPUs are waiting a lot for slow memory.

complexity

No, phones are extremely integrated and usually have only one CPU, one DSP and perhaps a micro controller in the flash card.

Hardware multithreading doesn't give much performance on a high end CPU, and it gives almost no benefit on a low end one. Less than

10% of the memory bandwidth is unused in an ARM7, so running a second thread either means it runs at 10% of the maximum speed or it slows down the main thread.

You don't understand multithreading at all. Interrupt latency is completely unaffected by multithreading. Whether you run 2 interrupts in parallel at half the speed or one after the other at full speed is irrelevant.

You confuse multiprocessing with multithreading. A 2-core CPU can indeed deal with 2 interrupts in parallel at full speed.

It is impossible to run code at a predictable speed, so you're screwed no matter whether you use a cache or not.

Wrong. Code is highly repetitive, so even if you assume the cache is invalidated at the start of a task, using a cache results in much faster execution.

Of course the cache burns power, but you're not using the flash. Which uses less power is highly dependent on their size and implementation. From what I've heard, caches are extremely efficient for sequential accesses - ie. code accesses.

No, it would be virtually impossible to find code that actually can't meet its deadline with a cache.

Wilco

- U
- Ulf Samuelsson
  
  Contact options for registered users
Vote on answer
posted
17 years ago

Wed, Mar 21, 2007 9:54 PM

I assume that there are certain cricitcal paths which needs this determinism. Thos can be handled by pragmas. It is also entirely possible that most threads only execute out of zero waitstate SRAM.

Adding a cache to the ARM7 CPU (not to the flash) will add waitstate to ALL non-hit accesses according to chip designers. It also adds waitstates to ARM9's if you put the memory on the AMBA bus. Only way to allow no waitstate operation is to put the SRAM in TCM.

If you make a jump to a location outside the cache, then you are dead in the water. Synch with AMBA bus Synch with bus interface SDRAM precharge cycle.

70 ns access...

With a multithreaded CPU you can use those cycles for something good.

Then you do not know how a moderm phone looks like. Each Bluetooth chip normally has an ARM Each WLAN chip normally has one or more ARMs On Smart Phones, you have a GSM/WCDMA controller and an application CPU. GPS functions will add one more ARM. Then you have a micro doing the charging algorithm.

It quickly adds up.

Multithreading for embedded systems is not about increasing performance. It is about replacing 2 CPUs capable of 50 MIPS which only runs at 20 MIPS with a single CPU which can run 2 x 20 MIPS threads. I.E: it is trying to fix the real time response problem.

It is cheaper to have one CPU doing the job of two CPUs than having two CPUs each doing the job of half a CPU. Someone is going to get very rich, once they understand this. I am too lazy...

You are locked into conventional thoughts on multithreading.

Not if you need both interrupts to respond within 200 ns. With multithreading, you do not even need interrupt, You can schedule a thread.

No I dont. A multithreaded CPU running at 400 MHz can do the task of 40 CPUs running at 10 MHz.

No you can measure how many cycles each thread is using within a certain timq quanta, and ensure that each thread gets their fair share.

--
Best Regards,
Ulf Samuelsson
This is intended to be my personal opinion which may,
or may not be shared by my employer Atmel Nordic AB

- J
- Jim Granville
  
  Contact options for registered users
Vote on answer
posted
17 years ago

Wed, Mar 21, 2007 10:37 PM

?! - what ? Or are you talking only within the ARM subset of the CPU universe here ?

-jg

- J
- Jim Granville
  
  Contact options for registered users
Vote on answer
posted
17 years ago

Wed, Mar 21, 2007 10:46 PM

Plus once you have this, you can often drop the MAX clock speed, which may have been hiked in the first place, to try and reduce the SW latencies to a tolerable level....

For an example of someone already doing this, look at Ubicom's devices. It's what I'd call hard-multithreading, where they have timeslices and can allocate them to tasks: if you want, you can map 29/64 to a high priority task, and 4/64 to lower pri ones, and 1/64 to a background watchdog type task, and get full independance. (etc) Then there is the Parallex Propellor, multiple cores, with small code storages/core.

-jg

- J
- Jonathan Kirwan
  
  Contact options for registered users
Vote on answer
posted
17 years ago

Wed, Mar 21, 2007 11:23 PM

That sounds like the explanation.

There are CPUs with exact, predictable execution times and where the only place for unavoidable variability is in recognizing and synchronizing interrupt code execution to an asynchronous external event (variability here can be kept to a cycle.) And where interrupt generation from internal timers have do NOT have this unavoiable variability, since their generation is synchronous with the cpu, and are exactly predictable in terms of their latency.

Jon

- U
- Ulf Samuelsson
  
  Contact options for registered users
Vote on answer
posted
17 years ago

Wed, Mar 21, 2007 11:24 PM

I know, I wrote a white paper on multithreading for embedded control when I worked in the National Semiconductor Research labs and presented to the Microcontroller division. Bulent Celebi, The head of the NSC microcontroller division became the CEO of Ubicom and Gideon Intrater , the head of the architecture group became VP at MIPS architecture group, (MIPS also has introduced a multithreaded MIPS core).

As I said, I am too lazy...

--
Best Regards,
Ulf Samuelsson
This is intended to be my personal opinion which may,
or may not be shared by my employer Atmel Nordic AB

- W
- Wilco Dijkstra
  
  Contact options for registered users
Vote on answer
posted
17 years ago

Wed, Mar 21, 2007 11:44 PM

It doesn't have to if you divide the memory map. Accesses to non-cacheable memory simply bypass the cache, cacheable accesses try the cache first. It does take some extra logic as the ARM7 isn't built for caches so you get a slightly lower maximum frequency. That is why doing it on the flash is a better solution.

Correct.

water.

Sure, cachemisses are bad. But caches work extremely well, we're using

3GHz CPUs with 10MHz memory after all...

The other thread also needs to use part of the cache for its code and data, so the cache becomes less effective. It is a difficult tradeoff, not as simple as you claim.

You haven't seen an average phone then. Yes, the most complex smart phones use 5-6 chips with several ARMs. Most phones are far more integrated and use 2-3 chips containing just one ARM and a DSP.

What realtime response problem? Interrupt latency of a modern CPU is only a few cycles. Cortex-R4 has a 20 cycle latency eventhough it has caches, branch prediction, and runs at 500MHz...

This is already happening, but you don't need multithreading.

Sorry, it's simple maths. If we have 2 20MHz CPUs that have a

200ns interrupt deadline, then a 40MHz CPU takes 100ns for the deadlines (as it is twice as fast), so it meets the 200ns deadline.

And a non-multithreaded CPU running at 400MHz can do the task of 40 CPUs running at 10MHz. Multithreading doesn't enter the picture at all...

Wilco

- W
- Wilco Dijkstra
  
  Contact options for registered users
Vote on answer
posted
17 years ago

Wed, Mar 21, 2007 11:46 PM

I guess you haven't heard about interrupts, wait states, cycle stealing DMA and other niceties then. Some of use live in the real world...

Wilco

- J
- Jim Granville
  
  Contact options for registered users
Vote on answer
posted
17 years ago

Thu, Mar 22, 2007 12:59 AM

Which has nothing to do with the false, sweeping claim you made above.

Not only is is possible to run code a predictable speeds, a large number of designs out there are doing this on a daily basis....

I'm glad the systems I ship do not have to conform to your idea of the 'real world', or they would fail.... :)

-jg

- C
- CBFalconer
  
  Contact options for registered users
Vote on answer
posted
17 years ago

Thu, Mar 22, 2007 2:29 AM

I think you are talking about a minimum service time, while Wilco is talking about entirely predictable times. You can't have the latter when essentially random asynchronous events steal processing time. We can control the net effect by adding timer interrupts.

--
Chuck F (cbfalconer at maineline dot net)
   Available for consulting/temporary embedded and systems.

- J
- Jonathan Kirwan
  
  Contact options for registered users
Vote on answer
posted
17 years ago

Thu, Mar 22, 2007 5:11 AM

I wrote an application where cycle by cycle exact counts are precisely required, with repeatability of less than 8ns variation of signal observed externally with a high speed scope. It's crucial because the timing is the divisor in some vital calculations where error cannot be well tolerated and I have no external feedback about its actual value so I need to know it, a priori from the crafted design. The only interrupts present in the system are those from a timer, which is set to interrupt only when I happen to _know_ that there is free time to tolerate the interruption. Even the serial ports operation is synchronized to an available window of time. The operation of external hardware by the software must be extremely precisely controlled and there are multiple lines to control in certain sequences, driven by zero-overhead loops that the DSP supports.

Entirely predictable times, known to the cycle. Like clockwork. But then, I carefully crafted the entire chain of timing sequences and the asynchronous events to occur exactly when I could afford them to occur.

Jon

- J
- Jim Granville
  
  Contact options for registered users
Vote on answer
posted
17 years ago

Thu, Mar 22, 2007 6:12 AM

Wilko inhabits the world Planet-ARM, whilst I live on Planet-Microcontroller, but his statement did not say 'entirely predictable times', it said: "It is impossible to run code at a predictable speed,..", !? which is simply nonsense, but has some merit as a good example of a "rash generalisation" :)

Jon has shipping examples, so do we, and many others.

Some uC have jitter free interrupts, others have interrupts that can be made jitter free, with the right design skills.

There was a earlier thread about the merits of designing a core with a fixed INT latency, even if that meant inserting delays on the faster paths. The silicon cost of this is quite low. What you gain is a drop in jitter from multiple cycles, to clock-edge levels - that can be a 100:1 improvement.

We have also routinely branch-delay mapped code, to get phase-error free output, and one design was a PAL signal generator, where you certainly DO notice any jitter, and if "It is impossible to run code at a predictable speed,.." were true, we could not have built this in SW.

-jg

- C
- CBFalconer
  
  Contact options for registered users
Vote on answer
posted
17 years ago

Thu, Mar 22, 2007 6:49 AM

... snip ...

I haven't done that for about 25 years, when I built a cheap timer for swimming meets. I didn't have any good touchpads though. The timer was built around an 8080, and needed careful construction to make all paths through routines take constant time.

--
Chuck F (cbfalconer at maineline dot net)
   Available for consulting/temporary embedded and systems.

- U
- Ulf Samuelsson
  
  Contact options for registered users
Vote on answer
posted
17 years ago

Thu, Mar 22, 2007 8:39 AM

Since you iinsist on not understanding:

Try this example:

spi_task(unsigned char *mbox); { while(1); data = 0; waitfor(!CS); for(i = 0; i < 8; i++) { waitfor(SCK); data = (data

--
Best Regards,
Ulf Samuelsson
This is intended to be my personal opinion which may,
or may not be shared by my employer Atmel Nordic AB

- W
- Wilco Dijkstra
  
  Contact options for registered users
Vote on answer
posted
17 years ago

Thu, Mar 22, 2007 10:58 AM

Indeed.

I was talking about current micro controllers, and that includes ARM and many others.

impossible to

I did indeed mean entirely predictable, this was clear from the context - which you left out. We were talking about caches and Ulf mentioned that if you have a cache hit execution becomes unpredictable. Ie. code runs too fast rather than failing to meet a realtime deadline!

I stand by my claim that it is impossible to make code run with a fixed timing on current micro controllers (just to make it 100% clear, I mean non-trivial code, and dealing with realtime events).

Microcontrollers typically have different memory timings for the different memories, there are data-dependent instruction timings to worry about, so you need to write everything in assembler and carefully balance the timings of if/then statements. If you pass pointers then you'd need to the memory timing into account whereever the pointers are used.

Then there is the interrupt problem. If you do service (asynchronous) interrupts, then only the highest priority interrupt could run with a fixed execution time - assuming the controller has a fixed interrupt latency, which is rarely true. If you use polling to avoid this then you have a different interrupt latency problem as you can only poll once in a while, so asynchronous events cannot be handled in a fixed time.

Of trivial programs, yes. In the original post mobile phones, WLAN, GPS, Bluetooth were mentioned - could you do any of that?

Your example of a PAL generator proves my point, it can't react to anything else while you're emitting a frame.

Wilco

- U
- Ulf Samuelsson
  
  Contact options for registered users
Vote on answer
posted
17 years ago

Thu, Mar 22, 2007 12:10 PM

In a multithreaded core, if you have a thread allocated to that event you can guarantee a response time.

See previous SPI slave example. You need to guarantee that the thread reads the input pin before the SPI master toggles the clock.

You need one instruction to read that input pin as fast as possible, the rest of the thread can execute at any time.

In a single threaded core, you would have problems due to overhead in interrupt entry/exit.

--
Best Regards,
Ulf Samuelsson
This is intended to be my personal opinion which may,
or may not be shared by my employer Atmel Nordic AB

- W
- Wilco Dijkstra
  
  Contact options for registered users
Vote on answer
posted
17 years ago

Thu, Mar 22, 2007 2:01 PM

Using polling in both cases would result in about the same max frequency. Assuming all ports run at the same frequency and are active then amount of code that needs to execute to receive 40 8-bit values is the same, whether multithreaded or not. If not all ports are active then multithreading has much lower CPU utilization (as only a few threads are running).

Using interrupts in both cases would result in about the same max frequency. The maximum frequency is lower compared to polling (due to the interrupt latency overhead - twice as slow is possible in a worst case scenario). Multithreading will have a similar interrupt latency as taking an interrupt is virtually identical to starting a new thread (some CPUs even switch to a different set of registers). The advantage of using interrupts is that CPU utilization is much lower if only a few SPI ports are active.

Peripherals typically have some buffering to reduce interrupt rate so the overhead is minimal (this is a little extra hardware, far less than hardware multithreading needs) . Therefore the advantage of polling when all devices are active is pretty small. So there is little difference between multithreaded polling and non-multithreaded interrupts.

If you're claiming that polling has lower CPU utilization in a multithreaded environment then I agree. If you're claiming that interrupts have a large overhead if do you very little work per interrupt (ie. no buffering), then I agree.

But I still don't see any advantage inherent to multithreading.

Wilco

- W
- Wilco Dijkstra
  
  Contact options for registered users
Vote on answer
posted
17 years ago

Thu, Mar 22, 2007 2:05 PM

Please see my reply...

Starting a thread on an event is just as complex as handling an interrupt.

Wilco

- U
- Ulf Samuelsson
  
  Contact options for registered users
Vote on answer
posted
17 years ago

Thu, Mar 22, 2007 4:34 PM

"Wilco Dijkstra" skrev i meddelandet news:C2wMh.3352$ snipped-for-privacy@newsfe3-win.ntli.net...

No, you start a thread containing a loop and in the beginning of the loop, you wait for an event. Once that event occurs, the thread becomes computable and you can read data on the next CPU cycle.

--
Best Regards,
Ulf Samuelsson
This is intended to be my personal opinion which may,
or may not be shared by my employer Atmel Nordic AB