New embedded CPU architecture

Jon Beniston · 2003-09-24T13:18:54+00:00

Hi all,I'm looking for ideas as to what features should be included in a newembedded CPU architecture. What are current microcontrollers missing?All suggestions greatly received!Cheers,JonB

A

Andras Tantos 22 years ago

Doesn't sound too good to me: You would limit the number of threads in HW or your OS kernel would have to revert to the conventional context switching - with the additional overhead of checking if there's enough HW resource. You would also have to design/manufacture/pay for HW for the high (constant) number of threads even if your particular application would use only a fraction of those.

Many CPU architectures support alternative register sets for interrupt handlers (ADSP 21xx and ARM come to my mind) but that's not exactly what you're talking about, it's far from being generic.

Some RISC architectures support a moving window over their register bank (I don't know the right term) so that the caller and the callee can work on a different set of registers, thus implementing some limited stack in the registers. That's also somewhat different from your idea though.

Regards, Andras Tantos

Vote

R

R Adsett 22 years ago

Yes and no. The applications I've done have had only a few threads that required the fastest response. The 80c196 (Intel) and 80c166/ST10 (SGS & Infineon) use mechanisms similar to this. Although I've yet to use an RTOS that takes advantage of them they are very useful for fast interrupt responses since you only have a few registers to save for context rather than 10(s). Something like

Receive interrupt save current context pointer load interrupt context pointer do interrupt processing restore previous context pointer return from interrupt

Where the context pointer register points to a bank (window) of registers that contains the stack pointer, ALU registers and a working register set.

Compare that to the more usual

Receive interrupt save current register set (all registers in set) do interrupt processing restore previous register set return from interrupt

The key question of course is how many registers you have in the working set. If it's only a few you've gained nothing, but if its in the order of 10's there is the potential for a much faster context switch. Also the interrupt gets it's own stack rather than needing to reserve space in all the task stacks for interrupt overhead.

If you extend this context switch to task level (and nothing in either of the architectures I've mentioned would appear to make that difficult), you would presumably reserve the limited available register contexts for your most critical tasks which will need the extra planning overhead anyway.

Robert

Vote

B

Ben Bradley 22 years ago

That's actually what I meant, an "orthogonal instruction set" where instructions can address any register in a bank. All (most?) RISC architectures have this, whereas some CISC architectures do (68xxx) and some don't (80(x)86 - I haven't kept up, but by the time Intel made the Pentium I would think/hope they had cleaned it up, except of course for legacy instructions).

-----

formatting link

Vote

R

Rick Lones 22 years ago

A few kilobytes of RAM writeable only from certain processor states (for OS data structures, e.g.) would be nice . . .

-rick-

Vote

B

Ben Bradley 22 years ago

There's probably a few more things: maximum clock speed, minimum clock speed (zero (static) is best), total power at max and min speeds, MIPS/milliwatt, ALU features such as MAC (as an instruction in the CPU core, not as a peripheral as the 430 does it), barrel shifter...

-----

formatting link

Vote

M

Morris Dovey 22 years ago

Jon...

I'd like:

[1] A "bool" instruction that can perform all of the sixteen possible boolean operations. (It can be implemented in two levels of logic using either all nand or all nor gates.) [2] I'd like a unix-like clock/calendar with CPU clock resolution and alarm interrupt capability, so that I can time activities, and schedule/trigger events.

Morris Dovey West Des Moines, Iowa USA C links at http://www.iedu.com/c

Vote

M

Meindert Sprang 22 years ago

Nah, not a fixed number that limits the amount of tasks. Better is to have two single instructions that save/restore the entire processor context and an accompanying pointer register.

Meindert

Vote

P

Paul Keinanen 22 years ago

Time to reinvent the TI TMS9900 architecture ?

It had sixteen 16 bit general purpose register set somewhere in RAM, so at each context switch, just switch to an other set of 16 memory words. Unfortunately the processor was slow in normal operation, when each register reference actually meant a main memory reference. This was long before caches appeared in single chip microprocessors.

However, with current hardware architectures, all the registers in a single register set would fit into a single cache line in the L1 cache, so the context switch overhead would be in the same order as a cache miss.

Paul

Vote

C

CBFalconer 22 years ago

... snip ...

Sounds good, BUT: The critical thing in interrupts is the time required to switch to a new environment. The cache is all very well for execution, but loading that cache requires reading all those memory items, and ensuring the previous group is safely stored in real memory. So there is no real substitute for an alternate register set.

Chuck F (cbfalconer@yahoo.com) (cbfalconer@worldnet.att.net) Available for consulting/temporary embedded and systems. USE worldnet address!

Vote

P

Paul Keinanen 22 years ago

Even with 16 kB of L1 cache, you would have quite a few alternate register sets :-).

Especially with interrupt service routines, which are quite small, the likelihood of being forced to write back the interrupted program register set into main memory is quite small. However, a complete task switch would sooner or later force writing back the old register set.

Loading or storing a full cache line can be quite effective compared to programmatically loading or storing registers on to the stack one by one as must be done in some architectures. A wide memory bus helps of course a lot. When accessing off-chip memory the data bus can not be very large due to cost constraints, but for instance in old PC:s with 64 bit (8 byte) wide data bus, the transfer of 32 bytes (a cache line on x86 processors) requires a full RAS/CAS cycle for the first 8 bytes (and storing the next 24 bytes into the memory chip I/O register), but for the next three cycles, only the low address was supplied, transferring data from the I/O buffers to the CPU, thus quite good transfer rates can be obtained in loading or storing entire cache lines. Thus the actual memory width is 256 bits (32 bytes), which is multiplexed for transmission over the data bus.

Paul

Vote

A

Anton Erasmus 22 years ago

My idea is that you keep the hardware threads for device driver type code. One can still run a normal RTOS, with arbitrary number of tasks as one of the threads. Also I do not mean that one should try and speed up context switching. The hardware threads must be able to execute concurrently. I.e. either interleav instructions of different threads, or have some parallel hardware to execute code from different thread simultaneously. One thread can execute a TCP/IP stack, another thread can execute some real time filter on audio date, or a software modem.

With something like a TCP/IP stack on current processors. The amount of CPU cycles needed by this code is highly dependent on Net trafffic. Most of this traffic might even be discarded by the stack.

All these architectures try and minimise the time for a context switch. I think if one can get away with no context switching for at least some or most of the device driver type tasks, then the overall speed will be improved. I also think that one can make things a bit more determanistic for protocol stacks etc.

Regards Anton Erasmus

>

Vote

B

Brad Eckert 22 years ago

Look at the Mcore and NIOS instruction sets to see what instructions they thought were important enough to keep in.

Certain bit operations are hard to do in software but trivial in hardware. For example, find-first-bit is just a priority encoder. Nice to have if you simulate floating point. Also, a REP instruction (repeat next instruction n times) doesn't take much hardware.

You can check out the instruction set of my soft CPU at

formatting link

which is being used in a .35u SoC so it won't be too long before I find out how well it works for real.

-- Brad Eckert

Vote

C

CBFalconer 22 years ago

You make some good points. However I wouldn't like to limit it to 'short interrupt routines'. We are talking about embedded systems, remember, and I can envision a set of processes each operating in their own space, as determined by the register set. A further process could be the scheduler, which is entered by a timer interrupt (or other condition). The longer anything runs, the larger the chance the cache for some other process is swapped out. Yet that process may need very short response time.

This sort of system will have a very simple kernel, with very little kernal data to be maintained. This helps to make it reliable.

Probably the cache memory shouldn't even exist - the register set would be kept in a known set of memory addresses, and those addresses are using high speed external (or internal) memory.

Chuck F (cbfalconer@yahoo.com) (cbfalconer@worldnet.att.net) Available for consulting/temporary embedded and systems. USE worldnet address!

Vote

D

David Brown 22 years ago

REP should be avoided like the plauge - it might not take much hardware to implement the execution, but it would play havoc with interrupts. You'd either have to wait for it to complete (giving you undeterministicly long interrupt delays) or make it restartable (leading to much more complex interrupt handling overhead just to cover this one case). It's far better to have a sort of a DBNZ (Decrement and branch if not zero) instruction to repeat the *previous* instruction, using a general register as the counter which you have loaded earlier. If you have an instruction prefetch queue a few instructions long, you can get this to run at top speed with no memory accesses to read the program - without having to make any changes to the interrupt system. (The 68332 has this "loop mode" - it is totally transparent to the user, and to almost all of the processor's hardware, but greatly speeds up memory copy loops.)

Vote

J

Jim Granville 22 years ago

Is this going to be a soft core, or FAB'd in what process silicon ?

What are the target RAM.CODE sizes/speeds it will work with ?

Does it need to fetch code from external memory ?

One feature, seen in the Z8, and C166 but missing in the AVR, is a register frame pointer. Some 80C51 variants are being talked about with a RAM frame pointer, and Ubicom have a natural extension to task switch, by using HW to time-slice such Frame pointers.

-jg

Vote

M

Morris Dovey 22 years ago

I agree. How about an on-chip high-speed "register space" adequate to provide, say, 256 levels of context switching?

Morris Dovey West Des Moines, Iowa USA C links at http://www.iedu.com/c

Vote

J

Jon Beniston 22 years ago

It will be available as a soft core (Verilog). It will be free to hobbiests / academics. It will be in silicon before the end of the year as part of a licensee's SoC.

Work on the compiler is continuing, but currently code size is as least as good as m6811/arm-thumb/mips-16. Clock-for-clock, performance is better.

As it's a soft-core, the RAM interface is up to you.

You can have it on or off chip, its up to you.

Cheers, Jon

Vote

J

Jon Beniston 22 years ago

Do you really need that many on-chip? IMO, you only need a couple on chip. The others can be loaded in/out in parallel with the execution of other threads.

Jon

Vote

P

Peter Bushell 22 years ago

You can dump the old context while the new thread is executing, but you can't do much execution on the new thread until you have its context! Some form of caching for commonly-used threads could therefore be useful - more than just a couple of register sets.

Peter.

Vote

U

Ulf Samuelsson 22 years ago

I am in Total Agreement with You... A Multithreaded processor would be real cool. I did a White Paper on Multithreading at National, but they are just focusing on Analog nowadays it seems. The National MCU management/apps has since then left the company, and I belive they are doing MT technology nowadays... The reason why it has not taken off, and why Intels HT technology only gives you a few % is the evil cache-trashing effect. If you always execute form internal memory, you are safe and clear.

Best Regards, Ulf Samuelsson ulf@a-t-m-e-l.com This is a personal view which may or may not be share by my Employer Atmel Nordic AB

Vote

New embedded CPU architecture

Join the Discussion

Didn't find your answer?