Integrated TFT controller in PIC MCUs

MMIX (instructional architecture by Donald Knuth) has 256 registers. I don't think any MMIX chips have been made, but there are FPGA implementations around. I believe it had some resemblance to a

1990's-era processor that was actually produced but I don't remember which one.

The Intel HD Graphics processors have 128 SIMD registers if I remember right. They have a relatively conventional instruction set (resembling a typical computer) compared with other GPU's.

Reply to
Paul Rubin
Loading thread data ...

I am still somewhat amazed at what was said. What on earth is there to stop people from doing something

*that* simple - here are two IRQ handlers on a small MCU, I used an mcf52211 a couple of months back to make a HV source - it does the PWM/regulating, overcurrent protection/limiting, serial communication etc., all in all 4 tasks, several IRQs. Took me about 2 weeks to program (I had hoped it would take 2 days but I had completely forgotten the insides of the 52211 so I had to recall a lot which is where the 2 weeks went). A total of about 250k sources, the object code being almost 9 kilobytes.

So here are two IRQ handlers where hopefully it is obvious how only what is needed is saved and restored, pretty basic stuff:

formatting link

This is not VPA, just plain 68k (well, CF) assembly, so it is far from being taken from my own world.

I just wonder how hopeless things must have become to question the viability of doing something that basic.

Dimiter

------------------------------------------------------ Dimiter Popoff, TGI

formatting link

------------------------------------------------------

formatting link

Reply to
Dimiter_Popoff

The "compact" I can easily agree with.

For the latter, are you referring to ARM7TDMI core in ARM mode, or some ARMv7 core in ARM mode?

The modes + stacks were okay, even useful for some things. The interworking and associated cruft I avoided by not using Thumb at all. There was no narrow bus to give Thumb advantage.

Reply to
Vladimir Ivanov

Yes. And there is speed penalty using R8-R15 even if code size remains not larger than ARM mode. That's why claims that Thumb2 is almost as fast as ARM mode smell of marketing.

But Anders posted in the previous post a link to interesting paper on the subject, which I am going to read now.

Reply to
Vladimir Ivanov

Or maybe I read that wrong and there is really no speed penalty if we're talking about wider 32-bit version of the same 16-bit instruction.

Do all instruction forms have wider version to accomodate R8-R15 usage?

Reply to
Vladimir Ivanov

Thanks for pointing this out. There must have been a big marketing pressure to introduce multiple load/store instructions and make the hardware more hairy. Consider that all the bigger iron should also support microMIPS and this stuff.

I have excluded the system control instructions, since that seems mostly like shortsightedness when they devised the compressed instruction sets. But, yes, Thumb2 and microMIPS fixed that.

Now that you mention it, I remember seeing pointers about future PIC32MM stuck to microMIPS only. Again marketing pressure?

Reply to
Vladimir Ivanov

Renesas' SH-5 had 64 64-bit GPRs and 64 32-bit floating-point registers. I think there were silicon implementations, but that architecture never went anywhere.

-a

Reply to
Anders.Montonen

Sorry I meant ARM7TDMI!

Remember the Cortex M was for single-chip *microcontrollers* (where the majority of the chip is memory).

--

John Devereux
Reply to
John Devereux

As far as I can tell from the header files and compiler source code, the PIC32MM could be a replacement/follow-up for the PIC32MX1xx/2xx. There's no DSP ASE, and no shadow registers, so it's clearly not a high- performance chip, and it doesn't seem like it has any special peripherals either. Using microMIPS at the low end makes sense, as you can fit more code in a smaller flash. I don't know how much silicon area is saved by having only the one instruction set, but that kind of makes sense for a low-end chip as well.

-a

Reply to
Anders.Montonen

I believe that is the case.

-a

Reply to
Anders.Montonen

That is a nice concept, but it has two problems:

- when you overflow the avaiable resister sets, you must spill to memory. whether you need to do this, depends on where you are in the register set. this makes timing difficult to predict, which is not nice for a real-time system

- imagine a context switch. now you have to save/restore all register sets!

In general, fat-context CPUs are better at single-threaded no-interrupt applications, but worse for switch-often interrupt-heavy applications.

Wouter van Ooijen

Reply to
Wouter van Ooijen

Not in Cortex-M0. About the only thing you can do with the upper registers (r8-12) is copy to/from a lower register. As an illustration: a cooperative context switch on a cortex-m0:

.cpu cortex-m0 .global switch_from_to .text .align 2

// extern "C" void switch_from_to( // int ** current_stack_pointer, // int * next_stack_pointer // ); switch_from_to:

// save current context on the stack push { r4 - r7, lr } mov r2, r8 mov r3, r9 mov r4, r10 mov r5, r11 mov r6, r12 push { r2 - r6 }

// *current_stack_pointer = sp mov r2, sp str r2, [ r0 ]

// sp = next_stack_pointer mov sp, r1

// restore the new context from the new stack pop { r2 - r6 } mov r12, r6 mov r11, r5 mov r10, r4 mov r9, r3 mov r8, r2 pop { r4 - r7, pc }

Wouter van Ooijen

Reply to
Wouter van Ooijen

That is true, but they implement ARMv6-M Thumb, not Thumb-2.

-a

Reply to
Anders.Montonen

In that case I was confused about the context of the statement :)

Wouter

Reply to
Wouter van Ooijen

I think it would be wrong to talk about /proving/ points here - proper proof would require implementing the same algorithms in different architectures (or preferably in the same basic architecture, but with different register counts) and comparing code densities, run times, cache hit/miss counts, memory bandwidth, etc. That is clearly far beyond what anyone will do for a Usegroup post!

Context switching is not a valid example for /you/, based on the figures /you/ gave. But it is certainly not something that can be dismissed so easily. I have written systems that had timer interrupts at 100,000 times per second - thus interrupt overhead is 100 times as important as in your single example with 1000 interrupts per second. I have written one system where there were 40 clock between interrupts - that does not leave a lot of time for context saving and restoring.

In general, in bigger systems (and PowerPC cores tend to be used in bigger systems than most of the embedded cores we see here) you try to avoid many interrupts, and prefer DMA and more sophisticated peripherals to keep the interrupt rate low. In smaller microcontrollers, you don't have such sophistication - your UART might have only a single buffer, so you will have interrupts for every character transmitted or received. But interrupts are less of an overhead, partly because of the lower register count. It's a different balance.

Other than interrupt context switches, there is also function call overhead. I don't know what sort of calling conventions you use in your VPA, but for high-level languages there is normally a defined convention with some registers being caller-saved (or "volatile") and others being callee-saved (or "non-volatile"). The split can vary between compilers on a target, or can been defined by the target's standard ABI. When you are calling unknown code (i.e., code from a separately compiled module), the compiler (or assembly programmer) must follow the calling convention. That means a function must save any callee-saved registers before using them, in case the calling function had data there. And it must save caller-saved registers before function calls, in case the called function uses them. There is always a significant amount of unnecessary saving and restoring in this process, and it increases with the number of registers in the system and with the number of small and simple functions (because with larger functions that really use the data, the save/restores are no longer unnecessary and therefore not overhead).

So if a PPC function (following the standard PPC EABI conventions) needs to use any of the 18 "non-volatile" registers, it must save the old values and restore them on exit - even if the calling function does not need the old values. And if it is calling another function, then it must save any of the 11 "volatile" registers it wants to keep - even if the called function does not touch them.

I can't produce a /function/ that is more efficient with 16 registers than 32 registers - but I hope that above I have explained how it make a difference with chains of small functions (or functions with few register demands).

I think it is reasonable to say that as programs (in the embedded world) have got bigger, memories have got bigger, and compilers have got better, then the balance for many systems has moved more towards 32 registers rather than 16 registers, in the same way that it has moved towards 32-bit cpus from 8 and 16-bit cpus.

The one I am most familiar with is VLE, which is used by many of Freescale's PPC microcontrollers. (It may be used by others too, but I have only used PPC from Freescale.) I also remember that Freescale had some chips with a sort of compression scheme where code was decompressed while it was loaded into cache - but I have forgotten the details. VLE is the same basic idea as ARM's Thumb2 and the MIPS equivalent.

The barrel shifter bits apply to a wide range of ARM instructions - meaning you can effectively tag a zero-cost shift instruction into many other instructions. Thus if you want "a = b + c * 16;", on the PPC you need two instructions (shift then add) with pipeline/scheduling considerations between them, while on the ARM it is done in one instruction and one cycle.

With 16-bit Thumb instructions, the barrel shifter only works on loads, stores, and specific rotate/shift instructions - with 32-bit Thumb and full ARM instructions it works on a wide range of instructions.

And you in turn are generalising based on an example and your own very specialised experience. I hope that what I wrote earlier makes it clear why I think 16 registers can be an advantage, and why I think your example cannot count as a general proof or argument (though I happily accept it as an example of when 32 registers is very useful).

Beyond that, I would say that even if 16 registers is not /more/ efficient than 32 registers, I have yet to see any reasoning for why 32 registers is /significantly/ more useful for the type of code generally seen in microcontrollers (and yes, I have no choice here but to talk of generalities). A filter algorithm might run faster with more registers

- but it would do even better by using additional DSP-specific support (such as the DSP registers and instructions on the Cortex M4 compared to the M3) or SIMD support (Alitvec on PPC, Neon on ARM).

I will try not to repeat myself, but I hope you can see that this is necessarily a generalised discussion, based strongly on opinion and personal experience - unless you want to give several whole programs implemented on at least two architectures using commonly used tools and techniques (i.e., C code rather than VLA code) as real evidence.

Yes, it is fine when you are used to it - but very weird to start with, and different from everything else (including the numbering used on other big-endian processors I have used, such as M68K/ColdFire). You have to double-check everything when you are connecting address line A31 on the MPC to A0 on the external RAM chip! And then Freescale's documentation freely mixes up 64-bit PPC conventions with the 32-bit conventions, so you find you are trying to set "bit 59" of a register by writing the value 16... It was even more "fun" when I found Freescale had also mixed it up in a couple of register definitions in a header file.

To me, "bit 0" is always the least significant bit regardless of the endianness of the chip. But I try not to start wars over it :-)

Fair enough.

When your interrupt function is a leaf function, it's no problem to only save the registers you need (regardless of the size of the register set).

But when it is not a leaf function, and you are calling other functions (whose code you do not know), you have to follow the calling conventions and assume that the called function will destroy all "volatile" registers - that's at least 11 extra register saves in a PPC EABI system, as well as the link register, CCR, etc.

On one PPC system I worked with which did not support individual interrupt vectors, /all/ registers were saved (and later restored) because the vector code was calling unknown external code. 32 x 32-bit general purpose registers plus 32 x 64-bit floating point registers, stored on external SRAM with 2 cycle accesses meant 400 cycles just for the register save and restore - not including the memory bandwidth for reading the code or any other processing time. Even if I had spend time optimising it by limiting the storage to the volatile registers (since I knew the interrupt functions all followed the EABI conventions correctly), and therefore halved the overhead, it would still have been very large.

Reply to
David Brown

That was needed for Thumb, but not for Thumb2 - you simply use the

32-bit instructions and have access to the same registers as you would with 32-bit ARM codes. If you like, you can think of Thumb2 as being mostly the same as 32-bit ARM (losing a little barrel shifter and conditional execution capability) with the addition of 16-bit "short-cuts" for the most commonly used instructions.

I haven't studied either Thumb2 or MIPS16e (or PPC VLE) in detail, but they all seem to be a similar solution to a similar problem - making a variable-length encoding scheme that is easy to decode, keeps common instructions short, but makes it easy to access the full range of the cpu's abilities.

No, the original Thumb instruction set only gave access to some of the cpu and let you write significantly slower but more compact code than full ARM. That's why they had to keep the ARM decoder too - if you needed fast code, you had to use the full instruction set. And no one considered the mix of two instruction sets to be "balance" - polite people called it a pain in the neck.

Thumb2 lets you write code that is about 60% of the size of ARM code, and is often /faster/ than 32-bit ARM code, since you can get almost all of the functionality while being more efficient on your memory bandwidth and caches.

For backwards compatibility. In Cortex M applications, code is generally compiled specifically for the target - so there is no need for binary compatibility. But for Cortex A systems, you regularly have pre-compiled code from many sources, and binary compatibility with older devices is essential.

Reply to
David Brown

OK, I really meant "make" your point. Though in technical terms making a point which cannot be proven is fairly pointless...

So you understand that these figures are correct - but imply that the example applies just to me. I thought we could agree at least on the meaning of numbers.

Which has *nothing* to do with context switching, if you do that and save all registers instead of the minimum you have to you just don't know what you are doing.

I know from threads from years past that you tend to mix up task scheduling and interrupt processing but please understand that there is a world of a difference between interrupt processing and a task switch initiated by an interrupt.

You have not made a point - context switching is *not* a case where 32 registers can be worse off than 16 in a non-negligible way (negligible meaning performance cost within say 0.1%, latency-wise same as 16 or better).

You have yet to give a valid example for what you claim.

This is wrong, it is not true that on larger systems interrupts must be avoided.

You should understand that there is no such animal as "in general" in engineering. Things we make have to *work*, so we have to go down to the details.

For example, on an mpc5200b based system, the one for my everyday use for programming, emailing etc., running DPS of course, I have plenty of interrupts all the time - from the display controller vertical retrace, from the two PS2 ports where the mouse and the keyboard are connected - each PS2 clock causes and interrupt and yes, they can come at a faster rate than every 10uS - then there are the ATA interface interrupts etc. etc. These have *nothing* to do with task switching, neither do they initiate one. Say the decrementer interrupt might initiate one and then of course all regisrters will be shuffled - but *AFTER* the interrupt has been served and unmasked again so that the interrupt latency stays low (so a 10uS IRQ rate cannot not impress the machine).

Whatever example you try to come up with you will never find one which requires saving *all* of the registers with the interrupts masked - which means there is no performance advantage in having

16 rather than 32 registers, while the opposite is true most if not all of the time.

Same concept, save/restore only what needs to be saved/restored. Fewer registers will be as good only as long as you do not need more registers, then you have to save/restore more *because* you have fewer registers. I explained that once already and I am doing it again for you, please do not make me do it for a third time. Just think and be willing to understand the obvious.

This is irrelevant, we are comparing cores. Whether this or that compiler got some of it basics right or wrong has nothing to do with it. The fundamental principle - "save/restore only what you have to" applies in all cases of programming.

IOW if you waste resources by saving/restoring more than you have to you are doing only a little better than masking all interrupts and jumping into a " bra *" loop; the ways to destroy something working are probably infinite, these are only two of them.

I know you cannot produce an example - the above (much of it clipped) was irrelevant.

You still do not get it, do you.

32 registers are not just *better*, they are a necessity on a load/store machine with a pipeline (deep something like 5-6 stages). You just cannot keep the pipeline full if you have only 16 registers without stalls because of data dependencies.

My FIR example demonstrated what is about a 3-fold improvement, and above I just explained - *again* - why. It should be easy to see how this applies by far not just to FIR but to any computationally intensive algorithm where data dependencies would kick in.

Obviously hardware other than a general purpose core can be built to do things the core cannot do.

We are comparing the cores here.

Poor programming. IRQ routines may never call unknown external code.

Dimiter

------------------------------------------------------ Dimiter Popoff, TGI

formatting link

------------------------------------------------------

formatting link

Reply to
Dimiter_Popoff

(snip)

(snip,then I wrote)

Seems that they thought of that.

and that.

formatting link

Section D.8 explains that one. For the more usual case where function calls are much more common than context switch, and you want to minimize total overhead, more register windows are better.

For SPARC, there is a register mask that controls how many are used by user mode code. Changing that allows supervisor code to use some, such that one can optimize overall between user and supervisor code.

And finally, they consider the case where context switch time is most important. In that case, they allocate register windows between tasks, such that one can do task switch by only changing the register mask and current window, and not storing any to memory. The OS has complete control over how the windows are used.

-- glen

Reply to
glen herrmannsfeldt

That just shows the disconnect between what you do and what the rest of the world does. :-)

While the actual wrapper around the device specific interrupt handler (ie: the generic IRQ code which runs before you get to the device specific handler itself) is generally still assembly language, most people don't write the actual device specific handler in assembly language any more, but use a higher level language such as C instead.

Once you do that, the IRQ wrapper needs to save all the registers the C compiler could potentially use, including all the temporary registers, before the wrapper calls the device specific handler.

Different set of tradeoffs. The higher level language code can potentially be reused on multiple architectures (or if that's not possible in a specific case, can at least used as the starting point for another driver); your PowerPC specific assembly language code cannot be reused in such a way.

What works for you in your restricted environment doesn't work when you need your code to work in a generic environment across a wide range of architectures.

Simon.

--
Simon Clubley, clubley@remove_me.eisner.decus.org-Earth.UFP 
Microsoft: Bringing you 1980s technology to a 21st century world
Reply to
Simon Clubley

I didn't know that. Thanks.

Simon.

--
Simon Clubley, clubley@remove_me.eisner.decus.org-Earth.UFP 
Microsoft: Bringing you 1980s technology to a 21st century world
Reply to
Simon Clubley

ElectronDepot website is not affiliated with any of the manufacturers or service providers discussed here. All logos and trade names are the property of their respective owners.