Integrated TFT controller in PIC MCUs

No. The point of the Cortex-M automatic interrupt prologue is to allow ISRs to be normal C functions without any assembly glue.

It is optional.

-a

Reply to
Anders.Montonen
Loading thread data ...

Thanks God, the first two tons of ink seem to have worked eventually. You claimed exactly the opposite for a long time.

This is where the next few tons of ink will have to go apparently. What on Earth makes you think having 32 registers rather than 15 makes you have more volatile registers. Starting to spend the third ton of ink: you only have to save the registers which you use. There is nothing stopping you from using only 3-4 registers in an interrupt handler thus saving only 3-4 registers on either machine. If the third ton of ink does not make that clear for you please recycle back to the first 2 tons, let us be environmentally friendly.

I already explained that - when you have data dependencies. The FIR implementation is a classic example of that. Everything else being equal if you have only 15 registers the 6-stage pipeline will stall about 2/3 of the time, check the former ton of ink we spilled.

They do have that and it does not obsolete the need in question. It saves you from unnecessary serializations, yes, but it does not help against data dependencies - which is what makes 15 registers too few for a load/store machine (unless it is non-pipelined, which is how at least initally I am sure ARM have been, but this is even more crippling).

So eventually you do understand that having 32 registers makes the (load/store) machine more efficient by definition. My FIR example demonstrated this can be up to a few *times* more efficient. And yet you call an architecture which is crippled by design - being unable to keep up with the one compared to simply because it has been designed as it is - non-crippled. Well your choice of words does not alter the reality - which is that you just cannot design in 15 registers the equivalent of a load/store machine with 32 registers. You can build hardware around that, Intel do that for ages to keep their even more crippled x86 model alive, but you can build hardware to do about anything (we covered that, too, so hopefully we will not go there again).

Clearly initially ARM has been designed saving on design resources - time, designer skill - to have something working to sell. Performance-wise its architecture is dramatically inferior to power exactly because they made it with only 16 registers, perhaps targeting it at small, low power applications. It has been superior to power for the smallest of applications obviously (like in the first phones) but when it comes to performance it is what it is. Notice "crippled" does not mean unusable; it only means that under equal conditions using ARM rather than power (for large enough systems, we covered that already, say 1M+ RAM) ARM will be at a significant disadvantage, up to a few times slower. Of course certain tasks can be done by the crippled CPU no slower, it is just that the opposite is never the case.

Dimiter

------------------------------------------------------ Dimiter Popoff, TGI

formatting link

------------------------------------------------------

formatting link

Reply to
Dimiter_Popoff

In article , snipped-for-privacy@hesbynett.no says... .....

The whole 32 V 16 register 'debate' has been trying to see "how many fairies fit on on a pinhead" type of discussion. All based on the the types of applications individual posters normally write.

We have no idea what type of application or even range of applications the processor is for, let alone what type of processing is required.

Personally observations on following might have been more useful

1/ Package type options for precessor and other compiler support merits 2/ merits of TFT controller flexibility

3/ What type of things he will do with TFT and if the UI has to have animated or moving widgets, or even phone style windowing and wipe effects was more important.

4/ Does the TFT controller have its own frame buffer(s) and their limits

5/ Does it have hardware assist or rely on memory to memopry DMA for copying screens or bits of screens

6/ Graphical library suppoort and limitations (there was a bit early on)

If the application is going to be busy doing lots of memory moves and the TFT controller accessing shared memory, that is going to have bigger load on the application than most other things, in the MAJORITY of applications.

--
Paul Carpenter          | paul@pcserviceselectronics.co.uk 
    PC Services 
 Click to see the full signature
Reply to
Paul

I disagree. Registers are not free: the cost die space, power, and probably most important: bits in the opcode. Other things being equal (and the instruction bandwith being a limit) more register means less bits for orther things, with the postential for slower code.

Reply to
Wouter van Ooijen

Because it depends on the ABI in use.

If you use an ABI in which most of the 32 registers are callee saved or write your device specific handler in assembly language and hence have direct control over the registers in use, then you are correct.

If you use a higher level language to write your handler and the ABI in use states around half of those registers are caller saved, then, in the general case, your IRQ wrapper must save those registers before it calls that handler because the compiler will generate code which conforms to that ABI.

These days, most people write their drivers in a higher level language such as C and code from different people/teams has to work together so the compiler must conform to the ABI in use.

This means that, in the general case, if your ABI requires the caller to save (say) ~16 registers out of the 32 registers but the code generated by the compiler for a specific driver only uses 6 of the caller saved registers, then those ~16 registers still need to be saved because the wrapper doesn't know any different.

The upside is that you get a general purpose ABI in which everyone's higher level language code can work together.

Simon.

--
Simon Clubley, clubley@remove_me.eisner.decus.org-Earth.UFP 
Microsoft: Bringing you 1980s technology to a 21st century world
Reply to
Simon Clubley

Like almost everything in engineering it is a trade off. The number of register needed to accomplish a task efficiently also depends on other aspects of the ISA. For example with an ISA with more sophisticated addressing modes one may need less registers than with a minimalistic RISC ISA. Many modern (superscalar) processors have internally more registers than exposed via the ISA, register renaming technique reduces the chance that registers become a performance bottleneck. With the x86

64-bit instruction set its designers choose to expand the number of general purpose registers from 8 to 16. They could have easily chosen a larger number of registers but apparently their analysis showed that the benefit of more registers did not outweigh the downsides. I'd say that it is a bit too simplistic to state that a ISA that has only 15 GP registers must be crippled.

I think this discussion about the optimum number of processor would be more appropriate in comp.arch were the people are that are/were involved with processor design.

Reply to
Dombo

This is at least the third time I explain this to you but I don't mind, I'll do it as many times as it takes: there are many ways to destroy something working other than inept programming, some of them much easier.

So what is the guaranteed IRQ latency on your ARM core of choice running linux with some SATA drives, multiple windows, ethernet, some serial interfaces. Try to give some figure - please notice the word "guaranteed", I know how much the linux crowd prefers to talk "in general".

Dimiter

------------------------------------------------------ Dimiter Popoff, TGI

formatting link

------------------------------------------------------

formatting link

Reply to
Dimiter_Popoff

It's useful to make the distinction between /named/ registers (exposed in the ISA to the programmer) and /unnamed/ registers (implementation dependent, internal registers for register renaming). When designing the amd64 ISA, the AMD folks, working tightly with gcc developers, Linux kernel developers, and presumably many other people, concluded that 16 named GP registers was the right balance for the architecture. It was long established that the 8 registers of x86 was too few, but as you say their analysis did not show much benefit of more than 16 registers - and the disadvantages (opcode space, and extra register stores in function calls) outweighed any advantage.

Internally, implementations of amd64 might have hundreds of unnamed GP registers.

Also note that the amd64 architecture has lots of SIMD registers as well as GP registers. I think in most examples where large numbers of GP registers would help, SIMD registers are a better solution - and are therefore implemented on most fast cpu designs.

Finally, the discussion was centred on load-store architectures such as ARM, MIPS and PPC. x86/amd64 are not load-store, and can do more with fewer named registers. Dimiter's assertion was that a load-store architecture is inherently crippled if it has only 16 registers - he has not commented on CISC architectures.

A more relevant example is the 64-bit ARM architecture - which has 32 GP registers. That does not in any way prove that the old 32-bit ARM was "crippled" with only 16 registers - but it does show that for such a large processor, the extra registers give a positive trade-off.

Reply to
David Brown

Having L1/L2/L3 caches will instantly introduce a high variation between the mean and max latencies. Even for i486s with their minimal cache and no operating system, a 10:1 variability was visible.

Any variability to do with register saving will be completely insignificant compared to the effects of caches. Unless, of course, you are having to dump the entire hidden state of an Itanic processor :)

Reply to
Tom Gardner

Yes, though on some processors one has the ability to lock part of the L1 cache - which allows to have it dedicated to interrupts which can make things a lot tighter (by saving the necessity to update entire cachelines).

Overall the latency variability obviously increases as processor sizes increase but then total execution times decrease, memories get faster etc. so the worst case latency can still be very low. On the 5200b which I use I have never needed to resort to any cache locks etc., all I do is just stay masked only as absolutely necessary.

Well we have not come to that obvious point yet I am afraid :-). Let us first have the figure on the worst-case linux IRQ latency I asked for then put into its context the try of ARM/linux devotees about lower latency by not having enough registers :-).

Dimiter

------------------------------------------------------ Dimiter Popoff, TGI

formatting link

------------------------------------------------------

formatting link

Reply to
Dimiter_Popoff

...*mean* total... But you know that!

It can be very difficult to /measure/ the maximum latency of even main processing loops, let alone interrupts.

Calculation of maximum times is only possible on the XMOS processors AFAIK.

Well, the ARM (embedded with an FPGA) I'm about to start using is dual core each L1 32+32K, 512K L2, and then 256K RAM. Maybe I'll do some serious timing, but the hard realtime stuff will be in the FPGA.

Reply to
Tom Gardner

Neither you nor anyone else can give worst-case IRQ latencies for Linux running on PPC, MIPS, ARM, x86 or anything else - there is too much variation. It is a rare system that can can give any useful worst-case IRQ latencies on /any/ software on processors running at many hundreds of MHz, with multi-layer caches, heavy pipelines, MMUs, etc. When you take into account all the possible issues your true worst-case IRQ latency can be enormous, measured in thousands of clock cycles, and orders of magnitude greater than the realistic average latencies. That's why such systems are great for throughput, but poor for real-time systems.

This is comp.arch.embedded. While some people here use embedded Linux, the majority do not - Cortex M3 cores running FreeRTOS or no OS is far more common than Cortex A9 cores running Linux. So for a real comparison, an M3 is ready for user interrupt code (all volatile registers stacked, ready for a non-trivial handler ) in 12 cycles. On the 180 MHz PPC microcontroller I used, I'd guess (I haven't measured, and don't intend to measure) a dozen cycles for the interrupt vectoring and pipeline flushing, then 20 instructions to save the interrupt registers and volatile registers - taking more than 20 cycles because of the instruction fetch times. If you are maximally unlucky with the caching, it will take perhaps twice that.

The chip with the smaller register set and dedicated interrupt hardware reacts faster, with lower variation, and puts you directly into the user code. The bigger and more complex chip has longer delays, more variation, and requires more user code. And in this case, the faster clock speed of the PPC device does not outweigh the higher clock cycle count in interrupt handling.

Reply to
David Brown

If I needed to meet guaranteed timing schedules, I wouldn't be using Linux to try and achieve them - it simply hasn't been designed for that.

I would use a RTOS or maybe even push the hard realtime part of the problem onto it's own bare metal board if the constraints were too tight for even a RTOS.

Note that even in the case of the RTOS, your drivers are still generally written in a HLL these days so the RTOS will still push the caller saves registers even if your driver doesn't use them because the RTOS has to assume the HLL compiler could potentially use all the caller saves registers in the ABI that it's allowed to.

I don't understand your fixation on the number of registers pushed; pushing a few extra registers is a _very_ small price to pay for all the advantages of being able to write drivers and other code in a HLL.

Note that even when writing HLL code to run on bare metal, the compiler still has to generate code against an ABI and hence follow the ABI's rules unless you modify the compiler to use your own custom ABI.

Simon.

--
Simon Clubley, clubley@remove_me.eisner.decus.org-Earth.UFP 
Microsoft: Bringing you 1980s technology to a 21st century world
Reply to
Simon Clubley

This answer means it is infinite - nice figure in the context of saving a few registers, no doubt about that. Am I supposed to laugh or to cry.

I can give a figure for DPS - and guarantee it, commercially. As an OS DPS is meanwhile no smaller than linux - just the applications written for it are much much fewer. VM, windows, filesystem, networking etc., it is all in there. And I do have a figure for the latency. So this figure for linux is infinity?

Dimiter

------------------------------------------------------ Dimiter Popoff, TGI

formatting link

------------------------------------------------------

formatting link

Reply to
Dimiter_Popoff

So your answer is "too huge to even look up the exact figure", fairly similar to the "infinite" David gave.

Oh but this is your fixation, not mine. You argued that ARM is at an advantage because it does not have 32 registers but only 15 and put that in the linux context by talking all that ABI and whatever abbreviation gibberish the linux crowd constantly invents to mask the mess they live in.

My point was - still is - that ARM is a crippled load/store machine because it has too few registers to be a viable (i.e. pipelined) one.

You (and a few others) wrote tons of irrelevant nonsense about saving registers, latency etc. - clearly talking without knowing what you are talking about.

Oh this will be the fourth or the fifth time I have to explain this to you: there are easier ways to destroy something working than inept programming, a hammer or even a piece of rock will do as nicely.

Dimiter

------------------------------------------------------ Dimiter Popoff, TGI

formatting link

------------------------------------------------------

formatting link

Reply to
Dimiter_Popoff

You are supposed to use something other than standard Linux (or Windows) when you need hard real time. If you really need to use Linux and you also really need real time, then you can use one of several real-time extensions to Linux which will give you a high (compared to dedicated RTOS's and more suitable hardware) but definitely not infinite maximum latency.

Of course, since you sell a real-time system which /does/ have guaranteed worst-case latencies, obviously you should be laughing :-)

I am sure DPS has lots of useful and important features - including everything you and your customers need. But I am also sure it /is/ smaller than Linux (which is currently at about 17e6 lines for the kernel alone) - the comparison is not useful. Comparing to vxworks, QNX, RTEMS, etc., would make more sense. (And these folks will also give you figures for latencies - assuming you can give details of the hardware, and perhaps pay them enough money!)

Unless you have calculated it, or at least measured it to a desired statistical level of accuracy, then by the definition of "worst case", it is infinite. (You might prefer to say "real time" requires calculation, not just measurement - but that gets increasingly difficult for more complex systems. If your tests suggest that missing a timing deadline is statistically less likely than being struck by lightning, that is often good enough.)

One report I found with Google is for an 800 MHz Cortex A8 chip with kernel 2.6.31, testing with and without the "real time" patch (this is not a "real-time extension" to Linux, which work in a different way - basically the "real-time patch" sacrifices total throughput but allows most system calls and functions to be pre-emptable). Without the "real-time patch", maximum measured latencies were 2465 us - with the patch, the maximum measured latency was 58 us.

Measurements will only be valid on a particular system, with particular kernel versions, and typical realistic (and worst case) loads - but that

58 us will give you a ballpark figure that's a little lower than infinity.
Reply to
David Brown

Perhaps the abbreviation "ABI" has multiple uses, and you are thinking of a different one than the rest of us? In this context, it is "Application Binary Interface", and is a set of rules for code and calling convensions for a particular target system. In some cases, the ABI will vary from compiler to compiler, or between target OS's - in other cases, the cpu manufacturer will control it tightly. In the x86 world, Intel gave very little guidance on an ABI - hence x86 compilers use wildly different calling conventions. AMD did better for amd64 - almost all compilers and OS's on amd64 use AMD's ABI, but of course Microsoft picked their own incompatible (and inferior) ABI.

In the PPC world, PPC EABI is the standard for embedded systems, with other ABI's used for AIX, Linux, etc. The PPC EABI (with 32-bit and

64-bit variations) covers a wide range of standardisations, including register usage, stack alignment, size of standard types, section names, standard functions, etc. It is the ABI that says register R1 is the stack pointer in the PPC, and that R2 and R13 are anchors for small data areas (constant and read/write respectively), and that registers R0 and R3-R12 are "volatile" and must be saved by interrupt wrappers that call other EABI functions.

I don't know why you assumed the mention of ABI meant people were talking about Linux.

Reply to
David Brown

Oh but it is - if we compare the OS itself, not the applications. Meaning what you as a programmer will have as functionality via system calls. 17e6 lines of wasteful programming could well be less than mine 1.7e6 lines (not sure about the exact figure), hard to say. Does their kernel include the support for windows, offscreen buffers, graphics draw calls etc.?

Do these come with all the features like windows, VM, filesystem, networking?

Measuring is OK, calculating is not just difficult, it can be outright impractical nowadays. One should do it to get a ballpark figure what to expect then measure it - over a long enough time the worst case response is not so hard to measure, provided you know what is going on.

Well 58uS is still OK, only about 5 times (or is it 10 times, I am not sure whether the 10 uS figure was not on a 200 MHz machine) worse than DPS at a 400 MHz power (mpc5200b). The question why is this real time patch not universally applied remains of course, how much of the functionality do they have to sacrifice if they use it.

I asked for this figure only to put into its context the claim about the "need" to save all 32 registers. So let us see - saving 16 registers more to say the slower of the two DDRAM-s, the one on the

400 MHz 5200b (assuming a complete cache miss), 133 MHz clocked DDRAM, which does something like a 10nS per .l IIRC on average will save 160 nS from the 58uS. I think we all can only laugh here. The funnier thing of course is that there is no justified necessity to waste these 160nS - but I can understand the programmer who may have wasted them, why would he bother - it would be just a waste of his time to chase nanoseconds when the system stays masked for tens of microseconds. I would not have bothered.

Dimiter

------------------------------------------------------ Dimiter Popoff, TGI

formatting link

------------------------------------------------------

formatting link

Reply to
Dimiter_Popoff

I am not sure that this is the best place to give a beginners course on Linux, QNX, RTEMS, or operating systems in general. Obviously you have vast experience with DPS - yet your questions show a lack of knowledge of how these sorts of OS's are built up and structured. I can't tell if you really know so little about what Linux is, and what an OS kernel is,

patronising and write about things you have worked with every day for twenty years, but equally I am happy to explain things if it is helpful to you. Can I just say you should read the Wikipedia articles plus each project's home page, and if we need to go further then we'll take it from there?

Agreed.

In Linux, interrupts get passed on to kernel interrupt threads, and thus involve a (limited) context switch. That is always going to be more costly than handling the interrupt directly, but allows the interrupt code more access to kernel functions.

As far as I understand the RT patch, there are two issues regarding universal application in the kernel. One is that improving worst-case response times means minimising the size of critical sections with interrupts disabled. The other is that much more of the kernel is preemptable and re-entrant, and uses finer grain locking. So code that used to be "get lock, do A, B, C, release lock" might be changed to "get lock, do A, release lock, do B, get lock, do C, release lock". The locked (or interrupt disabled) sections are shorter, but total throughput is reduced as there is more overhead in the locking. In particular, I gather than most spin-locks (which are very fast at taking a free lock) are replaced by mutexes with priority inheritance.

Certainly some aspects have moved into the main kernel - modern Linux kernels have a lot finer grained locking than older ones, which tended to use the "big kernel lock" a great deal. The main motivation here is for SMP systems - when Linux systems were generally on one core, a single "large" lock was okay, but with multiple cores it gets very inefficient.

Other aspects are configurable (as are many things in Linux) - you often want a different balance between throughput and response times for server systems, desktops, and embedded systems.

Register sizes are not relevant in this context (which is why people can't understand your jump to Linux) - clearly the number of registers saved is going to be a drop in the ocean when you are talking about big cpus running big OS's, rather than microcontrollers running bare-bones or dedicated OS's (and we have long ago established that saved register counts is usually, but not always, negligible in those systems too). Register save sizes is relevant when it is useful to have a response time of 12 cycles rather than 30 cycles - it is not an issue when the response time is 5000 cycles!

Yes indeed - premature optimisation is the root of all evil, after all.

Reply to
David Brown

They will shave some Flash space from the MIPS16e -> microMIPS transition, but that won't be revolutionary. But the MIPS32 -> microMIPS will be noticeable, yes. Maybe MIPS16e is not that popular after all.

The silicon savings of MM's core are probably close to none, I think this is mostly for the user's comfort of staying into a single mode and having a distinguished Thumb-2 competitor. I wouldn't be surprised if the MIPS32 decoder is present in the macro cell, just fused/disabled. It is only a speculation, of course, but keeping less cores is a sane choice.

Still, the MM might be interesting.

Reply to
Vladimir Ivanov

ElectronDepot website is not affiliated with any of the manufacturers or service providers discussed here. All logos and trade names are the property of their respective owners.