highest frequency periodic interrupt?

- A
- antispam
  
  Contact options for registered users
Vote on answer
posted
1 year ago

Sun, Jan 15, 2023 11:37 PM

Not exactly periodic but I did 2Mb/s interrupt driven bi-directional serial communication. That is about 5uS between characters and there were 2 interrupts per character (one to receive, the other to transmit answer). In other words, about 400kHz inerrupt rate. That was on STM32F103 running at 72 MHz (that is Cortex M3). I also tried 3Mb/s, but apparently that was too much for USB bus in PC (standard 12Mb/s port).

Concerning interrupt overhead, for STM32F030 running code from RAM overhead seem to be between 26-28 clocks. More precisely, I had very simple interrupt handler that just increments a variable (millisecond counter). "Work" part of the interrupt handler should execute in 7 clocks. When I timed busy loop interrupt increased execution time of the loop by 33-35 clocks. That agrees reasonably well with cycle counts for Cortex-M0 published in ARM forums: 16 clocks to enter to interrupt handler and 12 clocks to get back to main program. Processor in Pi Pico is Cortex-M0+ which is supposed to take 15 clocks to enter to interrupt handler. So you can expect 1 clock less overhead than for Cortex-M0.

Concerning useful procedures, there is a lot of things which can slow down the code. For example read-modify-write cycle on I/O port is likely to insert some extra wait states. Most MCU-s execute code from flash, and usually flash can not run at max CPU speed so there are extra wait states. For example Cortex-M4 running from one RAM bank and having stack in separate RAM bank can do interrupt like above in 27-28 clocks, so overhead probably is 20-21 cycles (I write probably because Cortex-M4 has complex rules concerning instruction times so I am not sure if interrupt handler takes 7 clocks). But different configuration can brings time up to 42-48 clocks. Cortex-M3 (which should have very close times to Cortex-M4) running from flash with 0 wait states (8MHz clock) needs 24 clocks to execute interrupt handler, but with

2 wait states (needed to run at 72MHz) needs 29 to 31 clocks and more when there are more wait states.

RP2040 in Pi Pico normally runs form RAM, so should be free from slowdown due to flash. But with two cores and several DMA channels there may be bus contention. Still, interrups rates of order 1M/s should not be a problem.

- J
- John Larkin
  
  Contact options for registered users
Vote on answer
posted
1 year ago

Mon, Jan 16, 2023 2:07 AM

A proportional+integral error amplifier is all that most power supplies need. That is easily Spiced and then easily turned into a few lines of code.

An integrator is of course IIR. A FIR filter has a finite gain hence some DC error.

I'm not afraid of integrators!

- J
- John Larkin
  
  Contact options for registered users
Vote on answer
posted
1 year ago

Mon, Jan 16, 2023 2:25 AM

Sound like roughly 200 ns of overhead, interrupt entry and exit, on the 133 MHz pico. That's not bad for a 100 KHz interrupt.

I probably don't even need 100 KHz for a power supply control loop. A

1 ms step response would be fine.

It would be fun to do a DDS in software, for an AC supply.

- A
- Anthony William Sloman
  
  Contact options for registered users
Vote on answer
posted
1 year ago

Mon, Jan 16, 2023 3:38 AM

Sort of off the point through. An AC supply has to deliver a sine-wave voltage while looking like a low impedance source to the load.

You can use software to calculate what that voltage ought to be (which is what direct digital synthesis is all about) but the switching arrangements that connected a more or less stable DC voltage source to the load and delivered the desired voltage, and the currents required to sustain that voltage, might require quite a lot of fast processing capacity to let them create the desired effect, without wasting a lot of power in the process

It wouldn't look much like a regular DDS source.

- M
- Martin Brown
  
  Contact options for registered users
Vote on answer
posted
1 year ago

Mon, Jan 16, 2023 10:21 AM

You have to make a test framework to exercise the code in as realistic a manner as you can - that isn't quite the same as instrumenting the code (although it can be).

I have never found profile directed compilers to be the least bit useful on my fast maths codes because their automatic code instrumentation breaks the very code that it is supposed to be testing (in the sense of wrecking cache lines and locality etc.).

The only profiling method I have found to work reasonably well is probably by chance the highest frequency periodic ISR I have ever used in anger which was to profile code by accumulating a snapshot of PC addresses allowing just a few machine instructions to execute at a time. It used to work well back in the old days when 640k was the limit and code would reliably load into exactly the same locations every run.

It is a great way to find the hotspots where most time is spent.

True enough. Sometimes you need a logic analyser for weird behaviour - we once caught a CPU chip where RTI didn't always do what it said on the tin and the instruction following the RTI instruction got executed with a frequency of about 1:10^8. They replaced all the faulty CPUs FOC but we had to sign a non-disclosure agreement.

It just means that you have to collect an array of data and take a look at it later and offline. Much like you would when testing that a library function does exactly what it is supposed to.

It is quite unusual to see bad behaviour from the multilevel caches but it can add to the variance. You always get a few outliers here and there in user code if a higher level disk or network interrupt steals cycles.

Instrumenting for timing tests is very much development rather than production code. ie is it fast enough or do we have to work harder.

Like you I prefer HLL code but I will use ASM if I have to or there is no other way (like wanting 80 bit reals in the MS compiler). Actually I am working on a class library to allow somewhat clunky access to it.

They annoyingly zapped access to 80 bit reals in v6 I think it was for "compatibility" reasons since SSE2 and later can only do 64bit reals.

That will be their problem not mine ;-)

- M
- Martin Brown
  
  Contact options for registered users
Vote on answer
posted
1 year ago

Mon, Jan 16, 2023 10:23 AM

Generally good advice unless the purpose of the interrupt is to time share the available CPU and FPU between various competing numerical tasks. Cooperative multitasking has lower overheads if you can do it.

For my money ISRs should do as little as possible at such a high privilege level although checking if their interrupt flag is already set again before returning is worthwhile for maximum burst transfer speed.

Actually there were processors which took the exact opposite position quite early on and they were incredibly good for realtime performance but their registers were no different to ram - they were *in* ram so was the program counter return address. There was a master register workspace pointer and 16 registers TI TMS9900 series for instance.

formatting link

I didn't properly appreciate at the time quite how good this trick was for realtime work until we tried to implement the same algorithms on the much later and on paper faster 68000 series of CPUs.

- G
- Gerhard Hoffmann
  
  Contact options for registered users
Vote on answer
posted
1 year ago

Mon, Jan 16, 2023 12:04 PM

Am 16.01.23 um 11:23 schrieb Martin Brown:

At TU Berlin we had a place called the Zoo where there was at least one sample of each CPU family. We used the Zoo to port Andrew Tanenbaum's Experimental Machine to all of them under equal conditions. That was a p-code engine from the Amsterdam Free University Compiler Kit.

The 9900 was slowest, by a lage margin, Z80-league. Having no cache AND no registers was a braindead idea.

Some friends built a hardware machine around the Fairchild Clipper. They found out that moving the hard disk driver just a few bytes made a difference between speedy and slow as molasse. When the data was through under the head you had to wait for another disc revolution.

It turned out that Fairchild simply lasered away some faulty cache lines and sold it. No warning given. It was entertaining to see, not being in that project.

Gerhard

- G
- Gerhard Hoffmann
  
  Contact options for registered users
Vote on answer
posted
1 year ago

Mon, Jan 16, 2023 2:27 PM

That was not a benchmark; that was a given large p-code machine with the intent to use the same compilers everywhere. Not unlike UCSD-Pascal.

with a non-existing cache controller and cache rams that cost as much as the cpu. I got a feeling for the price of cache when I designed this: <

formatting link

>

8086 was NOT slow. Have you ever used an Olivetti M20 with a competently engineered memory system? That even challenged early ATs when protected mode was not needed.

That benchmark was Unix System V, as licensed from Bell. Find something better to do when you need to swap.

Gerhard

- J
- John Larkin
  
  Contact options for registered users
Vote on answer
posted
1 year ago

Mon, Jan 16, 2023 2:29 PM

PowerBasic has 80-bit reals as a native variable type.

As far as timing analysis goes, we always bring out a few port pins to test points, from uPs and FPGAs, so we can scope things. Raise a pin at ISR entry, drop it before the RTI, scope it.

We wrote one Linux program that just toggled a test point as fast as it could. That was interesting on a scope, namely the parts that didn't toggle.

- J
- John Larkin
  
  Contact options for registered users
Vote on answer
posted
1 year ago

Mon, Jan 16, 2023 2:32 PM

Apply all of your engineering creativity.

- P
- panteltje
  
  Contact options for registered users
Vote on answer
posted
1 year ago

Mon, Jan 16, 2023 4:31 PM

On a sunny day (Mon, 16 Jan 2023 06:29:25 -0800) it happened John Larkin snipped-for-privacy@highlandSNIPMEtechnology.com wrote in snipped-for-privacy@4ax.com:

It all depends using rspberry Pi as FM transmitter (80 to 100 MHz or so):

formatting link

That code gave me the following idea, freq_pi:

formatting link

and that was for a very old Pi model, somebody then ported it to a later model, no idea how fast you can go on a Pi4.

- U
- upsidedown
  
  Contact options for registered users
Vote on answer
posted
1 year ago

Mon, Jan 16, 2023 5:39 PM

In a system with multiple ISRs, spending too long in a single ISR is a bad idea. Better just read inputs in the ISR and postpone the time consuming processing to a lower priority "pseudo ISR" (SW ISR).

Many processors have software interrupts (SWI), traps or whatever each manufacture is calling it.

In such environment the HW ISR at close to the end issues the TRAP instruction (SWI request), which is not activated as long as the HW ISR is still executing. When the HW ISR exits, interrupts are enabled.

If there is an other HW interrupt(s) pending, those are first executed. When no more HW interrupts are pending the SW ISR can start executing. This SW ISR can be quite time consuming. A new hardware interrupt request may interrupt the SW ISR.

When the SW ISR finally exits, the originally interrupted program is resumed.

- G
- Gerhard Hoffmann
  
  Contact options for registered users
Vote on answer
posted
1 year ago

Mon, Jan 16, 2023 7:30 PM

That was not created as a benchmark. The goal was to have the same Compiler and operating system on most of the upcoming microssystems available. Not too unexpected for an operation system department at a univ. And when there were underperformers, that would not go unnoticed.

We had some ICL Perqs; not my cup of meat. I had a small Prolog system on my Z80, funny but nothing for real work.

And yes, I was interested how fast my machines ran. In the VLSI course, I talked a group of other students into doing a stack machine much like Tanenbaum's, only simpler in HP's dynamic NMOS process. Unluckily, we caught a metal flake from a neighbor project that the DRC did not get.

So is Intel.

Ah, I had both of them, in the same 19" box.

Pinball machines with a Fairchild Clipper? Do you have an idea what a Clipper module did cost? The machine was intended as a multi user Unix machine. I later got a paid project to build a VME bus terminal concentrator based on 80186 for it.

Why should I care about medical devices, video games, pinball or navigation systems? GPS was an experiment at that time and the 50 Baud navigation strings no problem for sure.

The product WAS running UNIX. I wrote they had bought a commercial source license from Bell.

Cheers, Gerhard

- L
- Lasse Langwadt Christensen
  
  Contact options for registered users
Vote on answer
posted
1 year ago

Mon, Jan 16, 2023 7:31 PM

mandag den 16. januar 2023 kl. 18.39.13 UTC+1 skrev snipped-for-privacy@downunder.com:

I've sometime done that by setting the pending bit on an otherwise unused interrupt set at a low priority, cortex-m does tail chaining so RTI from an interrupt while another is pending is effectively just a jump and change in priority

another (and tricky to get right) way is to add a stack frame with the new code's address and do an RTI, ala' a task switch in an OS

- D
- Dimiter_Popoff
  
  Contact options for registered users
Vote on answer
posted
1 year ago

Mon, Jan 16, 2023 8:52 PM

Lasse also mentioned it, I see and it makes sense; I did not realize this was a "small" flavour of ARM, I am not familiar with any ARM.

On power, you get the PC and the MSR saved in special purpose registers assigned to do exactly that. You have to stack them yourself, even that. You don't even have a stack pointer, it is up to you to assign one of the GPR-s to that. The 603e core maximizes delegation to a huge extent and this is very convenient when you know what you are doing. Even when you get a TLB miss, all you get is an exception plus r0-r3 being switched to a shadow bank, Z80 style, meant for you to locate the entry in the page translation table and put it in the TLB; if not in the table you do a page fault, go into allocation, swapping if needed, fixing etc., you know how it goes. You don't need to stack anything, the 4 regs you have are enough to do the TLB fix.

Getting a page fault in an ISR is hugely problematic, if this is possible it compromises the entire design (so much about interrupt latency). In dps for power there is a "block translated area" (no page translation for it, it is always there) where interrupt handling code is to be placed. And there are 3 stack pointers in dps for 32 bit power: user, supervisor and interrupt. The interrupt stack pointer is always translated (also in BAT memory) and any exception first stacks a few registers in that interrupt stack; then it can switch to say supervisor stack pointer and go on to preempt a task, just do a system call etc.

I talked about this above, must have anticipated the question :).

The power core I mostly use can lock parts of the cache, IIRC in

1/4-th (i.e. 4k) increments. I have never used that though.

I was somewhat surprised that ARM has the ability to truly prioritize interrupts, 68k style. Both you and Lasse said that, this is an important thing to have.

- L
- Lasse Langwadt Christensen
  
  Contact options for registered users
Vote on answer
posted
1 year ago

Mon, Jan 16, 2023 9:22 PM

now there is basically two types cortex-Mx which is a 32bit microcontroller with increasing x features are added like DSP instructions single, and double, FPU

and the cortex-A which is a 32/64bit cpu used in cell phones etc.

look similar to the older generation ARM, ARM7-TDMI

stack pointer was also just a GP register defined as stack pointer

it had one IRQ, and it only shadowed the return address and status register and an FIQ (fast) that shadowed the return address, status register, and (afair) seven general purpose registers

quite a bit code needed to find the interrupt source, stacking, etc. and even worse if preemption was needed

- D
- Dimiter_Popoff
  
  Contact options for registered users
Vote on answer
posted
1 year ago

Mon, Jan 16, 2023 9:47 PM

- D
- Dimiter_Popoff
  
  Contact options for registered users
Vote on answer
posted
1 year ago

Mon, Jan 16, 2023 9:56 PM

Finding the source does not take much on SOCs with an interrupt priority encoder, they all have that one nowadays (off-core, it does only prioritize which vector will be supplied to the core which has just one IRQ line). Preemption from a hardware interrupt is a complex thing to do whatever the scheme unless the interrupt is guaranteed to be received only while running at user level. But true prioritized interrupts 68k style is a huge step forward ARM have made, this does make a difference.

- L
- Lasse Langwadt Christensen
  
  Contact options for registered users
Vote on answer
posted
1 year ago

Mon, Jan 16, 2023 10:05 PM

yes, the cortex-M basically only have one IRQ like the old ARM, but a module called the Nested Vectored Interrupt Controller was added to the cortex-m that does all the decoding, vectoring, priority, stacking, unstacking thought a "back door"

- D
- Dimiter_Popoff
  
  Contact options for registered users
Vote on answer
posted
1 year ago

Mon, Jan 16, 2023 10:31 PM

Wait a second, this is not like 68k. If there is just one IRQ line the external controller cannot interrupt the core while it deals with an interrupt and is still being masked. Or is there something more sophisticated in that?