ISA Support for Multithreading

- E
- Eric
  
  Contact options for registered users
posted
17 years ago

Tue, Feb 20, 2007 12:09 PM

Hi,

I'm looking for the minimum set of instructions that a processor must implement to support some kind of simultaneous multithreading. I found the MIPS MT extension. What do you think of the instructions/ mechanisms that it offers? What would you change? Do you know other ISAs implementing other usefull features related to SMP?

Many thanks for your help.

Eric

- N
- Nick Maclaren
  
  Contact options for registered users
Vote on answer
posted
17 years ago

Tue, Feb 20, 2007 12:15 PM

In article , "Eric" writes: |> |> I'm looking for the minimum set of instructions that a processor must |> implement to support some kind of simultaneous multithreading. I found |> the MIPS MT extension. What do you think of the instructions/ |> mechanisms that it offers? What would you change? Do you know other |> ISAs implementing other usefull features related to SMP?

Well, the minimum number is generally zero. Your other questions relate to what may be a good idea to provide. If you want a better answer, you need to be clearer about your objective.

Regards, Nick Maclaren.

- S
- slebetman
  
  Contact options for registered users
Vote on answer
posted
17 years ago

Wed, Feb 21, 2007 2:07 AM

For timeslice and/or preemptive multithreading/multitasking ONLY, without other modern OS features like virtual memory etc, you basically need to be able to arbitrarily save and restore the call stack. Without it the only kind of multitasking you can implement is cooperative multitasking.

Oh and obviously you need to be able to execute code outside the context of the currently running code/application -- interrupt timer, supervisory mode, ring 0, doesn't matter how you just need a mechanism. Otherwise again the only kind of multitasking you can implement is cooperative multitasking.

- N
- Nick Maclaren
  
  Contact options for registered users
Vote on answer
posted
17 years ago

Wed, Feb 21, 2007 9:47 AM

In article , " snipped-for-privacy@yahoo.com" writes: |> |> For timeslice and/or preemptive multithreading/multitasking ONLY, |> without other modern OS features like virtual memory etc, you |> basically need to be able to arbitrarily save and restore the call |> stack. Without it the only kind of multitasking you can implement is |> cooperative multitasking.

Otherwise commonly known as coroutines. Yes. But the term "call stack" is unhelpful, as that is generally in memory; it is actually the local environment (registers, state etc.) that needs saving.

|> Oh and obviously you need to be able to execute code outside the |> context of the currently running code/application -- interrupt timer, |> supervisory mode, ring 0, doesn't matter how you just need a |> mechanism. Otherwise again the only kind of multitasking you can |> implement is cooperative multitasking.

Oh, no, you don't. That is almost true if you have only a single CPU, but definitely isn't if you have more than one. Why shouldn't one CPU run the 'kernel' and the other run ordinary processes?

Regards, Nick Maclaren.

- P
- Paul A. Clayton
  
  Contact options for registered users
Vote on answer
posted
17 years ago

Wed, Feb 21, 2007 9:50 AM

NOTE: The following comments are from one who is not a programmer much less a processor architect. Where there is disagreement with the work of experienced architects, it should be assumed that the author has an inadequate understanding of the tradeoffs involved or of the architecture in question or is evaluating the architecture based on different criteria.

While I perceive the MIPS MT-ASE as clean and professionally architected, I disagree with some of the decisions made. I am disappointed that there was no adoption of reduced contexts beyond simply not including or sharing coprocessor state. While the MIPS ISA makes this slightly awkward (since the multiplier has state), it would allow for more contexts for a given area. This would be especially powerful with the identification of Shadow Register Sets with Thread Contexts. While I could see user applications usefully spawning threadlets, it seems especially useful for an interrupt context to be smaller than 31 registers (i.e., for there to be nearly twice or four times as many contexts). (MIPS MT-ASE defines a Thread Context as containing 38 to 42 registers [31 GPRs, 2-3 multiplier registers,

5-7 MT-ASE-specific CP0 registers, 0-1 CP0 register; a TC can also contain additional registers such as those for Floating Point], a Virtual Processing Element has all previously defined CP0 registers (minimum 14?), and a Processor has 2-3 MT-ASE-specific CP0. A Shadow Register Set contains 31 registers.)

I also disagree with the choice to have only a RISC-y 'fork' instruction (MIPS MT-ASE 'fork' loads the value in one GPR to the spawned thread's PC [actually TCRestart], the value of another GPR to the target register in the spawned thread, and updates the Status register [or generates a Thread Exception of the Thread Overflow type {or, of course, a Reserved Instruction Exception if the instruction is unimplemented}].) While this provides the greatest flexibility within the constraints of RISC, it prohibits an application from taking advantage of the thread locality (other than through a shared cache) and forces more heavyweight thread spawning to pass most values through the memory system. Using a RISC-y instruction does allow implementations to be simpler, but it effectively prohibits the exploitation of the fact that localized bandwidth is easier to implement and that register copies can be virtualized. It might be desirable, e.g., to support forking directly to a generic procedure.

Of course, with the earlier complaint, it also effectively prohibits thread splitting (i.e., one context generating two contexts each with a little more than half the content as the originating context, for which most of the data movement can be virtualized even in a fairly simple in-order implementation).

I would also be highly tempted to provide a PC-related (like MIPS jump-and-link) fork to slightly reduce the software overhead of some simple threadlet spawning. ("MIPS MT Principles of Operation" assumes that only one type of fork could be implemented and concludes that a PC-relative or PC-related fork would force an excessive burden in the case of distant targets ["forcing them to perform a double control transfer"].)

It might also be desirable to make hardware forking more flexible (MIPS MT-ASE only allows forking within a VPE. "MIPS MT Principles of Operation": the objective is to minimize the "payload" of a FORK operation, in part to minimize the hardware implementation cost, but also in anticipation of multicore "remote" FORKs, where the information would need to be transmitted between processing elements. This implies that either the fork instruction will be extended to allow forking to a different VPE [it seems improbable that multicore VPEs would be defined] or a new fork instruction will be provided to provide such functionality.)

There is some attractiveness to allowing the fork operation to be of variable 'aggressiveness'. E.g., a fork might fail if no TC is available within the specified resource group (e.g., within the threads sharing L2 cache) and within the TCs that have a certain degree of execution resources available (failing, e.g., when the only free TCs have high run costs [as would be the case if the only free TC was allocated to a second level of context storage requiring the moving of a TC in the first level to the second level] or when an independent execution pipeline is not associated with the TC). This would seem to require at least one additional register argument AND a return value (to indicate failure and perhaps the reason for the failure); both conditions which are non-RISC-y. It seems desirable to provide a mechanism in the hardware interface to communicate failure of a fork (so that the original thread can, e.g., choose a more expensive forking such as a system call or even just follow a different code path on failure). (Alternately, there could be a lightweight mechanism to test if the appropriate resources are available. This would allow the remote possibility of performing a more expensive [or less beneficial] than desired fork if another thread grabbed the only remaining TC that meets the resource criteria. However, I think checking for failure would be sufficiently common to justify a return value [i.e., a fast mechanism to check for failure].)

It also seems appropriate that a TC could be allocated to a 'open pool' such that when activated the TC could be associated with a specific VPE.

It might also be appropriate to provide configuration control (per VPE) to allow hardware-forking to other VPEs (presumably with per VPE configuration that the VPE does not accept outside spawned threads); this would make hardware forking more flexible.

(I also think the limit of 8 VPEs is inappropriate, but that can be easily fixed because there are 13 adjacent reserved bits in the TCBind register.)

The MIPS MT-ASE also leaves much as implementation specific. While this provides maximum implementation flexibility, it might be desirable to define one or more processor families with more of the implementation-specific features defined.

(It might be desirable for there to be a yield instruction variant that defines the restart condition via a single condition value rather than using a bit vector [so more than 31 {63 in

64-bit versions???} conditions could be defined without requiring an extra source register {the case of waiting on a single condition would seem likely to be fairly common}]. It is not clear to me whether each VPE defines the set of 31 conditions. [I receive the impression that the 31 conditions are defined by the Virtual MultiProcessor {"a collection of interconnected VPEs"}, but I could be mistaken.] If each VPE defines its conditions, the 31-condition limit might not be particularly constraining.)

(MIPS' branch delay slots also cause some complications for gating storage accesses.)

MIPS MT-ASE gating storage (and particularly InterThread Communications Storage) is interesting, but I do not think I have an adequate understanding to critique such.

Paul A. Clayton just a technophile

- B
- Bill Todd
  
  Contact options for registered users
Vote on answer
posted
17 years ago

Wed, Feb 21, 2007 8:52 PM

Possibly just one instruction that says, "Give me your current internal state for thread N and replace it with this one." As long as the OS was told the number of thread slots that the hardware supported and processor-generated interrupts included the applicable thread ID (as an output) I suspect (having thought about it for less than a minute) that the OS could potentially handle everything else.

Of course, this might not support anything remotely approaching

*optimal* SMT, but that wasn't what you asked.

- bill

- A
- Anne & Lynn Wheeler
  
  Contact options for registered users
Vote on answer
posted
17 years ago

Wed, Feb 21, 2007 9:16 PM

in the mid-70s i designed some dispatching microcode for multiprocessor project (that was eventually canceled and never shipped a product).

it basically moved part of the virtual machine dispatcher and interrupt handler into the microcode. the status for the virtual machine was already being kept ... so it required two queues ... one of available stuff to execute ...and one of stuff that had finished execution (and needed something from the kernel for one reason or another). how many different real processors and therefor concurrently executing virtual machines was somewhat transparent to the kernel code. this is similar in concept to what was later done by 432i.

misc. past posts mentioning the project

formatting link

later there was the "SIE" instruction ... which wasn't particularly queue or multiprocessor oriented ... just packaged all the the stuff needed to start/stop process operation into a single instruction.

SIE was somewhat part of migration of more and more support for virtual machine operation into hardware of the machine. Major part of that was virtual machine subset that appeared as "LPARS" (logical partitions) ... where a subset of virtual machine operation was provided directly by the hardware ... w/o requiring virtual machine operating system. for a little drift ... post with old email discussing some philosiphy differences between SIE implementation on the 3081 and 3090

formatting link

virtual memory

- N
- Nick Maclaren
  
  Contact options for registered users
Vote on answer
posted
17 years ago

Wed, Feb 21, 2007 9:43 PM

If the architecture keeps all state in memory, you don't need even that! The ICL 1900 mapped its registers to memory, so that can work even for register designs.

Regards, Nick Maclaren.

- P
- Paul Keinanen
  
  Contact options for registered users
Vote on answer
posted
17 years ago

Wed, Feb 21, 2007 9:47 PM

At least in processors with only a few registers, the interrupt usually saves the registers into the current stack. If the interrupt causes a task switch, you just have to save the stack pointer into the current task data area. Then load the stack pointer for the task to be activated (which was suspended by a previous hardware or software interrupt with registers on the stack) and issue a return from interrupt instruction, which will load the saved registers from the stack into the active registers.

In a small system with 3-5 tasks, scanning the task list in the ISR for the highest priority runnable task does not increase the interrupt latency very much.

Paul

- C
- CBFalconer
  
  Contact options for registered users
Vote on answer
posted
17 years ago

Thu, Feb 22, 2007 1:24 AM

A stack machine normally has about 4 registers not visible to the programmer - PC, SP, localbase, globalbase. Everything else is already in memory. With memory management globalbase can be taken as 0 and replaced by a pointer to the appropriate page tables. So scheduling is a matter of changing 4 hidden registers and returning.

--
 
 
 
 "A man who is right every time is not likely to do very much."
                           -- Francis Crick, co-discover of DNA
 "There is nothing more amazing than stupidity in action."
                                             -- Thomas Matthews

- B
- Bill Todd
  
  Contact options for registered users
Vote on answer
posted
17 years ago

Thu, Feb 22, 2007 1:33 AM

With no special SMT instructions, how do you tell the processor that you want to replace a specific thread that it is currently executing (e.g., one that is about to wait for some lock) with another? Or, for that matter, give it multiple threads to begin with?

- bill

- R
- robertwessel2
  
  Contact options for registered users
Vote on answer
posted
17 years ago

Thu, Feb 22, 2007 1:39 AM

That's a particularly bad idea if you expect there to be any isolation between user mode tasks and the OS. A user mode stack is always unsafe to use in an ISR.

That obviously does not apply to small embedded system where you know that each task's stack always has enough room to service an interrupt, but any other system must switch to a safe stack when taking an interrupt.

- G
- Grant Edwards
  
  Contact options for registered users
Vote on answer
posted
17 years ago

Thu, Feb 22, 2007 2:19 AM

You save the processor context in RAM (on the current thread's stack is common), and load the processor context from the thread you want to execute. Then you execute it.

You don't "give a processor multiple threads". You write a scheduler that knows about threads and knows how to switch between them.

--
Grant Edwards                   grante             Yow!  Yow! Are we wet yet?
                                  at               
                               visi.com

- B
- Bill Todd
  
  Contact options for registered users
Vote on answer
posted
17 years ago

Thu, Feb 22, 2007 6:07 AM

Are you sure that you understand what SMT is? It certainly appears that one of us doesn't.

- bill

- N
- Nick Maclaren
  
  Contact options for registered users
Vote on answer
posted
17 years ago

Thu, Feb 22, 2007 10:00 AM

In article , Bill Todd writes: |> |> With no special SMT instructions, how do you tell the processor that you |> want to replace a specific thread that it is currently executing (e.g., |> one that is about to wait for some lock) with another? Or, for that |> matter, give it multiple threads to begin with?

Special memory locations? Ordinary control instructions with a different set of operands? Control register contents? Anything else in common use I have forgotten?

There's more than one way to kill a cat.

Regards, Nick Maclaren.

- A
- Andrew Reilly
  
  Contact options for registered users
Vote on answer
posted
17 years ago

Thu, Feb 22, 2007 1:05 PM

What makes you think that an SMT processor needs to be in any way distinguishable (including instruction set) from a pair (or whatever ordinality) of processors that share memory? You don't need special instructions to schedule processes or threads, other than the usual sort that have been available in almost every processor since the 68010 (and many before that, of course).

Cheers,

--
Andrew

- G
- Gil Hamilton
  
  Contact options for registered users
Vote on answer
posted
17 years ago

Thu, Feb 22, 2007 2:57 PM

In particular, Bill appears to be referring to something like the TCs (thread contexts) supported by the MIPS MT ASE (MultiThreading Application Specific Extension). This is a step beyond mere multiple hardware threads, each of which appears to be a separate processor (such as Intel's HyperThreading or what MIPS calls VPEs [Virtual Processing Elements]). If interested, download the programmer's reference from this page (registration required):

formatting link

itecture.php

I can't claim to have a complete understanding, but it appears that you can actually allocate and schedule TCs from user space. That is, you can create additional threads of execution and switch amongst them without making a system call (though OS support is required to initially enable and configure the TC mechanism and, presumably, to manage them appropriately when exceptions or interrupts occur). There is also apparently some limited support for inter-thread communication: essentially, block until some data value is written / condition satisfied by another thread.

GH

- T
- Tim McCaffrey
  
  Contact options for registered users
Vote on answer
posted
17 years ago

Thu, Feb 22, 2007 4:08 PM

In the late 70's a company call Systems Research (later bought by Burroughs), created custom microcode for the HP21MX mini computer that did something similar. Every interrupt, the microcode was invoked, the current task was put at the tail of the ready-to-run Queue (for its priority level), then the device was serviced (if possible) (timers & serial devices were handled directly for most interrupts), and/or a ISR was called. Afterwards, the microcode picked the highest priority ready-to-run task and started it.

SRI wrote alot of microcode....

- Tim

- T
- Tim McCaffrey
  
  Contact options for registered users
Vote on answer
posted
17 years ago

Thu, Feb 22, 2007 4:11 PM

The perpherial processors on the CDC Cyber series had 10 18 bit accumulators and 10 program counters. The processor switched between them every cycle. I suppose you could call that multithreading support, of course your limited to

10 threads (both max and min). :)

- Tim

- W
- Walter Banks
  
  Contact options for registered users
Vote on answer
posted
17 years ago

Thu, Feb 22, 2007 4:55 PM

started it.

We implemented threads support in the RS08 and S08 and are extending this support to all the other micros Byte Craft supports ( ad off technology begins ).

The dispatcher groups the threads into three categories INTERRUPT,SOFTWARE and RESET

RESET are one shot threads that only ever execute once . INTERRUPT are threads that require an outside event as part of the execution conditioning.

SOFTWARE are threads that whose conditioning is entirely system status variables and flags. When all the open software threads have been serviced the processor can sleep.

Priority is linear within each group. On small processors each thread is run to completion. (It may spawn other threads)

The biggest advantage we have seen is lower RAM requirements and execution cycles. Overall ROM usage is similar or marginally less than other implementations.

Our surprise was how effective it is in coding small state machine tasks often found in single purpose micros for example a quadrature decoder

Walter Banks

--

1 519 888 6911 Byte Craft Limited A2-490 Dutton Drive, Waterloo, Ontario N2L 6H7 Canada

formatting link

email snipped-for-privacy@bytecraft.com