NOTE: The following comments are from one who is not a programmer much less a processor architect. Where there is disagreement with the work of experienced architects, it should be assumed that the author has an inadequate understanding of the tradeoffs involved or of the architecture in question or is evaluating the architecture based on different criteria.
While I perceive the MIPS MT-ASE as clean and professionally architected, I disagree with some of the decisions made. I am disappointed that there was no adoption of reduced contexts beyond simply not including or sharing coprocessor state. While the MIPS ISA makes this slightly awkward (since the multiplier has state), it would allow for more contexts for a given area. This would be especially powerful with the identification of Shadow Register Sets with Thread Contexts. While I could see user applications usefully spawning threadlets, it seems especially useful for an interrupt context to be smaller than 31 registers (i.e., for there to be nearly twice or four times as many contexts). (MIPS MT-ASE defines a Thread Context as containing 38 to 42 registers [31 GPRs, 2-3 multiplier registers,
5-7 MT-ASE-specific CP0 registers, 0-1 CP0 register; a TC can also contain additional registers such as those for Floating Point], a Virtual Processing Element has all previously defined CP0 registers (minimum 14?), and a Processor has 2-3 MT-ASE-specific CP0. A Shadow Register Set contains 31 registers.)
I also disagree with the choice to have only a RISC-y 'fork' instruction (MIPS MT-ASE 'fork' loads the value in one GPR to the spawned thread's PC [actually TCRestart], the value of another GPR to the target register in the spawned thread, and updates the Status register [or generates a Thread Exception of the Thread Overflow type {or, of course, a Reserved Instruction Exception if the instruction is unimplemented}].) While this provides the greatest flexibility within the constraints of RISC, it prohibits an application from taking advantage of the thread locality (other than through a shared cache) and forces more heavyweight thread spawning to pass most values through the memory system. Using a RISC-y instruction does allow implementations to be simpler, but it effectively prohibits the exploitation of the fact that localized bandwidth is easier to implement and that register copies can be virtualized. It might be desirable, e.g., to support forking directly to a generic procedure.
Of course, with the earlier complaint, it also effectively prohibits thread splitting (i.e., one context generating two contexts each with a little more than half the content as the originating context, for which most of the data movement can be virtualized even in a fairly simple in-order implementation).
I would also be highly tempted to provide a PC-related (like MIPS jump-and-link) fork to slightly reduce the software overhead of some simple threadlet spawning. ("MIPS MT Principles of Operation" assumes that only one type of fork could be implemented and concludes that a PC-relative or PC-related fork would force an excessive burden in the case of distant targets ["forcing them to perform a double control transfer"].)
It might also be desirable to make hardware forking more flexible (MIPS MT-ASE only allows forking within a VPE. "MIPS MT Principles of Operation": the objective is to minimize the "payload" of a FORK operation, in part to minimize the hardware implementation cost, but also in anticipation of multicore "remote" FORKs, where the information would need to be transmitted between processing elements. This implies that either the fork instruction will be extended to allow forking to a different VPE [it seems improbable that multicore VPEs would be defined] or a new fork instruction will be provided to provide such functionality.)
There is some attractiveness to allowing the fork operation to be of variable 'aggressiveness'. E.g., a fork might fail if no TC is available within the specified resource group (e.g., within the threads sharing L2 cache) and within the TCs that have a certain degree of execution resources available (failing, e.g., when the only free TCs have high run costs [as would be the case if the only free TC was allocated to a second level of context storage requiring the moving of a TC in the first level to the second level] or when an independent execution pipeline is not associated with the TC). This would seem to require at least one additional register argument AND a return value (to indicate failure and perhaps the reason for the failure); both conditions which are non-RISC-y. It seems desirable to provide a mechanism in the hardware interface to communicate failure of a fork (so that the original thread can, e.g., choose a more expensive forking such as a system call or even just follow a different code path on failure). (Alternately, there could be a lightweight mechanism to test if the appropriate resources are available. This would allow the remote possibility of performing a more expensive [or less beneficial] than desired fork if another thread grabbed the only remaining TC that meets the resource criteria. However, I think checking for failure would be sufficiently common to justify a return value [i.e., a fast mechanism to check for failure].)
It also seems appropriate that a TC could be allocated to a 'open pool' such that when activated the TC could be associated with a specific VPE.
It might also be appropriate to provide configuration control (per VPE) to allow hardware-forking to other VPEs (presumably with per VPE configuration that the VPE does not accept outside spawned threads); this would make hardware forking more flexible.
(I also think the limit of 8 VPEs is inappropriate, but that can be easily fixed because there are 13 adjacent reserved bits in the TCBind register.)
The MIPS MT-ASE also leaves much as implementation specific. While this provides maximum implementation flexibility, it might be desirable to define one or more processor families with more of the implementation-specific features defined.
(It might be desirable for there to be a yield instruction variant that defines the restart condition via a single condition value rather than using a bit vector [so more than 31 {63 in
64-bit versions???} conditions could be defined without requiring an extra source register {the case of waiting on a single condition would seem likely to be fairly common}]. It is not clear to me whether each VPE defines the set of 31 conditions. [I receive the impression that the 31 conditions are defined by the Virtual MultiProcessor {"a collection of interconnected VPEs"}, but I could be mistaken.] If each VPE defines its conditions, the 31-condition limit might not be particularly constraining.)
(MIPS' branch delay slots also cause some complications for gating storage accesses.)
MIPS MT-ASE gating storage (and particularly InterThread Communications Storage) is interesting, but I do not think I have an adequate understanding to critique such.
Paul A. Clayton just a technophile