I recall a discussion about the design of an instruction set architecture where someone was saying an instruction was required to test and set a bit or word as an atomic operation if it was desired to support multiple processors. Is this really true? Is this a function that can't be emulated with other operations including the disabling of interrupts?
AFAIK as long as you surround your "test and set" with an interrupt disable and an interrupt enable then you're OK. At least, you're OK unless you have a processor that treats interrupts really strangely.
Dekker's algorithm (circa 1962) can be used to implement a mutex without test-and-set, or even masking interrupts, and really only requires sequential consistency (and if that's a issue for your hardware, it should have appropriate memory barriers to force that).
A limitation is that Dekker's only works with two threads/processors.
Several related algorithms exist, and generalizations of some of those (for example, Peterson's) can handle more than two threads.
And once you have a mutex, you can implement everything else.
OTOH, proper atomic instructions are a huge boon (even if only in LL/SC form), so they're provided by everyone who's serious about multiprocessor support.
Interesting. I'm not sure it's limited to two threads per processor rather than just two threads. Even if they are on separate processors it's the same problem and the same limit, no?
The situation I am looking at resolving is a technique where rather than using a pipeline of length N to allow N-fold improvements in speed (other than when the pipeline has to be flushed) it can be used to get N-fold improvements in clock speed, but treating the CPU as N processors. They are time-sliced if you will. It also relieves a lot of the complexity of detecting problems and contingencies in the pipeline streamlining it that much more. But...
When someone was proposing this design some time back another poster was insistent the design required an instruction level mutex, if I am using the term right. I don't recall the details of what he wrote, but I've always wondered how important this is.
One of the things that can further streamline a processor is to make all instructions single cycle, truly single cycle. So a read-modify-write instruction may be hard to implement depending on the memory used. But in the context of the above multi-processor design, if the RMW instruction takes multiple clock cycles, it would not be truly uninterrupted since the other N-1 processors would all get a clock (and an instruction) between the multiple clocks of the RWM instruction.
I need to bat this around in my head a bit more I think.
It's two threads, on one processor or two. Many discussion of Dekker's don't really consider software threads (as we know them today), as it predates common support for that in OSs. Call it two competitors for the lock.
This is one of the oldest, if not the very oldest, multi-threading technique for processors, dating back to the PPUs on the CDC-6600 (circa 1965). The PPUs were 10 time-sliced "threads" on the I/O processor.
As soon as you have memory or FP, not all instructions are single cycle any more. But as described below, LL/SC is the standard way of avoiding having a RMW instruction.
It is important to understand how much synchronization you need between the threads. If the individual tasks are very independent, and need little or no synchronization with tasks running on other threads, then you need only weak support for such in the hardware (and your code would probably run well on a GPU).
Load-linked/store-conditional. That's what most RISCs use to implement atomic operations. Basically the load-linked establishes that you want to atomically update a location (as well as loading the existing value there), the store-conditional attempts to update that value (it's a store), but if something else had hit that location*, the SC will fail, and you have to retry the update. In that sense it's similar to a compare-and-swap, rather than a true atomic update like test-and-set.
The desire to avoid multi-cycle instructions is the prime driver for LL/SC, as it lets the operation be split into individual simple operations. So let's say you wanted to atomically add R2 to a memory location, you'd code something like:
The success flag is typically returned in a register. Alpha, for example returned it in the register operand of the SC (STL_C or STQ_C) so a simple BEQ (to test if the returned value was zero) was all that was needed to trigger the retry.
*FSVO "location" - usually the granularity is on something like a cache line or larger. There are often other restrictions as well - no interrupts or exceptions between the LL and SC, a maximum number of instructions between the pair, no other LL, loads or stores in general, branches, etc. The exact details varied considerably. Commonly this is described as having a flag set by the LL, and various conditions can cause the flag to be cleared before the SC executes - and if that flag is clear the SC fails.
I don't know what you mean when you use abbreviations that aren't clear. Is FP floating point? There is nothing inherent in floating point that makes it multiple cycles in a custom processor design.
There certainly is no reason for memory to be multiple cycle. I think you are picturing various implementations where memory is slow compared to the CPU. That's not a given.
How dependent the tasks are will vary with the problem being solved. I'm just looking at how best to utilize the facility available in FPGAs. The guy who was talking about the time slicing was looking into that when he was told about the need for a mutex.
I need to think how that could be implemented. One way would be to have an extra bit per memory location. Another would be to make the memory interface "smart" where it tracks the location you read as part of this function and handles the logic of monitoring other writes to it. That could be pretty simple really as long as it is one location per processor.
Would an instruction that reads a location while also writing to it do the job? If you are trying to lock a resource by setting a location to a 1 say, you set the 1 but also read it. If it comes back already a 1 you know it was already locked. There doesn't need to be the same difficulty of setting it to a 0 if you hold the lock.
Doing a read and a write at the same time in most FPGAs is not so hard. The block memory in an FPGA has simultaneous read/write functionality. In fact it looks like a large addressable register set requiring a clock edge to do the read. By using the opposite edge of the clock to clock the memory (as the rest of the CPU) the write and read can happen at the same time with the read result coming out a half clock before it has to be written to its destination. win-win.
While you can imagine some FP formats that you might reasonably implement in a single cycle, they'd be pretty far out of the mainstream. The width, normalization, and rounding requirements make all but the simplest FP operations implausible to implement in what's usually made the time for a single cycle. OTOH you *could* slow down the rest of your processor so that you can accomplish the implemented FP operations in a single cycle. While trivial, why would you accept a three fold reduction in clock rate for the integer part of your core just so you could have single cycle FP instructions?
No, but such a situation is sufficiently uncommon that you'd really want to spec it up front. Most cores these days don't even have single cycle L1 caches.
The granularity can be pretty coarse. Commonly a cache-line or bigger. But you don't want to add a bit per memory word. With a cache, you'd keep track of any changes of ownership of the cache line. With direct memory, you can just track the memory area (perhaps make it blocks of 64B for simplicity) in each "thread", and then just watch all other writes from other threads/time-slices for matching addresses.
No, that's not enough in typical multi-processor implementations - if two cores/threads/whatever hit the same location, you can only return
0 to one of them, or both might think they acquires the lock. What you've described *is* a TAS, but not an atomic one.
OTOH, since you have only a single core, and you appear to be proposing that your TAS *will* be atomic (since it's going to execute in a single clock, and presumably not interleave with any other TAS in another timeslice), you're good to go.
Of course, there are alternatives. If you need only moderate locking for your one time-sliced CPU, you can implement a single global lock, and then just stall (prevent them from doing anything in their timeslice) any threads trying to acquire that lock if it's not available, or just spin. IOW this is a simplified form of TAS, and you'd us it to implement bigger atomic updates. You could also have several global flags ("lock0..lock15"), so you could have separate, non-conflicting, things happening on different locks without contention.
IOW, (with a single global flag), you might end up with something like:
retry_lock: try_global_lock_bit ;acquire global lock, set carry if success jnc retry_lock ;just a spin lock ...update shared item(s)... release_global_lock
The try_global_lock and release_global_lock would be the two instructions you'd implement.
Even on FPGAs, where the hard memory blocks are absurdly fast in comparison to logic (as compared to most of the real world), you're likely to see memory access to be multiple clocks unless your CPU logic is really slow.
There are many, many ways to implement synchronisation between threads, processors, whatever. In theory, they are mostly equivalent in that any one can be used to implement the others. In practice, there can be a lot of differences in the overheads in the hardware implementation, and the speed in practice.
Typical implementations are "compare-and-swap" instructions (x86 uses these) and load-linked/store-conditional (common on RISC systems where instructions either load /or/ store, not both. And of course, on single processor systems there is always the "disable all interrupts" method.
But if you can use dedicated hardware, there are many other methods. The XMOS devices have hardware support for pipelines and message passing. On a dual-core PPC device I used, there is a hardware block of semaphores. Each semaphore is a pair of 16-bit ID, 16-bit value that you can only access as a 32-bit read or write. You can write to it if the current ID is 0, or if the ID you are writing matches that of the semaphore. There is plenty of scope for variation based on that theme.
I received my first XMOS board from Digi-Key a couple of days ago, and I'm looking forward to using it for some simple experiments. I /feel/ that many low-level things will be much simpler and with fewer potential nasties lurking in the undergrowth. (I felt the same with the Transputer, for obvious reasons, but never had a suitable problem at that time)
With your experience, did you find any undocumented gotchas and any pleasant or unpleasant surprises?
Before saying anything else, I would first note that my work with XMOS systems was about four years ago, when they first started getting popular. I believe many things that bugged me most have been improved since then, both in the hardware and software, but some may remain.
I think the devices themselves are a really neat idea. You have very fast execution, very efficient hardware multi-threading, very predictable timings, and a variety of inter-thread and inter-process communication methods.
Their "XC" programming language was also a neat idea, based on C with additional primitives to support the hardware features and multi-threading stuff, and an attempt to make some aspects of C safer (real arrays, control of when you can access variables, etc.).
However, IMHO the whole thing suffered from a number of serious flaws that limit the possibilities for the chips. Sure, they would work well in some circumstances - but I was left with the feeling that "if only they had done /this/, the devices would be so much better and could be used for so many more purposes". It is a little unfair to concentrate on the shortcomings rather than the innovations and features, but that is how I felt when using them. And again, I know that at least some issues here have been greatly improved since I last used them.
A obvious flaw with the chips is lack of memory. The basic device with one cpu and 8 threads had 64K ram that was for program memory and run-time data. There was no flash - you had to use an external SPI flash which used valuable pins (messing up the use of blocks of 8, 16 or
32 pins), and used up a thread if you wanted to be able to access the flash at run-time. And while you could implement an Ethernet MAC or a
480 Mbps USB 2.0 interface on the chip, there was nowhere near enough ram for buffering or to do anything useful with the interface. Adding external memory was ridiculously expensive in terms of pins, threads, and run-time inefficiency.
The hardware threading is great, and provides a really easy model for all sorts of things. To make a UART transmitter, you have a thread that waits for data coming in on a pipe. To transmit a bit, you set a pin, wait for a bit time (using hardware timers), then move on to the next bit. The code is simple and elegant. A UART receiver is not much harder. There is lots of example code in this style.
Then you realise that to implement a UART, you have used a quarter of the chip's resources. Your elegant flashing light is another thread, as is your PWM output. Suddenly you find you are using a 500 MIPS chip to do the work of a $0.50 microcontroller, and you only have a thread or two left for the actual application.
And you end up trying to run FreeRTOS on one of your threads, or make your own scheduler to multiplex several PWM channels in one thread. Much of the elegance quickly disappears for real-world applications.
Then there is the software. The XC language lets you write code that starts tasks in parallel, automatically allocates channels for communication, lets you declare timers and wait on them. That's all great in theory - but it quickly gets confusing when you try to figure out the details of when you can pass these around, when they get allocated and deallocated, or when you can have a thread create new threads. XC carefully tracks threads and data accesses, spotting and blocking all sorts of possible race conditions. If a variable is written by one thread, then it can't be accessed from another. You can work with arrays safely, but you can't take addresses. Data gets passed between threads using communication channels that are safe from race conditions and nicely synchronised.
And then you realise that to actually make the thing work, you would need far more channels than there are on the device, and they would need to be far faster - all you really wanted was for two threads to share a circular buffer, and you know in your application code when it is safe to use it. But you can't do that in XC - the language and the tools won't let you. So you have to write that code in C, with calls back and forth with the XC code that handles the multi-threading stuff.
And then you realise that from within the C, you need to access some hardware resources like timers, that can't be expressed properly in C, and you can't get back to the XC code at the time. So you end up with inline assembly.
Then there are the libraries and examples. These were written in such a wide variety of styles that it was impossible to figure out what was going on. A typical example project would involve a USB interface and, for example, SPDIF channels for an USB audio interface. The Eclipse-based IDE was fine, but the example did not come as a project - it came as a collection of interdependent projects. Some bits referred to files in different projects. Some bits merely required other projects to be compiled. Some bits of the code in one project would use assembly for hardware resources, others would use XC, others would use C intrinsic functions, and others would use a sort of XML file that defines the setup for your chip resources. If you change values in one file in one project (say, the USB vendor ID), you have to figure out which sub-projects need to be manually forced to re-build in order for it to take effect consistently throughout the project. Some parts use a fairly obvious configuration file - a header with defines that let you control things like IDs, number of channels, pins, etc. Except they don't - only /some/ of the sub-projects read and use the configuration file, other parts are hard-coded or use values from elsewhere. It was a complete mess.
Now, I know that newer XMOS devices have more resources, built-in flash, proper hardware peripherals for the devices that are most demanding or popular, and so on. And I can only hope that the language and tools have been improved to the point where inline assembly is not required, and that the examples and libraries have matured to the point that the libraries are usable as-is, and the examples show practical ways to develop code.
I really hope XMOS does well here - it is so good to see a company that thinks in a very different way and brings in these new ideas. So if your experience with modern XMOS devices and tools is good, I would love to hear about it.
Thanks for a speedy, comprehensive response. I'll re-read and digest it properly later.
My initial gut feel is that many of your points were valid and probably are still valid - because they /ought/ to still be valid.
The issues that most interest me relate to where you found it necessary to step outside the toolchain. Part of me thinks (hopes, really) that it is merely because your problem wasn't well suited to the devices strengths (esp. guaranteed timing), and/or were too big, and/or importing existing code/thinking lead to friction, and/or the tools were immature.
I expect I'll end up agreeing with many of your observations, but I'll have fun finding that out :)
I know that at least some of my points are no longer an issue, or at least not as much of an issue - XMOS have devices with flash, USB hardware, etc. At least some of the toolchain issues should be fixable. And the mess of the examples and libraries is certainly fixable - at least, if one disregards the time and effort it would involve!
The existing code was mainly XMOS's own examples, libraries and reference designs...
I do agree that much of their USB stuff was poorly suited to the devices and too big for them, and that probably made things worse - but it was XMOS's own code. With newer devices with hardware USB peripherals, I expect fewer such issues.
I will go along with your hope - expectation, even - that the tools have matured and improved over time.
Yes, those are precisely the aspects that interest me. I'm particularly interested in easy-to-implement hard realtime systems.
As far as I am concerned, caches and interrupts make it difficult to guarantee hard realtime performance, and C's explicit avoidance of multiprocessor biasses C away from "easy-to-implement".
Yes, I know about libraries and modern compilers that may or may not compile your code in the way you expect!
I'd far rather build on a solid foundation than have to employ (language) lawyers to sort out the mess ;)
Besides, I want to use Occam++ :)
Yes, those did strike me as limitations to the extent I'm skeptical about networking connectivity. But maybe an XMOS device plus an ESP8266 would be worth considering for some purposes.
Yes. However I don't care about wasting some resources if it makes the design easier/faster, so long as it isn't too expensive in terms of power and money.
Needing to run a separate RTOS would be a code smell. That's where the CSP+multicore approach /ought/ to be sufficient. For practicality, I exclude peripheral libraries and networking code from that statement.
That's the kind of thing I'm interested in exploring.
That's the kind of thing I'm interested in exploring.
At which point many advantages would have been lost.
Irritating, but not fundamental, and as you point out, they could be fixed with application of time and money.
I think we are largely in violent agreement.
StartKIT to echo anything it receives on the USB line, with a second task flipping uppercase to lowercase. That should give me a feel for a low-end resource usage.
Then I'd like to make a reciprocal frequency counter to see how far I can push individual threads and their SERDES-like IO primitives.
And I'll probably do some bitbashing to create analogue outputs that draw pretty XY pictures on an analogue scope.
The XMOS devices have a lot more control over timing than you get on a normal processor (especially a superscaler one with caches). The development tools include a lot of timing analysis stuff too.
XC is still based on C !
It is a /long/ time since I used Occam (except for shaving away unlikely explanations). It was fun, as long as you were careful with the spacing in your source code.
XMOS programming is more inspired by Hoare's CSP than Occam. They should probably implement Go for the devices.
There are XMOS devices with Ethernet hardware, I believe - that seems the best idea.
The chip has 8 hardware threads (per core). If you want more than 8 threads, you either have a bigger and more expensive chip, or you have to do something in software.
Yes. Hopefully, newer versions of the tools make this sort of hack unnecessary.
You'll have fun, anyway - unless you get issues with dependency messes from sub-projects under Eclipse. If that happens, I recommend re-arranging things so that you have one project with multiple directories - it will save you a lot of hair-pulling.
There is nothing implausible about single cycle floating point. It is all a matter of context and application. Not every design has to scream with the highest possible clock rate at the expense of pipeline complications. Don't impose an architecture until you know the requirements.
Again, assuming an architecture and requirements. This conversation is not about designing the next generation of ARM CPU.
Why don't I want to add a bit per memory location? If that bit happens to be free, I'd be silly to let it go to waste.
You are ignoring that the instruction also writes a 1 to the location. So two processors *can't* both read a zero. Why is it not atomic? The write and read happen on the same clock cycle. Much like Dekker's algorithm it sets the flag to exclude other processors as a priority. Then it can at leisure check if the flag was already set or if it "captured" the flag.
Again, an abbreviation I don't know. What's a TAS?
Not sure what you are assuming here. There is one "core", but it is time shared N ways giving N functional processors. Each one operates on a different clock cycle however, so any single instruction which takes one clock cycle is truly atomic and the other processors can't interfere.
Stalling the other processors is a bad idea. The primary apps for my designs are hard real-time and if one processor is dealing with something that needs 100% of its attention at that moment and it is stalled because another processor wants to lock some resource... not good.
Limiting the stall to processors that are accessing the same location in memory would be very messy in terms of logic.
The multiple channel approach was suggested by the previous designer and was shot down by the antagonist. I don't recall any of the detail though. If there is a resource that needs to be shared, it requires a mutex. I'm not sure how to get around that.
Isn't try_global_lock what my read/write (essentially a swap) instruction does? A one is written to the memory location no matter what. The previous value is returned. The previous value tells you if you acquired it or not (return of one means it was already locked, return of zero means you locked it). To perform the release you just write a zero, but only if you have the lock.
Not sure what you are referring to here. FGPA memory is *very* fast, almost as fast as registers. Likely you are thinking of external memory, in particular DRAM.
Yes, but hopefully with little need to use the important areas that keep language lawyers employed.
If the "awkward bits" still /have/ to be used, then Opportunities Will Have Been Missed.
Indeed. I've toyed with Python and don't dislike its ethos, but semantically significant spaces make me shudder. For a start, how does and editor's copy-and-paste work reliably. The consider refactoring browsers.
I still remember makefiles, where tabs were invisibly different to spaces.
Oh, . I've always regarded Occam (and now XC) as an instantiation of CSP in a language rather than a library. To that extent I haven't needed to distinguish.
The hardware seems the least of the problem; the software stack is more of an issue.
Yes. But I haven't thought through the ramifications of that, yet. In particular, I haven't developed my set of design patterns.
There's too much unreasoned prejudice against cut-n-paste, and too much in favour of common code bases. Sometimes cut-and-paste is optimum, e.g. when developing several PCBs over several years, the best way of ensuring you can regenerate a design is to copy all library components into project-specific libraries.
I don't mind code duplication if it means that I can find what needs to be modified, and then modify it in the knowledge that I haven't affected other parts of a system. I've seen companies get that dreadfully wrong, with the result that mods took 6 months to get to a customer.
It would be making architecture assumptions to require single-cycle memory - saying that memory may be multiple cycle is the general case.
(You might /want/ to make architecture assumptions here - I gave some other suggestions of specific hardware for locking in another post. The solutions discussed here are for general memory.)
Normally, one does not have a bit free per memory location.
LL/SC solutions usually have only a single tracker tracker per cpu, or even just a single tracker per shared memory bus. The granularity might be for a machine word, or for a cache line. The expectation is that these sequences will be so short that it is very rare for a conflict to occur, so it is fine to have just one tracker. You set the tracker when reading from the memory to get the old value, then use a conditional store that will only do the write if nothing else has touched that same location since your load. It gives you a "test and set" type operation, but split into RISC-friendly separate load and store operations.
Normally, you cannot read and write a memory location at the same time. It would only be possible with specialised hardware, or with a bus locking mechanism (which can be the basis of your synchronisation primitive). In a normal memory bus, you have a read operation then a write operation. If you don't have bus locking of some sort, then cpu 1 could read 0, then cpu 2 could read 0, then cpu 1 would write 1, then cpu 2 would write 1.
"test and set". It reads the old value in a memory location, sets the memory to 1, and returns the old value - all as an atomic operation. Atomicity is key here - and normal memories are not atomic.
(If you are thinking of using an FPGA here, it is often possible to implement a memory block where you /do/ read the old value on the same clock cycle as writing the new one, making a TAS operation simple to implement.)
If this is going to be efficient, it is unlikely that the core supports reading and writing within the same instruction - LL/SC is usually a better choice.
We are having two different conversations here. I am designing a CPU, you are talking about the theory of CPU design.
As I have said before, I am designing a CPU in an FPGA. I have FPGA features available to me that are different than what is available to chip designers. I do not have the same constraints as chip designers.
I was asking what is required in an instruction set to support multi-processors and I think the answer is that the memory swap instruction would do the job efficiently.
I am not sure a bit per reservation unit will work in a multiprocessor environment (it certainly will on your single core design, obviously).
Both the reserved address (range, typically) and the reserving processor will need to know who reserved what. But this is just a "gut feeling", I did not think seriously about it.
Test And Set, a classic 68k RMW opcoode. Read a byte (just byte size allowed), adjust the Z flag (Z IIRC...) to the state of bit 7 (MSB), then write the byte back with bit 7 set. While doing this the processor provides some means (holds AS asserted on the 68k IIRC) such that the access to the address can not be granted to another processor for the entire duration of the TAS opcode.
Looks like in your case a simple bset (again 68k lingo, bit test and set, but on two separate memory accesses, potentially interruptible by another bus master - not by a processor interrupt, it is a single instruction) will do what you need.