Speaking of Multiprocessing...

Do you have a question? Post it now! No Registration Necessary

Translate This Thread From English to

Threaded View
I recall a discussion about the design of an instruction set  
architecture where someone was saying an instruction was required to  
test and set a bit or word as an atomic operation if it was desired to  
support multiple processors.  Is this really true?  Is this a function  
that can't be emulated with other operations including the disabling of  
interrupts?

--  

Rick C

Re: Speaking of Multiprocessing...
On Thu, 23 Mar 2017 18:38:13 -0400, rickman wrote:

Quoted text here. Click to load it

AFAIK as long as you surround your "test and set" with an interrupt  
disable and an interrupt enable then you're OK.  At least, you're OK  
unless you have a processor that treats interrupts really strangely.

--  

Tim Wescott
Wescott Design Services
We've slightly trimmed the long signature. Click to see the full one.
Re: Speaking of Multiprocessing...
On 3/23/2017 4:19 PM, Tim Wescott wrote:
Quoted text here. Click to load it

Rethink that for the case of SMP...  (coincidentally, "Support Multiple
Processors"  :> )


Re: Speaking of Multiprocessing...
On Thu, 23 Mar 2017 16:26:46 -0700, Don Y wrote:

Quoted text here. Click to load it

D'oh.  Atomic to the common memory, not to each individual processor, yes.

Although it wouldn't have to be an instruction per se: you could have it  
be an "instruction" to whatever hardware is controlling the common  
memory, to hold off the other processors while it does a read/modify/
write cycle.

--  

Tim Wescott
Wescott Design Services
We've slightly trimmed the long signature. Click to see the full one.
Re: Speaking of Multiprocessing...
On 3/23/2017 4:47 PM, Tim Wescott wrote:
Quoted text here. Click to load it

Yes.  If the processor supports a RMW memory cycle AND the memory
arbiter honors that contract, then any competing processors would
explicitly be held off from accessing the location in question
until the RMW cycle terminated.

Quoted text here. Click to load it

Yes.  The point is that you have to be able to view state in specific
memory locations as functionally possible to "examine and alter"
indivisibly.  You can also introduce specific hardware mutexes but
then you are limited (in practical terms) by their number.

[I designed a processor many years ago that used such a mechanism to
synchronize with its peers]

You can also have the memory arbiter notice the cycle and "remember"
it even while allowing it to be "interrupted" (poor choice of
words) in the hope that the interruption does not COMPETE for that
particular memory location.  *If* the arbiter notices that the
interruption actually *does* compete for that location, then it
can explicitly act to stall that competing request until the
original "remembered" request has a chance to finish (presumably
"in short order").

It boils down to memory topology and the amount of resources you want
to throw at supporting multiple concurrent accesses.  As RMW's *tend*
to be infrequent, you can just opt for the easy way out...

By far, the easiest solution is a RMW memory cycle as you can then
design an arbiter to honor *it* while, conceivably, allowing OTHER
accesses to proceed in parallel (you just have to ensure that
the RMW accessed location is treated atomically; all others are
free to be accessed at will, concurrent with that RMW!)

Re: Speaking of Multiprocessing...
On Thu, 23 Mar 2017 17:49:39 -0700, Don Y

Quoted text here. Click to load it

The R/M/W popular a few decades ago when the core memory was much
slower than the processor. The read operation in core memory is
destructive, so you have to write back the original value. This is
usually done within the memory controller, the same or modified value
could also be written back by the processor, so you get the R/M/W
sequence practically  for "free".

In modern systems, things get complicated, since you may have to read
a full 64 bit memory word, bypassing caches on both read and write
while keeping the RAS active through the whole sequence.


Re: Speaking of Multiprocessing...
On 3/24/2017 12:48 AM, snipped-for-privacy@downunder.com wrote:
Quoted text here. Click to load it

On many "microprocessors", there are hints as to when RMW cycles are
undertaken.  E.g., the m68k would only issue address strobe for the
"two phase" RMW cycle (a consequence of the TAS opcode).

But, this requires the memory arbiter (for closely coupled coprocessors)
to monitor /AS and not attempt early (read vs write) cycle termination
(which is a potential performance hack in a shared memory system) by
just watching the individual data strobes.

Other legacy processors usually had exploits that could be leveraged
to deduce when RMW-ish cycles were in effect -- at the cost of a bit
of external logic (e.g., decoding opcode fetch cycles to provide
cleaner arbitration points for memory sharing).

However, as bus interface units became increasingly decoupled from
execution units, it gets harder to reliably infer what is ACTUALLY
happening in the CPU just by watching the bus.

Quoted text here. Click to load it

With SoC's, there's very little you can do to second-guess the processor
so you have to rely on the processor to perform this sort of access
(esp as the memory in question might be entirely "internal" to the processor).



Re: Speaking of Multiprocessing...
On 24/03/17 10:47, Tim Wescott wrote:
Quoted text here. Click to load it

Yes. But when you have multi-level caching, perhaps some with
write-back semantics, it needs to force write-through, and be
bus-locked all the way to the common memory.

X86 has a LOCK prefix which acts on certain following instructions
to make this happen, and SMP and multi-CPU architectures honor it.

Clifford Heath.

Re: Speaking of Multiprocessing...
On Fri, 24 Mar 2017 21:04:25 +1100, Clifford Heath

Quoted text here. Click to load it


Well, if the memory is write-back, then yes, the write will go all the
way to memory.

OTOH, that's certainly *not* the case for the vast majority of atomic
or "locked" operations, on x86, or any other high end CPU.  This
almost always happens in cache, but the line in question needs to be
held exclusively by the CPU issuing the atomic operation.  If it's
already in that state, then it happens very fast.  If it's not in that
state, the line may need to be fetched from main memory (if no other
core has it), from another core (if another core has that line and
it's modified), or the state may need to be changed to exclusive, by
invalidating the shared copy in any other cores.  The exact details
vary, but "lock" does not force a bus lock all the way to main memory,
except, possibly, in the case of write through and/or non-cached
memory.

Re: Speaking of Multiprocessing...
On 25/03/17 14:01, Robert Wessel wrote:
Quoted text here. Click to load it

Thanks for the clarification. I was aware of the result, but not of
the recent implementations. It certainly seems to work in the software
I've shipped anyhow.

Clifford Heath.


Re: Speaking of Multiprocessing...
Quoted text here. Click to load it

How does disabling interrupts prevent another processor from messing
up your "atomic" operation?

--  
Grant Edwards               grant.b.edwards        Yow! I want EARS!  I want
                                  at               two ROUND BLACK EARS
We've slightly trimmed the long signature. Click to see the full one.
Re: Speaking of Multiprocessing...

Quoted text here. Click to load it


Dekker's algorithm (circa 1962) can be used to implement a mutex
without test-and-set, or even masking interrupts, and really only
requires sequential consistency (and if that's a issue for your
hardware, it should have appropriate memory barriers to force that).

https://en.wikipedia.org/wiki/Dekker%27s_algorithm

A limitation is that Dekker's only works with two threads/processors.

Several related algorithms exist, and generalizations of some of those
(for example, Peterson's) can handle more than two threads.

And once you have a mutex, you can implement everything else.

OTOH, proper atomic instructions are a huge boon (even if only in
LL/SC form), so they're provided by everyone who's serious about
multiprocessor support.

Re: Speaking of Multiprocessing...
On 3/23/2017 11:43 PM, Robert Wessel wrote:
Quoted text here. Click to load it

Interesting.  I'm not sure it's limited to two threads per processor  
rather than just two threads.  Even if they are on separate processors  
it's the same problem and the same limit, no?

The situation I am looking at resolving is a technique where rather than  
using a pipeline of length N to allow N-fold improvements in speed  
(other than when the pipeline has to be flushed) it can be used to get  
N-fold improvements in clock speed, but treating the CPU as N  
processors.  They are time-sliced if you will.  It also relieves a lot  
of the complexity of detecting problems and contingencies in the  
pipeline streamlining it that much more.  But...

When someone was proposing this design some time back another poster was  
insistent the design required an instruction level mutex, if I am using  
the term right.  I don't recall the details of what he wrote, but I've  
always wondered how important this is.

One of the things that can further streamline a processor is to make all  
instructions single cycle, truly single cycle.  So a read-modify-write  
instruction may be hard to implement depending on the memory used.  But  
in the context of the above multi-processor design, if the RMW  
instruction takes multiple clock cycles, it would not be truly  
uninterrupted since the other N-1 processors would all get a clock (and  
an instruction) between the multiple clocks of the RWM instruction.

I need to bat this around in my head a bit more I think.


Quoted text here. Click to load it

LL/SC???  Low level, s.... c...?

--  

Rick C

Re: Speaking of Multiprocessing...

Quoted text here. Click to load it


It's two threads, on one processor or two.  Many discussion of
Dekker's don't really consider software threads (as we know them
today), as it predates common support for that in OSs.  Call it two
competitors for the lock.


Quoted text here. Click to load it


This is one of the oldest, if not the very oldest, multi-threading
technique for processors, dating back to the PPUs on the CDC-6600
(circa 1965).  The PPUs were 10 time-sliced "threads" on the I/O
processor.


Quoted text here. Click to load it


As soon as you have memory or FP, not all instructions are single
cycle any more.  But as described below, LL/SC is the standard way of
avoiding having a RMW instruction.

It is important to understand how much synchronization you need
between the threads.  If the individual tasks are very independent,
and need little or no synchronization with tasks running on other
threads, then you need only weak support for such in the hardware (and
your code would probably run well on a GPU).


Quoted text here. Click to load it


Load-linked/store-conditional.  That's what most RISCs use to
implement atomic operations.  Basically the load-linked establishes
that you want to atomically update a location (as well as loading the
existing value there), the store-conditional attempts to update that
value (it's a store), but if something else had hit that location*,
the SC will fail, and you have to retry the update.  In that sense
it's similar to a compare-and-swap, rather than a true atomic update
like test-and-set.

The desire to avoid multi-cycle instructions is the prime driver for
LL/SC, as it lets the operation be split into individual simple
operations.  So let's say you wanted to atomically add R2 to a memory
location, you'd code something like:

 retry:
    ll r1,mem
    add r1,r2
    sc r1,mem
    branch-if-fail retry

The success flag is typically returned in a register.  Alpha, for
example returned it in the register operand of the SC (STL_C or STQ_C)
so a simple BEQ (to test if the returned value was zero) was all that
was needed to trigger the retry.


*FSVO "location" - usually the granularity is on something like a
cache line or larger.  There are often other restrictions as well - no
interrupts or exceptions  between the LL and SC, a maximum number of
instructions between the pair, no other LL, loads or stores in
general, branches, etc.  The exact details varied considerably.
Commonly this is described as having a flag set by the LL, and various
conditions can cause the flag to be cleared before the SC executes -
and if that flag is clear the SC fails.

Re: Speaking of Multiprocessing...
On 3/24/2017 1:37 AM, Robert Wessel wrote:
Quoted text here. Click to load it

I don't know what you mean when you use abbreviations that aren't clear.  
  Is FP floating point?  There is nothing inherent in floating point  
that makes it multiple cycles in a custom processor design.

There certainly is no reason for memory to be multiple cycle.  I think  
you are picturing various implementations where memory is slow compared  
to the CPU.  That's not a given.


Quoted text here. Click to load it

How dependent the tasks are will vary with the problem being solved.  
I'm just looking at how best to utilize the facility available in FPGAs.  
  The guy who was talking about the time slicing was looking into that  
when he was told about the need for a mutex.


Quoted text here. Click to load it

I need to think how that could be implemented.  One way would be to have  
an extra bit per memory location.  Another would be to make the memory  
interface "smart" where it tracks the location you read as part of this  
function and handles the logic of monitoring other writes to it.  That  
could be pretty simple really as long as it is one location per processor.

Would an instruction that reads a location while also writing to it do  
the job?  If you are trying to lock a resource by setting a location to  
a 1 say, you set the 1 but also read it.  If it comes back already a 1  
you know it was already locked.  There doesn't need to be the same  
difficulty of setting it to a 0 if you hold the lock.

Doing a read and a write at the same time in most FPGAs is not so hard.  
The block memory in an FPGA has simultaneous read/write functionality.  
In fact it looks like a large addressable register set requiring a clock  
edge to do the read.  By using the opposite edge of the clock to clock  
the memory (as the rest of the CPU) the write and read can happen at the  
same time with the read result coming out a half clock before it has to  
be written to its destination.  win-win.


Quoted text here. Click to load it


--  

Rick C

Re: Speaking of Multiprocessing...

Quoted text here. Click to load it


Yes.




While you can imagine some FP formats that you might reasonably
implement in a single cycle, they'd be pretty far out of the
mainstream.  The width, normalization, and rounding requirements make
all but the simplest FP operations implausible to implement in what's
usually made the time for a single cycle.  OTOH you *could* slow down
the rest of your processor so that you can accomplish the implemented
FP operations in a single cycle.  While trivial, why would you accept
a three fold reduction in clock rate for the integer part of your core
just so you could have single cycle FP instructions?


Quoted text here. Click to load it


No, but such a situation is sufficiently uncommon that you'd really
want to spec it up front.  Most cores these days don't even have
single cycle L1 caches.


Quoted text here. Click to load it


The granularity can be pretty coarse.  Commonly a cache-line or
bigger.  But you don't want to add a bit per memory word.  With a
cache, you'd keep track of any changes of ownership of the cache line.
With direct memory, you can just track the memory area (perhaps make
it blocks of 64B for simplicity) in each "thread", and then just watch
all other writes from other threads/time-slices for matching
addresses.


Quoted text here. Click to load it


No, that's not enough in typical multi-processor implementations  - if
two cores/threads/whatever hit the same location, you can only return
0 to one of them, or both might think they acquires the lock.  What
you've described *is* a TAS, but not an atomic one.

OTOH, since you have only a single core, and you appear to be
proposing that your TAS *will* be atomic (since it's going to execute
in a single clock, and presumably not interleave with any other TAS in
another timeslice), you're good to go.

Of course, there are alternatives.  If you need only moderate locking
for your one time-sliced CPU, you can implement a single global lock,
and then just stall (prevent them from doing anything in their
timeslice) any threads trying to acquire that lock if it's not
available, or just spin.  IOW this is a simplified form of TAS, and
you'd us it to implement bigger atomic updates.  You could also have
several global flags ("lock0..lock15"), so you could have separate,
non-conflicting, things happening on different locks without
contention.

IOW, (with a single global flag), you might end up with something
like:

  retry_lock:
     try_global_lock_bit    ;acquire global lock, set carry if success
     jnc retry_lock  ;just a spin lock
    ...update shared item(s)...
    release_global_lock

The try_global_lock and release_global_lock would be the two
instructions you'd implement.


Quoted text here. Click to load it


Even on FPGAs, where the hard memory blocks are absurdly fast in
comparison to logic (as compared to most of the real world), you're
likely to see memory access to be multiple clocks unless your CPU
logic is really slow.


Quoted text here. Click to load it

Re: Speaking of Multiprocessing...
On 3/24/2017 3:07 AM, Robert Wessel wrote:
Quoted text here. Click to load it

There is nothing implausible about single cycle floating point.  It is  
all a matter of context and application.  Not every design has to scream  
with the highest possible clock rate at the expense of pipeline  
complications.  Don't impose an architecture until you know the  
requirements.


Quoted text here. Click to load it

Again, assuming an architecture and requirements.  This conversation is  
not about designing the next generation of ARM CPU.


Quoted text here. Click to load it

Why don't I want to add a bit per memory location?  If that bit happens  
to be free, I'd be silly to let it go to waste.


Quoted text here. Click to load it

You are ignoring that the instruction also writes a 1 to the location.  
So two processors *can't* both read a zero.  Why is it not atomic?  The  
write and read happen on the same clock cycle.  Much like Dekker's  
algorithm it sets the flag to exclude other processors as a priority.  
Then it can at leisure check if the flag was already set or if it  
"captured" the flag.

Again, an abbreviation I don't know.  What's a TAS?


Quoted text here. Click to load it

Not sure what you are assuming here.  There is one "core", but it is  
time shared N ways giving N functional processors.  Each one operates on  
a different clock cycle however, so any single instruction which takes  
one clock cycle is truly atomic and the other processors can't interfere.


Quoted text here. Click to load it

Stalling the other processors is a bad idea.  The primary apps for my  
designs are hard real-time and if one processor is dealing with  
something that needs 100% of its attention at that moment and it is  
stalled because another processor wants to lock some resource... not good.

Limiting the stall to processors that are accessing the same location in  
memory would be very messy in terms of logic.

The multiple channel approach was suggested by the previous designer and  
was shot down by the antagonist.  I don't recall any of the detail  
though.  If there is a resource that needs to be shared, it requires a  
mutex.  I'm not sure how to get around that.


Quoted text here. Click to load it

Isn't try_global_lock what my read/write (essentially a swap)  
instruction does?  A one is written to the memory location no matter  
what.  The previous value is returned.  The previous value tells you if  
you acquired it or not (return of one means it was already locked,  
return of zero means you locked it).  To perform the release you just  
write a zero, but only if you have the lock.


Quoted text here. Click to load it

Not sure what you are referring to here.  FGPA memory is *very* fast,  
almost as fast as registers.  Likely you are thinking of external  
memory, in particular DRAM.

--  

Rick C

Re: Speaking of Multiprocessing...
On 24/03/17 18:36, rickman wrote:
Quoted text here. Click to load it

<snip>

Quoted text here. Click to load it

It would be making architecture assumptions to require single-cycle
memory - saying that memory may be multiple cycle is the general case.

(You might /want/ to make architecture assumptions here - I gave some
other suggestions of specific hardware for locking in another post.  The
solutions discussed here are for general memory.)

Quoted text here. Click to load it

Normally, one does not have a bit free per memory location.

LL/SC solutions usually have only a single tracker tracker per cpu, or
even just a single tracker per shared memory bus.  The granularity might
be for a machine word, or for a cache line.  The expectation is that
these sequences will be so short that it is very rare for a conflict to
occur, so it is fine to have just one tracker.  You set the tracker when
reading from the memory to get the old value, then use a conditional
store that will only do the write if nothing else has touched that same
location since your load.  It gives you a "test and set" type operation,
but split into RISC-friendly separate load and store operations.

Quoted text here. Click to load it

Normally, you cannot read and write a memory location at the same time.
 It would only be possible with specialised hardware, or with a bus
locking mechanism (which can be the basis of your synchronisation
primitive).  In a normal memory bus, you have a read operation then a
write operation.  If you don't have bus locking of some sort, then cpu 1
could read 0, then cpu 2 could read 0, then cpu 1 would write 1, then
cpu 2 would write 1.

Quoted text here. Click to load it

"test and set".  It reads the old value in a memory location, sets the
memory to 1, and returns the old value - all as an atomic operation.
Atomicity is key here - and normal memories are not atomic.

(If you are thinking of using an FPGA here, it is often possible to
implement a memory block where you /do/ read the old value on the same
clock cycle as writing the new one, making a TAS operation simple to
implement.)

Quoted text here. Click to load it

If this is going to be efficient, it is unlikely that the core supports
reading and writing within the same instruction - LL/SC is usually a
better choice.



Re: Speaking of Multiprocessing...
On 3/24/2017 2:06 PM, David Brown wrote:
Quoted text here. Click to load it

We are having two different conversations here.   I am designing a CPU,  
you are talking about the theory of CPU design.


Quoted text here. Click to load it

As I have said before, I am designing a CPU in an FPGA.  I have FPGA  
features available to me that are different than what is available to  
chip designers.  I do not have the same constraints as chip designers.

I was asking what is required in an instruction set to support  
multi-processors and I think the answer is that the memory swap  
instruction would do the job efficiently.

--  

Rick C

Re: Speaking of Multiprocessing...
On 24/03/17 19:12, rickman wrote:
Quoted text here. Click to load it

Aha!  That is some useful information to bring to the table.  You told  
us earlier about some things that you are /not/ doing, but not what you  
/are/ doing.  (Or if you did, I missed it.)

In that case, I would recommend making some dedicated hardware for your  
synchronisation primitives.  A simple method is one I described earlier  
- have a set of memory locations where you the upper half of each 32-bit  
entry is for a "thread id".  You can only write to the entry if the  
current upper half is 0, or it matches the thread id you are writing to  
it.  It should be straightforward to implement in an FPGA.

The disadvantage of this sort of solution is scaling - if your hardware  
supports 64 such semaphores, then that's all you've got.  A solution  
utilizing normal memory (such as TAS, CAS, LL/SC) lets user code make as  
many semaphores as it wants.  But you can always use the hardware ones  
to implement more software semaphores indirectly.

Quoted text here. Click to load it


Site Timeline