timestamp in ms and 64-bit counter

- D
- David Brown
  
  Contact options for registered users
Vote on answer
posted
4 years ago

Thu, Feb 13, 2020 8:57 AM

Yes, cpus with CAS and other locked instructions (like atomic read-modify-write sequences) need bus lock signals. These are quite easy to work with from the software viewpoint, and a real PITA to implement efficiently in hardware in a multi-core system with caches. Thus you get them in architectures like x86 that are designed to be easy to program, but not in RISC systems that are designed for fast and efficient implementations.

CAS can be useful even on a single cpu, if you have multiple masters (DMA, for example). And CAS or LL/SC can be useful on a single cpu if you have pre-emptive multi-tasking and don't want to (or can't) disable interrupts.

On a small processor like yours, disabling interrupts around critical regions is almost certainly the easiest and most efficient solution.

(If I were making a cpu, I'd like to have a "temporary interrupt disable" counter as well as a global interrupt disable flag. I'd have an instruction to set this counter to perhaps 3 to 7 counts. That's enough time to make a CAS, or an atomic read-modify-write.)

There is a whole field of possibilities with locking, synchronisation mechanisms, and lock-free algorithms. Generally speaking, once you have one synchronisation primitive, you can emulate any others using it - but the efficiency can vary enormously.

- G
- George Neuner
  
  Contact options for registered users
Vote on answer
posted
4 years ago

Thu, Feb 13, 2020 3:54 PM

Ditto.

I spent ~7 years in a small company as acting network admin in addition to my regular development work. I watched over a pair of NT4 servers, a dozen NT4 workstations, and a handful of Win98 machines.

The NT servers never gave any problems. They ran 24/7 and were rebooted only to replace a disk or install new software. We didn't install all the service packs, so sometimes the servers would run for more than a year without a reboot.

The workstations only rarely had problems despite being exposed to software that was being developed on them. The machines ran 24/7 - backups done after hours and on weekends. I can speak only to my own experience as a developer: my workstation took a fair amount of abuse from crashing and otherwise misbehaving software, but generally it was rock solid and would run for months without something happening that required a reboot to fix.

In my experience, W2K was a bit flaky until SP2. After that, it generally was stable.

Poster "upsidedown" (sorry, don't know your name) was right though about the NT4 service packs. In my own experience: - the initial OS release was a bit flaky - SP1 was stable (at least for English speakers) - SP2 was really flaky - SP3 was stable - SP4 was stable - SP5 was a bit flaky - SP6 was stable

I have been using Windows since 3.0 (which still ran DOS underneath). I was quite happy with the reliability of NT4. I have had far more problems with "more modern" versions: XP, Win7, and now Win10.

YMMV, George

- B
- Bernd Linsel
  
  Contact options for registered users
Vote on answer
posted
4 years ago

Thu, Feb 13, 2020 6:40 PM

One should mention that, at least in ARM and MIPS architectures, LL and SC are not implemented with a global lock signal, but instead using cache snooping (for uni- and multiprocessing systems).

LL just performs a simple load and additionally locks the (L1) data cache line of that address (so that it cannot be replaced until SC or another LL).

SC checks if data in that cache line has been modified since the last LL; if so, it fails, otherwise it succeeds and writes the datum (whether write-through or write-back is depended on CPU cache configuration and the virtual address).

An SC instruction targeting an address that hasn't been a LL source before always fails and invalidates all LL atomic flags, so that their corresponding SC's will fail. Thus, an SC to a dummy address is exploited to implement synchronization barriers (in addition to cache sync instructions).

The possible number of concurrent LL/SC pairs depends on the CPU model, most support only 1 pending SC after a LL, some allow up to 8 parallel LL/SC pairs (from different cache lines).

Finally, an example: Emulated CAS on a MIPS32 CPU, works independed of number of processors in the system:

// compare_and_swap // input: a0 = unsigned *p, a1 = unsigned old, a2 = unsigned new // returns: v0 = 1 (success) | 0 (failure), v1 = old value from *p

.set nomips16, nomicromips, noreorder, nomacro compare_and_swap:

1: ll v1, 0(a0) // load linked from a0+0 in v1 bne v1, a1, 9f // if v1 != a1 (old), // branch forward to label 9 move v0, zero // branch delay slot: load result 0 // executed "while" taking the branch

move v0, a2 // load a copy of a2 (new) into v0 sc v0, 0(a0) // store conditionally into a0+0 beq v0, zero, 1b // if unsuccessful (v0 == 0) // retry at label 1 nop // branch delay slot: nothing to do

9: jr ra // else (v0 == 1) return v0 nop // jump delay slot: nothing to do

Ann.: This example could be further optimized for speed trading program space, reordering the opcodes so that the preferred case (successful CAS) executes linearly and branchless, forward branches likely not taken, and backward branches likely taken (usable branch prediction has only been introduced at MIPS R8). L1 cache latency is usually 1 clock, when executing linear code, it is hidden by prefetch and pipeline. A L1 cache line is typically 64 bytes (16 words) wide, i.e. if the CPU supports parallel LL/SCs, they must be at least 16 words apart, otherwise the SC to the address of the first LL will always fail.

Regards, Bernd