atomic test_and_set on ADSP21020

- A
- alb
  
  Contact options for registered users
Vote on answer
posted
11 years ago

Mon, Jul 16, 2012 8:19 AM

Indeed, the 'lock' is to prevent writing to the serial port while another transmission is going on, since this will result in lost of character.

why do you need the pointer to be volatile? Isn't sufficient to have the pointed structure to be volatile?

AFAIK, volatile qualifier for a structure makes volatiles for each member of the structure, but in this case you have a recursive structure and I'm not sure I understand fully how the volatile qualifier is interpreted here.

[...]

This is a good hint that I missed in my implementation. Assigning q_tail->next will allow the isr to send this one as soon as the previous is complete.

Since I did not have a 'next' pointer in my structure I had to call the transmit() function from the isr in order to pull data from the request queue. It seems to me this approach is more elegant than mine.

In a two element request queue it seems to me that q_head->next should always be equal to q_tail. Correct?

it seems to me this is your 'locking' variable. You do not allow any other transmission if the head of the queue is != NULL.

[pasted from your next post] buf->transmitted = TRUE;

I appreciate your effort to sketch down the implementation. It indeed gave me a valuable hint on how to write the transmit() function.

I think I still have to pin down a nasty bug in moving pointers around since the program seems to crash always at the same point.

- N
- Niklas Holsti
  
  Contact options for registered users
Vote on answer
posted
11 years ago

Mon, Jul 16, 2012 10:06 AM

My mistake (a remmant from an early version of the code). Well spotted. Since q_tail is accessed only in the non-interrupt processes, which do not preempt each other, it does not have to be volatile.

Yes, I agree.

That is my understanding too. But note that there is a paper by Eric Eide and John Regehr saying that compilers often (well, relatively often) have bugs in their treatment of volatiles (google for "volatiles are miscompiled"). I would definitely check the compiler-generated assembly code, at least once, to verify that my compiler is doing the right thing.

I assume that by recursion you mean that the buffer_t field "next" points to another buffer_t. Since q_head is defined as a pointer to a volatile object, accesses to the "next" field of the buffer at q_head are treated as access to a volatile. Since the "next" field is not defined as a pointer *to* a volatile object, accesses to fields of the buffer that q_head->next points to are not treated as volatile accesses, when the code uses the form q_head->next->Xxx. As I understand it.

Yes, if you observe the state when neither output_isr() nor transmit() is running.

Yes, q_head is certainly the main controlling variable. q_head and the field q_tail->next are the only variables that are accessed concurrently by the interrupt handler and the non-interrupt processes. The code assumes (and I should have remarked on it) that reads and writes (loads and stores) of these variables are atomic in the sense that when a read and a write are concurrent, the read returns either the old or the new value of the variable, and not some half-updated value. This assumption holds for the ADSP21020, where memory read and write are wide enough for a whole pointer, but it would not hold for example on a processor with

8-bit read/write but 16-bit or wider pointers.

(In C, unfortunately, there is no "atomic" qualifier similar to the "volatile" qualifier, that could ensure at compile-time that a given variable is atomically read and written. There is only the standard type sig_atomic_t that is ensured to have atomic read/write, but this type may not be large enough to hold a pointer; it is only required to be 8 bits or larger.)

Note that while my code has concurrent (racing) read and write of the some variables (q_head and q_tail->next), it avoids concurrent writes, which could lose (overwrite) data. The transmit() function only writes to q_head when it knows that transmission is not going on, and it only writes to the "transmitted" field of a buffer before the buffer is enqueued. Therefore, output_isr() cannot try to write to those variables concurrently.

A warning about ignoring this warning :-) : The C99 standard says, in paragraph 5 of the section "6.7.3 Type qualifiers", that "If an attempt is made to refer to an object defined with a volatile-qualified type through use of an lvalue with non-volatile-qualified type, the behavior is undefined", which might apply to this initialization of the "buf" variable. However, the same paragraph has a foot-note saying that this applies to objects that were "never actually defined as objects in the program (such as an object at a memory-mapped input/output address)", which is not the case here.

You are very welcome. But note how simple the transmit() code becomes if interrupts can be disabled or masked:

void transmit (buffer_t * volatile buf) // Starts transmitting the buf, or enqueues it for // later tranmission. { // Assume that buf->data and buf->length are set // by caller and that buf->length > 0.

// Prepare the buffer for enqueing or transmitting: buf->transmitted = FALSE; buf->next = NULL;

prevent_output_interrupts();

if (q_head != NULL) { // Transmission is going on. We can just enqueue // the buffer, and it will be transmitted. buf->data_index = 0; q_tail->next = buf; } else { // No transmission going on now. // This buf becomes the first in the queue, and // we start its transmission. q_head = buf; buf->data_index = 1; // Start transmission: write_to_output_port (buf->data[0]); } q_tail = buf;

allow_output_interrupts(); }

The interrupt handler code is the same in this case.

--
Niklas Holsti
Tidorum Ltd
niklas holsti tidorum fi
       .      @       .

- G
- George Neuner
  
  Contact options for registered users
Vote on answer
posted
11 years ago

Mon, Jul 16, 2012 9:07 PM

You can trace the roots of load/store back to the CDC 6600 (1964), but IMO "modern" really began in the 80's with the interlocked load/store RISC designs: MIPS2, PA-RISC, etc. All modern chips effectively are a load/store architecture internally even if externally their programmer visible ISA presents complex addressing modes. But pinning down when the mass market adopted load/store is difficult because it happened in stages.

The i486 and the 68040 could be called "mostly load/store" as each had only a couple of (internally) complex modes. The Pentium Pro was a true load/store design, but probably wasn't popular enough to count ... so call the Pentium II the first one for Intel. The 88K was a market failure, so call POWER the first for Motorola.

I think a case could be made that ARM really was the first widely used load/store design, but until VMM became common in designs, the issue of interruptible vs restartable instructions didn't much matter.

String instructions are a special case. They really are a loop of microinstructions which can be interrupted between iterations.

[I know not all chips technically execute microcode ... what I'm referring to as "microinstructions" may simply be states in the execution sequencer.]

This is where terminology becomes troublesome 8( ... what exactly does it mean to be "restartable" vs "interruptible"? Everyone has a slightly different understanding based on which vendors' documentation they have been reading.

FWIW: I consider that an instruction is "interruptible" if a programmer visible context switch can occur during its execution. Excepting some special cases like string operations, most ISAs specify that an interrupt logically can occur only between ISA level instructions

"Restartable" is a different matter. I consider that an instruction is restartable if it can be halted and resumed in-progress or can be aborted and reexecuted without noticeable side effects. At the ISA level, instructions may be either resumable or reexecutable. At the microinstruction level, instructions almost always are aborted and reexecuted.

On almost all chips today, all but the very simplest ISA level instructions decompose into a sequence of microinstructions. When a microinstruction sequence representing a memory accessing instruction takes a fault, the entire sequence is rolled back to the beginning and the VMM interrupt is generated. Once the fault is cleared and the interrupt returns, the microinstruction sequence is reexecuted. Whether this is a full reexecution from the beginning or a resume from the previous point of failure is implementation dependent.

WRT software mutual exclusion algorithms, realizing that any memory access may be arbitrarily delayed is where the difference between in-order and out-of-order execution becomes important. These algorithms all depend on specific ordering of memory accesses to work properly. As long as those ordering requirements are respected, mutual exclusion will be guaranteed even in the presence of arbitrary thread delays. Special care has to taken on OoO chips to guarantee that the memory accesses do occur in the required order.

Lamport's algorithm is unique because it purports to work with multiple CPUs, however it still requires in-order accesses from all participating CPUs.

Hope this ... doesn't confuse the issue more 8-) George

- G
- George Neuner
  
  Contact options for registered users
Vote on answer
posted
11 years ago

Mon, Jul 16, 2012 10:45 PM

No. If you examine them carefully, each of the algorithms relies on 2 separate flags being set and tested in a particular order. The use of

2 flags guarantees that the interruption of any operation does not compromise the integrity of the algorithm.

I agree that their operation is not particularly obvious, but these algorithms have been very well proven. They are basic knowledge for introductory CS operating system courses. They should be taught with basic algorithms to every programmer. Now that many (most?) programmers are writing threaded code, IMO every programmer should be aware of them and how they work.

No, it only means the way is more complicated.

George

- U
- upsidedown
  
  Contact options for registered users
Vote on answer
posted
11 years ago

Tue, Jul 17, 2012 5:35 AM

Thank you very much for this description, I haven't followed the detailed HW development since the 1980's.

In the x86 family, the CPUID instruction seems to be a quite popular method to drain the pipeline.

- D
- dp
  
  Contact options for registered users
Vote on answer
posted
11 years ago

Tue, Jul 17, 2012 8:39 AM

Indeed, I was wrong. Taking advantage of "in order" memory access (which can be enforced on the "out of order" machines I know) can be used. I did not really remember what exactly I have considered 13 years ago... :-). I suppose the fact that expanding this for multiple tasks (say, 64) would take each task examine all 63 flags etc. has made it less appealing to me or something. Or the fact that I just had an easier option. But it can be made to work indeed - as far as I can see at a glance, there is no fundamental flaw in it.

Dimiter

- D
- David Brown
  
  Contact options for registered users
Vote on answer
posted
11 years ago

Wed, Jul 18, 2012 9:05 AM

There are other algorithms that are more efficient for multiple tasks - as you note, Dekker's is not very scalable. But in most cases, there should only be a few tasks competing for any given mutex/critical section/semaphore - you don't really want 64 different tasks all fighting for the same shared resource!

Note also that you don't actually need proper in-order memory access here (as enforced by things like the eieio or msync instructions on a PPC, or by memory area policies) - you only need the /appearance/ of in-order memory access. So if you are talking about multiple tasks on the same processor, you don't need anything beyond "volatile" accesses - even if the reads and writes are cached or buffered in some way, the processor will still view them in the same order.

- D
- David Brown
  
  Contact options for registered users
Vote on answer
posted
11 years ago

Wed, Jul 18, 2012 9:23 AM

(I haven't studied the code here - this is just general comments.)

The paper looks mainly at complex uses of "volatile", especially in connection with "sequence points" and with poorly specified areas of C. An example would be the expression "(v + v)", where "v" is declared volatile - should the compiler read "v" twice and add the results, can it simply read "v" once and double the result, or perhaps read "v" twice and double one of the results? The C specifications are very unclear on such matters, and compiler writers differ in their interpretations - as do programmers writing the code. I am not convinced that E&R are correct in all /their/ interpretations either. All we can be sure of is that with complex expressions using volatiles, we can't all be right - so such code should be avoided.

When using "volatile", the key to correct coding is to realise there is no such thing as volatile data, or volatile variables. There are only volatile /accesses/. A "volatile variable" is nothing more than an ordinary variable with a note to the compiler that all normal accesses should be "volatile". Think about when you want your /reads/ and /writes/ to be volatile, and you will get it right. I often don't even bother declaring the variables to be volatile - I use macros to enforce volatile reads or writes as needed.

And keep all volatile-related code simple - then the compiler will get it right too. Rather than writing "v = (v + v)", write "int v1 = v; int v2 = v; v = v1 + v2" to make explicit exactly the functionality you want.

- G
- George Neuner
  
  Contact options for registered users
Vote on answer
posted
11 years ago

Thu, Jul 19, 2012 8:12 PM

In fact, none of the software only algorithms scale particularly well to large numbers of tasks.

Yes, with a caveat which may make no difference to anyone here but is mentioned for completeness.

For SMT ("hyperthreaded") chips, each physical core behaves as 2 or more logical cores executing separate instruction streams. But unlike software multitasking, in SMT the logical cores really do execute

*simultaneously* (assuming sufficient chip resources are available). Exactly when and in what order writes become visible to all logical cores is implementation dependent. Code running on such a chip may have to obey the same restrictions as for a true multiprocessor even when there is only one physical core present.

George

- D
- David Brown
  
  Contact options for registered users
Vote on answer
posted
11 years ago

Sat, Jul 21, 2012 2:53 PM

Indeed - SMT has to be treated like true SMP as far as these things are concerned.

- J
- Jack
  
  Contact options for registered users
Vote on answer
posted
11 years ago

Mon, Jul 23, 2012 5:10 PM

Interesting link. Thanks

Bye Jack

--
Yoda of Borg am I! Assimilated shall you be! Futile resistance is, hmm?