I'm mostly responding to your point about "not a periodic task" when discussing GPIO delays. None of what I'm saying should either encourage you or discourage you from weighing all your other concerns. But I thought I might point out one way to solve one problem to see if that helps.
You pointed out the basic idea of having a regular timer (you suggested a periodic one at 1ms.) That's reasonable and you can use it for a lot of things (adjust the rate per your minimum delay need, of course.) For some things you may have to deal with, you could set up state machines attached to the timer which would get a small bit of cpu for a moment each tick (or every so many ticks.) An example of something somewhat complex that works okay this way would be reading and writing a serial EEPROM "in the background." Normally, the state machine would just stay in a quiescent state of "do nothing" until some data needs to be transferred. Then the state machine would transition out of the quiescent state and begin the transfer, moving from state to state at the timer rate. (Assuming you didn't have a hardware peripheral for all this, of course, and had to bit-bang the transfer.)
However, you also said that the "the toggling of a pin after a certain delay ... [is] not a periodic task." Well, counting down the counter can be thought of as a periodic task. The problem then remains that you might have a lot of counters to count down. A solution, so that you only have one to worry about _ever_, is to use a delta queue for such timing and to re-insert the desired routine given the delay to the next event.
For example, suppose you wanted to toggle a particular pin so that it was high for 50ms and low for 150ms. (Assume your 1ms timer event exists and that delta queue insertion is already implemented [to be discussed in a moment.]) The code might look like:
void SetPinHigh( void ) { // some code needed to set the pin high insertdq( SetPinLow, 50 ); }
void SetPinLow( void ) { // some code needed to set the pin low insertdq( SetPinHigh, 150 ); }
In the above case, you are responsible for queueing up the next operation. But that is pretty easy, really. The number there just specifies the number of timer ticks to wait out "from now." If the delta queue handler does a reasonably fast job calling code when it is supposed to and the operation isn't itself slow [if it is, just re- insert the process at the start of the procedure instead of the end of it], and/or if the timer is slow by comparison, then the triggered events will appear to be consistently accurate.
Okay. So what is a delta queue? Assume the queue is empty. When you try and insert a new entry (function pointer and time), it is inserted immediately into the queue with the delay time as given. The timer then decrements that counter and, when it reaches zero, it removes the entry from the queue and executes the function. For the case where a second entry is to be added before the first is executed, an example might help:
insertdq( P1, 150 ); insertdq( P2, 250 ); insertdq( P3, 100 ); insertdq( P4, 200 );
Assume these happen back-to-back in short order, between 1ms timer ticks, and assume the queue is empty to start. The queue will look like the following:
After the 1st insert: --> [ P1, 150 ] ; P1 wants to wait out 150 timer events
After the 2nd insert: --> [ P1, 150 ] ; P1 still wants to wait out 150 timer events [ P2, 100 ] ; P2 will then wait another 100 timer events
After the 3rd insert: --> [ P3, 100 ] ; P3 is to wait out 100 timer events [ P1, 50 ] ; P1 will now have to wait out only 50 more [ P2, 100 ] ; P2 waits yet another 100 timer events
After the 4th insert: --> [ P3, 100 ] ; P3 is to wait out 100 timer events [ P1, 50 ] ; P1 will wait out an addition 50 (150 total) [ P4, 50 ] ; P4 will wait yet another 50 (200 total) [ P2, 50 ] ; P2 will wait still another 50 (250 total)
What happens in the insert code is that the time is compared to the current entry in the queue (starts at the first.) If the time of the process to be inserted is less or equal to it, then the process and its time is inserted at that point (before the current entry) and the time of the newly inserted process is subtracted from what was the 'current' entry, which had the greater or equal value. If the time of the process to be inserted is greater, though, then the time value of the current queue entry is subtracted from the inserting entry's time, the current entry is advanced to the next entry, and the examination continues as already described. Obviously, if you reach the end of the queue, you just add the entry and its time.
If you use this method, your timer code will never have to decrement more than one timer -- the one belonging to the queue entry at the top of the queue. All of the other entries are automatically decremented by doing that, as their times listed in the queue are simply how much "more time" is needed by them than the process earlier than they are.
If the timer event handler encounters a situation where the top entry on the queue is decremented to zero AND where one or more entries after it also have their times listed as zero, then the timer event handler should call the top entry's code and when that returns then call the next entry, etc., until all 'zero' entries have been both run and removed from the queue. It stops when it finds a queue entry with a non-zero counter value. At that point, the count down can continue at the next timer event.
I think you can see the value, and perhaps some of the cautions in the above method. But it is simple enough. If you also use something more complicated than just toggling a port pin, you can set up quite a few 'state machines', each hooked to the timer event with an appropriate delay queued up for the next time they need a little cpu. The main code for a state machine might have a static 'state' variable used to index through an array of state machine functions, calling each one in turn and allowing each state to return the value of the next state to use. This main code would then re-schedule another execution by inserting itself after a fixed delay, if you want.
You could also consider looking up concepts such as thunks, iterators of the CLU variety, and cooperative coroutines. It's not hard to write up a little bit of code to support a thread switch.
One of the nice advantages of doing the little bit of coding you need, yourself, is that you know what you have and understand it well. Hauling in an operating system written for general purpose use can be a fair learning curve, which may easily compare to the work involved in writing your own delta queue code or thread switch function. Worth considering.
Jon