Resource revocation

D

Don Y 12 years ago

Hi,

What's the current "best practices" regarding asynchronous notifications (in a multithreaded environment)?

I have a system wherein "tasks" (omit a formal definition) request resources from a service that meters out their use; waiting until the resource has been granted to them "officially" (in some cases, this is all trust based).

When done, they surrender the resource to the service where it can be reused by other consumers.

But, there are times when the service must revoke a granted use of a particular resource. In some cases, it "asks" for the resource back (giving the current consumer time to tidy up before releasing it). In other cases, it just *seizes* the resource -- and notifies the consumer after-the-fact.

Presently, I use signals to notify the consumer when this sort of thing is happening.

But, my personal experience is such that folks have problems writing this sort of code. *Remembering* that they have to register a handler for the signal; remembering that said handler can be invoked at any time (including immediately after it has been registered); etc.

Is there a new "safer" way of implementing these types of notifications?

Thx,

--don

Vote

R

Richard Damon 12 years ago

My first thought about your seizing mechanism is that you are going to need to either kill the task you granted the permission to, or tolerate it using the resource after you have seized it (perhaps with some error occurring on the use). The problem being that there will always be a point in time between it testing that is still has the right to use it, and the actual access, unless of course you need to fill your code with critical sections for EVERY use of the resource.

Vote

D

Don Y 12 years ago

Or, let the task *think* it is still using the resource even though its actions on/with that resource aren't having the effects the task *thinks* they are having!

I.e., it depends on what that "resource" is and how it is accessed.

If, for example, it is a (shared) communication channel, the task can *think* it still has (exclusive?) use of the channel but the mechanism that actually pushes messages onto/from the channel is actually silently discarding everything to/from the task when it no longer *owns* the resources (while the actions of the NEW owner are now proceeding properly).

If the resource is a piece of virtual memory, the OS can allow accesses to continue (without faulting) and just ignore all the writes attempted and return garbage for reads.

Or, these "uses while not currently owned" could result in errors reported to the user -- that may or may not be expressly indicative of the fact that the resource is no longer owned. E.g., "read failure" and the task scratches its head wondering if the medium is faulty or .

There's also the (un-illustrated) example of the task actually being *allowed* to continue using the resource under the assumption that it will, in fact, "soon" honor the *request* that it release the resource. ("Gimme a minute...")

Remember, a "resource" is anything that the system *decides* is a resource. and, asynchronous notifications can originate for a variety of *other* reasons in addition to resource revocation.

The "resource" that prompted my question is an abstract resource with very loose constraints -- and, no real downside to having it revoked "in use". I was coding an algorithm for an irrigation "zone" wherein the process responsible for the zone *requests* a certain "water flow rate" (water being the scarce resource and it is infinitely divisible).

The task can't begin monitoring the amount of water dispensed for its needs until it knows that water *is* being dispensed for its needs. ("OK, 1 gallon per minute so I need to wait

14.34 minutes to ensure the required 14.34 gallons are dispensed. THEN, I can turn off the water and let the system use it for some other purpose. Shower, anyone??")

Delays in acquiring the resource have consequences (i.e., the task doesn't just want to *block* awaiting it) since an indefinite delay means the zone never gets serviced (things die, etc.). So, the task (or a surrogate operating on its behalf) needs to be able to watch (and worry!) when a request is lingering, unfulfilled.

Similarly, if something "more important" needs that resource, the task needs to know that it has been "reappropriated" and take remedial actions ("Hmmm... I was able to dispense 6.2 gallons.

*If* I can reacquire the resource, soon, I can just dispense another 8.14 gallons and I'm golden! OTOH, if I have to wait hours or days to reacquire it, I may have to start over again. *Or*, signal a failure as the plants relying on that water have probably died from dehydration!")

In other cases, the resource may be a computational one. I.e., having access to CPU time/memory on another node. If that resource is revoked, the workload manager has to find some other node to satisfy the request *and* figure out what portion of the operation previously scheduled on that node must be recreated, etc.

[My point here is: different tasks tend to need different recovery strategies.]

Exactly! Hence the need for an asynchronous notification mechanism. E.g. a "signal". So, when the signal is sent, a thread processes that notification *before* the task actually is allowed to execute another instruction (that handler could kill the task, suspend it until the handler can reacquire ownership of the resource, or set a flag that the task can examine at some convenient point in its process, etc.).

I.e., the remedy tends to be defined by the use (UNuse)?

--don

Vote

P

Paul Rubin 12 years ago

This type of program typically doesn't compute very much. It's either acting on some message, or sleeping til the next message arrives.

I think overall it's preferable to not confuse the issue by moving stuff around between processes without the processes knowing. The resource should be under control of one process, and relinquished by 1) sending a message asking for the process to give it back; or 2) killing the process, preferably with automatic cleanup actions when the process dies.

OK, the usual sense of signals that I thought was reflected in your code sample, is basically delivering a simulated hardware interrupt to a running task, so it needs locks, critical sections and all that messy stuff.

In the case of your lawn sprinkler application I think that is fine. IMHO in this day and age, it's only worth dealing with low-level approaches if you're doing hard-real-time or have to run on 10-cent processors or something like that.

Yeah, the way I'm imagining, I wouldn't do it that way, as described above.

In this case I'd say just kill the task, so it can restart in a completely known state. Admittedly I am somewhat under the influence of Erlang right now, and this is a core tenet of Erlang philosophy.

The task would periodically post updates saying how far it has gotten (how much water has been dispensed, or whatever). When it's killed and restarts, it can take up where it left off.

I don't understand this example--what would the "resource" be? In general terms I'd say kill the process and let the crash handler park the tool in a safe position. But in this machining example, I'm imagining some kind of low level PID loop that would keep checking a flag to know if it had to bail out. In either case, the idea is to get to a place where you can restart later.

The only reason for such "rope" is to push the limits of the hardware because more modular approaches are too slow or whatever. Computers are ridiculously powerful these days, so unless you're doing something extremely demanding (basically something that would have been impossible or economically unfeasible 10 years ago), seeking "rope" is probably a sign of doing something wrong.

Right, they are messy and it's preferable to avoid them. E.g. by using message passing instead of signals.

You should probably look into model-checking tools if you absolutely have to pursue this approach. Dawson Engler's papers on using such tools to find crash bugs in Unix file systems might be of interest.

Actually Tom Hawkins' "ImProve" program might be of some use to check that you got all your watering stuff right, in terms of turning correct combinations of valves on and off etc., if you're interested in experimenting with high-tech approaches. I haven't used it but have been interested in it for a while:

formatting link

I did play around with Atom (a hard realtime DSL written by the same guy) and I think the approach is pretty powerful.

Even someone capable of running with scissors without stabbing himself every time shouldn't do it outside of some dire emergency.

In this watering application you have (presumably) rather loose timing constraints, and roughly unlimited CPU resources. So I think you can do fine using safe, simple methods instead of running with scissors.

Vote

D

Don Y 12 years ago

Actually, the irrigation program "computes" a fair amount, given how "slow" it is expected to operate. E.g., it has to identify the "plants" ("water consumers") that are serviced by its zone; identify their individual water needs; identify the "emitters" that service their respective root systems and the flow rates associated with each of those emitters; track the water available to them "recently" (rainfall, water from other nearby irrigation zones that happen to overlap their root systems, supplemental water "injected" by the user manually; etc.); the amount of sunshine falling on them (some might be shaded during some seasons while others are in "full/reflected sun") as well as the desicating (sp?) effects of the wind (again, noting the individual "exposure" to wind from particular directions; etc.

And, it has to continue to update these data while waiting for the "water resource". Or, waiting for its *return* (if it has been revoked). Along the way, it may have to escalate its request as the *hard* deadline approaches. ("Hey, if you don't let me water these things soon, they will die -- in which case, there is no point in my continuing to execute as a task!")

[I.e., this is a soft realtime problem layered inside a hard realtime shell. There *is* a point where a missed dealine results in a failure]

The question becomes one of whether you inform the process *before* you take action (or, even let the process itself "relinquish" the resource) or, if you inform the process after-the-fact. (or, if you just kill the process and don't even worry about informing it! :> )

If you "request" the process to relinquish the resource, then the system (i.e., all other consumers of that resource) are at the mercy of the developer who coded that application. If he fails to relinquish it (perhaps even failing to notice the notification!) or intentionally delays relinquishing it (like a youngster trying to postpone bed-time), then other consumers suffer.

I.e., if everyone adopts that sort of attitude, then you've got a sluggish system.

And, you *still* would need a kill switch so a stubborn consumer could be forced to relinquish the resource, regardless of his wishes.

[Note this all ignores the timeliness issues involved. How *quickly* must a task relinquish a resource when commanded? What happens if the task isn't a high enough priority to even be granted a significant slice of the CPU to process that request?]

I've taken the other approach. A process owns (permanently?) the resources. It then doles them out, on request, to other consumers. When it wants/needs to give the resource to another consumer, it does so -- and notifies the previous owner that it has LOST the resource. (of course, it can also *request* a current consumer to release a resource... but, it has to be capable of withdrawing them in the presence of uncooperative consumers!)

This allows me to ensure "policy" over how a resource is managed is centralized in one place: the (permanent) "owner" of the resource.

Ponder:

We don't notify a task that we are going to take the *CPU* away from it (timeslice) and expect the task to respond, "OK, you can have it". Instead, we just *take* the processor and give it to

*we* (scheduler) decide is the most important use for that resource. There are no guarantees that the interrupted task will ever regain the CPU. Nor any notification that it has *lost* the CPU!

Yet, this is something we are comfortable with...

Yes, the signal *is* delivered as a simulated hardware interrupt (targeted towards that task). But, it is passed to the "task" as a message from task (the one who raises the signal)

*through* the kernel and to the scheduler as it (eventually) prepares to resume that signalled task. (I.e., I need to be able to raise a signal on one physical processor and have the task that it affects reside on *another* physical processor).

I chose the irrigation example because it avoids the issues of timescale. So, we're not distracted by the efficiency of the delivery mechanism, etc.

But, just because its "slow" and "computationally simple" (compared to rendering video), that doesn't make it any less of a concern. E.g., if there is only a few KIPs of spare capacity in the processor (since processors do more than just control water valves), then this can be just as constrained as trying to implement a mouse in

200 bytes of FLASH...

I had a friend who coded like that. Spawn hundreds of processes... then, kill of the ones he decided weren;t important. :-/

Yes. All of these approaches are just juggling "responsibilities". E.g., in my case, a task only checkpoint when it knows the resource has been revoked *and* the nature of the task requires remembering state (vs. simply restarting from scratch). If you require the task to *periodically* checkpoint itself, then it has to come to some sort of balance between spending all of its time checkpointing (so it has very fine-grained resumption capability) vs. very little checkpointing (so it doesn't waste its time keeping track of where it was).

[Recall, the checkpointing must be done in a medium that is more persistent than the task's context -- since the task's execution environment can be torn down (completely) at any time. So, now the system must provide a service for this -- and, one that is sufficiently lightweight that invoking it OFTEN doesn't affect performance... or, the performance of other tasks.]

(remember, you don't necessarily have a big disk sitting there or scads of RAM... how much is a task granted access to? What if it requires *more* to preserve its "significant" state?)

Maybe a cutting tool. Maybe a power source because the next operation puts significant demands on the available power supply (which is shared by other machines in the facility). Maybe a coolant system (what happens if you withdraw the coolant before it has had a chance to achieve its intended goal... is the "piece" ruined?)

The point is, you may not be able to "resume" the operation. You've just made an expensive piece of "scrap". And, even if this is unavoidable, you have to *know* that it is scrap and must be disposed of... not "resumed".

Dealing with consumer markets, *everything* is economically unfeasible! (unless you are catering to consumers who are not cost conscious). E.g., I would imagine most irrigation controllers are implemented with little PICs -- because their approach to the problem is much more naive: turn the water on for X minutes, then advance to the next zone. They don't look at the *needs* of their "consumers" (plants/landscape). Nor do they worry about the availability of the resource they are using. I.e., only a single zone is active at a time (often, though not necessarily) and they assume someone else has ensured an adequate supply *to* the valve manifold.

Here, for example, if I turned on all of the irrigation valves simultaneously, several things would happen:

- household water pressure would drop noticeably

- the implied flow rates of each irrigation "emitter" would not be correct (because the water pressure in the irrigation system would have fallen below nominal)

- some of the irrigation loads probably wouldn't "work" at all (i.e., not having enough static head to meet the required rise)

But, no particular "zone" should have to worry about this. It's a system constraint. One that should be enforced by whatever doles out the "water resource". If someone decides to "run a bath", the individual irrigation zones shouldn't need to know that their water use will interfere with that activity. (OTOH, something "higher" in the system design should be able to enforce that policy on the irrigation system)

But messages only "exist" when they are being examined. If, for example, you issue a query to the database, you either assume the query happens "fast enough" (whatever that means in your application)

*or* spawn a separate thread to process that query so you can keep watching for messages. You now have yet another synchronization issue to contend with, etc.

I've avoided the "combinations" issue entirely (at least in the "zone controller tasks"). An individual zone deals only with the needs of its "consumers" (plants). It knows that it may not have access to the resources that it requires at all times (water, "information", etc.) So, it knows how to deal with these deficits on its own.

Similarly, the "water controller" only has to worry about the needs of *its* consumers (several of which are the individual irrigation zone controllers). And, how it can recover from those cases where it can not supply the needs of those consumers (loss of water supply, master valve failure, etc.)

It boils down to what you consider an "emergency". And, how much "spare capacity" you have available. E.g., if you can afford to

*walk* from place to place with those scissors, then there is no need to take on the risk of running!

OTOH, if you don't have the luxury of being able to take a leisurely stroll with them, then you either *break* (fail to meet your goals) or you learn to run, safely!

Again, irrigation is a trivial example. Imagine, instead, the resources are physical processors and you are rendering video in real time for distribution over the network. If one of those resources becomes unavailable (crashes or is repurposed to put out something more important than watching TV), how much spare capacity do you design into the system so you can *leisurely* go about recovering? (Remember, if the next frame isn't there in a few tens of milliseconds, the user will perceive a visual artifact in the rendered image!)

I.e., I would like to find *one* way of dealing with this sort of thing instead of one way for "fast" things and another for "slow" things, etc. (because that leaves "fast" and "slow" up to debate and requires developers to appreciate the differences)

--don

Vote

P

Paul Rubin 12 years ago

That all seems like a trivial amount of computation.

Hard deadlines in realtime programming usually mean microseconds or so. This plant watering stuff isn't even soft realtime (where you generally want responses within milliseconds but are allowed to miss occasionally).

Sounds like you might want a two-step approach: 1) ask the process to give the resource back; 2) if that doesn't work within a reasonable timeout, kill the process. I'd really stay away from this notion of reassigning the resource while the process is trying to use it and doesn't know what has happened.

If you have hostile applications in your system, you have a completely different design problem. In a normal system, just have a few safety timeouts and kill processes that miss them.

That is fine.

I think I'd use an extra protocol step so that the process can give back the resource gracefully. If you've already decided you want to do something different (and in my opinion more error-prone), I don't understand why you're asking for advice.

That is different. Tasks that get preempted in normal OS's usually don't even know that anything has happened.

I don't think they even make processors slow enough for a few KIPs to matter for something like this.

Processes in Erlang are very cheap, and having 1000's of them is no big deal. On other systems they may cost more.

IIRC one of the first questions I asked you was what OS you were using, what language, etc. I've run Python on boards with as little as 64 meg of ram and Erlang's requirements are comparable. Of course 64MB was a lot of memory not that long ago, and I tossed out that number casually just to make you jump ;-).

If you can't, then you can't. The best you can do is to try to plan the operation to be resumable, if you think you might have to preempt it.

You were the guy trying to make patch panels on laser printers and buy solenoid valves from Wal-mart or something? I don't think you're in a consumer market making millions of devices. In your situation for a one-off thing I'd use something like a Beagleboard. If you're making millions of devices and have to squeeze pennies out of the hardware cost, you do that by spending far more up front on software development. But, even then, message passing is a reasonable approach. Traditional Forth multitaskers use a few hundred bytes and are bloody fast and can run fine on a Cortex M0 or equivalent. If you're using some simple C RTOS then it's probably comparable.

Yes, in a hard realtime system you can't necessarily use that approach and you may end up havint to resort to something much more difficult. In soft realtime with these relaxed requirements, you can use simpler methods without much trouble.

Just like in everyday life, dealing with "slow" things is often (at least up front) much cheaper and less hassle than "fast" things. Consider mailing an envelope somewhere vs. paying Federal Express for same day delivery. Or using an off-the-shelf CPU board instead of making an ASIC. One of the consequences of cheap powerful SBC's (Raspberry Pi etc) is that you can use relatively resource hungry programming approaches to drastically shorten development effort (and therefore decrease cost), even for relatively low-budget embedded projects. You are freed from a lot of constraints.

It's completely sensible for the most economical programming techniques to be different than what you'd do in a resource constrained system, just as those techniques are different again than what you'd do in hardware (or Verilog). Obviously you can use constrained methods on big processors, so that lets you use the same approach everywhere, but it means you do a lot of unnecessary work.

Vote

D

Don Y 12 years ago

In terms of "number crunching", it's trivial. A four function calculator would suffice.

But, in terms of processor cycles, there's a lot more than meets the eye. E.g., querying the database requires issuing the RPC for the actual "SELECT", concurrently setting a timer to ensure the task doesn't "wait too long" for the reply; then, parsing the reply to examine each emitter, it's flow rate, the permeability of the soil around it so you know what the water *previously* dispensed there has "done" in the time since it was dispensed; how much the plant's root systems will have taken up, how stressed

*that* plant has been (wind/sun) in the days (?) since it was last watered so you understand *its* needs (and how close its effective HARD deadline is); meanwhile, querying any other emitters (possibly serviced by other zones) that have added to the moisture content in that area; then, looking at all of the plants and making a decision as to how critical the provision of water from *this* zone is at this time -- along with how large a "dose". [BTW, this is what industrial commercial systems do to varying degrees] [Remember, conditions are *always* changing. You can't just make a decision and sit on it until you think it is time to act on it (which is what a naive controller does)]

Just moving messages (RPC's) up and down through the network stack consumes more resources than a "conventional" irrigation controller would in a *week*!

By contrast, a PIC-based controller does:

while (FOREVER) { sleep(water_interval - water_time) // typically days! valve(X, ON); sleep(water_time) // typically minutes valve(X, OFF); }

No. This is a common misconception.

"Hard" and "soft" have ABSOLUTELY NO BEARING ON THE MAGNITUDE OF THE TIMES INVOLVED!

Rather, they are concerned with the shape of the value function associated with the deadline. HRT problems have a value function that "goes to zero" at the deadline. I.e., missing a HRT deadline means you might as well reset the processor and start working on something else -- there is no value left to continued work (i.e., expenditure of resource) towards the goal.

[NON-realtime problems have no "deadlines"]

By contrast, an SRT problem has a value function that decreases at and after the deadline. I.e., there is more value to getting it done BEFORE the deadline -- though there may still be value to getting it done *after* the deadline has passed! (Of course, most SRT problems are encapsulated within a "final", HARD deadline beyond which their value is inconsequential).

[Note a *system* can contain hard and soft real-time problems.]

How far in the future a deadline is -- or, how *often* it is -- has no bearing on the HARD vs. SOFT distinction. Sending a probe to Pluto could have a deadline *years* in the future. Does that make it "soft"? Even an *abacus* onboard the spacecraft could process innumerable "instructions" in that time period! But, when it comes time for a maneuvering thruster to be engaged, it had *better* be engaged (else the spacecraft misses its orbital insertion, etc.)

Similarly, events from a mouse "wheel" can come at tens of Hz. Yet, if you miss 80% of them *completely*... . Or, if you handle them *late*, it could still be acceptable.

If you don't dispense water for the plants in a given zone *exactly* when you would like to, it's not the end of the world. For whatever reason (too busy computing pi? water resource not yet available?), a SOFT deadline is missed. But, there is still value to supplying water -- whether that's a few minutes later or a few hours! (i.e., the shape of the value function depends on the needs of the particular plants being serviced, their recent watering history, environmental factors and the value of the plants themselves! It's relatively easy to regrow wildflowers; considerably harder to regrow a fruit tree -- or ensure it doesn't shed it's blossoms and, thus, lose an entire crop of fruit for this growing season!)

OTOH, there comes a point where you've simply waited too long and any attempt at watering is going to yield no results (or, even NEGATIVE results!). This is the HARD deadline that represents sink or swim -- beyond which it is silly to even compute how late you are!

If the cacti in the side yard don't get watered at 5PM today, they won't mind if it happens 5PM a *week* from today! OTOH, if the rose bushes aren't watered 8twice* a day, they are toast! (unless, of course, it is winter time in which case they should be watered very INfrequently lest the roots rot!)

Again, how a process is coded can vary with the consequences of that implementation. I.e., deliver a signal and the process can be prevented from doing *anything* in the absence of the revoked resource. The signal can even cause a "message" to be delivered to the task saying "please release the resource, NOW!".

On the other hand, if you rely on "cooperation", then you have to qualify this cooperation as well as quantify it. I.e., when requested, you *must* relinquish the resource within time T. Even if your task has insufficient priority to claim use of the CPU in that period!

Many systems have to tolerate potentially hostile processes. Esp if they are designed to be "open". What's to prevent an "app" in your smartphone from taking a resource and holding onto it indefinitely? Or, an application on your PC? What do you do in the absence of a human presence to intervene and "kill" the offending task? What do other tasks do *while* this condition persists?

(What do you do when there is no "human" available to kick your system back into operation?)

I was actually hoping for a mechanism that more intuitively allowed the developer to *see* this "event" without the explicit coding that, e.g., signals require.

E.g., if you are talking of a single resource (within an app), then:

handle = await_resource(resource_sought) spawn(use_resource) result = await_release(handle) if (result == I_released_it_when_I_was_finished) // success else // use_resource was not able to complete as expected

is a more robust coding style. Something that can easily be applied boilerplate style. Get what you need. Hve something do the work while you wait for it to "finish". Then, verify that it's "finishing" was truly the result of the task completing as expected vs. something else causing it to terminate.

However, it falls down (gets ugly) when "use_resource" must then request some *other* resource, etc.

How's that different from pulling a resource out from under the eyes of a task? The task doesn't know anything has happened. Or, I could conceivably block the task until I was able to restore the resource to it!

I.e., we treat "CPU time" as a *different* sort of resource... And, don't seem to have any problem with that.

Simmilarly, we treat physical memory as a different resource (in a VM system) than "logical" memory. We ignore the fact that accessing location X might be nearly instantaneous while X+1 may take milliseconds to access (if the page containing it's backing store has to be swapped in)

You are assuming the processor is *only* working on this task. Or, are accustomer to applications/systems where the system idle loop *always* gets a chance to run (i.e., when the processor is never overloaded)

Yup. In my friend's case, a task (process an inappropriate term) was a handful of bytes! Hence my caution in using terms like threads, processes, tasks, etc. I've deployed systems where an "execution unit" was as small as a couple of bytes and as large as many megabytes. So, what's "normal"?

But, "cheap" is a relative term. 1000's of processes on a machine with limited resources can be impossible. I.e., C.A.E doesn't presume you're running on a desktop, etc. How many tasks are running inside your mouse? :-/

Scale that back by an order of magnitude or two! :> Think "SoC" not "SBC". Think "several dollars" vs. "dozens of dollars". And, think scores of machines and not singletons.

Why do you assume that because I want to have a way for folks with shallow pockets to ALSO take advantage of a technology that these are the *only* people who will take advantage of that technology?

Why do people build MythTV boxes? Don't their cable providers offer DVR's? Surely, it's got to be cheaper to buy/rent a DVR THAT YOU KNOW WILL WORK than to tinker with trying to get some bits of code to work on a machine you've thrown together from scraps! Even assuming your time is "free", you prsumably would want the resulting device to *work*, reliably! ("Dang! My MythTV box didn't record the programs I wanted. I guess my time server was screwed up and it didn't know that today was Friday...")

Similarly, should video recording technology ONLY be available to people who want to hack it together from odds and ends? So, if you aren't technically literate (and motivated!), you can't take advantage of that technology? Regardless of how deep your pockets are?

Rather, wouldn't you want a solution that folks with money (and no time, inclination, etc.) could purchase (subsidize!) while also providing a means by which folks with more *time* (and technical expertise) than money (or, perhaps, more *desire*) can also avail themselves of that technology?

[Well, I have no idea what you would want. But, *I* would want a solution that can be approached from each of these perspectives]

E.g., I designed my network speakers so you could implement them with a bunch of surplus PC's -- one for the server and one for each speaker/speaker-pair. (assuming you have the space for a whole PC where you would want one!)

Or, you could buy a bare board and components and assemble one for yourself -- possibly housing it in a tin can in lieu of a "real" enclosure!

Or, you could purchase one commercially -- for considerably more (as there are people wanting to make a profit in that distribution chain!).

Exactly. You don't write your app in Python. You don't expect it to have GB of RAM available and GIPs of CPU. And, you don't create an environment for others to augment/maintain the design that will lead to the system, as a whole, being flakey.

Again, hard and soft have nothing to do with how *fast* something is. I.e., how many instruction cycles it will take to execute. So, how many cycles something takes has nothing to do with the soft/hard-ness of the RT. Those things only affect the amount of *resources* available to the application.

This is fine if *all* you are doing is soft (or fast). And, if your "solution" doesn't have to change from one domain to the other as it evolves (since folks are hesitant to reengineer an entire application -- prefering, instead, to try to tweak it to death).

E.g., when I originally coded the irrigation controller (different system), it mimicked the "naive" controllers (electromechanical) that I had experience with at the time (growing up, no one "watered" the yard; The Rain did that!). Then, I replaced the sequential zone N then zone N+1 approach with a system that allowed multiple zones to be watered concurrently. This required coordination to ensure "too many" zones didn't try to operate simultaneously. Now, I want to give it more smarts *and* reflect the impact it can have on other "water consumers" here. And, since I can't predict what those uses will be [actually, this is a small lie], I need to be able to abort a watering cycle and resume it at some later time.

[The same sort of algorithm can also be applied to controlling access to other shared resources -- electricity, natural gas, etc. Don't let the air conditioner compressor kick in when you are baking -- especially if you are on a ToU tarrif!]

I "spend resources" on making products/environments more robust and/or useable. E.g., protected memory domains so task A can't screw with task B's resources (this costs resources which translates to real money when you have to buy chip A vs chip B). RTOS's vs. MTOS's (because MTOS's don't/can't make timeliness guarantees even if they can be implemented more simply/cheaply). "Services" instead of "libraries" (because services can be more universally applied and controlled).

I.e., the goal is to use "big system" capabilities on "tiny iron". So you afford the developer with the environment most conducive to him producing a benevolent/harmless application without incurring all the cost of twiddling individual bits.

To that end, being able to provide a "template" that guides how you can craft a "robust" application -- and, what you can expect

*from* the system via that template -- can make these goals much more attainable.

"Invoke this service in this manner; expect these results."

"Request a resource using this mechanism; expect to handle these situations/exceptions."

etc. But, at the same time, protecting the *system* from the developer's greed/folly!

--don

Vote

P

Paul Rubin 12 years ago

I didn't realize there was an SQL database in this, but if there is, the computer is big enough that it's all still trivial.

These same methods work fine in relatively small systems (say a few KB of ram) if you don't mind using low level languages like C or Forth and carefully allocating memory by hand, not having memory protection, etc (see the Mars Rover article). In the smallest 8-bit cpu's or in hardware, you may have less flexibility.

Vote

D

Don Y 12 years ago

No, the RDBMS is in *another* machine (note I said "RPC" and not "IPC" ) elsewhere on the network. This (and others) is just a *client*. As long as you have the resources for the protocol stack, "software" to issue the query and catch/parse the result, you can implement such a client on things as tiny as a PIC (and smaller). You just can't *do* much in terms of issuing lots of requests per unit time. Nor *accumulating* many results!

But, if you are clever/methodical about it, you can handle a virtually unlimited number of emitters, "plants", etc. and come up with *a* number that indicates how much water you should dispense "now".

Again, "SoC" not "SBC" (i.e., think: fraction of an MB of FLASH and tens of KB of RAM -- but nothing beyond that!). A "single chip" solution (plus I/O's).

You don't have to allocate memory by hand -- nor statically (another common misconception). Nor do you have to live without memory protection (though you probably live without backing store unless you *add* secondary storage to the device). There are lots of SoC's nowadays that can give you a full-featured environment without the *quantity* of resources you might have in a desktop, etc.

Even with 8 bitters you can give the developer the "feel" of a big machine. E.g., I had a Z180-based system (essentially 8 bits with a 1MB address space) where you would be hard pressed to know you *weren't* writing UN*X code! (aside from the dreadful slowness of a ~6MHz, 25+ year-old, 8 bit machine!)

--don

Vote

H

Hans-Bernhard Bröker 12 years ago

This is a very common misconception, but still just as wrong. Realtime has nothing to do whatsoever with the length of any particular interval of time. It doesn't matter if your deadline is coming every 50 nanoseconds or once a year. If there's a deadline, and it's defined as a point fixed in time, then you're doing realtime processing.

Nor is the distinction between soft and hard realtime to be found in the timescales involved, but in the gravity of consequences if you miss a deadline.

In other words, realtime is about whether there _is_ a "too late", not _when_ that might be.

Vote

P

Paul Rubin 12 years ago

Pedantically you and Don are correct. In practice in terms of software technique, the interval lengths actually matter.

Vote

T

Tom Gardner 12 years ago

It isn't pedantic, since that is the only difference between hard and soft realtime.

True, to a large extent.

But not if, for example, the processor entered a sleep mode for 364.999 days, woke up and missed the deadline. Engineers are paid to be pessimists; if you don't want that characteristic, hire a marketeer :)

Vote

T

Tom Gardner 12 years ago

I'm probably teaching you to suck eggs, but since this is a distributed system, has your architecture and design considered the cases of partial failure?

The classic problems are where: - another machine silently stops processing at some level, i.e. possibly above the TCP level - the network fails, including the subtle failures leading what is to all intents and purposes a self-inflicted DoS attack - in a high availablility system: - the network becomes partitioned leading to duplicate services - the network becomes re-joined leading to the problem of deleting duplicate services

Vote

T

Tom Gardner 12 years ago

I suspect the IRS imposes just as hard a deadline as the IR does here :)

To quote one of the many variants of the joke...

GROUCHO MARX (to woman seated next to him at an elegant dinner party): Would you sleep with me for ten million dollars?

WOMAN (giggles and responds): Oh, Groucho, of course I would.

GROUCHO; How about doing it for fifteen dollars?

WOMAN (indignant): Why, what do you think I am?

GROUCHO: That?s already been established. Now we?re just haggling about the price.

Vote

P

Paul Rubin 12 years ago

Yes really. If I'm working on desktop tax preparation software, the software is useless if it can't figure out my taxes by April 15th. But to describe that as "hard realtime software" to a newsgroup of embedded developers who actually work on things like motion control is pedantic, ridiculous, or both ;-).

Vote

D

Don Y 12 years ago

Yes. And the consequences vary depending on the nature of the failure.

Certain services are considered "essential". Failure of one or more of them means I'm screwed. Conceivably, these services could be replicated for higher availability -- that's for someone else to worry about! (have to put design limits *somewhere*! :> )

E.g., the database service is heavily relied upon by all clients in the system. Since everything is diskless, the concept of persistent store has to be implemented elsewhere.

Rather than "just" implement a "network file service", I opted to give clients access to *structured* storage. Why have each app create its own "file format" and have to implement code to

*parse* (and error check!) that format? Since most configuration and control "files" are really "tables" in some form, why not have a service that can *store* those tables in their logical form?! And, allow clients to grab individual entities from those tables.

And, allow something with persistent store to keep track of them!

So, a conventional irrigation controller might have a table like:

ZONE FREQUENCY DURATION COMMENT 1 3 days 15 minutes shrubbery 2 daily 5 minutes flower beds 3 twice a day 10 minutes rose bushes ...

This makes the job of the actual *controller* task pretty simple! Figuratively:

for (zone = 1; zone < MAX_ZONES; zone++) query("SELECT duration, frequency FROM irrigation WHERE zone = %d", zone); parse_result(&duration, &frequency); valve(zone, ON); sleep(duration); valve(zone, OFF); }

(some hand-waving implied since I am assuming the frequency criteria has been met)

Additionally, the RDBMS can enforce checks/criteria on the data stored in it. E.g., limiting the choices for "frequency" or "duration". This code can run *in* the RDBMS instead of burdening the client/app with that run-time checking! You can *assume* the data are meaningful when you get the results of the query -- no (?) further testing required!

Part of the reason behind the "asynchronous notification" issue/resource revocation is that you have to be prepared to deal with a resource (i.e., an entire client node) "going down" -- making the resources that it "published" inaccessible to the system. So, anything using those resources has to be able to deal with the resource being unceremoniously revoked AT WILL.

E.g., in the event of a power outage, only some of the nodes are battery backed. What happens to all the tasks that are expecting to interact with the hardware and/or software (tasks) residing on those nodes?

"Please stop using this resource." "OK" "No! I mean it is no longer available! You *can't* use it!" "Oh..."

So, why implement *two* mechanisms if you will always HAVE TO have the "kill" option working?

As to some of your later points:

While the system is distributed, control over it (supervision) is not. I.e., if a node (or process) goes down, the workload manager removes it from the system completely. If the node suddenly becomes visible, again, it won't be given any work until the workload manager formally reintroduces it into the grid.

Technically, there can be periods where some nodes still have connectivity with the node and it appears to be functioning. If so, the results/services that it provides remain useful. But, once the system sees that it is unresponsive (to the supervisor or any client), its death knoll has been struck.

--don

Vote

D

Don Y 12 years ago

I think "gravity" suggests "safety critical", "life support", etc. Rather, I view it as the *value* of a timely result. I.e. if your product picks bottles off a conveyor belt, then missing a deadline means a bottle crashes onto the floor. This might be relatively insignificant in terms of its financial value, etc.

But, the effort expended trying to *get* the bottle before it crashed to the floor has been wasted. *And*, no further effort will change the outcome! (i.e., *drop* that task from your workload)

This is an excellent way of putting it!

--don

Vote

T

Tom Gardner 12 years ago

Fine :)

I may be over-interpreting the words you have used, but what would happen if: - the controller/manager dies; what do the subset of nodes that have decided the controller has failed actually do? Would there be a conflict with the other subset of nodes? - ditto intermittent network connectivity (yes, I have seen that even in benign environmental conditions)

Vote

P

Paul Rubin 12 years ago

This really sounds more and more like you're reinventing Erlang. That's ok, it happens all the time. You might benefit from:

formatting link

FWIW, Erlang has a replicated, distributed database (non-relational, more like an object db) built into its runtime.

Vote

D

Don Y 12 years ago

Exactly.

I wrote a 9-track tape "driver" that ran on a 25MHz i386. In "polled" mode, it had to pull a datum off the read head every 6 microseconds. (that's 10-6, not 10-3!)

But, it wasn't *hard* real time!

If something happened to interrupt me (e.g., NMI), then I would miss a datum (i.e., deadline). But, I could stop the transport, "backup" (which could actually be in the forward direction if I happened to be "reading backwards" at the time) to the previous file mark and then reread the record.

Of course, if this happened continuously, I would fail in a big way. (But, NMI's weren't expected to be common!)

So, it's a real-time problem because there *is* a dedline; and a SOFT real-time problem because there is value to pursuing the goal after the deadline has passed -- by, essentially, recreating the original problem and making a second attempt at it! :> )

Exactly. "The spacecraft has LEFT the solar system..."

Engineers find the "least bad" solution to problems! (acknowledging that there are rarely any "correct" ones)

--don

Vote

Resource revocation

Join the Discussion

Didn't find your answer?