Dynamic upgrading/Hot-swapping a service

Do you have a question? Post it now! No Registration Necessary

Translate This Thread From English to

Threaded View
My application runs 24/7/365.  There's no "reset" switch.  So,
components (hardware AND software) that are upgraded are done
so while the system is live.

[I suspect there are six nines services on-line that behave
similarly.  But, if inspected "in the small", I wonder how
clean the switchovers are?  And, how durable the connections?]

I can instantiate a new instance of a server and replace the
bindings to the old server with bindings to the new.  So, any
*new* attempts to use the service are automatically routed away
from the old server and handled, instead, by the new instance.

Any connections/transactions to the old server can continue to
be handled, there.  When each client closes its connection to
the server, the old server will see one less client to service.
This is sort of equivalent to a zombie process (even though the
process is actually still functioning as intended)

And, no *new* clients will ever appear (because the new instance
is handling them).  So, eventually, the server will detect "no
more clients" and exit().  This frees up the resources that the
old service was using.

[This is true even if an "old" client tries to reopen a connection
to the service!]

The problem comes with connections that are more durable/persistent.
There comes a time when you want to more aggressively *kill* the
"zombie", even though it is performing as intended, etc. -- if only
to reclaim those resources!

Doing so WITHOUT INTERRUPTING THE SERVICE is, of course, the desired
way of meeting this goal (crashing the service is almost always a
sign that you got lazy in the implementation).  So, the old service
needs to be told to "migrate clients" to the new instance of the service.

Of course, the mechanics of migrating a specific client for a specific
service are highly dependent on that service, etc.  Any internal state
associated with the client has to be abstracted to a form that the
new server can interpret and map to its new implementation.  This, then,
has to be conveyed to the new service along with the endpoints for the
existing client(s).

The first question:  should this be implemented as a "force clients
to the new service" (i.e., shed ALL clients)?  Or, as "force THIS client
to the new service" -- iterated over the set of existing current clients?

[The net result in each case should be the same]

The former gives finer-grained control over the reallocation of
resources/connections/activities.  But, I can't really see how it
would be used with *less* than the full set of current clients
(so, is it needless detail?).

The second question goes to who should affect this decision.
Imposing it "from on high" ignores the particular requirements
of the service in question -- it assumes all services are
equally easy/difficult from which to migrate clients.  OTOH,
the only place that has global knowledge of what's happening
(resource-wise) in the system is "on high".

My current thinking is between two alternatives:
- signal the new service that it should *acquire* the old
   clients (from OLD_SERVICE) and let "it" sort out the most
   expeditious way of doing so;
- signal the old service that it should *shed* the old clients
   (to NEW_SERVICE) and let it figure out how best to affect that

I think I prefer the former as it lets the new service assume the
agency implied by "on high"; *it* acts to affect the desired
change but with a better idea of what is involved in doing that
("on high" being ignorant of the specifics of this service).
So, it can ensure its own house is in order before contacting
the old service and imposing the directive on it.

In this way, the old service can make decisions as to which
clients should be migrated first (there is always the threat
of an unceremonious process shutdown imposed without consent
"at some time", hereafter).  And, it can negotiate with the
new service as to the best way of exchanging state information
for those connections (e.g., define the protocol as well as
the mechanism)

Re: Dynamic upgrading/Hot-swapping a service
On 4/20/2017 4:05 PM, Don Y wrote:
Quoted text here. Click to load it


Quoted text here. Click to load it

Re: Dynamic upgrading/Hot-swapping a service
On Thu, 20 Apr 2017 16:05:00 -0700, Don Y

Quoted text here. Click to load it

Before (re)inventing the wheel, take a look how VAXcluster (now
VMScluster) has done it since the 1980's.

Re: Dynamic upgrading/Hot-swapping a service
On 4/21/2017 12:45 AM, snipped-for-privacy@downunder.com wrote:
Quoted text here. Click to load it

In cluster environments, nodes tend to be indivisible entities.
When you update the software on a node, you update the node,
itself.  You "kick off" the processes that are running on the
node just prior to the upgrade (even if that means migrating them
to another node -- with an OLD copy of the service they are using
at the time) and summarily replace the node (and the services it

[If you've migrated the existing connections to services that WERE
running on that node to some other node, then you still have those
clients running on that other node -- potentially indefinitely with
the OLD server code!]

Imagine, instead, that some processes are using the "file service"
(whatever THAT is!) on *a* node.  You want to upgrade the file service
code (without affecting any of the other services that are running
on the node) WHILE the file service remains in use on that node.

I.e., install the new service and start it running.  Change the
service registry to reference the new server instance for *new*
service requests (i.e., any files that are accessed AFTER this
point will be handled by the NEW file service).  Allow the old
file service instance to remain active to finish servicing any
existing connections.

*Eventually*, the preexisting connections will be completely serviced
(those files closed, etc.).  Because all NEW requests are handled by
the NEW service, the old service will eventually find itself with no
work to do -- no active connections (clients).  At that point, it
can terminate itself with no deleterious impact on the system.

The problem comes with clients that "linger" on the old service longer
than you'd like.  E.g., imagine a process that opens a dribble file and
leaves it open FOREVER.  That would stake a continuous claim on the
old service preventing it from ever being "replaced".

Or, you might be in a *hurry* to replace a service -- before the
clients currently using it are naturally *done* with it.

So, you need a way of migrating the *active* connections to another
server ALONG WITH THE INTERNAL STATE associated with each of those

For a file service, that state might include an inode number, access
mode (R/W), current file offset (for read or write), any buffered data
(to be written or already read-ahead), any media I/O actions "in progress",

But, the *new* service may associate different state with each connection
as dictated by *its* implementation.  So, simply "copying" the state
from the old service to the new service won't suffice; there needs to
be some "state translation" that takes place to ensure the client's
connection remains semantically intact across the transition between

Or, a modification of the server contract that allows any server to
simply state, "I quit" and let the clients figure out how to recover
or restore their use of that service (boo, hiss!).

I'll be meeting up with some local colleagues, tonight, for 12oz curls.
I'll see if any of the guys who work in "enterprises" can shed some light
on the approach they take to this sort of thing.  Though I suspect their
users are more "transient" than persistent.  So, will leave a service in
short order as a natural consequence of their operation (in which case,
just registering the new service and waiting would suffice).

Re: Dynamic upgrading/Hot-swapping a service
On Fri, 21 Apr 2017 10:43:28 -0700, Don Y

Quoted text here. Click to load it

It is more than a quarter of a century since I have been running a
large VAXcluster with  a dozen cabinet size CPUs, but I try to
remember some of the details.

If you have multiple CPUs with shared (and mirrored) disks, switching
from an active process from one CPU to an other is quite easy. As long
as the OS supports process checkpointing or swapping out a complete
process to disk, things are easy. Instead of swapping in a process
from disk back into the original CPU, just swap it in to an another
CPU :-)

In a VAX cluster, application programs refer to resources, such as
disks by logical names. It is the responsibility of the system manager
to maintain the day to day mapping between the logical disk names and
the physical disk names.  Some logical name lists (one logical name
translates to multiple physical resources and the OS selected the
first physical device available from the logical name list).

In those days, dumb terminals were used. With the Ethernet/serial
converters (DecServer xxx) running the LAT protocol, it was quite easy
to automatically connect a dumb terminal user from one CPU to an

Re: Dynamic upgrading/Hot-swapping a service
On 4/21/2017 1:06 PM, snipped-for-privacy@downunder.com wrote:
Quoted text here. Click to load it

That's not the same thing.   That's *migrating* a process to a different
CPU.  You're moving the entire state of the process to resume execution
on another CPU.  All the "variables" AND all the instructions that
interpret those variables!

I want to "alter the executable" while it is running -- change the
instructions and (somehow) tweak the variables so their current
values "make sense" when interpreted by a different set of instructions!

I'm typing a "followup" to your message using Thunderbird.  I
(the human) can be regarded as a client of Thunderbird.  I am engaged
in an interaction with it -- my CONNECTION to it persists continuously
as I am typing this message.

WHILE I AM TYPING, I want something to be able to sneak in and
REPLACE the copy of Thunderbird that is executing in my computer's
memory -- not just the copy that resides on the disk (which Windows
won't examine until the next time I *load* Thunderbird) -- and to
do so such that this message ends up intact as it is eventually
posted to the NNTP server.

I.e., to do this, you'd need to capture a copy of what I've typed
up to the instant the upgrade is switched in *under* me.  It
would have to know how the windows that the old Thunderbird
instance was using were maintained by the OS, and the source
of keystrokes and other user interface events.

It would be messy and tedious to get it "right" -- but not impossible.

A far easier goal would be to swap the executable bound to "Thunderbird.exe"
so that the next time I invoked Thunderbird, I'd get the NEW executable;
let my current interaction run to completion with the *old* executable!

But, there's no guarantee that I will terminate this Thunderbird session
anytime soon.  Or *ever*!

The OS can forcibly move the user interface connections to another
process running on the same -- or different -- node.  But, that doesn't
mean the client's (i.e., user's) experience will be "continuous"
or coherent.

Quoted text here. Click to load it

Re: Dynamic upgrading/Hot-swapping a service
On Fri, 21 Apr 2017 17:51:42 -0700, Don Y

Quoted text here. Click to load it

I still do not see what your actual problem is.

Just swap the MAC addresses between the activating and passivating
server and the client nor the client application doesn't noticing
anything special.

On the server side with stateless protocols such as UDP and LAT things
are quite straight forward.

With state full protocols like TCP, things get hairy, if the protocol
state is maintained in kernel mode, if it is not swapped out and in
into an other process with the user mode code. With the TCP stack in
user mode, this should not be a big problem.

Re: Dynamic upgrading/Hot-swapping a service
On 4/22/2017 4:09 AM, snipped-for-privacy@downunder.com wrote:
Quoted text here. Click to load it

Find a piece of software that is currently executing:  your
microwave oven controller, your PC (consider it a *collection*
of software), your calculator, your ....

Now, WHILE it is "solving some particular problem for which it
was designed", pause the clock and replace all the INSTRUCTIONS
in the program(s) with a new, revised program (it does <whatever>
only "better" (the 8 digit calculator now handles 12 digits; the
microwave oven now has 6 other types of cycles; the PC is now
running Windows 11 instead of DOS 3.3; etc.)

Let the clock resume.  None of the actions that were running
at the time the clock was PAUSED should have been affected by
the upgrade.  I.e., if the calculator was in the middle of
computing "14!", it should continue to completion -- from
wherever it happened to have been, at the time -- yielding
the correct result.

Note, however, that the result should now be displayed as
8.71782912*10^10 or 87178291200 and NOT as 8.7178291*10^10
to reflect the extra precision that it has internally
as well as the extended "display"/reporting capability
(assuming, of course, that the original executable was
interrupted before any loss of precision).

Put something in your microwave oven.  Set the timer to X.
After an arbitrary amount of time, pause the process (processor)
and replace the ROMs.  Resume the process.  EXPECT the entire
process -- start to finish -- to proceed exactly as it would
have had you not replaced the ROMs!

Quoted text here. Click to load it

Communication protocols aren't the only places where state is involved.
Start counting out loud.  The next time you encounter a person, switch
to another language.  I.e., the algorithm by which you determine the
next ordinal to speak has changed.  But, you've still got to remember
which was the *last* previously spoken!

Quoted text here. Click to load it

Re: Dynamic upgrading/Hot-swapping a service
On 4/22/2017 11:55, Don Y wrote:
Quoted text here. Click to load it
Since this group is for embedded processing, it is fair to ask why the  
original calculator would have a display with  more that 8 significant  
Quoted text here. Click to load it
This assumes that you can replace the ROMs by some hot-swap process that  
does not kill power to the RAM/registers that hold the state and quickly  
enough that the food will not cool substantially.  Also, the old program  
state must be coded so that the new ROMs read and operate on it properly.

It sounds like a lot of work.
Quoted text here. Click to load it

Best wishes,
We've slightly trimmed the long signature. Click to see the full one.
Re: Dynamic upgrading/Hot-swapping a service
On 4/23/2017 10:01 AM, Phil Martel wrote:
Quoted text here. Click to load it

Why does the calculator *function* have to be implemented in
a calculator *package*?  Do you not use <math.h> in your
embedded applications?

With the tiniest bit of imagination, one should be able to consider
a new math library that had greater precision *or* different
algorithms that converged faster than the previous implementation.

Given that you (I) can not shut the application down "for
maintenance", how would you replace the library (used by multiple
modules) in the application while the system was powered up and
operating?   (see my previous examples for steps)

Replace "library" with "service" and you have my original question
(i.e., most libraries can be implemented *as* services with the
re-formalization of the interface communication overhead)

Quoted text here. Click to load it

Again, imagination suggests you could implement the ROMs (i.e., the
program TEXT) in other media that *can* be (effectively) replaced "between
one clock cycle and the next".  This is all old technology.  The problem
lies in doing so while some consumer (client) might be ACTIVELY executing
within that block of program TEXT.

Quoted text here. Click to load it

No, that isn't necessary.  In fact, different algorithms may use
inconsistent state vectors so that mapping from one algorithm to
another is not possible.  That doesn't preclude "interrupting"
existing processing, replacing the TEXT and finishing the
processing with the "new" algorithm.

Quoted text here. Click to load it

That's why things like Windows want you to reboot so often!  :>

OTOH, web sites and enterprise systems regularly roll out
updates WHILE still providing services -- because the cost
of shutting the systems/services down for that update can
be substantial ("We're sorry, but the on-line banking transaction
that you are engaged in AT THIS MOMENT will be aborted.  Please
try again later.")

(Would you want to have to *stop* your car to have the code in the
ABS system updated -- given that stopping the car might not be
possible, reliably, given the current state of the ABS code?  :> )

Re: Dynamic upgrading/Hot-swapping a service
On 4/23/2017 13:52, Don Y wrote:
Quoted text here. Click to load it
Obviously it doesn't have to be, but it may be. Perhaps "calculator" is  
a poor example of what you're trying to explain

Quoted text here. Click to load it
I'm not familiar with *how* these systems do what they do.  Keeping the  
old copy running while clients are in the middle of a transaction and  
perhaps warning them to finish up is an option.

Quoted text here. Click to load it
Provided you translate and replace the existing state vector also.

Quoted text here. Click to load it

I'm not familiar with *how* these systems do what they do.  Keeping the  
old copy running while clients are in the middle of a transaction and  
perhaps warning them to finish up is an option.
Quoted text here. Click to load it

Would you want to rely on the company that wrote the bad ABS code to fix  
it and do so while your car was moving?  I suspect that the "fix it  
live" problem is tougher that the "ABS" problem.

FILAAS (Fix it live as a service) might be possible if the processor and  
system the ABS was running on was standard, but what about your cardiac  
pacemaker?  Is that running on the same processor?

Best wishes,
We've slightly trimmed the long signature. Click to see the full one.
Re: Dynamic upgrading/Hot-swapping a service
On 4/24/2017 7:48 AM, Phil Martel wrote:
Quoted text here. Click to load it

I'm trying to pick examples of "programs" that people can easily
understand.  A calculator evaluating a transcendental function
(i.e., something with some "meat" in it) could approach the
problem in different ways (Taylor series, CORDIC, etc.) in
different "revisions"/versions.

So, (*ignoring* the desire to upgrade due to a *flaw* in the
implementation,) it is conceivable that you would want to
upgrade the algorithm to adopt an approach that converges
more quickly.

And, because the algorithm would be iterative, it is likely
that it could be "in progress" when you choose to upgrade the
software (e.g., an 80b floating-point "FMUL" can be a single
instruction but FTANH probably isn't!).

Finally, the approaches can vary significantly in terms of
their resource requirements (e.g., temporary storage) making
a direct mapping of one to the other virtually impossible.

Quoted text here. Click to load it

That assumes they *will* "finish up" (consider a "black box" service that
is always receiving "log" information) and in the time frame that *you*
consider appropriate.  If you're shutting down a node in a cluster for
periodic maintenance, you can probably afford to wait seconds/minutes
for everything to come to an orderly state.  But, you can't make that
generalization about all clients and dependencies (recall, many clients
are, typically, *agents* -- "serving" clients of their own!)

You can always ensure no *new* clients avail themselves of the "old"
instance of the service thereby (hopefully) expediting its "release".

Quoted text here. Click to load it

That may not be practical.

factorial(n: int) : int
    ASSERT( n >= 1 )
    result := 1
    while (n > 1) {
       result *= n
    return result

factorial(n: int) : int
    ASSERT( n >= 1 )
    if (n == 1)
       return 1
    return N * factorial(n-1)

have vastly different state vectors (assuming I haven't botched the

So, just assuming you can <somehow> map one state vector into another
won't give you a "fix".

Quoted text here. Click to load it

I think most of these types of services are short-lived and/or
transactional.  And, for services with human interaction, you can
always hope the human "client" is "understanding"/patient (which
is possible IF these types of inconveniences aren't frequent)

Quoted text here. Click to load it

*Undoubtedly* tougher!  OTOH, if there was sufficient risk (death or injury)
to applying the brakes *prior to* installing the upgrade, I'd much prefer
<someone> invest in *that* solution!  You can't tell the Apollo 13 crew
that you'll fix their problem -- AFTER they return home...  :>

Quoted text here. Click to load it

Pacemaker is a perfect example of upgrade /in situ/.  Of course, the chances
of the pacemaker needing to perform its function during the upgrade AND being
unable to do so AND the patient dying while the doctor is standing nearby is
probably pretty slim.  And, the pace maker designer undoubtedly considered
this capability in their design of the product.

We worked out a bunch of different approaches to the problem Friday night.
Unfortunately, no *one* is a panacea.  So, I'm working through the costs
(and consequences) of each approach.  I've got an off-site/retreat coming
up RSN so I hope to bring my problem to the table, there.

As I can't rely on others (writing code to run in my system) to design
components with this capability in mind, I need a fall-back strategy that
will allow me to upgrade *those* components in the least painful way possible
(if those folks' products end up "looking bad" as a result, its their "image"
to attend to).

Re: Dynamic upgrading/Hot-swapping a service
On 4/24/2017 15:33, Don Y wrote:
Quoted text here. Click to load it
So, lets say you're in the middle of calculating  
factorial(1,000,000,000,000) with algorithm 2.  Then you find out about  
algorithm 1 (or maybe decide that Stirling's approximation is close  
enough).  What *can* you do with the unfinished solution other than dump  
the work and restart the problem with the new algorithm or let it  
finish? (and next time use the new algorithm)?
Quoted text here. Click to load it

Best wishes,
We've slightly trimmed the long signature. Click to see the full one.
Re: Dynamic upgrading/Hot-swapping a service
On 4/24/2017 7:02 PM, Phil Martel wrote:
Quoted text here. Click to load it

You (as an executing client who has called upon the "factorial service"
to perform that calculation) don't "find out about" anything!  To *you*,
nothing appears remiss.  That's the whole point; as long as the API
hasn't changed, you shouldn't care that the service has been replaced
with an equivalent service.

How the *system* ensures that illusion is maintained is the problem
being addressed.

The remedy that "makes most sense" will vary with the design (and
functionality) of the service being upgraded.  And, the approach the
maintainer chooses to address those "rolling updates"

As it would be heavy-handed for the system to dictate how EVERY service
is coded AND the constraints placed upon their algorithms, the system
can only offer (prefabricated) *mechanisms* that the service designer
(and maintainer) can exploit to facilitate the upgrade.

And, the system has to rely on the designer/maintainer to make best use
of the mechanisms that it provides -- because the designer/maintainer
has more intimate knowledge of the way the service is intended to work.

A "lazy" designer may choose not to address live upgrade issues.
In which case, the system will resort to draconian measures when
an upgrade is installed:  it will KILL the running service and
let the clients deal with the resulting mess.  *Users* will then
either avoid products from that provider *or* will avoid upgrading
(if the consequences are too painful -- where "too" is a subjective
criteria defined by the user in question).

Quoted text here. Click to load it

I prepare a document using /WordProcessor25/.  The document can be seen
as a snapshot of the "conceptual document" that I seek to prepare.
I upgrade to /WordProcessor29/.  Is all of the work that I did prior
to that upgrade lost?  (Why not?  :>  )

Re: Dynamic upgrading/Hot-swapping a service
On 4/25/2017 0:37, Don Y wrote:
Quoted text here. Click to load it

I used the word "you" to mean the system providing the service  
(including the programmer who implemented the new algorithm).  However,  
a factorial calculation is a poor example in that it is not persistent.  
It may make sense for you (the system) to dump the work you've done and  
start over, or to continue with the old algorithm for this instance.
Quoted text here. Click to load it

I think the example you're trying for is that you run a word processor  
service and that I'm a client.  I'm typing into a document using  
/WordProcessor25/ (which I think of as /WordProcessor/).  You want to  
upgrade to /WordProcessor29/ while I'm typing right here.
In this case, perhaps in most cases, there's some point where the system  
can save its state as a checkpoint, start the new software and continue.  
  If the system can do the change between user inputs, the change will  
be transparent.  The case where the inputs come too fast is where it  
gets tricky and you may have to keep a copy of the old code running.

Best wishes,
pomartel At Comcast(ignore_this) dot net

Re: Dynamic upgrading/Hot-swapping a service
On 4/25/2017 9:19 AM, Phil Martel wrote:
Quoted text here. Click to load it

There are no conditions placed on what a service can provide.  E.g.,
my calculator is a service; in your world, it might be a library.

I use factorial as an example of a "job" that can take some "macroscopic"
amount of time -- rather than arguing about whether 10 microseconds or
200 hours is "too long" for a service to "linger" in the face of a
pending/desired upgrade.

Quoted text here. Click to load it

There are *many* possible courses of action that the developer could
apply to providing a rolling upgrade of his service.  The system can't
impose *one* -- without sharply constraining the types of services that
can be implemented as well as the "time" each takes to operate.

If a service has side-effects, then you can't (typically) start over
as you would have to consider which side effects had already taken


Quoted text here. Click to load it


I want to upgrade.  *When* is a separate issue.

The point I am making is that you *can* upgrade /WordProcessor/ and
not lose the "work in progress" -- even if it has been sitting on an
offline floppy for 2 years -- BECAUSE THE NEW WORDPROCESSOR KNOWS

(i.e., yet another "upgrade strategy")

Quoted text here. Click to load it

Again, the system can't know how the service behaves or how its clients
expect to use it.  So, you can't *impose* an upgrade strategy on the service.
Instead, you provide mechanisms that allow many approaches to be used and
count on the designer/maintainer to use their specific knowledge of the
service (THEIR service!) to decide which of them to exploit.

Part of installing the upgrade "software" is the specification of the
upgrade strategy mechanism to be used -- along with any ancillary
requirements FOR THE UPGRADE.  Make it easier for developers to
do what they "should" instead of forcing them to do ALL of the
heavy lifting (which would tend to result in more "please reboot, now"

What I've been doing (since Friday evening) is codifying the different
strategies and drafting guidelines for when each is "preferred" along
with when each is contraindicated.  Then, I'll sort out the consequences
of a developer incorrectly specifying the "wrong" strategy/mechanism;
or, incorrectly implementing it in their upgrade:  how might this
screw up other aspects of the system and what can I do to guard against

Site Timeline