If you've ever make a mistake, you're good to go for this document. I also like the descriptions in the Field Guide to Human Error. Reading that first might help you understand this document.
What I've tried to provide is enough feedback and cross-checks in the operational processes and development work-flows to enable workers to compare their intent to the actual status of the automation they are driving. This is a balance that requires a large measure to diligence and a trust-but-verify attitude at all times.
No complex system is inherntly safe, and the master source structure is just as dangerous as any other. The intent is to balance the power to make broad changes with the assurance that your actions are changing what you meant to change.
The goal of any configuration management strucure is to "build what you promised from what you have, without errors, and in a timely manner", see the master source top-level document. Mistakes in execution are more likely to break ongoing production than errors in content. Since our goal is always to make the structure better, we should take steps to avoid either type failure.
To some extent every configuration management strucure is clumsy and complex. People balance and mitigate these issues to differentiate success from failure. This document explains the reasoning and tactics I use to train people, maintain site policy, and justify my salary.
The proximal cause of this type of error is almost always a missing close-the-loop step.
The proximal cause of this type of error is usually a loss of context. Either interruption in process or lack of a secondary checking your work. Sometimes also caused by normalization of deviance.
In the new version of the master source I've tried to make all of the
data proximal and available to the diver of
each structure. The local recipe file (Makefile
or Msrc.mk
) and the platform recipe file
(Makefile.host
or
Makefile
) are both kept in the
current working directory. No data is stored in a non-text format
(we strongly prefer text files to database tables). There are command-line
options to display the derived values of each step in the process, and
options to dry-run most every operation.
These give the driver feedback. That feedback must be taken seriously
by every driver. A push from the wrong directory, or with the wrong
version of a key file is just about the worst thing you can do to any
production operation. I also include the current working directory in
my shell prompt, PS1
.
vs:$ msrc -Ctest.cf -P10 -E ....
$ msrc -Cprod.cf -P10 -E ....
To mitigate that we added a step to the procedure to stress that running a "do nothing update" before any push that might damage production is mandatory:
The$ msrc -C.... : test
: test
command is selected because forgetting
the colon runs an empty test
command (which fails
silently), and missing the test
word doesn't
hurt anything either. (Omitting the space fails to find the command
:test
, which is also harmless.)
The output from that command includes the list of instances updated.
Which gives the driver two items that might trigger an abort reaction:
an unexpected set of hostnames, or the length of the host list
(either too large or too small) for the expected change targets.
The attempt here is to offer feedback before the actual commit,
and with history editing replacing the : test
with make install
is trivial.
The fix for the aborted update is also clear: if you got the wrong
set of hosts, then you should use efmd
to
produce the correct list. That gives you the updated options for
msrc
, since they take identical options.
Use the recipe file to record all the update commands you intend to
run. Testing a recipe file's install
target
which updates 20 files on a test host is great.
Keying in 20 install
commands in a change
window is insane. I don't know how anyone can justify the odds of
a mistake in the later case.
For similar reasons I avoid punned recipe files. When a single
make
recipe named Makefile
serves as both the master and platform recipe file, one might
activate an update in the wrong context.
If you don't put in a make
recipe, embed it in
a comment using mk
markup. Never type a
utility
command of more than a few words.
I try to avoid quoting shell meta character for the remote command
as well. If you need a pipe, put it in the recipe. There may be
an occasional need for a remote shell meta-character (usually
&&
), which is why
msrc
passes them quoted to each target host.
On the other end of the typing errors: recipe names should be long
enough to avoid single letter mistakes. Steps named
n
and m
are
easy to mistype and mistake for each other.
People rely on there experience to recognize key patterns that indicate if things are going according to plan, or not. The idea is that the two people making a change share a common mental model of what is supposed to happen, and are constantly checking their mental model against what is actually happening and each other. This situational awareness is core to preventing mistakes. (This is also a core concept in pair programming, for the same reasons.)
I am assuming that all scripts used to make production change were reviewed and tested on a non-production set of hosts, well before the change window. If that is not the case, fix that first. There is little-to-no excuse to run any change without prior testing.
This gives the person running the change a close-the-loop metric which enables them to close the change ticket with a positive assertion ("checkout complete") rather than a observation that they didn't see any obvious errors.
Note that the checkout recipe should never be a step in the update recipe. It might be run before the update to verify that the update has not been done, and it may well be run multiple times (by the third eye requirement) as a post-update step.
Similarly a structure that changes modes without a clear request
from the operator is really bad. The old complaint from
emacs
users that vi
's
modes were bad is ironic, in that emacs
has
even more modes, and it can change modes without keyboard input.
Just as bad are processes that offer no feedback. This is why file
transfer programs (like scp
) offer a updated
progress metric as they copy data over the network. Show status for
long processes. Status is more important than behavior: don't tell
the operator about details they didn't already know about. People
do not deal well with extra information they do not understand.
Knowledge of the current situation prevents mistakes.This is true for errors in execution. It is not true for errors caused by some distant events.
The biological term for "the part of an organism that is distant from the point of inspection (or connection)" is distal. Failures that result from external forces, or actions taken by parties out-side of the driver's work-group (or span-of-control) are therefore distal sources of error.
So I don't subscribe to that doctrine: the weak link is almost always a process that provided little feedback or visibility (e.g. a GUI) or a procedure that had no useful cross-checks before the commit action was taken. The cross-check in this case was the cost of on-going changes and the added risk to those changes, versus the small savings in capital costs for the slightly larger disks.
Distant sources of data need to be observable: as the list of hosts we are about to update needs to be visible to the driver (as above). But the reasons for each step in the process need to be just as clear to the driver. Steps which add no certainty to the proces are of little value to the driver. What gives each step in the process value? Here is a list I would start with:
The output from the process is organized and fairly easy to read (possibly with some training).
The basic UNIX™ shell commands have a common pattern for error messages:
For example, I'll spell the null device wrong:command
:noun
:error-message
$ cat /dev/mull cat: /dev/mull: No such file or directory
This error message doesn't tell the driver which component of the path is wrong, but is gives her a finite number of places to inspect.
If a key step fails, then any automation should stop as soon as
possible. Never depend on the driver to interrupt the process from
the keyboard.
The failure should be as close to the last line of output as you
can make it, and include a key phrase like "noun
failed to verb
".
The best thing abou these errors is that they are common across
many tools, and the error message are available in most locales.
They are also clearly spelled out for each system call in the
manual pages for sections 2 and 3. That is not to say they are
clear to a novice, but they are consistant and can be learned.
And nearly every base tool exits
with a
non-zero exit code when it fails. So check the status of
any command that matters, and don't run commands that don't matter.
Investigation of failures should include cross-checks from the point-or-view of any distal inputs. Any distal part that has a way to cross check our work should have an interface to test it ad hoc. Use those to recover from failures.
If there is a possible termination-point in the process, then there should be a clear on-ramp to resume the process after the issue is resolved. This may require a return to a previous step. This may even require a whole-sale backout of the work aready done. Live with that, and learn to accept temporary failure as long-term success.
In the physical world we are bombarded by our sences with input data, so much so that we have to ignore most of it. In the digital world one must request data to see it.
Actions to prevent mistakes require not assuming that others have a similar understanding of the situation. Verification steps assure that the driver and their secondary agree on the status of the change. Never let a chance to check a verification pass you by.
Avoid any "normalization of deviance". If any output in the process looks funny then stop to confirm that output was (in some way) expected. Situational awareness is key in configuration management, and viewing all available data before taking actions (unplanned or planned) is the key to stable operations.
Avoid dealing with newly emerging requirements in an event-driven or uncoordinated way. Discovering "new knowledge" a part of a planned change takes you out of the "envelope of pathways to a safe outcome".
Anticipate available resource before your change. If resources are not following what you expected find out why.
instance
:command
:noun
:error-message
We also should carry exit codes back from remote commands.
We should build a strcuture to examine exit codes from each update, and take action for unexpected results.
People running operation, development, and change management are working under rules that make sense given the context, indicators, operational status, and their organizational norms. Always look for the outcomes and messages that will cause them to take the best action after each step. Make them aware of something that is not "normal" and they may take action to avoid making it worse. Hiding failures, cross-checks, or other related data from them gives them no context to take compensatory actions.
The common GNU build is a great example of this.
A README
file in each product source directory
is visible in the output of ls
.
This is being offered to the builder in a culturally normal way.
Becaue the most common action of an operator after unpacking a source directory
is to cd
to it then run ls
.
In fact a source directory without one of these is quite rare.
Along the same lines, the existance of the file
configure
in the directory is usually a
script built by autoconf
. If the
README
instructs the operator to run that
script, then they will usually do that. The expectation being that
the operation of that script does no harm.
Moreover that configuration script shows you what it finds as part
of the process of execution.
After the product is built (and may installed) the operator may request
the version of the application under a common command-line option,
usaully -V
or --version
.
This is compared to the last known version to assure that the
update did the right thing.
This canonical chain (README
to
configure
to -V
) has
changed very little in the last 20 years.
You're local site policy should call-out which style of information each process requires. If all information if requested you should expect more failures.
Changes happen because we need them. I have run production machines with uptimes of more than 2,400 days. There was no compelling reason to update the operating system, so no need to reboot them.
Someone must request each change. That's not to say that the same group issues every change request. Some changes are tiggered by different steak-holders than others. Local site policy should state a clearly as possble who requests different changes.
This might be as easy as setting up a pair of nntp
news servers and publishing a log if each request and the log of
each change supporting that request. I've done that for more than
25 years, with absolutely no regrets. This solution also allows operators
to Cc:
an e-mail gateway to the news
service in Customer correspondences.
rcsdiff, cvs diff, and the like tickle + email msync + email? TODO files level2s msync (all checks it makes) level3s build level3s diff Focus attention on the basis of earlier assessments (of errors). limit scope of new learning (sh, make, m4, rdist, maybe xapply markup) keep similar circumstances truely similar History editing is great, but view the entire command, not just the first 80 characters! Adapt this to the way you do your work, or at least meet in the middle. local site policy don't get complacent Stability is more important than a short-term plan. If a change is so important that is cannot be aborted, then you've already failed. Address organizational pressures to choose schedule over system safety before the change starts. Lack of a testing environment is not acceptable for a risky change. Error types: Mode error. Getting lost in display architectures. Lack of coordination when changing common configuration inputs. Wrong task priority. Data overload. Not noticing changes (in status, metrics, paths, names, values) [~graphic]. Cognitive consequences of computerization Computers increase demands on people's memory; Computers ask people to add to their package of skills and knowledge; computers increase the risk of people falling behind in high tempo operations; Computers can undermine people's formation of accurate mental models of how the system and underlying process works; knowledge calibration problem: thinking you know how the system works when you know very little of the actual model; compartmentalization limits the reach of relavent information making a list of hosts with a kludge is a bad idea -- be sure that the reason a host is selected is the right one. automation traces op rules log describe that logging install's log, if you use ksb's install describe how to enable that! local.defs might record each command ;
$Id: error.html,v 1.5 2012/11/10 23:14:57 ksb Exp $