I like to use an analogy to explain how sites that have been poorly managed are working today.
My grandmother made the best chicken soup stock. She made large batches of it a few times a year to freeze for later use. Some of these she gave to my mother. The secret of her stock died with her, so we'd take some of the her Original Stock to mix with our imitation stock. And that way we could pad it for a while longer. Eventually we would have had to dilute what was left and repackage it to continue the adventure. Thawing the frozen stock to add our imitation stock, then re-freezing the result was deemed a `bad idea'.
Building new instances by coping existing instances and mixing in replacement finger-files is just like that. You are making use of grandma's stock to make your soup, and eventually you'll find it is too diluted (or tainted) to consume. Every site needs to be able to build new instances of each type from raw materials.
I promise you are not wasting your time here. Even if you never install any of my code, you might learn how to organize your configuration management structures better just by reading this document.
"People who know what they're talking about don't need PowerPoint." -- Steve JobsNow for some PowerPoint I shouldn't need:
git
,
RCS,
CVS, or
the like). That structure may (may not) manage folders or the migration
of files to other names. But that structure must keep track "milestones".
That is, automation needs to extract each file at a stable revision, even
while new revisions are in process.
make
, ant
, or the like)
to form a set of files that represent the product. This may require
compilation, linking, or any other process that can be done strictly
with automation. No fingers allowed after the process starts.
rpm
, pkg-add
,
or the like), but could be a simple archive that is installed by some
local tool.
All files were input by people, in one way or another. So having been input by fingers is not the issue: having to be input again by fingers is the issue.
A file that is committed to a revision control structure gains more substance than a finger-file by virtue of the ability to recall an exact copy via automation. A product built entirely from revision controlled files can be rebuilt from those files. A package of such products, constructed from a revision controlled recipe is as repeatable as the parts that made it. Any instance built from those files, products, and packages is just as repeatable as any other artifact.
To make it clear: no finger-files are used to configure any layer. Finger-files are the raw material to build more revision controlled structures, in the longer term. This larger commit-loop is called `progress'.
The gross simplification most technicians enjoy is that these two different sources of data can be grouped into a single bundle. It is true that site policy is just more files in the source repository. But it is not true that we make `everyday changes' to those files. Changing the signature of a production database host, web-server, or application instance without a way to manage and track those changes is a sure way to make management of your site impossible.
That's the whole goal of configuration management: build what you promised from what you have, without errors, and in a timely manner.
A corollary to that: never build any managed element twice. If we can build it well once, we can use that same process to build it as needed.
The most important part of operational configuration management is on-going updates, not the first build or first boot. Having data that was right long ago (but might not be now) is worse than knowning you don't know.
Given that we are not going to reset the clock to get the same timestamps on the contents of a build, we must either ignore the timestamps, or never rebuild an artifact with the same identifier, but different contents. This is a matter for you to decide in your local site policy. (I almost never construct a new product or release with the same name as a previous one.)
Since I extract my revisions with automation (via
rcsvg
(1),
cvs
, or git
),
I do not worry about random changes made by peers.
We only move to stable symbolic labels at known intervals.
Moreover some configured files are different for every instance, for example the name of the instance itself, its serial number, and at most sites the IP address. Even though these elements change for every instance, they are tracked in local site policy files.
And some configuration files are different on each instance due to
differences in the applications and services provisioned
For example, sudo
and op
configuration files should include only the escalation rules needed to
manage each host. Sending unrequired escalation rules to a host is a
security incident waiting to happen.
I record all of the site policy for layers 1-4 in a few common ways. This lets a team of less than 10 people run more than 3,200 instances without breaking themselves or production. Much higher scale-factors are possible with more support from development groups.
We need a way to record recipes to avoid finger mistakes when driving automation. I used 2 ways because there are 5 layers and no single way works for all of them. The file tactic is to record recipes in the in-line comments in each file, for the other (multiple-file) layers I use a separate recipe, script, or feed-back-loop to automate each process. Every file can be marked-up with comments, every process can be automated with a recipe. Every locally built processor should accept comments, for that reason alone.
At layers 1 and 5 we manage a single file by revision.
Separate recipe files would double the number of files we manage, and imply
a link between revisions and files that you do not want to manage.
If you think you've found a file that can't be marked-up, you've never
used uudecode
, m4
, or
you are limiting yourself in some other unreasonable way.
For each file I add any required recipe to the comments within the file.
I even use comments to markup parts of a file I need to extract later.
We'll talk about
mk
and
explode
later in this document.
At layers 2 and 3 I use make
recipe files.
This is the obvious choice because it works on every platform I
manage, and I don't have to use any of the `advanced' features of
any specific version. I stick with (almost) plain old V7 recipes.
At layer 4 I use the master source structure that is contained in
this directory (mmsrc
(8),
msrc
(8),
hxmd
(8),
wrapw
(1),
xapply
(1),
xclate
(1),
ptbw
(1),
explode
(1),
and mkcmd
(1)) plus
the tools from install_base
.
Then some close-the-loop processes which check the signature of
each instance against either a known-good signature or
the last known signature to look for regressions, failures, or human mistakes.
For all the forms of pull logic I used
msrcmux
(7),
mpull
(8),
muxcat
(1), and
rsync
(1).
For the forms of push logic I use
msrc
(8),
ssh
(1),
scp
(1), and
rdist
(1),
The key issue is knowing that the state of the resources you are about to use is stable. That is that you are not going to incorporate a file that is partially committed (in effect a finger-file) into your deployment. This is the first requirement of any configuration process: "start with what you have". If you start with something you didn't expect to have, you will get results you didn't expect to get.
Use a process to advance (or regress) labels that makes sense in your environment. Close-the-loop by always viewing all the uncommitted changes before any update to production. Any uncommitted change should stop the process. Files that have not been committed are (by definition) finger-files and must not be part of a production update. They could build a test environment, but that's site policy -- any local policy allowing uncommitted changes to move to production is a bad one.
I use
rcsdiff
(1)
to check for layer 1 issues.
Files with no symbolic label are usually excluded from the build.
If they are not in this layer's context, then a check that
displays them and stops the process is part of the update.
It is also poor form to leave uncommitted files in the revision control
structure. These are trip-hazards for other engineers.
So I run a recurring
tickle
(8)
task to e-mail engineers that have idle locks older than a few weeks.
If that doesn't prompt them I take more direct action, by shaming
them before their peers.
I always gather files by symbolic name with
rcsvg
(1).
Any production build stages a copy of the source under a temporary directory.
That directory is where the build process runs, not any other working copy.
Under git
we'd use a known 160-bit SHA1 hash (a 5 to
41 digit hex number), but in either case we'd extract the stable source as
directly as possible.
In addition to that I check the control recipe used by
msrc
with
msync
(8)
(or see the HTML document).
msrc_base
package was built that way.)
Once built or packaged the release of the package is known by the name of the package and a unique identifier. The identifier could be a number, a date, or any other unique key that distinguishes that build from any other. For example "msrc_base-2.31" would be a good specification for an older release of these tools.
The same close-the-loop checks that msync
uses are
used at this level. A hook in the recipe file allows recursion into
the product directories to check them as well.
Not much different from a package recipe at this level. I have built instances from source code (FreeBSD), from ISO images of mostly RPM files (Linux), from network boot images (Solaris, HP-UX), and from boot tapes (AIX). Given all those tactics, I can tell you that the details do not matter as much as the structure underneath. Having a manifest of parts and knowing that that manifest is complete and stable makes the process work.
The mechanics of getting an instance booted are widely available: kickstart, jumpstart, DHCP boots, as well as using a remote protocol to mount an ISO image from the boot ROM (iLO, ALOM, virtual provider, or the like). Once you get the image booted, it is quite possible to configure the whole of the configuration with automation. There is no file in a computer that is not made of bits, and bits are easy to write.
You just have to have a policy for the contents of each file, and the order to build and install each part.
If you need to gather site policy, it is going to be for an
internal presentation layer (a web site, or the like).
Which just means converting all the documents to HTML or
some other format. So encode your policy in files that are easy to
convert to HTML, and easy to process mechanically.
I use the hxmd
format for almost all of my
automated policy, and HTML for the people-policy.
Each configuration for your hosts, routers, switches, disk arrays, and other IT instances decays just as the rice does. Trying to do a lot of up-front configuration when you install the instance just means it gets stale. Stale because it is continuous updates that keep instances fresh. Computational, application, and capacity demands change, software evolves; these factors continuously move the goals your site structure has to meet. Moving goals means changing configuration, it is really that simple.
If it doesn't get stale it gets lost. Lost when you lose the configuration of your instances by:
/etc
like: group.2006-12-12
,
or resolv.conf.old
. Files like these mean
that admins (with a superuser shell) have no confidence that
they can fall-back to a previous revision of that configuration file.
So they leave trash in the filesystem, rather than chance a much
harder recovery.
I have confidence, because I know how to make the correct file all the time. And if it doesn't work I know where to fix it, and how to test it. That let's my whole team work faster and with much more agility than anyone using their fingers alone.
That confidence doesn't mean we are careless. We build back-out copies of
files I change. In fact install
puts them in
a directory named OLD
for us. But we do not
keep those files forever. A recurring purge
task
removes all the backup clutter from the filesystem. This keeps junk from
accumulating, but allows a quick recovery from a fat-finger error, or
even a bad commit.
There are a few mistakes or failures that cause an instance to become inaccessible from the network:
/etc/resolv.conf
/etc/services
ssh
access to
the instance:
/etc/password
missing the privilege separation
login for sshd
/var/empty
/etc/ssh/sshd_config
/etc/ssh/ssh_host*
keys
/etc/pam.d/*
or /etc/pam.conf
SHELL
(e.g. nonexistent)
ssh
keys may be corrupt
That is about 20 files that could lock you out of a running instance.
There are a few more that can keep an instance from booting, depending on
the operating system.
In all these cases having console access via
a conserver
with serial port access, out-of-band
management iLO access, or other remote console access will save you from
anything short of a hardware failure.
Think about the number of files that you might update without locking yourself out of the instance. Pretty much all the other 20,000 files installed on my workstation, which is more than 99%. Don't let the 20 files stop you from automating the 20,000.
I would argue that updating the 20 files with fingers is actually worse than updating all the rest, since the time-to-recover is higher for mistakes in the 20. Since you've automated the risky ones, you would certainly automate the less-risky ones. I don't see any valid argument to not automate as much of the configuration of an instance as possible.
All the files used to update an instance are always from from a revision control structure, with the recipes from the same source. What else might impact the results of a build process? I want to consider 3 contextual factors that create differences when building and updating the configuration of an instance. So let me describe those 3 factors, then tackle how to install updates.
The first is which update-target is selected.
We may need to update a configuration file under
/etc
or a binary file under
/usr/local/bin
, but we rarely need to
rebuild every possible update in a single change.
We select a target to update under master source by directory and possibly by
the update command we apply to that directory.
Thus we must create a unique directory for every target application.
Most applications install the program, the manual page, and any default
configuration (if none exists), as asked. Once we have a way to install
every configuration directory, we can automate installing them all in
the proper order.
The next factor is the meta-data about the desired state of the target instance used to configure the directory. Since instances run different mixes of application, data services, command and control services, and other IT facilities, we need an authoritative data source that tells the configuration structure which to configure on every managed instance. That meta-data is a layer 5 policy which is machine readable. Changes to key meta-data elements may require many configuration directory updates, it is hard to predict which directory uses which meta-data, and the purpose of that mapping. This is why changes to site-policy take more care and skill.
The last factor is the build environment: the version of any compiler, library, or other tool that impacts the exact rendering of the source files into a product. Include support for any cross-compiler and machine architecture flags in this group, but the need for such options is meta-data. Any update to the build environment might imply both a major update to the client instances, and a rebuild of all the binary files currently installed.
The usual case is that most modern instances are provisioned with a compiler, most build tools, and any run-time libraries likely to be needed when they are installed. But it is possible to run all the preparation work on a specially provisioned `build farm'. In that case, all the client instances simply install packages of files (via RPM, PKG. or the like).
The difference between push and pull is tactical at that scale. The larger issue is package verses not. When you package an update you must assure that any automated installation accounts for all possible transitions from the existing state to the desired state. A failed installation, via a package installation script, has little choice other than a non-zero exit for any failure. Recovery is much harder to automate if you lack invariants with reguard to the availability of required prerequisites.
By packaging only those directories without instance-specific elements, only platform specific configurations, you may get a higher success rate for package installations. Stand-alone packages installations simply fail when any prerequisites are not up to their needs, which is all they can do.
An active push or pull of a product may be able to discover missing
prerequisites to trigger the automated update of out-of-date ones.
Similarly, a structure around packages (like apt
,
yum
, or pca
) manages
the prerequisites and failures to provide better service.
But such a service alone doesn't update every file on
a host, because it doesn't have a source of meta-information.
That binary file might be installed someplace under
/usr/local
or
/opt/
depending on the type of operating system.
The manual page could be under application
/usr/local/man
or some place else.
There are 4 combinations of policy and environment that we should be able to create. First we'll look at the push model.
mmsrc
builds a shadow directory
under /tmp
(aka $TMPDIR
),
then uses the local build environment to run the build recipe.
This is exactly what we need to build a product on the master host with
the local environment.
When it is finished it removes any temporary files, so there is no
cleanup required. Save a copy of the build directory, when required,
with cp
or tar
.
msrc
builds the shadow directory
on the client host (in the directory specified in
the make
macro INTO
),
then ssh
's to the client instance to
run the build process.
It leaves the remote copy on the host. This helps debug any failed builds. A cleanup task on each client could remove the (usually small) shadow directory after some delay. I usually just remove the whole shadow hierarchy as a clean-up task after a major update has been in production for a month.
mpull
builds a local copy of the
master directory using rsync
to fetch it from
the master server. Then it uses mmsrc
to
build the directory using the site policy visible on the client.
It leaves the configured copy of the source in directory specified in
the make
macro INTO
,
just as msrc
would.
mmsrc
, copy the configured directory back to
the master server, build with the local tools, then mock the install
process to see what needs to be updated
(or package the directory for later update).
I have never once needed this.
It is possible to project a copy of the master source for any product
(via rsync
, rdist
, or
NFS) to the client, then
use an msrcmux
service on that client to request
the configured directory from that client on a build host.
Then the build host would trigger the build portion of the recipe,
copy the directory back to the client for the installation or packaging.
Since many master source directories may have file caches or other files created at build-time, read-only mounts over NFS might not work. To fix this, use a modern union mount to allow a transparent overlay of a local filesystem over the read-only NFS mount. This allows the client to build on top of the read-only directory. That mitigates some of the pain of this role reversal.
If I did need this, then I would install msrcmux
on
each client and mount a read-only NFS mount of a local cache of
the master source on each client, with a union mount of
a tmpfs
over it. Then allow a (local) build
server to request a configured copy. That host would then
remote install the resultant files back to the client, from
a temporary directory. This is a lot of effort for
a case I've never needed. But I'm sure it would work, since I
just tried it.
rsync
the INTO
directory from the master server
into the same directory on the client instance and trigger the installation
recipe.
msrcmux
allows a client to pull the
configured directory to the client instance of a tcpmux
service. The muxcat
client application is the
usual client application. See the msrcmux
HTML document, and the
muxcat
(1) manual-page.
mpull
to fetch the master directory and build
it with local meta-information. See the mpull
HTML document, and
mpull
(8).
The Genesis tactics uses a complete copy the master source to build every tool in the correct order. This is often used to bring a fail-over copy of the master source host up at a new datacenter. Or to create an archive disk that is known to be "pure". See the Genesis HTML document.
The RPM build process turns a single product or package into an
RPM file. These files are used in a pull structure to update many
clients, or speed the local build process. My local policy requires that
I name the RPM recipe file ITO.spec
.
This assures that automation can find the correct specification file.
Within that file some mk
marked lines have
additional meta-data about the directory which contains the file. See
level2s
(8).
Personal builds, in a mortal login's home directory, contain a
(mostly complete) instance of the local tools. This copy references
a configuration subspace which is wholey contained under the home
directory. I use this tactic to test new versions, new products, and to
show other admins how nifty the structure is. See the build
plan in ksb's HTML document.
Basically we pull each level 2 product via msrcmux
down in tern from a make
recipe, which forces
the correct order and configuration parameters to install all local tools.
While RPMs may be removed from the target system, and a mortal install could be removed, there is no inverse operation for the Genesis build. Genesis is intended to permanently convert an instance into a master source repository. But the list of products to install could be changed by site policy to create other layer 4 signatures.
msrc_base
package contains these toolsmmsrc
come as a plain C source, which is
built with make
and a C compiler. It may be
reconfigured with autoreconf
.
We then use that to build all the tools required to reconstruct it:
explode
, mkcmd
.
With those we can build the wrappers:
ptbw
, xclate
,
xapply
, wrapw
.
With those we can build and run the push version of the master source
structure: hxmd
, msrc
.
And with those installed we can build a new version of
mmsrc
to complete the loop.
The mkcmd
tools allows us to reuse the
definition of options between programs that share the same specification.
While the explode
tools allows us to select
just parts of a larger layer 1 file to incorporate into another
product. Together these two allow my tools to share and mix options,
common source code, and recipes so that they are never out of step with
each-other. See the primary corollary.
When you look at the -V
output of
mmsrc
and hxmd
you'll
see they share the same hostdb.m
module.
This is the power of mkcmd
: that file contains
the options and the C code required to support both programs. At the
same time mmsrc
shares make.m
with msrc
. This is more than a "trick", it
allows a level of code reuse others only dream they had.
With just the msrc_base
tools you could
install the pull version of the source structure:
msrcmux
, mpull
,
muxcat
. Or you could continue with
install_base
to
add better management of target environments.
But the core of the structure is stable with just the
msrc_base
package.
It is up to you to make something of it, or not. I always add
install_base
, oue
, level2s
,
rcsvg
, and msync
to
any machine I build.
msrc_base
wastes
a few hours of your time and 5MiB of disk space.
In the best case you find a path to avoid service failures, lots
of typing, and wasting your time on processes you should have
automated long ago.
In the worst case not installing
msrc_base
then de-provisioning a critical
service could leave you `without income'.
I hope you learned something, even if you never install the software.
In either case, good luck!
-- KS Braunsdorf, September 2012
$Id: msrc.html,v 1.17 2012/09/27 17:23:36 ksb Exp $