sh
(1), and
have an understanding of the UNIX™ process model, exit codes,
and have coded several scripts, used gzip
,
and find
.
It also assumes that you can read the manual page for any other
example command. It would help a little if you've used
printf
(3) or some other percent-markup
function, but it's OK if you've not used any before.
xapply
?xapply
is a generic loop. It iterates
over items you provide, running a customized shell command for
each pass though the loop.
One might code this loop as something like:
and feel pretty good about it, so why would you needfor Item in $ARGV do body-part $Item done
xapply
?
The number one reason to use xapply
is that it runs some
of the body-part's in parallel. It starts as many
as you ask it to (using the -P
option),
then, as processes finish, it launches the next iteration of
body-part, until they are all started.
It waits for the running ones to finish before it exits.
The benefit is that we might take advantage of more CPUs resources (either as threads on CPU cores, or multiple CPU packages in a host).
Even better, it can manage the output from those parallel
tasks so that each is not all mixed with the others.
Without the -m
switch xapply
assumes you
can figure out which iteration of body-part output each line.
Under the -m
option xapply
groups
the output from each iteration together,
such that one finishes completely before the next one starts.
Like most loops, xapply
can skip though the list more
than one item at time.
The -count
option allows you to
visit the items in the argument list in pairs (or groups of count).
This is handy for programs like diff
that need two targets.
Unlike common loops xapply
keeps track of
critical resources for each iteration.
A body can be bound to a token which it uses
for the life of its task. That resource token (for example a modem) won't
be issued to another iteration until the owner-task is complete, then
it will be allocated to a waiting task.
This allows xapply
to make very efficient use of
limited resources
(and it honors -P
as an upper limit as well).
Xapply
has other friends. In fact it is the core node
that connects
xclate
, ptbw
and
hxmd
to each other.
We'll come back to the usefulness of that fact in a bit.
In summary, xapply
lets you take advantage of
all the CPU resources on a host while keeping the tasks and resources
straight.
To raise the overall torque even more it reaches out to share resources,
to collate output, and reuse configuration data.
These features are all coordinated across multiple instances of
xapply
and the related tools.
gzip
utility can be pretty expensive in terms of CPU.
If we want to compress many output files (say *.txt) we could run
something like:
gzip -9 *.txt
Most modern hosts have more than the single CPU that is going to use.
We might break the list up with some shell magic (like split
(1))
then start a copy of gzip
for each file.
That won't balance the CPUs as one list will inevitably
have most of the small files.
This short list finishes long before the others leaving an idle
CPU with files left to compress.
The shell code to split the list up is also pretty complex. Given a temporary file, it might look like this:
/bin/ls -1 *.txt >$TMP_FILE LINES=`wc -l <$TMP_FILE` split -l $((LINES/4+1)) $TMP_FILE $TMP_FILE, for Start in $TMP_FILE,* do xargs gzip -9 <$Start & done wait rm $TMP_FILE $TMP_FILE,*
With xapply
, we can keep 4 processes running in parallel with:
That will keep our machine busy for a while! If there are less than 4 files we just start as many as we can. More than that will queue until (the smallest or first) one finishes, then start another. This actually sustains a load average on my test machine right at 4.0. Thexapply -P4 "gzip -9" *.txt
xapply
process itself is blocked in the wait
system call and therefore uses no CPU, until it is ready to start another
task.
In some cases the list of files might be too long for an argument list.
We can provide the arguments on stdin (or from a file)
with the -f
switch to xapply
:
This is also good because it won't try to compress a file named "*.txt" in the case where the glob doesn't match anything. The other great thing about that is that the firstfind . -name \*.txt -print | xapply -f -P4 "gzip -9" -
gzip
task starts as soon as find
can send the first
filename though the pipe!
When find
has queued enough files to block on
the pipe it gives up the CPU to the gzip
s,
which is exactly what you want. Just before that there are actually
5 tasks on the CPU, which is OK as find
is largely
blocked on I/O while gzip
is busy on the CPU.
xapply
as a filter, reading from stdin and writing
to stdout, like awk
would. We'll see in the custom
command section that this is closer to the truth than it looks.
For now just play along.
Because of the parallel tasks xapply
has some unique
issues with I/O.
On the input side we have issues with processes competing for input from stdin. We take several measures to keep the books balanced.
-count
switch and stdin
xapply
command folds input lines 1 and 2 to a single line,
then 3 and 4, then 5 and 6 -- and so on to the end of the file:
The two occurrences of stdin, spelled dash "-" like most UNIX filters, share a common reference. That is the code knows to read one thing from stdin for each dash, for each iteration, rather than reading all of stdin for the first dash leaving nothing for the second.xapply -f -2 'echo' - -
In other words it does what you'd expect. Using -3
and
three dashes reformats the output to present 3 lines as a single output
line.
Find
has the -print0
option for just this reason.
Xapply
has the -z
option to read
-print0
output. Some other programs, like
hxmd
,
also use the nul
terminated format.
So the compress example might become:
find . -name \*.txt -print0 | xapply -fz -P4 "gzip -9" -
-i input
-f
the default value is /dev/null
.
This lets the parent xapply
use stdin for
input without random child processes consuming bits from it.
To provide a unique word from $HOME/pass.words
to
each of 5 tasks:
This has some limits, when the file is too short for the number of tasks thexapply -i $HOME/pass.words 'read U && echo %1 $U' 1 2 3 4 5
read
will fail and
the echo won't be executed. (Put 3 words in the
file and try it.) We might want to recycle the words after they've been
used, see below where we explain how
-t
does that.
Since the read
is part of a program it could be part of
a loop, so a variable number of words from the input file could
be read for each task. Under -P
this could be problematic.
-m
option, xapply
tasks each
send output to stdout all jumbled together. This is not
evident until you try a large -Pjobs
case with
a task that outputs over time (like a long running make
).
If you want an example of this you might compare:
to the collated version:xapply -P2 -J4 'ptbw -' '' ''
xapply -m -P2 -J4 'ptbw -' '' ''
The xclate
processor is xapply
's output
friend. It is not usually your friend, as it is hard to follow all
the rules. In fact some programs, like gzip
, don't
follow the rules very well.
You'll have to compensate for that in
your xapply
spells.
In our example above we'd like to add the -v
switch to
gzip
to see how much compression we are getting
Which looks OK, until you run it. The start of all the compression lines come out all at once (the first 4 of them), then the statistics get mixed up with the new headers as they are output. It is a mess.find . -name \*.txt -print0 | xapply -fz -P4 "gzip -9 -v" -
By adding the -m
switch to the xapply we should be
able to collate the output. However it doesn't work because
the statistics are sent to stderr, so we must compensate with
the addition of a shell descriptor duplication:
find . -name \*.txt -print0 | xapply -fzm -P4 "gzip -9 -v 2>&1" -
The logic in xapply
to manage xclate
is
usually enough for even nested calls. When it is not you'll have
to learn more about xclate
, I'd save that for a major
rain storm, or long trip on a plane.
The xapply
's command line option -s
passes
the squeeze option (also spelled -s
) down
to xclate
. This option allows any task which
doesn't output any text to stdout to exit without
waiting for exclusive access to the collated output stream.
This speeds the start of the next task substantially in cases
where output is rare (and either long, or evenly distributed).
apply
uses a printf-like percent expander to
help customize commands. As a direct descendant of apply
,
xapply
has a similar expander.
As one of my tools it has a lot more power in that expander.
In addition to the apply
feature of binding %1
to the
first parameter, %2
to the second, and so forth,
xapply
has access a facility called the
dicer.
The dicer is a shorthand notation used to pull substrings out of
a larger string with a known format. For example a line in the
/etc/passwd
file has a well-known format which uses
colons (":") to separate the fields. In every password file
I've ever seen the first field is the login name of the account.
The xapply
command
filters thexapply -f 'echo %[1:1]' /etc/passwd
/etc/passwd
file into a list of login names.
The dicer expression %[1:1]
says "take the first parameter,
split it on colon (:) then extract the first subfield".
Here are several possible dicer expressions and their expansions:
I stuck a nifty one in there, the dollar sign always stands for the last field. The other important point is that
Expression Expansion %1 /usr/share/man/man1/ls.1.gz %[1/2] usr %[1.1] /usr/share/man/man1/ls %[1.1].%[1.2] /usr/share/man/man1/ls.1 %[1/$.1] ls
%[1/1]
would expand to the empty string, since the first field is empty.
The dicer also lets us remove a field with a negative number:
Expression Expansion %1 /usr/share/man/man1/ls.1.gz %[1/-1] usr/share/man/man1/ls.1.gz %[1/-2] /share/man/man1/ls.1.gz %[1.-$] /usr/share/man/man1/ls.1
Because splitting on white-space is so common, the blank character is special in that it matches any number of white-space characters. Escape any of blank, a digit, close-bracket, or backslash with a backslash to force it to be taken literally.
Later versions of xapply
also allow access to the
mixer which allows the selection of characters from a
dicer expression. That is slightly beyond the scope of
this document. As an example, %(3,$-1)
is the
expression to reverse the characters in %3
.
All these tools use the same mixer+dicer expression syntax:
xapply
, mk
, and sbp
.
xapply
provides: viz. shells, escape characters, and padding.
-S shell
option lets you select a shell for
the command built to start each task. I would use ksh
or
sh
if it were me. You could set $SHELL
to anything you like, but that might confuse other programs that use
xapply
, so stick to -S
.
As a special case when you set -S perl
it changes the behavior of xapply
.
To introduce the command string
it uses perl -e
rather than the Bourne shell
compatible $SHELL -c
.
It might also setup -A
differently (see below).
-f
,
xapply
matches the corresponding lines from each file as
parameter pairs. When only one of the files runs out of lines the
empty string is provided as the element from the other. You can change this
pad string to anything you like, for example -p /dev/null
.
In one of our first examples we joined pairs of lines. What happens if
there is only 1 line? The echo command gets an extra space on the end,
which it trims. To see that we can replace the default expansion with
a quoted one, and run it through cat -v
:
This outputs "A $" (without the quotes).echo A |xapply -f -2 'echo "%*"' - - | cat -ve
There are alternatives. Under -p
we can detect a
sentinel value in for missing line. Say, for example, that a comma on
a line by itself could never be an element of the input, then
-p .
would let us detect the missing even line with
xapply -p , ... if [ _"%2" = _"," ] ; then ...'
It is usually considered good form to exit from task
as soon as possible. With this in mind the above trap might be better
coded as:
... [ _"%2" = _"," ] && exit; ...'
-a
option. Take care that the symbol you pick is quoted
from the shell.
Viz. "xapply -a ~
..." is not what you'd want under
csh
or ksh
, since the tilde gets expanded to a
path to someone's home directory.
Because xapply
is driven from mkcmd
it takes
the full list of character expressions (-a "^B"
is
ASCII stx
, -a M-A
is code 230), that
doesn't mean you should use them. Try to stick with percent if you can.
In ksh
that makes some let
,
$((...))
, and ${NAME%glob}
parameter substitutions require %%
to
get a literal percent sign.
xapply
is emulating a generic loop it stands
to reason that there would be a "loop counter".
The loop counter is
named %u
, which stands for "unique".
Since I'm a C programmer, I start the loop counter at zero (0) and
bump it up one for each trip through the loop.
For example to output the numbers 0 to 4 next to the letter 'A' to 'E':
xapply 'echo %u %1' A B C D E
A better use of this might be to process data from one iteration to
the next (making generations of a file with the extension .%u
).
Use of the ksh
built-in math operations to build a
function based on %u
is common. To queue many
at
jobs about 5 minutes apart:
Thexapply -x 'at + $((%u*5)) minutes < %1' *.job
-x
option lets you see the commands executed on stderr.
This emulates set -x
in Bourne shell.
That will expand an unbalanced grave quote in the subject argument. Even worse we might try to run "Abrose" as a shell command.xapply -f -2 'Mail -s "Hi %1" "%2" <greeting.mail' names.cl address.cl
A program should be safe from such corner cases, like a filename with
a quote or control character in the name. On input xapply
can use the -print0
-style, on output we depend on the shell.
To make a parameter safer there is a
q
modifier that
tells xapply
that you are going to wrap the expansion in shell double-quotes, and that
you'd like the resulting dequoted text to be the original value.
By spelling the expansion as:
We're askingxapply -f -2 'Mail -s "Hi %q1" "%q2" <greeting.mail' names.cl address.cl
xapply
to backslash any of double-quote, grave,
dollar, or backslash in the target text, so the command is presented to
the shell as:
Mail -s "Hi Paul d\`Abrose" "pa@example.com"...
This is not always enough, sometime the data should be passed through
a scrubber, or sent to /dev/null
, if you don't trust it.
%+
shifts the
parameters over one to the left, then expands
the new cmd
(replacing the
%+
) then continues with the rest of
the original cmd
.
And example makes this a little clearer:
Outputsxapply -n -2 "( %+ )" "echo %1 %1" ksb rm /tmp/bob
( echo ksb ksb ) ( rm /tmp/bob )
This is really a lot more useful with the input is a pipe
(viz. under -fz
).
A program can match commands to parameters and send the
paired stream to xapply
for parallel execution.
This is exactly how hxmd
works.
xapply
didn't get any arguments to use
as parameters it shouldn't run anything (unlike busted xargs
).
In a few cases it might be nice to have an "else" part (like a Python
while
loop). The -N else
option allows
a command to run when we didn't get any tasks started.
Let's rework our compression filter, we'll misspell the extension we are looking for (so we don't match anything) and put in a message when we do not find anything to compress.
find . -name \*.text -print0 | xapply -fzm -P4 -N "echo nothing to compress 1>&2"" "gzip -9 -v 2>&1" -
This is mostly used in scripts to give the Customer a warm feeling that we looked, but didn't find anything to do.
xapply
is very predictable.
When we run the examples on the same input, we are apt to get the same
output. All that changes when we allow xapply
to
start a ptbw
to manage a resource.
Each line of a ptbw
resource file represents a
unique resource that is allocated to a task.
A resource could be anything, a CPU, filesystem,
VX disk group or network address. I picked a modem in these
examples because the exclusive use to dial a phone number is clear.
If we have 3 modems connected to a host
on /dev/cuaa0
, /dev/cuaa1
, and
/dev/ttyCA
we can put those strings in a file
called ~/lib/modems
. Then we can ask xapply
to reserve 1 modem for each command:
No matter how many phone numbers are inxapply -f -t ~/lib/modems -R 1 'myDialer -d %t1 %1' phone.list
phone.list
we
will never try to dial different numbers on the same modem.
This is because xapply
and ptbw
know how
to work with each other to keep the books straight.
We can force a new ptbw
instance into our
process tree by using the -t
option, the -J
,
or a -R
option with any value greater than 0.
If we don't use any of those options xapply
uses
the internal function iota
just as ptbw
does, but doesn't insert an instance in the process tree, so any
enclosing ptbw
will be directly visisble to each task.
The new expander form %t1
expands to the modem selected.
The -R
options specifies how many resources to allocate
to each task.
All of the dicer forms we saw above might be applied to a resource:
given that %t1
expands to /dev/cuaa1
:
Expression | Expansion |
---|---|
%t[1/$] | cuaa1 |
%t[1/-$] | /dev |
%t[1.-$] | /dev/cuaa1 |
If we use the resource to allocate CPUs we might want to get
more than one to a task. In that case we can tell ptbw
to just bind unique integers as the resources. On a 16 CPU machine
we could divide the host into 5 partition of 3 CPUs:
Thexapply -J5 -R3 -f -P5 'myWorker %t*' task.cl
-J5 -R3
is passed along to ptbw
to
build a tableau that is five by three, then xapply
consults that to allocate resources. The %t*
passes
the names of the CPUs provided down to myWorker
.
xapply
in xclate
-e
var
=dicer
option allows any
environment variable to be set to a dicer expression.
To specify the modem in $MODEM
(rather than in an option):
xapply -f -t ~/lib/modems -R 1 -e "MODEM=%t1" 'myDialer %1' phone.list
This is also really useful to send options down to xclate
in
XCLATE_1
to set headers and footers on collated output.
For more on the use ofXCLATE_1='-T "loop %{L}"' xapply -m -e L=%u 'echo' A B C
XCLATE_n
see the
xclate HTML document.
Here is why xapply
has to set the variable; the xclate
output filter is launched as a peer process to the echo
command,
so changing $L
in the command won't give it a new value
in the (already running) process. We can't set it in the parent shell
as it won't change for each task, so xapply
needs to be able
to set it.
-u
forces xapply
to pass the value
of %u
to any output xclate
as the xid.
Using that the above example becomes
but that's not the reason this option exists.XCLATE_1='-T "loop %x"' xapply -m -u 'echo' A B C
When another processor (say hxmd
) wants to know which of
several tasks has completed it can call xapply
with
-u
and xclate
with -N notify
.
Then xclate
reports the completion of each task with
the number of the task as the xid
on the resource
given to -N
.
This makes xapply
an excellent "back-end" program to manage
parallel tasks, although it works best from a C or perl program.
Here is an example where we use notify to
show the order of complete tasks:
It would be sad if we couldn't get the exit code from each task, but we can. Try that same with axclate -m -N '|tr -u \\000 \\n|while read N; do echo fini $N; done' -- \ xapply -m -u -P5 'sleep' 3 2 5 2 3
-r
switch passed to xclate
.
The two numbers are the exit status, and the xid.
Also try both of those without
the -u
option to xapply
, in
one case you get the number of the task, in the other the number of
seconds slept (which is the value of %1
).
The observant student might think
this looks like it was designed to be given as input to an instance
of xapply -fz
.
Another possible use is hxmd
's retry logic.
One last corner case: the -r
output for -N
's
command is encoded as task "00". Thus it is distinguishable, as a
string, from the first task (given as "0"). The is the same hack
the new rmt
program uses to tell the client it has
a new more advanced command set.
ptbw
to meptbw
program allows a shorthand to
access the recovered
resources as shell positional parameters. For historical reasons this
option is also provided by xapply
. In the xapply
case the shell parameters ($1
, $2
, ...) become
run-time versions of the expander names (%t1
, %t2
, ...).
That makes our command line modem example look like:
We don't have to specify axapply -f -t ~/lib/modems -R 1 -A 'myDialer -d $1 %1' phone.list
-e MODEM
, we can just for
the name into $1
and use it from there. This even works
when the -S
option selects
perl
as the shell, or
even worse tcsh
.
See the ptbw HTML document for more ideas about how to setup resource pools and using them from the command-line and from scripts.
xapply
as a co-processksh
manual page
under Co-Processes if you've never heard of thise before.
Because of the way xapply
is designed
it makes a really great co-process. It manages a list of tasks
given to it on stdin
, and outputs a list
of results on stdout
-- which is exactlty
what a co-process service should do.
For a real turbo let's start our gzip
loop
as a co-process in a fair mockup of a workstation dump structure.
Say we want to dump many workstations in parallel to a large file server.
We are going to ssh
to each client to run
dump
(8) over a list of filesystems.
But we need to limit the impact to each workstation owner's
desktop, so let's run the compression for the files
locally on the file server. For a start I'm going to assume
that the file server can run at least 4 processes at a time.
I'm going to simplify the code a little to show the inner loop
for a single host here.
We'll start a co-process that keeps 3 gzip
tasks
running. To do that it reads the names
of the files to compress from stdin
, so
the main script outputs each completed dump archive to the co-process
with print -p
, if it is marked in the list
as "gzip
". After all the hosts are
finished we close the co-processes input, then wait
for
it to finish.
#!/bin/ksh # comments and some argument processing : ${SSH_AUTH_SOCK?'must have an ssh agent to run automated backups'} unset TAPE RMP RSH ... nice xapply -P3 -f 'gzip -7v %1 1>&3 2>&3' - 3>gzip.log |& ... for $TARGET in ... ; do ... while read FS WHERE COMPRESS junk ; do ssh root@$TARGET -x -n su -m operator -c "'/bin/sync; exec /sbin/dump -0uL -C16 -f - $FS'" >$WHERE.dump [ _${COMPRESS:-no} = _gzip ] && print -p $WHERE.dump done <<-\! / slash gzip /var var gzip /usr usr gzip /home home gzip /var/ftp var_ftp no ... ! done exec 3>&p;exec 3>&- wait # cat gzip.log exit 0
In the real code we run several hosts in parallel. Also the list of
target filesystems is not from a here document: but that would be
much harder to explain here. I put in a comment where one might
display (or process) the log from all the gzip
processes. This might be used to feed-back and tune the compression
levels or exclude dumps that grow when compressed (viz. compressed tar files
tend to do that from /var/ftp
.
The reason this is a good structure is that the number of compression
tasks is controlled with a single -P3
specification: when we move the process to an newer host we can tune it
up to use most of the CPU, saving just enough to run ssh
to fetch backups from our client hosts. In the production script
the parallel factor is a command-line option, and the outer loop also
processes multiple client hosts in
parallel with xapply
.
Conversely when we need more resources for the incoming dump streams we
can reduce -P
, or
tune the nice
options to
focus more effort on the ssh
encryption tasks.
And to simplify the code we could use a pipeline to compress the dumps
as they stream in from the client, but that slows down the over-all
thoughtput of the process to the speed of the backup host, which may
have more disks than brains.
xapply
as a co-process you might look at
a pstree
(aka. ptree
) of
the processes doing the work. What you should see is the peer
instance of xapply
with some workers below it,
and sometimes a defunct
process or two
waiting to be reaped. These don't hurt anything, it is just the way
xapply
blocks reading input before it checks
for finished tasks. Here is a simple example, using your own
ksh
as the master proceess.
$ nice xapply -f -P3 'sleep %1; date 1>&3' - 3>log.$$ |& $ jobs [1] + Running nice xapply -P3 -f "sleep %1; date 1>&3" - 3> $ print -p 10 $ ptree -n $$ 1380 ksh -i -o vi -o viraw 31057 xapply -P3 -f sleep %1; date 1>&3 - 31058 /bin/ksh -c sleep 10; date 1>&3 _ 31063 sleep 10 31059 ptree -n 1380 $ print -p 20 ; print -p 22 ; print -p 21 $ ptree -n $$ 1380 ksh -i -o vi -o viraw 31148 xapply -P3 -f sleep %1; date 1>&3 - 31149 /bin/ksh -c sleep 20; date 1>&3 _ 31161 sleep 20 31150 /bin/ksh -c sleep 22; date 1>&3 _ 31163 sleep 22 31152 /bin/ksh -c sleep 21; date 1>&3 _ 31162 sleep 21 31164 ptree -n 1380 $ sleep 30 $ ptree -n $$ 1380 ksh -i -o vi -o viraw 31148 xapply -P3 -f sleep %1; date 1>&3 - 31150 () 31152 () 31168 ptree -n 1380 $ exec 4>&p ; exec 4>&- [1] + Done nice xapply -P3 -f "sleep %1; date 1>&3" - 3> $ wc -l log.$$ 4 log.1380 $ rm log.$$
The reason we see 2 exited children under
the co-process xapply
is that xapply
was blocked waiting for a child to
exit
until one did (to free up a slot), then it
noticed that there were no more tasks to launch (when we moved and closed the
p
descriptor). So it waited for the
other childern then exit
'd itself.
Always remember that the co-process can be an entire pipeline, which is
better than just a single xapply
.
I use the nice
to start my co-processes command
and the |&
to end it as
structural documentation in the script.
The nice
also puts the main script at an advantage,
but you could do the opposite and use op
(or
sudo
) to get better scheduling priority,
a different effective uid, or some other escalation for
the co-process. If you need the exit codes from the processes see
a note above about using
a wrapped xclate
to do that.
-V
to output a useful
version banner, and -h
to output a brief on-line help
message. So xapply
does.
xapply
in
the hxmd
HTML document
and the msrc
HTML document.
$Id: xapply.html,v 3.19 2010/08/13 17:19:58 ksb Exp $