xapply
.
sh
(1), and
have an understanding of the UNIX process model and have used GNU
parallel
for a while. Or if you've used
dsh
see the section below.
If you've been using find
as a parallel
process tool, there is help for that too.
parallel
and xapply
xapply
doesn't try to allow multiple sources of tasks to be mixed ad hoc.
Parallel
offers to include iterations elements
from the command-line mixed with stdin
and
fixed strings, while xapply
acts more like a
filter: items come from stdin
or from fixed words, not both.
Every replacement markup in
Here is an example of
Or you are looking for
In my tool chain this is best done with
The list of hosts contains more than just the names of the hosts.
In fact any attribute related to the host might
be listed in the file. See the
Given that file we can process the three listed hosts. I'll put the
script I want to run in a file named
The cache recipe file contains the parts of the process that are marked-up
for every execution. This reduces keyboard errors, makes the process
more repeatable. The recipe is run through
By referencing the name of the cache directory on
the
The file
Note that every file may have a revision control comment in it,
that is a very good idea. Also note that
We encapsulate each operation in a directory so we may reuse them in
different combinations (and orders) to provide derived services. It is
possible to have any directory recursively call another, as well.
To run that for the same hosts:
The little detail is that the
The real loss here is in the reuse we got from the combination for
cache directories. In the
I usually use
Using the cache directory above would fetch the report from the
target host, then send it back as the file
By limiting the changes to the command-line we allow rapid development of
common tasks. Then quick integration into existing automation.
By using a configuration file format with arbitrary attribute macros
these 5 tools all read natively (
And everything should always be revision controlled.
Normally
For example we might
It is much more portable to use
In addition to the permutations done by the output order from the
parallel tasks,
Examples from the
For all of these you would actually embed the command in a recipe file:
either a
The other option would be to use
The minimal required recipe (to send no files) would be
To save even more typing add an
Back to Xapply
takes the filter approach: if you want
a stream of words, then build that stream with a program that does
just that. Xapply
reads stdin
in two formats: NUL
(aka.
\000
) terminated and NL
(aka. \n
) terminated. We don't need a way to
mix files, fixed arguments, and stdin
built-in
to the processor (we have perl
for that).
Actually I've almost never needed more than sed
and oue
to
cleanup any iteration list.
Markup differences
The secondary difference is that parallel
's markup
tries to read your intent more than xapply
's.
For example the markup
{.}
under parallel
treats
removing the dot-suffix from "sub.dir/foo.jpg" (.jpg is removed)
differently than "sub.dir/bar" (nothing is removed), while
xapply
allows either
%[1/$.-$]
to remove the suffix on the basename,
or %[1.-$]
to remove a suffix that might
include a slash (/
).
I believe parallel
s' markup more likely to
bleed complexity into other programs, and is less likely to solve any
real-world problem. (When you are not sure there is a dot-extender on
the filename, then fix the invariant so you are sure.)
parallel
is a unique case,
each has a unique command-line option to change the spelling, and unique rules.
Xapply
uses a more complex markup, but every
expansion supported by the dicer and mixer is general. That is to say
that the same dicer markup is used in mk
,
oue
, and other tools.
Simple command-line conversions
-0
or --null
-z
. This is always a good idea if you
had to build an iterator script to output a mix of fixed strings,
files
, and stdin
.
-a
input-file
-f
and list all
input-files
as positional parameters.
There is no interaction between stdin
and
-f
unless you specify a dash in the
positional parameters, in which case the default for -i
is changed to /dev/null
.
--basefile
file
msrc
to send files to a remote host (and
recover any output files).
--bg
nohup
and daemon
are for.
--cleanup
msrc
's -m
option,
which may trigger any recipe cleanup required.
--colsep
regexp
--delimiter
delim
--E
eof-str
NUL
terminated list, then use
-
count
to
set the number of column. If you need a delim
do it in that filter.
--dry-run
-n
.
--eta
--group
--ungroup
-m
and wrap your xapply
in an explicit xclate
, if you want even more
features (like exit code processing). Or use the hxmd
program built on top of xapply
.
--halt-on-error
USR1
signal and %p
markup to find the correct pid to signal xapply
to
stop adding more tasks. Use hxmd
's retry
logic to process any recovery logic (or code you own with
xclate
's -N
option).
--joblog
xapply
, but hxmd
's
REDO logic could provide some of the same data.
The date and run-time information could be provided by
changing -c
, but this would really need to
be done in a script that called hxmd
because
the command-line would be quite long and complex.
-k
xclate
manager doesn't have a way to
force the order. If you want to collate the output, then write each
output stream to a file named as a function of
%u
. For example, one may build
a temporary directory (/tmp/ksb
) to
stash the output from several tasks:
Now a second pass with the same list will match keys to output files:
$ xapply -P8 -f 'long-shot %1 >/tmp/ksb/%u' moons
$ xapply -f 'summary %W1 /tmp/ksb/%u' moons
-L
max-lines
-f
and -
count
and -p
pad
to get the same effect.
--load
max-load
xapply
do this. A
task manager like hxmd
could sample the load
before injecting tasks into an xapply
queue.
In my experience the system load average alone is
not enough information to provide task manager with sufficient feed-back.
It might have to sample any combination of swap space, available physical
memory, disk input/output utilization, and network throughput.
--xargs
fmt
or adjust
to group arguments.
--onall
hxmd
.
--files
-k
above. Another
way to for stdout
to a file is to prefix
the command with "exec >/tmp/ksb/%u;
" which
doesn't limit the number of shell commands which might be listed in
cmd
.
--pipe
--block-size
size
NUL
terminated input blocks, and
use -z
. Any processing you need to do to
the input stream is better encoded in your own filter.
--progress
xapply
show you a
progress status, since it doesn't know the total number of tasks
it might run. I have executed xapply
processes
that have run for weeks, reading their input clients connecting to a FIFO.
--number-of-cpus
ptbw
.
--interactive
or wrap your command in such a program.
echo Run %W*? 1>&2;read a</dev/tty;expr "_$a" : "_[yY].*" >/dev/null || exit 0
--quote
Xapply
supports
3 levels of quoting:
These prefixes allow some parameters to be quoted, while others
are not. For example:
%q
sh
or ksh
) double quotes. That is any of these
four characters: \
, "
,
$
, or `
.
%Q
IFS
):
`
,
$
,
\
,
"
,
~
,
*
,
?
,
[
,
|
,
&
,
;
,
(
,
)
,
#
,
=
,
'
,
<
, or
>
.
%W
%Q
would plus the
standard IFS
list (space, tab, and newline).
xapply -2 -fp red '%1 %Q2' brush colors
--no-run-if-empty
grep
or
sed
.
--retries
Xapply
doesn't know a computer name from from
any other parameters, you are looking for hxmd
.
--return
filename
xapply
. If you are to structure a
task to process data on a remote host and send return files back, I
would use either msrc
-m
(see the hxmd
's
--semaphore
-t
. The
ptbw
token broker acts as a semaphore
handler in most cases.
--sshloginfile
hxmd
's
-C
option. Which lets you specify a
whole lot more than a few fixed parameters.
--tty
-i/dev/tty
.
--timeout
sec
xapply
, in any case. In a pinch you could use
mk
's resource limits, but that's a little over-kill.
mk
markup to do that:
The #!/bin/sh
# Use "mk -mLimit" to run with a 20 second wall-clock time limit:
# $Limit,clock=20: %f -t -P2
# ...
mk
--transfer
--trc
msrc
.
--trim
sed
.
--xapply
--shebang
mk
and hxmd
.
Since hxmd
takes comments in the list of
targets we embed a marked line (see mk
's
I do not think the pun of a configuration file as a script is a
great idea, but local policy allows other things I don't like.
Remember to #!/usr/bin/env -S mk -mRun
# $Run: ${hxmd:-hxmd} -C%f some-processor
list-of-hosts and attributes
chmod
it
+x
.
-F
. You can use
xapply
as an interpeter with something like:
#!/usr/bin/env -S xapply -P2 -F
gzip -9 %1
The push, remote execute, pull model
The most useful meme encoded in parallel
is
the idea that one might visit a task on a list of hosts with some
data file, then return the results back to the driving hosts.
While that's not hard to explicitly spell under xapply
,
it is surely easier to cast with parallel
.
msrc
(or plain hxmd
if you'd rather walk).
We break the task down into the orthonormal parts: the list of
target hosts (site.cf
), the recipe
(report/Cache.m4
) and the remote
script to run (report/cmd
). The last two,
taken together, form an hxmd
hxmd
site.cf
:
# $Id:...
%HOST COLOR CPUs
sulaco black 2
ripley grey 4
lv426 cream 1
report/cmd
:
# $Id:...
date
uptime
exit 0
m4
for each host in site.cf
so that the attributes
of the each host can tune the actions of the recipe. Then the recipe is
used as a make
recipe file to build the required
data for each target host, which is also marked-up in
m4
so that it can be processed for each
target host to tweak the recipe for attribute values (like
CPUs
).
Here is a very simple report/Cache.m4
:
The file `# $Id:...
report: FRC
ssh -x 'ifdef(`REMOTE_LOGIN',`REMOTE_LOGIN@')HOST` /bin/sh <cmd
# Shell completion might put a trailing slash on our directory name -- ksb
'HOST`: report
FRC:
'dnl
cmd
in the report
recipe allows us to push commands to the target host without
quoting them from m4
. make
, and
the shell. We'll use the cmd
script from
above.
hxmd
command-line, we force
the m4
processing of
the Cache.m4
recipe in that directory and
the make
update of the name of
the directory (as the target
).
The update rule for the HOST
macro is only triggered
when the directory name is suffixed with a slash, due to the rules
hxmd
uses to create the update taget.
Which outputs:
$ hxmd -P10 -Csite.cf cat report
Change "sulaco:
Tue May 1 16:21:06 MDT 2012
2:34PM up 55 days, 23:54, ...
ripley:
Tue May 1 16:21:06 MDT 2012
2:34PM up 133 days, 10:00, ...
lv426:
Tue May 1 16:21:07 MDT 2012
2:34PM up 144 days, 20:32, ...
-P10
" to the options
"-dCX -P1
" to see how it works.
cmd
in the report
could take any actions required on the remote host (as long as it doesn't
need to read stdin
).
This model scales out to thousands of hosts with attribute tunes for as many
cases a needed to meet your needs.
REMOTE_LOGIN
may be defined to map the local login to any remote login, even on a
per-host basis.
Using
To do that same task with msrc
repeat that taskmsrc
using a punned control
recipe we need a make
recipe to offer the required
macros to msrc
and with the report script encoded
as an update rule, and nothing else:
# $Id:...
INTO=/tmp/ksb.1
IGNORE=+++
report: FRC
date
uptime
FRC:
$ msrc -P10 -Csite.cf make report
msrc
data recovery
only goes to stdout
: with
hxmd
the data is actually cached in
a local file, which makes it easier to use for
additional processing. Under hxmd
we use
cat
to display the "report", while under
msrc
we use make
to
run the display on the remote host. That is an important detail (the display
runs on the remote host, not on the local host).
msrc
tactic we
code the cmd
code in the recipe, and must
use make
markup to quote dollar sign
($
) and avoid command failures that
would stop the process.
msrc
for software builds,
hxmd
for process control scripts, and
xapply
for ad hoc status polling.
report
.
This is most useful when the process includes and update to the
content as it is processed (in at least one direction). This would
be triggered by including the name of the directory in the
MAP
macro list.
See the msrc
MAP
ed files are used
much more than MAP
ed cache directories.
The common wins with
With these tools you can specify a subset of a whole population with
some host selection options (which work for both tools exactly the
same way). For example you might target a single test host:
hxmd
and msrc
(I replaced "$ msrc -G prometheus -Csite.cf make report
-P10
" with an explicit host
selection via -G
.)
mmsrc
,
msrc
, hxmd
,
efmd
, and distrib
)
and others can parse by proxy (via efmd
), we
can share the host data between interactive tasks, across political groups,
and use them in diverse autotmation applications.
Conversions from
find
's execution options to xapply
Find
is a great utility for producing a source-stream
for a parallel task. Some non-standard additions have been made to
find
to reduce the number of check processes the
-exec
primative fork
s to
search the filesystem. I think there are better ways to improve
the overall through-put of a find
pipeline.
find
's -exec
should be parallelized with:
This pipeline allows $ find
pathname
... expression
-print0
|xapply
-mfzP
'
template
' -
find
to traverse the
filesystem without any logic to manage fork
ed
processes. We let find
focus on the filesystem,
while xapply
manages the processes. Tuning
xapply
's the parallel factor (under
-P
) added more parallel processes, adding
an xclate
wrapper or ptbw
governor, or status code stream is now possible, where it is not
with find
`managing' the execution.
This is a very powerful meme: by running a process in the context of
a different directory we may leverage another invarient to increase
our parallelism. Find
's -execdir
Find
imposes a limit that
we'll refactor here: the name of the file we locate must be
the program we want to run.
By using xapply
we remove that restriction.
find
a make
recipe file (-name '[mM]akefile'
) or a file
with a locally meaningful extender (viz. ".lme"), neither of which
need be the program we want to execute. Using the dicer we can
select the directory, then run the processor of our choice:
$ find
pathname
... -name '[mM]akefile'
-print0
|
xapply
-mfzP8
'cd %[1/-$] && make -f %[1/$] found' -
The OpenBSD hack to Find
's +
hack is really a binpack
find
(see the manual page) allows
multiple arguments to be joined into a single executrion of the
target utility
, but it is really not
portable across versions of find
.
$ find
pathname
... -name '*.lme'
-exec
bundle-process +
-print0
to
build a path list that is NUL
(\000
) terminated.
Then use xapply
-z
to process
the list.
If you want to group the maxumum number of elements for
each command (like the OpenBSD $ find
pathname
... -name '*.lme'
-print0
|
xapply
-mfzP13 -8
'bundle-process %W1 %W2 %W3 %W4 %W5 %W6 %W7 %W8' \
- - - - - - - -
+
feature does)
use the binpack
filter under
the -zN
options to group the files, then
feed the list to xapply
.
If you have a lot of filenames with special characters in them this
may exceed $ find
pathname
... -print0
|
binpack -zN bundle-process |
xapply
-mfP10
'' -
kern.argmax
, tune the limit down
with -b
(divide by 2 always works). Since most
filenames do not have shell meta-characters in them, this almost
never happens. (Or tune -w
down to make less optimal
bins.)
binpack
permutes the order of
the files as it packs them into bins. If you require a (more) stable order,
just use a simple perl
filter to
limit the command length. Here is an example:
#!/usr/bin/env perl
use Getopt::Std;
use strict;
# Example linear packer takes -b bytes and -z only, add others as needed --ksb
my(%opts,$q,$l);
my($overhead) = 8; # 8 >= sizeof(char *)
getopts("b:z", \%opts);
$/ = "\000" if ($opts{'z'});
my($bsize) = $opts{'b'};
if (!defined($bsize)) {
$bsize = `sysctl -a kern.argmax 2>/dev/null` || 128*1024;
$bsize =~ s/.*([0-9]*)\s*$/$1/;
# bias bsize for environment space, ptr+"name=value\000" * envs
map { $bsize -= 2+length($_)+length($ENV{$_})+$overhead } keys(%ENV);
}
my($cur) = 0;
while ($q = <>) {
chomp($q);
$q =~ s/([\"\'\\\#\`\$\&;|*()?><\{~=[])/\\$1/g;
$l = length($q)+$overhead;
if (0 == $cur) {
print "$q";
$cur = $l;
} elsif ($cur+$l+1 < $bsize) {
print " $q";
$cur += $l+1;
} else {
print "\n$q";
$cur = $l;
}
}
if ($bsize > 0) {
print "\n";
}
exit 0;
The difference between
The dsh
and hxmd
dsh
application resembles
hxmd
, but worries more about
the source host than the clients.
Emphasis is on local resource utilization, over client configuration, and
less on automation of client-side processes. Most trivial cases might
be implemented as straight xapply
commands against
a file which only contains a list of hostnames.
Dsh
's configuration structure breaks hosts into
groups (posses in hxmd
speak) by listing the
members of a group in a file named for the group.
Hxmd
allows arbitrary posse relationships,
via attribute macros and guards.
The attribute macros also provide configuration options to scripts, recipes,
and other files markup-up with m4
.
Conversion of
The dsh
optons to hxmd
dsh
options are largely geared towards
interactive use to drive an interactive process, while the
hxmd
options are more geared for
completely automated tasks.
-v
show execution process
hxmd
you may use -v
,
-dC
, and -dX
to
show different aspected of the execution process.
--quiet
hxmd
is very quiet.
--machine
machinenames
-G
followed by
the exact spelling of the hostname as it appears in the configuration file.
--all
hxmd
.
--group
groupname
SERVICE
to form a posse,
see the hxmd
--file
machinefile
-C
, -X
, or
-Z
depending on what you really want.
--remoteshell
shellname
--remoteshellopt
rshoption
control
specification, or use the HX_CMD
attribute
macro to set the default action.
-h
--wait-shell
-P1
for sequential commands. Set a
higher value for parallel access.
Always set $PARALLEL
to a default that
makes sense in any script or recipe file.
--concurrent-shell
hxmd
alone.
Usually we start a screen
or
tmux
instance, then drive that with
hxmd
or xapply
.
--show-machine-names
xclate
options, like
But that only outputs the hostname as the first line of hosts that
output something, which is actually more useful.
$ xclate -ms hxmd -Csite.cf -F2 -e XCLATE=-H%2 "%0ssh -n HOST uptime" "HOST"
--hide-machine-names
--duplicate-input
--bufsize
buffer-size
tmux
is used. But sending a shell
script or make
recipe to the host is a much better
idea. Fingers on keyboards cause mistakes. Sending mistakes to many
hosts in parallel is a recipe for trouble.
-V
--num-topology
N
msrc
).
--forklimit
fork-limit
hxmd
more than a hard limit (which is set with -P
).
dsh
web site:
I'll assume $PARALLEL
is set to the parallel
factor you want for these examples.
uname
)
$ dsh -a -c -- uname -a
$ hxmd -P 'ssh -n HOST uname -a'
$ dsh -g children -c -- uname -a
$ hxmd -C children.cf -P 'ssh -n HOST uname -a'
netgroup
ypmatch
to get the list of hosts
we can feed them in as a configuration file:
$ dsh -g @nisgroup -- uname -a
$ ypmatch ... |hxmd -C - -P 'ssh -n HOST uname -a'
mk
, make
,
op
or other recipe processor, or in
a shell script, function or alias.
msrc
with a
simple Msrc.mk
, which makes the commands look
more like
$ dsh -g children -c -- uname -a
$ msrc -Cchildren.cf -P uname -a
The nifty thing about that command is that the
directory context supplies the default
# $Id....
INTO=.
SEND=.
MAP=.
IGNORE=+++
-C
configuration and other parameters
(via Msrc.hxmd
). This saves a lot of typing
for interactive use, and allows scripts to use the same spells over and
over without recoding each service every time it is needed.
Msrc.hxmd
with the default -C
and -P
options:
Then the command becomes just:
# $Id....
-Cchildren.cf
-P10
(Use the $ msrc uname -a
-z
command-line option to
defeat the inclusion of options from that file.)
Summary
Any of these tools are better than typing lots of commands by hand.
Pick the ones you like the best and use them, it might save your
hands and wrists.
-- ksb (KS Braunsdorf) Sep 2013
xapply
,
or use your browser's back button.
$Id: parallel.html,v 3.21 2013/09/04 13:49:27 ksb Exp $ by ksb.