Replace lesser tools with the one and only xapply

This document is meant to provoke you to try xapply.

To understand this document

This document assumes you are quite familiar with the standard UNIX™ shell, sh(1), and have an understanding of the UNIX process model and have used GNU parallel for a while. Or if you've used dsh see the section below. If you've been using find as a parallel process tool, there is help for that too.

The difference between parallel and xapply

The main difference between the two is that xapply doesn't try to allow multiple sources of tasks to be mixed ad hoc. Parallel offers to include iterations elements from the command-line mixed with stdin and fixed strings, while xapply acts more like a filter: items come from stdin or from fixed words, not both.

Xapply takes the filter approach: if you want a stream of words, then build that stream with a program that does just that. Xapply reads stdin in two formats: NUL (aka. \000) terminated and NL (aka. \n) terminated. We don't need a way to mix files, fixed arguments, and stdin built-in to the processor (we have perl for that). Actually I've almost never needed more than sed and oue to cleanup any iteration list.

Markup differences

The secondary difference is that parallel's markup tries to read your intent more than xapply's. For example the markup {.} under parallel treats removing the dot-suffix from "sub.dir/foo.jpg" (.jpg is removed) differently than "sub.dir/bar" (nothing is removed), while xapply allows either %[1/$.-$] to remove the suffix on the basename, or %[1.-$] to remove a suffix that might include a slash (/). I believe parallels' markup more likely to bleed complexity into other programs, and is less likely to solve any real-world problem. (When you are not sure there is a dot-extender on the filename, then fix the invariant so you are sure.)

Every replacement markup in parallel is a unique case, each has a unique command-line option to change the spelling, and unique rules. Xapply uses a more complex markup, but every expansion supported by the dicer and mixer is general. That is to say that the same dicer markup is used in mk, oue, and other tools.

Simple command-line conversions

-0 or --null
Use -z. This is always a good idea if you had to build an iterator script to output a mix of fixed strings, files, and stdin.
-a input-file
Use -f and list all input-files as positional parameters. There is no interaction between stdin and -f unless you specify a dash in the positional parameters, in which case the default for -i is changed to /dev/null.
--basefile file
Use msrc to send files to a remote host (and recover any output files).
--bg
This is what nohup and daemon are for.
--cleanup
See msrc's -m option, which may trigger any recipe cleanup required.
--colsep regexp
--delimiter delim
--E eof-str
Use any filter you like to process the arguments into a NUL terminated list, then use -count to set the number of column. If you need a delim do it in that filter.
--dry-run
Use -n.
--eta
There is no way to estimate the time it takes an arbitrary program to run.
--group
--ungroup
Use -m and wrap your xapply in an explicit xclate, if you want even more features (like exit code processing). Or use the hxmd program built on top of xapply.
--halt-on-error
Use a USR1 signal and %p markup to find the correct pid to signal xapply to stop adding more tasks. Use hxmd's retry logic to process any recovery logic (or code you own with xclate's -N option).
--joblog
There is no good way to produce the exact output from within xapply, but hxmd's REDO logic could provide some of the same data. The date and run-time information could be provided by changing -c, but this would really need to be done in a script that called hxmd because the command-line would be quite long and complex.
-k
The xclate manager doesn't have a way to force the order. If you want to collate the output, then write each output stream to a file named as a function of %u. For example, one may build a temporary directory (/tmp/ksb) to stash the output from several tasks:
$ xapply -P8 -f 'long-shot %1 >/tmp/ksb/%u' moons
Now a second pass with the same list will match keys to output files:
$ xapply -f 'summary %W1 /tmp/ksb/%u' moons
-L max-lines
Use -f and -count and -p pad to get the same effect.
--load max-load
There is no way to make xapply do this. A task manager like hxmd could sample the load before injecting tasks into an xapply queue. In my experience the system load average alone is not enough information to provide task manager with sufficient feed-back. It might have to sample any combination of swap space, available physical memory, disk input/output utilization, and network throughput.
I have thought about coding one of these, but every 18 months I'm 50% less likely to need it.
--xargs
Use a filter like fmt or adjust to group arguments.
--onall
You are looking for hxmd.
--files
See -k above. Another way to for stdout to a file is to prefix the command with "exec >/tmp/ksb/%u;" which doesn't limit the number of shell commands which might be listed in cmd.
--pipe
--block-size size
Produce NUL terminated input blocks, and use -z. Any processing you need to do to the input stream is better encoded in your own filter.
--progress
There is no way to make xapply show you a progress status, since it doesn't know the total number of tasks it might run. I have executed xapply processes that have run for weeks, reading their input clients connecting to a FIFO.
--number-of-cpus
You are looking for ptbw.
--interactive
Add a shell prompt to your command:
echo Run %W*? 1>&2;read a</dev/tty;expr "_$a" : "_[yY].*" >/dev/null || exit 0
or wrap your command in such a program.
--quote
Most people actually need quoting. Xapply supports 3 levels of quoting:
%q
Quote any character that is special inside shell (sh or ksh) double quotes. That is any of these four characters: \, ", $, or `.
%Q
Quote any character that is special to the shell, but not an default internal field separator (aka. from $IFS): `, $, \, ", ~, *, ?, [, |, &, ;, (, ), #, =, ', <, or >.
%W
Quote anything %Q would plus the standard IFS list (space, tab, and newline).
These prefixes allow some parameters to be quoted, while others are not. For example:
xapply -2 -fp red '%1 %Q2' brush colors 
--no-run-if-empty
Strip empty lines with grep or sed.
--retries
Xapply doesn't know a computer name from from any other parameters, you are looking for hxmd.
--return filename
There are so many ways to get data back from a remote host that we don't need one in xapply. If you are to structure a task to process data on a remote host and send return files back, I would use either msrc -m (see the HTML document), or build on top of hxmd's cache files (see the example below).
--semaphore
I think you are looking for -t. The ptbw token broker acts as a semaphore handler in most cases.
--sshloginfile
You are looking for hxmd's -C option. Which lets you specify a whole lot more than a few fixed parameters.
--tty
Use -i/dev/tty.
--timeout sec
If you want to kill a process based on time, either wrap it in a program that does that or set a resource limit. I don't believe this is a job for xapply, in any case. In a pinch you could use mk's resource limits, but that's a little over-kill.

Here is an example of mk markup to do that:

#!/bin/sh
# Use "mk -mLimit" to run with a 20 second wall-clock time limit:
#	$Limit,clock=20: %f -t -P2
# ...
The mk HTML document might be a good thing to read.
--transfer
--trc
You are looking for msrc.
--trim
You are looking for sed.
--xapply
Blush.
--shebang
You are looking for mk and hxmd. Since hxmd takes comments in the list of targets we embed a marked line (see mk's HTML document) to take what ever action is required.
#!/usr/bin/env -S mk -mRun
# $Run: ${hxmd:-hxmd} -C%f some-processor
list-of-hosts and attributes
I do not think the pun of a configuration file as a script is a great idea, but local policy allows other things I don't like. Remember to chmod it +x.

Or you are looking for -F. You can use xapply as an interpeter with something like:

#!/usr/bin/env -S xapply -P2 -F
gzip -9 %1

The push, remote execute, pull model

The most useful meme encoded in parallel is the idea that one might visit a task on a list of hosts with some data file, then return the results back to the driving hosts. While that's not hard to explicitly spell under xapply, it is surely easier to cast with parallel.

In my tool chain this is best done with msrc (or plain hxmd if you'd rather walk). We break the task down into the orthonormal parts: the list of target hosts (site.cf), the recipe (report/Cache.m4) and the remote script to run (report/cmd). The last two, taken together, form an hxmd cache directory which is a reusable element that many related tasks might share. (There are more parts to a well-constructed cache directory, but they are not necessary for this example.)

The list of hosts contains more than just the names of the hosts. In fact any attribute related to the host might be listed in the file. See the hxmd HTML document for more details. Here is a simple example of my test site.cf:

# $Id:...
%HOST	COLOR	CPUs
sulaco	black	2
ripley	grey	4
lv426	cream	1

Given that file we can process the three listed hosts. I'll put the script I want to run in a file named report/cmd:

# $Id:...
date
uptime
exit 0

The cache recipe file contains the parts of the process that are marked-up for every execution. This reduces keyboard errors, makes the process more repeatable. The recipe is run through m4 for each host in site.cf so that the attributes of the each host can tune the actions of the recipe. Then the recipe is used as a make recipe file to build the required data for each target host, which is also marked-up in m4 so that it can be processed for each target host to tweak the recipe for attribute values (like CPUs). Here is a very simple report/Cache.m4:

`# $Id:...
report: FRC
	ssh -x 'ifdef(`REMOTE_LOGIN',`REMOTE_LOGIN@')HOST` /bin/sh <cmd

# Shell completion might put a trailing slash on our directory name -- ksb
'HOST`: report

FRC:
'dnl
The file cmd in the report recipe allows us to push commands to the target host without quoting them from m4. make, and the shell. We'll use the cmd script from above.

By referencing the name of the cache directory on the hxmd command-line, we force the m4 processing of the Cache.m4 recipe in that directory and the make update of the name of the directory (as the target). The update rule for the HOST macro is only triggered when the directory name is suffixed with a slash, due to the rules hxmd uses to create the update taget.

$ hxmd -P10 -Csite.cf  cat report
Which outputs:
sulaco:
Tue May  1 16:21:06 MDT 2012
 2:34PM  up 55 days, 23:54, ...
ripley:
Tue May  1 16:21:06 MDT 2012
 2:34PM  up 133 days, 10:00, ...
lv426:
Tue May  1 16:21:07 MDT 2012
 2:34PM  up 144 days, 20:32, ...
Change "-P10" to the options "-dCX -P1" to see how it works.

The file cmd in the report could take any actions required on the remote host (as long as it doesn't need to read stdin). This model scales out to thousands of hosts with attribute tunes for as many cases a needed to meet your needs.

Note that every file may have a revision control comment in it, that is a very good idea. Also note that REMOTE_LOGIN may be defined to map the local login to any remote login, even on a per-host basis.

We encapsulate each operation in a directory so we may reuse them in different combinations (and orders) to provide derived services. It is possible to have any directory recursively call another, as well.

Using msrc repeat that task

To do that same task with msrc using a punned control recipe we need a make recipe to offer the required macros to msrc and with the report script encoded as an update rule, and nothing else:
# $Id:...
INTO=/tmp/ksb.1
IGNORE=+++

report: FRC
	date
	uptime

FRC:

To run that for the same hosts:

$ msrc -P10 -Csite.cf  make report

The little detail is that the msrc data recovery only goes to stdout: with hxmd the data is actually cached in a local file, which makes it easier to use for additional processing. Under hxmd we use cat to display the "report", while under msrc we use make to run the display on the remote host. That is an important detail (the display runs on the remote host, not on the local host).

The real loss here is in the reuse we got from the combination for cache directories. In the msrc tactic we code the cmd code in the recipe, and must use make markup to quote dollar sign ($) and avoid command failures that would stop the process.

I usually use msrc for software builds, hxmd for process control scripts, and xapply for ad hoc status polling.

Using the cache directory above would fetch the report from the target host, then send it back as the file report. This is most useful when the process includes and update to the content as it is processed (in at least one direction). This would be triggered by including the name of the directory in the MAP macro list. See the msrc HTML document. For most applications MAPed files are used much more than MAPed cache directories.

The common wins with hxmd and msrc

With these tools you can specify a subset of a whole population with some host selection options (which work for both tools exactly the same way). For example you might target a single test host:
$ msrc -G prometheus -Csite.cf make report
(I replaced "-P10" with an explicit host selection via -G.)

By limiting the changes to the command-line we allow rapid development of common tasks. Then quick integration into existing automation.

By using a configuration file format with arbitrary attribute macros these 5 tools all read natively (mmsrc, msrc, hxmd, efmd, and distrib) and others can parse by proxy (via efmd), we can share the host data between interactive tasks, across political groups, and use them in diverse autotmation applications.

And everything should always be revision controlled.

Conversions from find's execution options to xapply

Find is a great utility for producing a source-stream for a parallel task. Some non-standard additions have been made to find to reduce the number of check processes the -exec primative forks to search the filesystem. I think there are better ways to improve the overall through-put of a find pipeline.

Normally find's -exec should be parallelized with:

$ find pathname ... expression -print0 |xapply -mfzP 'template' -
This pipeline allows find to traverse the filesystem without any logic to manage forked processes. We let find focus on the filesystem, while xapply manages the processes. Tuning xapply's the parallel factor (under -P) added more parallel processes, adding an xclate wrapper or ptbw governor, or status code stream is now possible, where it is not with find `managing' the execution.

Find's -execdir

This is a very powerful meme: by running a process in the context of a different directory we may leverage another invarient to increase our parallelism. Find imposes a limit that we'll refactor here: the name of the file we locate must be the program we want to run. By using xapply we remove that restriction.

For example we might find a make recipe file (-name '[mM]akefile') or a file with a locally meaningful extender (viz. ".lme"), neither of which need be the program we want to execute. Using the dicer we can select the directory, then run the processor of our choice:

$ find pathname ... -name '[mM]akefile' -print0 |
	xapply -mfzP8 'cd %[1/-$] && make -f %[1/$] found' -

Find's + hack is really a binpack

The OpenBSD hack to find (see the manual page) allows multiple arguments to be joined into a single executrion of the target utility, but it is really not portable across versions of find.
$ find pathname ... -name '*.lme' -exec bundle-process +

It is much more portable to use -print0 to build a path list that is NUL (\000) terminated. Then use xapply -z to process the list.

$ find pathname ... -name '*.lme' -print0 |
	xapply -mfzP13 -8 'bundle-process %W1 %W2 %W3 %W4 %W5 %W6 %W7 %W8' \
		- - - - - - - -
If you want to group the maxumum number of elements for each command (like the OpenBSD + feature does) use the binpack filter under the -zN options to group the files, then feed the list to xapply.
$ find pathname ... -print0 |
	binpack -zN bundle-process |
	xapply -mfP10 '' -
If you have a lot of filenames with special characters in them this may exceed kern.argmax, tune the limit down with -b (divide by 2 always works). Since most filenames do not have shell meta-characters in them, this almost never happens. (Or tune -w down to make less optimal bins.)

In addition to the permutations done by the output order from the parallel tasks, binpack permutes the order of the files as it packs them into bins. If you require a (more) stable order, just use a simple perl filter to limit the command length. Here is an example:

#!/usr/bin/env perl
use Getopt::Std;
use strict;

# Example linear packer takes -b bytes and -z only, add others as needed --ksb
my(%opts,$q,$l);
my($overhead) = 8;	# 8 >= sizeof(char *)
getopts("b:z", \%opts);
$/ = "\000" if ($opts{'z'});
my($bsize) = $opts{'b'};
if (!defined($bsize)) {
	$bsize = `sysctl -a kern.argmax 2>/dev/null` || 128*1024;
	$bsize =~ s/.*([0-9]*)\s*$/$1/;
	# bias bsize for environment space, ptr+"name=value\000" * envs
	map { $bsize -= 2+length($_)+length($ENV{$_})+$overhead } keys(%ENV);
}
my($cur) = 0;
while ($q = <>) {
	chomp($q);
	$q =~ s/([\"\'\\\#\`\$\&;|*()?><\{~=[])/\\$1/g;
	$l = length($q)+$overhead;
	if (0 == $cur) {
		print "$q";
		$cur = $l;
	} elsif ($cur+$l+1 < $bsize) {
		print " $q";
		$cur += $l+1;
	} else {
		print "\n$q";
		$cur = $l;
	}
}
if ($bsize > 0) {
	print "\n";
}
exit 0;

The difference between dsh and hxmd

The dsh application resembles hxmd, but worries more about the source host than the clients. Emphasis is on local resource utilization, over client configuration, and less on automation of client-side processes. Most trivial cases might be implemented as straight xapply commands against a file which only contains a list of hostnames.

Dsh's configuration structure breaks hosts into groups (posses in hxmd speak) by listing the members of a group in a file named for the group. Hxmd allows arbitrary posse relationships, via attribute macros and guards. The attribute macros also provide configuration options to scripts, recipes, and other files markup-up with m4.

Conversion of dsh optons to hxmd

The dsh options are largely geared towards interactive use to drive an interactive process, while the hxmd options are more geared for completely automated tasks.
-v show execution process
Under hxmd you may use -v, -dC, and -dX to show different aspected of the execution process.
--quiet
By default hxmd is very quiet.
--machine machinenames
It is not possible to add a literal machine, which is not in any configuration files. It is possible to specify a host the is in a configuration file with -G followed by the exact spelling of the hostname as it appears in the configuration file.
--all
This is the default for hxmd.
--group groupname
Use a macro attribute like SERVICE to form a posse, see the hxmd HTML document.
--file machinefile
Use one of -C, -X, or -Z depending on what you really want.
--remoteshell shellname
--remoteshellopt rshoption
Specify the action as part of the control specification, or use the HX_CMD attribute macro to set the default action.
-h
Exactly the same.
--wait-shell
Set -P1 for sequential commands. Set a higher value for parallel access. Always set $PARALLEL to a default that makes sense in any script or recipe file.
--concurrent-shell
There is no way to do this with hxmd alone. Usually we start a screen or tmux instance, then drive that with hxmd or xapply.
--show-machine-names
We could play games with xclate options, like
$ xclate -ms hxmd -Csite.cf -F2 -e XCLATE=-H%2 "%0ssh -n HOST uptime" "HOST"
But that only outputs the hostname as the first line of hosts that output something, which is actually more useful.
--hide-machine-names
This is the default.
--duplicate-input
--bufsize buffer-size
This is where tmux is used. But sending a shell script or make recipe to the host is a much better idea. Fingers on keyboards cause mistakes. Sending mistakes to many hosts in parallel is a recipe for trouble.
-V
Exactly the same.
--num-topology N
Slave instances are driven by a recipe (script or more commonly an instance of msrc).
--forklimit fork-limit
Really you need the slow-start logic in hxmd more than a hard limit (which is set with -P).

Examples from the dsh web site: I'll assume $PARALLEL is set to the parallel factor you want for these examples.

Visit all hosts with display command (uname)
$ dsh -a -c -- uname -a
$ hxmd -P 'ssh -n HOST uname -a'
Limit to a specified posse by file.
$ dsh -g children -c -- uname -a
$ hxmd -C children.cf -P 'ssh -n HOST uname -a'
Limit to the hosts from a netgroup
If we can use ypmatch to get the list of hosts we can feed them in as a configuration file:
$ dsh -g @nisgroup -- uname -a
$ ypmatch ... |hxmd -C - -P 'ssh -n HOST uname -a'

For all of these you would actually embed the command in a recipe file: either a mk, make, op or other recipe processor, or in a shell script, function or alias.

The other option would be to use msrc with a simple Msrc.mk, which makes the commands look more like

$ dsh -g children -c -- uname -a
$ msrc -Cchildren.cf -P uname -a

The minimal required recipe (to send no files) would be

# $Id....
INTO=.
SEND=.
MAP=.
IGNORE=+++
The nifty thing about that command is that the directory context supplies the default -C configuration and other parameters (via Msrc.hxmd). This saves a lot of typing for interactive use, and allows scripts to use the same spells over and over without recoding each service every time it is needed.

To save even more typing add an Msrc.hxmd with the default -C and -P options:

# $Id....
-Cchildren.cf
-P10
Then the command becomes just:
$ msrc uname -a
(Use the -z command-line option to defeat the inclusion of options from that file.)

Summary

Any of these tools are better than typing lots of commands by hand. Pick the ones you like the best and use them, it might save your hands and wrists.
-- ksb (KS Braunsdorf) Sep 2013

Back to xapply, or use your browser's back button.



$Id: parallel.html,v 3.21 2013/09/04 13:49:27 ksb Exp $ by .