To understand this document

This document assumes you are quite familiar with the standard UNIX™ process model and the shell sh(1). In addition you should have a working understanding of the xapply(1) wrapper.

What is `ptbw`?

The ptbw wrapper is a governor for large parallel tasks. Without a limiter, many parallel processes would consume more resources than the underlying system can provide. This conflict may "thrash" the node, starve it for resources, or cause competing tasks to double allocate an asset.

Ptbw provides cooperating processes a pool of resources in a structure that allows them to be locked and released very quickly. A managing instance of ptbw controls the free pool, and each client instance allocates from that pool on behalf of a client process. When a client process exits, the resources are returned to the free poll.

Wrapper mode

Because ptbw is a wrapper, it has 2 modes: the manager and client. See the ptbw manual page for more details on these modes.

In manager mode, under the -m option, ptbw starts a process in an environment that allows it to allocate resources from a pool. That process (given as utility below) usually starts parallel tasks which allocate elements from the pool to do work. That usage looks like:

ptbw -m [-dqQ] [-depth] [-J jobs] [-N path] [-R req] [-t tags] [utility]

Then, each client process started directly or indirectly by utility requests resources from the pool and then executes with an environment variable, named ptbw_list, set to indicate while elements of the pool were bound for that task. That usage looks like:

ptbw [-AqQ] [-depth] [-R req] [-t tags] [client]

An example of asset management

Physical resources are a good first example, for instance an out-bound modem. Modems are a fixed asset, as they are each connected to a telephone line. There are a fixed number of modems connected to a server, so a process that needs access to a modem must lock that modem to insure that another process is not presently engaging it.

Assume there are 3 modems connected to our server, and that we need to dial 150 hosts to gather some data from each. The usual method to split the work would evenly divide the hosts to create a list of 50 hosts for each dialer process. By doing that, we are assuming that each data recovery task takes the same amount of time.

Another way would be to start 150 processes and spin-lock for a modem in each. That's exactly how some of the low-level UNIX kernel events work, which is fine for a small set of threads spinning for a diverse resource pool -- not so much for user-level tasks that should be better coordinated.

The method implemented by ptbw wraps an environment in which access to a modem is granted to a tree for the life of the root process in that tree. As modems become available, a process is fork'd and bound to that resource; as each task finishes it exits to return the modem to the pool.

Of course, a "modem" could represent any virtural or physical resource.

The master interface

The resource manager is the master in the wrapper model. This process services clients as they request access to tokens from the resource pool, provides notification of available resources to blocked clients, and waits for the utility process to exit.

Normally, clients use the diversion stack managed in the environment to connect to the tightest enclosing master instance. But, in the case of ptbw, it is sometimes more effective to have a "global" instance rooted at a well-known socket to manage system-wide pools. In that case, the global instance sets the desired socket name with -N path. This allows disparate clients to cooperate via the "fixed" address of the global instance.

The path selected is only really fixed by a site (or application) policy; not something you need to get consensus on across the IETF, or even across the street.

Any master instance must draw the list of tokens from someplace. The non-comment lines from the tags file specified under -t form the resource pool for most instances. This allows comments for revision control markup to be included in the file. For example, a tags file describing our three working modems might look like:

# $Source: revision control path...
/dev/cua01
/dev/cua02
#/dev/cua03
/dev/cua04

The 2 commented lines help keep track of the authoritive data source, and the fact that modem 3 is out of service. But the reason why modem 3 is out of service is in the revision log of the file, not in the file itself. Hxmd configuration files accept comments for the same reason.

Feel free to copy that file to "example.list", being careful to trim any blanks on the end of the lines. To turn that file into a running instance of ptbw, one might execute:

$ ptbw -m -t example.list ${SHELL} -i

By using the shell as the utility, we can explore the wrapper environment we've created. Try a few commands from within that shell; for starters, the normal version option:

$ ptbw -V
ptbw: $Id: ptbw.m,v 1.76 ...
ptbw: default number of tokens: 8
ptbw: environment prefix "ptbw"
ptbw: environment tags: "link", "list", "d", "1"
ptbw: protocol version 0.4
ptbw: accept filter: dataready
ptbw: safe directory template: ptbdXXXXXX
ptbw:  1: /tmp/ptbdX3ORQP/sta0 [target]

Note that additional line at the end of the output which shows the active diversion and the fact that it is the default target.

Next we should ask to see the whole tableau:

$ ptbw
ptbw: master has 3 tokens, max length 10 (total space 33)
 Master tokens from file: example.list
 Index  Lock    Tokens (request in groups of 1)
     0  0       /dev/cua01
     1  0       /dev/cua02
     2  0       /dev/cua04

If we were to lock a token with the client program, then look at the tableau before the process releases the lock:

$ ptbw sleep 1024 &
[1] 4181
$ ptbw
ptbw: master has 3 tokens, max length 10 (total space 33)
 Master tokens from file: example.list
  Index  Lock    Tokens (request in groups of 1)
      0  4181    /dev/cua01
      1  0       /dev/cua02
      2  0       /dev/cua04

We could even see the process-id of the ptbw instance with the lock. You can use that process-id to shoot your own processes, but don't believe it if you are the superuser. In fact, I almost never run a ptbw as root, there is no reason to.

Either foreground and terminate the sleepers, or kill them by pid before you move on.

$ kill -TERM 4181
$ exit
[1] + Terminated           ptbw sleep 1024

Recursion

The tags file may also be the name of a master instance's socket. In that case, the new master allocates resources from the (now) enclosing master, manages them, then returns them to the spcified master instance. That is one way a "global" master might be leveraged to share resources amongst unrelated tasks. The tags filename, "-", is a synonym for the tightest diversion's socket name.

Still under the ptbw from above. We can start another master instance inside the first, just run:

PS1="below " ptbw -m -J2 -R1 -t - ${SHELL} -i

then run the same test you run above. Now the version output shows 2 enclosing diversions, and the tableau has only 2 lines.

Next exit the "below" shell and start another, then look in the tableau output. You might notice that the tokens the nested instance manages have changed. By default ptbw rolls though the tokens, it doesn't always give out the first available. This gives better performance for most real-world tasks, and follows the Principal of least Astonishment, because that is what people would do in the real world.

Feel free to exit those master shells.

Partitions

Clients can allocate as many resources as they wish, but can only ask for req of them in an atomic transaction. Any request for more than that many tokens might be processed as more than one allocation. This gives competing processes as chance to "get in the game" before the big dog consumes all the choice tokens. The -R specification to the master should be tuned to match the clients minimal needs. A value of 1 is usually a good default, or don't specify any.

When constrained by a tags, the jobs specification has little effect: we can't add more tokens to the resource pool just because the command-line asked for more. But in cases where we are drawing from an enclosing instance or creating a list, we can use -J and -R to specify how many tokens to allocate for maximum through-put.

For example, to display a tableau with an allocation of 20 integers and an atomic allocation limit of 4 at a time:

$ ptbw -m -J5 -R4 -- ptbw

Which outputs a tableau which looks like:

ptbw: master has 20 tokens, max length 2 (total space 50)
 Master tokens from the internal function: iota
 Index	Lock	Tokens (request in groups of 4)
     0	0	0
     1	0	1
     2	0	2
     3	0	3
     4	0	4
     5	0	5
     6	0	6
     7	0	7
     8	0	8
     9	0	9
    10	0	10
    11	0	11
    12	0	12
    13	0	13
    14	0	14
    15	0	15
    16	0	16
    17	0	17
    18	0	18
    19	0	19

That same command, when passed a -t option with a file that holds fewer than 20 tokens, produces an error like:

ptbw: token count 11 less than suggested (20)

This error can be suppressed with -q.

The client interface

In the example above, we used ptbw as the client to output the tableau. That is the default behavior when no client shell command is specified on the command-line.

When a shell command is executed from a client instance of ptbw, it may accept the tokens allocated on its behalf in 2 ways. Either as an environment variable, or as positional parameters to client.

By default, the environment variable ptbw_list holds the recovered values (separated with newlines). Here is an example of a client that just calls echo:

$ ptbw -R2 sh -c 'echo $ptbw_list'

Run from a shell that should output:

ptbw: no enclosing diversion

To make an enclosing diversion you need to start another shell wrapped in a ptbw:

$ PS1="inside$ " ptbw -m -R2 -J3 $SHELL -i
inside$

In that shell, I might set the prompt to indicate that it is not your login shell; for example, set PS1 to some unique string. You can check to see if you are in such as above, by running ptbw without arguments.

In that output, notice the line with the "[target]" tag on the end, that is the current selected diversion. When there is no diversion in the list, there are no enclosing diversions. All wrappers should have about that same display in the version output, for example see xclate.

From within that shell, retry the echo client:

inside$ ptbw -R2 sh -c 'echo $ptbw_list'
0 1

When repeatedly run from a shell, that should output:

inside$ r
2 3
inside$ r
4 5
inside$ r
0 1
inside$

then repeat that sequence as long as you like.

The indirection through a shell (via sh -c) is a little cumbersome. We need that to expand the environment variable set by ptbw. We can eliminate that by moving the tokens to the command line with the -A option. This option appends the tokens to the end of the argument vector for the client command, so the above example becomes:

inside$ ptbw -R2 -A echo
2 3
inside$ exit
$

When calling a shell script that takes the tokens as positional parameters (on the end of the argument list) the -A form is much better. Use the environment form when the program that consumes the tokens is not the direct client, that is don't use a shell to set an environment variable if you don't need the shell at all.

A persistent instance

Moving our attention back to the master aspect of the wrapper, what utility would a a "global" instance select? The processes that the global instance serves are not descendent from that process; they locate the service by a well-known socket name, not by an inherrited environment. So there is no clear process in which to wait.

The special name ":" (a colon) denotes a utility that blocks until the process is explicitly shutdown. An instance started with this feature is called a "persistant instance".

There are 2 ways to stop a persistant instance. One way is to send the process a TERM signal. The other is the start a client with the -Q option: when that client disconnects from the master it gracefully shuts itself down.

To try that start a detached instance with (I'm assuming the environment variable USER is set to your login name):

$ ptbw -m -R2 -J3 -N /tmp/$USER.ptbw : &
[1] 6765

Then repeat the command below to output pairs of tokens from that instance:

$ ptbw -t /tmp/$USER.ptbw -A -R2 echo
0 1
$ r			# run again
2 3...

After running this command, terminate the background instance with:

$ ptbw -Q -t /tmp/$USER.ptbw -R0 true
[2] 10946

you should get notification that the background process terminated. Repeat that experiment with a client that stalls for a bit, say 15 seconds, then ask the master to exit:

$ ptbw -m -R2 -J3 -N /tmp/$USER.ptbw : &
[1] ...
$ ptbw -t /tmp/$USER.ptbw -R2 sleep 15 &

$ ptbw -Q -t /tmp/$USER.ptbw -R0 true
$ jobs
[2] + Running              ptbw -t /tmp/$USER.ptbw -R2 sleep 15
[1] - Running              ptbw -m -R2 -J3 -N /tmp/$USER.ptbw :
$ sleep 15
[2] + Done                 ptbw -t /tmp/$USER.ptbw -R2 sleep 15
[1] - Done (76)            ptbw -m -R2 -J3 -N /tmp/$USER.ptbw :

The master doesn't exit until the client does. This is actually quite a feature. The exit code (76) indicates that a client asked for the shutdown.

Detached instances

The basic wrapper tactic includes detached diversions, that is diversions that don't modify the stack environment. Ptbw implements this with the standard -d command-line option.

A detached diversion publishes the client socket in the environement variable ptbw_d. This is used as a tags specification under -t in a client instance to directly address the detached diversion. The value could be record in a file, or send to another process through any of the many UNIX IPC facilities.

The value of ptbw_d must be recovered and recorded by client directly, as it may be used by any other program that needs a detached diversion of ptbw. In other words the variable doesn't need to be preserved by any service that needs a detached instance, but it may be preserved. It is considered good form to preserve any detached names if it is not a bother to do so. It is also considered poor form to depend on this polite behavior in unrelated applications.

Resource allocation and contention

In some cases we need 2 (otherwise) unrelated resources to get a task started; for example, an in-bound and out-bound modem. Starvation and live-lock/dead-lock are real posibilities in this case. To avoid such issues, I use the tactics listed below.

Paired resource

When resources are almost always provided in pairs pool them in pairs. A line from a tag file could be a list of resources, and xapply's dicer mark-up can split those for you in a few characters.

In the rare case where you've "over allocated" by not using part of the resources you allocated you have to ask yourself if that is worth the risk of live-lock or the complexity of coding a lot more structure to allocate exactly what you (think) you need now.

Use `flock`(1)

A long time ago I coded a shell interface to the BSD flock(2) system call. When allocating more than a single resource, you can start a ksh co-process to hold a lock on a flag file that all consumers use. After you have the resources, you terminate the co-process and go on your way. This prevents live-lock from allocation order inversions in competing code.

See the flock(1) program at HTML document for more details.

Allocate the most limited one first

If you know that 1 resource is more limited that another, always block on that one. After you have the limited one ask for the less limited one(s). Processes do not spend time holding other resources locked while blocking on the bottle-neck.

Don't use `ptbw`

For some resources, it makes much more design-sense to limit resource consumption in some other way. For example, band-width across network, disk, or the like might be limited by atlq, rsync's --bwlimit or by some other throttle. In these cases ptbw alone might not be the right tool to partition such fungible resource.

Ptbw might help fair-share a larger pool of band-width limited rsync instances. Use --bwlimit to fix each process to a hard limit, while using ptbw to limit the number of instances. This might be a job for xapply or even hxmd.

Use cases

While all the above is a pretty good stream of facts some readers still might be wondering, "How would I know I needed ptbw?"

When you have a list of tasks, and each task requires a "resource", but the resources are general enough to do any single task you might need ptbw. You would need it to select a resource to process an item from the list, not knowing the pairing apriori.

Looking at it the other way; if I pair these numbers and letters:

0	A
1	B
2	A
3	B
4	A
5	B

Then I give that list of tasks to 2 worker threads, the first uses "A" to process task "0", while the second uses "B" to process task 1. When the one using "A" finishes it could start on "2", but what if the one using "B" finishes first and trys to use resource "A" while task "0" is still in process?

We would need to "hard partition" the list into "A-only" and "B-only", thereby binding the resource to the thread, to keep double allocation from breaking the process. Such hard paritions limit the total throughput of the system by holding one of the resources locked while no process needs it (for example, when A-only finishes, but B-only still has many tasks to complete, we can't use "A" to work on any).

This sounds like `xapply`

Because we are starting all the tasks in an explicit way, this sounds a lot like a job for xapply, and it is. Ptbw itself just manages the environment, some client code must start the tasks with the payload (unit of work) bound to the allocated resource, and xapply knows all about that.

Xapply includes a built-in client for the ptbw wrapper. Under -t xapply figures out if it needs to wrap itself in an enclosing ptwb instance, or just connect to an existing one based on the type of the tags parameter (file or socket).

Xapply knows how to make a client connection to the enclosing ptbw instance to allocate tokens for its many tasks. It returns those tokens only as it runs out of new tasks to start, it doesn't release tokens if it is going to immediately issue again. The current implementation of xapply doesn't fetch additional tokens after the initial allocation, so when others come available it won't run a wider parallel factor: that's a missing feature.

See the xapply HTML document for examples of the xapply interface.

$Id: ptbw.html,v 1.22 2012/03/29 21:15:48 ksb Exp $