uniproc - Universal data processing tool

NAME

uniproc - Universal data processing tool

SYNOPSIS

uniproc [OPTIONS] INPUTFILE COMMAND [ARGS]

DESCRIPTION

Take each line from INPUTFILE as DATA (chopping end-of-line chars), pass each TAB-delimited fields of DATA to COMMAND as arguments after ARGS (unless placeholder is in COMMAND or ARGS, see below), run COMMAND and then record the exit status.

Can be parallelized well. uniproc(1) itself does not run multiple instances of COMMAND in parallel, just in series, but if you start multiple instances of uniproc(1), then you can run COMMANDs concurrently. Locking ensures no overlapping data being processed. So you don't need special precautions (locking, data partitioning) when starting uniproc(1) multiple times on the same INPUTFILE.

Use a wrapper command/script for COMMAND if you want either of these:

save COMMAND's output as well.: By default it goes to STDOUT. Use redirexec(1) for example.
pass DATA on the STDIN or in environment variable instead of command argument.: Use args2env(1) or args2stdin(1) for example.

If re-run after an interrupt, won't process already processed data. But you may re-try the failed ones by the --retry option.

The user is allowed to append new lines of data to INPUTFILE between executions or during runtime - it won't mess up the processing. However editing or reordering lines which are already in the file, confuses the results - don't do it.

ARGS (and COMMAND too, somewhat usefully) supports placeholders: A curly bracket-pair {} is replaced to DATA as one argument, including TAB chars if any, anywhere in COMMAND ARGS. If there is a number in it, {N}, then the Nth TAB-delimited field (1-indexed) is gonna be substituted in. A lone {@} argument expands to as many arguments as TAB-delimited fields there are in DATA. Multiple numbers in the placeholder like {5,3,4} expands to all of the data fields specified by the index numbers, into multiple arguments. Note that in this case, the multi-index placeholder must stand in its own separate argument, just as the all-fields {@} placeholder. Indexing a non-existing field expands to empty string. Be aware that your shell (eg. bash(1)) may expand arguments like {5,3,4} before it gets to uniproc(1), so escape it if neccessary (eg. '{5,3,4}'). If there is any curly bracket placeholder like these, DATA fields won't be added to ARGS as the last argument.

OPTIONS

-r, --retry

Process those data which were earlier failed (according to INPUTFILE.uniproc > state file) too, besides the unprocessed ones.

-f, --failed

Process only the earlier failed items.

-1, --one-item

Process only 1 item, then exit. Default is to process as many items in series as possible.

-n, --items NUM

How many items to process.

-e, --errexit

Stop processing items as soon as the first COMMAND exits non-zero, and uniproc(1) itself exists which that exit code (or 128+signal if signaled).

-Q, --quasilock

Create and check locks using lock files instead of flock(2). Useful for network filesystems which does not support shared locks (eg. sshfs). It is assumed that either all instances of uniproc(1), across all hosts that are working on a given INPUTFILE, are run in quasi-lock mode, or all in flock(2)-lock mode - do not mix. These quasi lock files are:

INPUTFILE.uniproc.lock: locking the INPUTFILE.uniproc state file, and
INPUTFILE.uniproc.NUM: locking the command processing the NUMth item. Note, this is the same file which is locked by flock(2) in real-lock mode.

Beware, when using quasi-locks: the user may manually clean up lock files which are left there after an interrupted process. While atomic lock acquisition is approximated using general filesystem primitives, there is no simple race-free to automatically release the lock when a process terminates. Therefore uniproc(1) does not even try to emulate such lock-release mechanism, so it neither detects nor reclaims stale lock files. However to help the user identify possibly alive processes which expect resources being exclusively allocated to them, uniproc(1) writes some useful info about the current process in the lock files: PID START_TIMESTAMP HOSTNAME.

-sp, --show-progress

Show which item is being started to process.

-sd, --show-data

Show the raw data what is being started to process.

-ss, --show-summary

Show stats summary when exit.

--debug

Output debug messages.

FILES

It maintains INPUTFILE.uniproc file by writing the processing status of each lines of input data in it line-by-line. Processing status is either:

all spaces ( ): processing not yet started
periods (...): in progress
digits, possibly padded by spaces ( 0): result status (exit code)
exclamation mark (!) followed by hexadecimal digits (!0f): termination signal (COMMAND teminated abnormally)
EOF (ie. fewer lines than input data): processing of this item has not started yet

INPUTFILE.uniproc is locked while read/write to ensure consistency. INPUTFILE.uniproc.NUM are the name of the files which hold the lock for the currently in-progress processes, where NUM is the line number of the corresponding piece of data in INPUTFILE. A lock is held on each of these INPUTFILE.proc.NUM files by the respective instance of COMMAND to detect if the processing is still going or the process crashed.

LIMITATION

Due to currently used locking mechanism (Fcntl(3perl)), running on multiple hosts may disrespect locking, depending on the network filesystem. See --quasilock option.

ENVIRONMENT

When running COMMAND, the following environment is set:

UNIPROC_DATANUM: Number of the particular piece of data (ie. line number in INPUTFILE, 0-indexed) which is need to be processed by the current process.
UNIPROC_DATANUM_1INDEX: Same as UNIPROC_DATANUM but 1-indexed instead of 0-indexed.
UNIPROC_TOTALNUM: Total number of items (processed and unprocessed). Note this figure may be outdated because INPUTFILE is not always measured before each COMMAND start.

EXAMPLES

Display the data processing status before each line of data:

  paste datafile.uniproc datafile

How much competed?

  awk -v total=$(wc -l < datafile) 'BEGIN{ok=ip=fail=0} {if($1==0){ok++} else if($1=="..."){ip++} else if($1!=""){fail++}} END{print "total: "total", completed: "ok" ("(ok*100/total)"%), in-progress: "ip" ("(ip*100/total)"%), failed: "fail" ("(fail*100/total)"%)"}' datafile.uniproc
  
Output:

  total: 8, completed: 4 (50%), in-progress: 1 (12.5%), failed: 1 (12.5%)

Record output of data processing into a file per each data item:

  uniproc datafile sh -c 'some-command "$@" | tee output-$UNIPROC_DATANUM' --

  uniproc datafile substenv -e UNIPROC_DATANUM redirexec '1:a:file:output-$UNIPROC_DATANUM' some-command

Same as above, plus keep the output on STDOUT as well as in separate files. Note, the {} argument is there to pass DATA to the right command:

  uniproc datafile pipecmd some-command {} -- substenv -e UNIPROC_DATANUM tee -a 'output-$UNIPROC_DATANUM'

Display data number, processing status, input data, (last line of) output data in a table:

  join -t $'\t' <(nl -ba -v0 datafile.uniproc) <(nl -ba -v0 datafile) | foreach -t --prefix-add-data --prefix-add-tab tail -n1 output-{0}