Vhistify Tutorial

Introduction
Prerequisites
Typographic Notations
Installation
First Steps
Examples
Commandline Options
Plugins
Internal Structure of vhistify
Caveats and Things to Keep in Mind

Introduction

In the field of medical image processing and evaluation, most workflows are comprised of several individual tools and scripts, which are combined into one complex pipeline. Since changing even one parameter (e.g. filter method, Matlab version, co-registration template file) can have a large influence on the result of the workflow, documenting all workflow steps is essential (Good Scientific Practice). Manual documentation is error prone, cumbersome and in some cases outright impossible (opaque relationships within and between complex software packages).

Vhistify is our attempt to create workflow documentations in an automated way. It is based on VHIST, a file format specifically designed to document workflows. VHIST files are self-contained and PDF-compatible - all information stored in a VHIST file is accessible from any PDF browser. However, VHIST files also contain structured information on each workflow step (embedded XML) suitable for automated processing.

Vhistify executes and monitors another program and gathers lots of information about the monitored process. This information includes:

The command line of the executed command
The hostname of the computer, on which the program was executed
The name of the user, who executed the command
The time, the program took to run
The initial working directory of the command
The return value of the program as well as the reason for termination (killed, segmentation fault, etc)
Standard output and error of the program
A list of files read and written by the program, including paths, filesizes and MD5 fingerprints

With the help of plugins, vhistify can also:

Infer version numbers of tools used
Collect meta-information for files of known file formats
Generate zip-archives, which contain the source code of a script.
Gather more detailed information about the machine used (version of Linux, versions of core libraries and programs, etc.)
Create preview images for plots or graphics
etc.

Prerequisites

Currently, vhistify only supports Linux systems. vhistify requires that strace and Python 2.6 or 2.7 are installed on your system. In some situations, you will need a recent version of strace (Matlab with the Parallel Computation toolbox does not work correctly with strace 4.5; strace 4.8 works well).

Typographic Notations

All examples in this tutorial are shell commands and are displayed in the following form:

# change into directory "some/directory"
$ cd some/directory
$ echo "hello world"
   hello world

The $-sign at the beginning of each line marks the shell prompt and does not belong to the command. Do not copy it when you type the command into the terminal! Lines starting with the #-sign are comments and can be left out when executing the commands. Where appropriate, we show the output of the command underneath the command itself, just like the line "hello world" in the example above. Output lines are shown in an indented way.

We assume that you use the bash shell. To find out which shell you use, open a terminal and enter

$ echo $SHELL
   /bin/bash

If you use another command line, you might have to adjust some of the commands.

Installation

On Debian, Ubuntu and derived distributions of Linux, you can download the .deb installer and install it on your system. vhistify is installed into the /opt/ directory and the installer will create a symbolic link to the vhistify executable in /opt/bin/. On other Linux distributions, download and unpack the tar archive. Make sure that the vhistify executable is in your path.

# export PATH="$PATH:/path/to/vhistify/"

If you want to use a custom version of strace with vhistify, you can create a symbolic link to the strace executable in the vhistify/ directory (right next to vhistify executable). Vhistify will prefer this version of strace over the strace binary in the unix path.

First Steps

This section shows how to "vhistify" a simple command-line call. On Linux, to copy a file input.txt to the file output.txt, we can write

# create a file input.txt, which contains the text "hello world"
$ echo "hello world" > input.txt 

# copy the file to the destination
$ cp input.txt output.txt

on the command-line. We can now document this operation with vhistify. First, remove the file again with the following command:

$ rm output.txt

Afterwards, repeat the cp command but insert the word vhistify at the beginning of the command-line:

# copy the file and monitor the copying with vhistify
$ vhistify cp input.txt output.txt

Vhistify will create a file output.txt.vhist in the same directory as output.txt. You can open the file with any PDF viewer (Preview on Mac OS X denies to open files with an extension other than .pdf. You therefore have to rename the file to output.txt.pdf). The VHIST file contains lots of information about the copying process. The entry "title", however, is empty as vhistify can not derive sensible titles on it's own (it does not know what the command cp does). You can add a title with the --title option:

# remove previous outputs
$ rm output.txt
$ rm output.txt.vhist

# rerun the copying process
$ vhistify --title="Copy a File"  cp input.txt output.txt

We have to remove the old VHIST file since vhistify will not overwrite existing VHIST files. Most often, this behaviour is what you want. You can, however, set the VHISTIFY_RENAME_OLD_VHISTFILES to true to change this behaviour. This option is very convenient when testing vhistify calls as you do not have to remove/rename old VHIST files by hand.

$ export VHISTIFY_RENAME_OLD_VHISTFILES=true
$ vhistify --title="Copy a File"  cp input.txt output.txt

If we run several vhistified commands in a row and an output file of one step is the input file of the next call, vhistify will use the VHIST file of the input as the base for the newly created VHIST file:

$ export VHISTIFY_RENAME_OLD_VHISTFILES=true
$ vhistify --title="Copy a File"  cp input.txt intermediate.txt
$ vhistify --title="Copy a File"  cp intermediate.txt output.txt

The VHIST file output.txt.vhist contains two sections, one for each of the two cp commands. If more than one input file is associated with a VHIST file, vhistify will append to one of the two and embed the other one inside the created VHIST file. If a command creates several output files, vhistify will create one VHIST file for each one of them. This way, the complete workflow is always documented within one file and you can inspect the creation history of every file by looking into the accompanied VHIST file.

There is much more to vhistify and VHIST than we can explain in this "first steps" section. You can find more information in the following sections and on http://www.nf.mpg.de/vhist. To get started with vhistify, the information in this section should get you pretty far.

Examples

Vhistify contains a directory examples/, which includes a number of vhistify demos. Most demos will run on any linux system. Some demos require special software, such as FSL, Matlab or SPM. Each example will abort with an error if it can not find one of the needed tools.

To run the demos, perform the following steps:

1) Add vhistify to your path
You can test if vhistify is in your path by typing
```
$ vhistify --version
```
on the command line. If vhistify is in your path, you should see output of the form
```
vhistify 1.34.0.1234 of Jun 20 2010
```
If your shell displays an error, type
```
$ export PATH="$PATH:/opt/vhistify/bin"
```
on the command line. This code assumes that you installed vhistify into the directory /opt/vhistify/.
2) Copy the examples directory into your home directory

If you installed vhistify to /opt/vhistify/, copy /opt/vhistify/examples/.
3) Switch into one of the example subdirectories

To view the Python example, switch into examples/python-example/.
4) Read the example's description and instructions

Open the file run-example.bash in a text editor. The first lines of each script contain a short summary.
5) Type ./run-example.bash

This will run the example. Some examples take 1-2 minutes to execute. Afterwards, there will be one or several files with the extension .vhist in the example's directory. You can open them with any PDF viewer.

Commandline Options

You can view a short summary of all command line options by typing:

$ vhistify --help

You can also have a look at the vhistify Manual Page for a more detailed explanation of all options and environment variables.

Plugins

Vhistify supplies a large number of plugins, which you can use to enhance and augment vhistify's output. There are plugins for different scripting languages, plugins for tools or software compilations, plugins which handle how files are listed or embedded a VHIST files and so on. You can use the --list-plugins option to get a list of all plugins available. The --plugin-help options displays a detailed description of each plugin.

$ vhistify --list-plugins
   This is a list of all installed plugins:

   * spm
       Enhances the output of SPM jobs.

   * largefile
       Disables MD5 checksums for very large files and reads MD5 sums from
       .md5 files on the file system.

   * matlab
       Enhances the output of Matlab jobs.
   ....

$ vhistify --plugin-help largefile
   Documentation of plugin largefile:
       The "largefile" plugin disables the generation of MD5 checksums for
       files larger than 100 MB. If a file with the same name as the
   ....

To enable plugins, set the --plugins option:

$ vhistify --plugins=matlab,largefile ....

A number of plugins are active by default. These plugins are marked as [default] in the listing generated by --list-plugins.

Internal Structure of vhistify

For many purposes, it is sufficient to prepend vhistify to your command line and activate some plugins that seem to suit your needs in order to generate adequate VHIST files. In some cases, however, it is helpful to understand the internal structure of vhistify to debug problems and further improve the generated VHIST files.

Info: Per-plugin-operations are executed in the order, the plugins were specified on the command line or in the environment variables VHISTIFY_PLUGINS and VHISTIFY_DEFAULT_PLUGINS. Plugins in VHISTIFY_DEFAULT_PLUGINS are handled first.

Read command line and vhistify environment variables

At first, vhistify reads all command line options and environment variables it supports. This way, it knows the command to execute and the list of required plugins.
Load plugins and create plugin instances

Vhistify tries to load every plugin requested by the user. It creates an instance of each plugin. At this point in time, the plugin constructor is called, followed by the _setup() method. The constructor is individually defined for each plugin whereas the _setup() method is the same for each plugin. If vhistify can not create one of the plugin instances for some reason, it exits with an error.
Let plugins initialize environment

Each plugin gets the chance to setup the environment (update environment variables, run programs, etc) before the actual command is executed. This step is usually very trivial as only few plugins need vhistify-specific setup of the environment. Setting up the environment is done by the initEnvironment() hook.
Run the given program with strace

The actual command is executed and monitored with strace. Moreover, vhistify measures the time required by the command, records standard output and standard error, checks return value and signals (segmentation fault, killed, interrupted by pressing Ctrl+C, etc) and logs username, hostname, current working directory, etc.
Parse the strace log into a list of system calls.

Just what the title says.
Derive list of files and list of programs from the system calls.

From the system calls of interest, try to determine which files where read and written by the command. Some system calls add new files to the list, other modify it. For example, if your command creates the file X and then moves it to Y, vhistify documents that your command actually created the file Y.
Let plugins filter out unwanted files from file list.

Each plugin gets the complete list of all files and can remove any file it wants to exclude. If one plugin removes a file, the file will not be documented in the generated VHIST file(s). This step is useful to exclude executables, libraries and other parts of the base system, which can be considered static. The unix plugin, for example, removes files in paths such as /bin, /usr or /tmp from the list. The plugin hook is called filterFiles().
Let plugins amend filtered-out files

Each plugin receives the list of files removed by any of the plugins. If a plugin wants one of these files to be part of the VHIST file, it can re-add the file. (Example: the "unix" plugin might remove the whole FSL directory but the "fsl" plugin will re-add FSL template files). If one of the plugins decides to re-add a file, the file will be included in the generated VHIST files. The plugin hook is called amendFiles().
Let plugins convert files

Each plugin gets a list of all files and can modify or convert them. Plugins are executed one after the other and each plugin sees the output of the previous plugin. Plugins might use this step to combine source code files into a zip archive, create preview images for data files, add meta information to files or decide whether a file should be embedded or not. The plugin hook is called convertFiles(). CAVEAT In fact, plugins can remove files from the list without any form of replacement, therefore undo the previous step. This, however, is considered bad practice and should be avoided.
Add VHIST files and find root files

Search the file system for VHIST files associated with the input files. Add the VHIST files to the list of files and choose one as the root file. The new VHIST section will be appended to the root file, all other VHIST files are embedded into the generated VHIST file(s).
Let plugins augment section

Each plugin gets the section object and can add/modify/remove user-defined keywords to/from the section. This step can be used to add version numbers, svn/git revisions, a summary of the packages installed on the system, etc. Each plugin receives the modified section of the previous plugin. The plugin hook is called updateSection().
Create list of output VHIST files

Vhistify will create one VHIST file for each file marked as output file. The only exception is the file, which contains the standard output.
Create VHIST file(s)

Vhistify creates the VHIST file and generates copies for all output files. Afterwards, vhistify terminates.

Caveats and Things to Keep in Mind

Document Small Units of Work

If you document your compute jobs with vhistify, vhistify sees the complete job as one unit and cannot distinguish between individual components of the job. For example, if you run one script, which reads 100 files and generates 100 files (one for each input file), it is a good idea to break up the job into 100 smaller jobs and document each one individually. Otherwise, the generated VHIST files will list all 200 files and it gets problematic to perceive the structure of the compute jobs.
Datasets with Several Files

Some file formats (such as the DICOM file format) store datasets as directory trees. Such datasets are hard to document without generating clutter in the VHIST file. We have some ideas how to document such datasets, however, there is no solution, yet. Currently, each file will be listed individually inside the generated VHIST files.
Interactive Tools

Even though it is possible to document tools with graphical user interfaces with vhistify, the generated VHIST files might not be very useful. The reason for this is that most interactive programs allow you to perform a large number of processing steps in any order and combination. Since vhistify can not discriminate the individual steps, you will have a hard time determining what happened.
File Dialogs

Some implementations of "open file/save file" dialogs tend to look into files to determine their file types or generate preview images. This causes a lot of noise, that can not be filtered out by vhistify. We found that the KDE and Gnome file dialogs cause this problem. Other file dialogs, such as the ones used by FSL, Matlab, IDL or Qt do not impose this problem.
Symbolic Links

Vhistify generates the list of accessed and created files after your program exited. This property makes it problematic to handle symbolic links correctly. At the moment, support for symbolic links is rather primitive. Therefore, some operations (create a symbolic link to an existing directory, write to this directory and afterwards delete the symbolic link) will not be handled correctly. We will improve symbolic link support in the future. For the time being, it is a good idea to not create or delete symbolic links in the programs you document with vhistity.
Interactive Shell Scripts

Vhistify documents the standard output of your programs. For technical reasons, however, it currently does not document the standard input. Therefore, your documentation might be incomplete if your program queries crucial information via standard input and does not show your choices in the standard output.

On the other hand, it is safe for your scripts to ask the user for passwords as they will not appear in the generated VHIST files, either.
Missing System Calls

We do not yet handle all Linux system calls, which are relevant to file accesses and filesystem operations. 99% of all programs and scripts, however, should not suffer from this insufficiency.
Matlab Multiprocessing 1

If you monitor a Matlab job, which uses the parallel computing toolbox, the matlabpool() call to open the Matlab pool requires a lot of time (between half a minute and several minutes). The reason for this is that opening a Matlab pool generates lots of system calls which strace will trace (Matlab starts several instances of itself). After the Matlab pool opened, performance is normal, again.
Matlab Multiprocessing 2
We found that documenting Matlab with vhistify and strace 4.5 causes problems when you use the parallel computation toolbox. Strace may generate PANIC message of the form
```
PANIC: handle_group_exit: 13595 leader 13572
```
when the Matlab pool is opened or closed. This problem seems to be solved with strace 4.8. Sadly, even though strace 4.5.20 was released in 2010, Ubuntu 13.04 still ships this version. You can find more recent versions of strace at http://sourceforge.net/projects/strace/. Compiling and installing strace requires a gcc compiler and is a matter of
```
# download strace and unzip it to /path/to/strace/source code
$ cd /path/to/strace/sourcecode/
$ ./configure
$ make
$ sudo make install
```
Distributed Computing

Strace can only trace processes on the same machine. Therefore, you can not use vhistify to document Jobs, which are automatically distributed amongst several machines or cluster nodes. Applications, which use multi-threading or multi-processing on the same machine should work fine.