Vhistify Tutorial
Table of Contents
Introduction
In the field of medical image processing and evaluation, most workflows are comprised of several individual tools and scripts, which are combined into one complex pipeline. Since changing even one parameter (e.g. filter method, Matlab version, co-registration template file) can have a large influence on the result of the workflow, documenting all workflow steps is essential (Good Scientific Practice). Manual documentation is error prone, cumbersome and in some cases outright impossible (opaque relationships within and between complex software packages).
Vhistify is our attempt to create workflow documentations in an automated way. It is based on VHIST, a file format specifically designed to document workflows. VHIST files are self-contained and PDF-compatible - all information stored in a VHIST file is accessible from any PDF browser. However, VHIST files also contain structured information on each workflow step (embedded XML) suitable for automated processing.
Vhistify executes and monitors another program and gathers lots of information about the monitored process. This information includes:
- The command line of the executed command
- The hostname of the computer, on which the program was executed
- The name of the user, who executed the command
- The time, the program took to run
- The initial working directory of the command
- The return value of the program as well as the reason for termination (killed, segmentation fault, etc)
- Standard output and error of the program
- A list of files read and written by the program, including paths, filesizes and MD5 fingerprints
With the help of plugins, vhistify can also:
- Infer version numbers of tools used
- Collect meta-information for files of known file formats
- Generate zip-archives, which contain the source code of a script.
- Gather more detailed information about the machine used (version of Linux, versions of core libraries and programs, etc.)
- Create preview images for plots or graphics
- etc.
Prerequisites
Currently, vhistify only supports Linux systems. vhistify requires that strace and Python 2.6 or 2.7 are installed on your system. In some situations, you will need a recent version of strace (Matlab with the Parallel Computation toolbox does not work correctly with strace 4.5; strace 4.8 works well).
Typographic Notations
All examples in this tutorial are shell commands and are displayed in the following form:
# change into directory "some/directory" $ cd some/directory $ echo "hello world" hello world
The $
-sign at the beginning of each line marks the shell prompt and
does not belong to the command. Do not copy it when you type the command
into the terminal! Lines starting with the #
-sign are comments and
can be left out when executing the commands. Where appropriate, we
show the output of the command underneath the command itself, just
like the line "hello world" in the example above. Output lines are
shown in an indented way.
We assume that you use the bash
shell. To find out which shell
you use, open a terminal and enter
$ echo $SHELL
/bin/bash
If you use another command line, you might have to adjust some of the commands.
Installation
On Debian, Ubuntu and derived distributions of Linux, you can download the
.deb
installer and install it on your system. vhistify is installed into
the /opt/
directory and the installer will create a symbolic link to the
vhistify
executable in /opt/bin/
. On other Linux distributions, download
and unpack the tar archive. Make sure that the vhistify
executable is in
your path.
# export PATH="$PATH:/path/to/vhistify/"
If you want to use a custom version of strace with vhistify, you
can create a symbolic link to the strace executable in the vhistify/
directory (right next to vhistify
executable). Vhistify will prefer
this version of strace over the strace binary in the unix path.
First Steps
This section shows how to "vhistify" a simple command-line call. On Linux,
to copy a file input.txt
to the file output.txt
, we can write
# create a file input.txt, which contains the text "hello world" $ echo "hello world" > input.txt # copy the file to the destination $ cp input.txt output.txt
on the command-line. We can now document this operation with vhistify. First, remove the file again with the following command:
$ rm output.txt
Afterwards, repeat the cp
command but insert the word vhistify
at the beginning of the command-line:
# copy the file and monitor the copying with vhistify $ vhistify cp input.txt output.txt
Vhistify will create a file output.txt.vhist
in the same directory
as output.txt
. You can open the file with any PDF viewer (Preview on
Mac OS X denies to open files with an extension other than .pdf
. You
therefore have to rename the file to output.txt.pdf
). The VHIST
file contains lots of information about the copying process. The entry
"title", however, is empty as vhistify can not derive sensible titles on
it's own (it does not know what the command cp does). You can add
a title with the --title
option:
# remove previous outputs $ rm output.txt $ rm output.txt.vhist # rerun the copying process $ vhistify --title="Copy a File" cp input.txt output.txt
We have to remove the old VHIST file since vhistify will not overwrite
existing VHIST files. Most often, this behaviour is what you want.
You can, however, set the VHISTIFY_RENAME_OLD_VHISTFILES
to true
to change this behaviour. This option is very convenient when testing
vhistify calls as you do not have to remove/rename old VHIST files by hand.
$ export VHISTIFY_RENAME_OLD_VHISTFILES=true $ vhistify --title="Copy a File" cp input.txt output.txt
If we run several vhistified commands in a row and an output file of one step is the input file of the next call, vhistify will use the VHIST file of the input as the base for the newly created VHIST file:
$ export VHISTIFY_RENAME_OLD_VHISTFILES=true $ vhistify --title="Copy a File" cp input.txt intermediate.txt $ vhistify --title="Copy a File" cp intermediate.txt output.txt
The VHIST file output.txt.vhist
contains two sections, one for each
of the two cp
commands. If more than one input file is associated with
a VHIST file, vhistify will append to one of the two and embed the other
one inside the created VHIST file. If a command creates several output
files, vhistify will create one VHIST file for each one of them.
This way, the complete workflow is always documented within one file
and you can inspect the creation history of every file by looking into
the accompanied VHIST file.
There is much more to vhistify and VHIST than we can explain in this "first steps" section. You can find more information in the following sections and on http://www.nf.mpg.de/vhist. To get started with vhistify, the information in this section should get you pretty far.
Examples
Vhistify contains a directory examples/
, which includes a number of
vhistify demos. Most demos will run on any linux system. Some demos require
special software, such as FSL, Matlab or SPM. Each example will abort
with an error if it can not find one of the needed tools.
To run the demos, perform the following steps:
- 1) Add vhistify to your path
You can test if vhistify is in your path by typing
$ vhistify --version
on the command line. If vhistify is in your path, you should see output of the form
vhistify 1.34.0.1234 of Jun 20 2010
If your shell displays an error, type
$ export PATH="$PATH:/opt/vhistify/bin"
on the command line. This code assumes that you installed vhistify into the directory
/opt/vhistify/
. - 2) Copy the examples directory into your home directory
If you installed vhistify to
/opt/vhistify/
, copy/opt/vhistify/examples/
. - 3) Switch into one of the example subdirectories
To view the Python example, switch into
examples/python-example/
. - 4) Read the example's description and instructions
Open the file
run-example.bash
in a text editor. The first lines of each script contain a short summary. - 5) Type ./run-example.bash
This will run the example. Some examples take 1-2 minutes to execute. Afterwards, there will be one or several files with the extension
.vhist
in the example's directory. You can open them with any PDF viewer.
Commandline Options
You can view a short summary of all command line options by typing:
$ vhistify --help
You can also have a look at the vhistify Manual Page for a more detailed explanation of all options and environment variables.
Plugins
Vhistify supplies a large number of plugins, which you can use to enhance
and augment vhistify's output. There are plugins for different scripting
languages, plugins for tools or software compilations, plugins which
handle how files are listed or embedded a VHIST files and so on. You
can use the --list-plugins
option to get a list of all plugins available.
The --plugin-help
options displays a detailed description of each plugin.
$ vhistify --list-plugins
This is a list of all installed plugins:
* spm
Enhances the output of SPM jobs.
* largefile
Disables MD5 checksums for very large files and reads MD5 sums from
.md5 files on the file system.
* matlab
Enhances the output of Matlab jobs.
....
$ vhistify --plugin-help largefile
Documentation of plugin largefile:
The "largefile" plugin disables the generation of MD5 checksums for
files larger than 100 MB. If a file with the same name as the
....
To enable plugins, set the --plugins
option:
$ vhistify --plugins=matlab,largefile ....
A number of plugins are active by default. These plugins are marked as
[default]
in the listing generated by --list-plugins
.
Internal Structure of vhistify
For many purposes, it is sufficient to prepend vhistify
to your
command line and activate some plugins that seem to suit your needs
in order to generate adequate VHIST files. In some cases, however,
it is helpful to understand the internal structure of vhistify to
debug problems and further improve the generated VHIST files.
Info: Per-plugin-operations are executed in the order, the plugins
were specified on the command line or in the environment variables
VHISTIFY_PLUGINS
and VHISTIFY_DEFAULT_PLUGINS
. Plugins in
VHISTIFY_DEFAULT_PLUGINS
are handled first.
- Read command line and vhistify environment variables
At first, vhistify reads all command line options and environment variables it supports. This way, it knows the command to execute and the list of required plugins.
- Load plugins and create plugin instances
Vhistify tries to load every plugin requested by the user. It creates an instance of each plugin. At this point in time, the plugin constructor is called, followed by the
_setup()
method. The constructor is individually defined for each plugin whereas the_setup()
method is the same for each plugin. If vhistify can not create one of the plugin instances for some reason, it exits with an error. - Let plugins initialize environment
Each plugin gets the chance to setup the environment (update environment variables, run programs, etc) before the actual command is executed. This step is usually very trivial as only few plugins need vhistify-specific setup of the environment. Setting up the environment is done by the
initEnvironment()
hook. - Run the given program with strace
The actual command is executed and monitored with strace. Moreover, vhistify measures the time required by the command, records standard output and standard error, checks return value and signals (segmentation fault, killed, interrupted by pressing Ctrl+C, etc) and logs username, hostname, current working directory, etc.
- Parse the strace log into a list of system calls.
Just what the title says.
- Derive list of files and list of programs from the system calls.
From the system calls of interest, try to determine which files where read and written by the command. Some system calls add new files to the list, other modify it. For example, if your command creates the file
X
and then moves it toY
, vhistify documents that your command actually created the fileY
. - Let plugins filter out unwanted files from file list.
Each plugin gets the complete list of all files and can remove any file it wants to exclude. If one plugin removes a file, the file will not be documented in the generated VHIST file(s). This step is useful to exclude executables, libraries and other parts of the base system, which can be considered static. The unix plugin, for example, removes files in paths such as
/bin
,/usr
or/tmp
from the list. The plugin hook is calledfilterFiles()
. - Let plugins amend filtered-out files
Each plugin receives the list of files removed by any of the plugins. If a plugin wants one of these files to be part of the VHIST file, it can re-add the file. (Example: the "unix" plugin might remove the whole FSL directory but the "fsl" plugin will re-add FSL template files). If one of the plugins decides to re-add a file, the file will be included in the generated VHIST files. The plugin hook is called
amendFiles()
. - Let plugins convert files
Each plugin gets a list of all files and can modify or convert them. Plugins are executed one after the other and each plugin sees the output of the previous plugin. Plugins might use this step to combine source code files into a zip archive, create preview images for data files, add meta information to files or decide whether a file should be embedded or not. The plugin hook is called
convertFiles()
. CAVEAT In fact, plugins can remove files from the list without any form of replacement, therefore undo the previous step. This, however, is considered bad practice and should be avoided. - Add VHIST files and find root files
Search the file system for VHIST files associated with the input files. Add the VHIST files to the list of files and choose one as the root file. The new VHIST section will be appended to the root file, all other VHIST files are embedded into the generated VHIST file(s).
- Let plugins augment section
Each plugin gets the section object and can add/modify/remove user-defined keywords to/from the section. This step can be used to add version numbers, svn/git revisions, a summary of the packages installed on the system, etc. Each plugin receives the modified section of the previous plugin. The plugin hook is called
updateSection()
. - Create list of output VHIST files
Vhistify will create one VHIST file for each file marked as output file. The only exception is the file, which contains the standard output.
- Create VHIST file(s)
Vhistify creates the VHIST file and generates copies for all output files. Afterwards, vhistify terminates.
Caveats and Things to Keep in Mind
- Document Small Units of Work
If you document your compute jobs with vhistify, vhistify sees the complete job as one unit and cannot distinguish between individual components of the job. For example, if you run one script, which reads 100 files and generates 100 files (one for each input file), it is a good idea to break up the job into 100 smaller jobs and document each one individually. Otherwise, the generated VHIST files will list all 200 files and it gets problematic to perceive the structure of the compute jobs.
- Datasets with Several Files
Some file formats (such as the DICOM file format) store datasets as directory trees. Such datasets are hard to document without generating clutter in the VHIST file. We have some ideas how to document such datasets, however, there is no solution, yet. Currently, each file will be listed individually inside the generated VHIST files.
- Interactive Tools
Even though it is possible to document tools with graphical user interfaces with vhistify, the generated VHIST files might not be very useful. The reason for this is that most interactive programs allow you to perform a large number of processing steps in any order and combination. Since vhistify can not discriminate the individual steps, you will have a hard time determining what happened.
- File Dialogs
Some implementations of "open file/save file" dialogs tend to look into files to determine their file types or generate preview images. This causes a lot of noise, that can not be filtered out by vhistify. We found that the KDE and Gnome file dialogs cause this problem. Other file dialogs, such as the ones used by FSL, Matlab, IDL or Qt do not impose this problem.
- Symbolic Links
Vhistify generates the list of accessed and created files after your program exited. This property makes it problematic to handle symbolic links correctly. At the moment, support for symbolic links is rather primitive. Therefore, some operations (create a symbolic link to an existing directory, write to this directory and afterwards delete the symbolic link) will not be handled correctly. We will improve symbolic link support in the future. For the time being, it is a good idea to not create or delete symbolic links in the programs you document with vhistity.
- Interactive Shell Scripts
Vhistify documents the standard output of your programs. For technical reasons, however, it currently does not document the standard input. Therefore, your documentation might be incomplete if your program queries crucial information via standard input and does not show your choices in the standard output.
On the other hand, it is safe for your scripts to ask the user for passwords as they will not appear in the generated VHIST files, either.
- Missing System Calls
We do not yet handle all Linux system calls, which are relevant to file accesses and filesystem operations. 99% of all programs and scripts, however, should not suffer from this insufficiency.
- Matlab Multiprocessing 1
If you monitor a Matlab job, which uses the parallel computing toolbox, the
matlabpool()
call to open the Matlab pool requires a lot of time (between half a minute and several minutes). The reason for this is that opening a Matlab pool generates lots of system calls which strace will trace (Matlab starts several instances of itself). After the Matlab pool opened, performance is normal, again. - Matlab Multiprocessing 2
We found that documenting Matlab with vhistify and strace 4.5 causes problems when you use the parallel computation toolbox. Strace may generate PANIC message of the form
PANIC: handle_group_exit: 13595 leader 13572
when the Matlab pool is opened or closed. This problem seems to be solved with strace 4.8. Sadly, even though strace 4.5.20 was released in 2010, Ubuntu 13.04 still ships this version. You can find more recent versions of strace at http://sourceforge.net/projects/strace/. Compiling and installing strace requires a gcc compiler and is a matter of
# download strace and unzip it to /path/to/strace/source code $ cd /path/to/strace/sourcecode/ $ ./configure $ make $ sudo make install
- Distributed Computing
Strace can only trace processes on the same machine. Therefore, you can not use vhistify to document Jobs, which are automatically distributed amongst several machines or cluster nodes. Applications, which use multi-threading or multi-processing on the same machine should work fine.