Python Dispatchers for CCP4

Introduction

CCP4Dispatchers is a project currently under development to provide automatically generated Python dispatchers to wrap the executables distributed with the CCP4 Suite.

The motivation for providing this is to encapsulate the set up of CCP4 environment variables so that CCP4 programs can be run from Python scripts without worrying about whether Python was started from a correct CCP4 environment. This sidesteps the requirement to source one of the shell environment set up files (such as ccp4.setup-sh or ccp4.setup-csh) on Linux and Mac systems, and for centrally set environment variables on Windows. Instead, the correct environment is set by the dispatcher just prior to program execution. In this way the environment is encapsulated, avoiding problems caused by clashes of incompatible environment definitions with non-CCP4 software. The provision of a standard package of Python wrappers may be of benefit to the many authors who have previously written their own dispatchers to wrap CCP4 programs. CCP4Dispatchers provide a guarantee of portability, and use of the dispatchers works in the same way on all systems with a compatible Python interpreter.

A generated dispatcher can be used from the command line as a drop-in replacement for the executable it wraps, with no change to syntax. Alternatively, the package CCP4Dispatchers can be imported within Python and used to run CCP4 software as subprocesses. This provides a simple mechanism for Python scripting of tasks using CCP4 programs.

Dispatcher generation

For binary distributions of the Suite, the CCP4Dispatchers package will have been generated automatically as part of the installation. When the Suite has been compiled from source, or when alternative environment definitions are required, it will be necessary to use the straightforward procedure explained below to generate the package.

Generating dispatchers is a two step process. First, an environment definition file must be created, which contains the variable names and their values defined, with one line per definition in one of the following formats:

VARIABLE=VALUE
export VARIABLE=VALUE
setenv VARIABLE VALUE

The reason for choosing these formats is that the environment definition file can then double up as a script that may be sourced on systems with an appropriate shell to obtain the same environment set by the dispatchers. Any lines that do not adhere to this format are ignored. Shell-like substitutions are allowed if they refer to lines previously defined. So a line like CBIN=$CCP4/bin will be correctly interpreted as long as the value for CCP4 has already been set.

To make the process of determining the correct environment definition as easy as possible, the Python program envExtractor.py may be used. This program writes out a definition file based on the result of sourcing an existing 'ccp4.setup' script in a sanitised environment. Therefore, a suitably edited version of the template file 'ccp4.setup-sh' or 'ccp4.setup-csh' may be passed to envExtractor.py, which will write out an environment definition, called for example 'ccp4-env.sh'.

The environment definition file created in the previous step may be passed to dispatcherGenerator.py, along with the path to the directory containing CCP4 executables, to generate the CCP4Dispatchers package. By default, that package will be created under the current directory. In order to follow the current behaviour of the binary installer, the package should be generated under $CCP4/share/python/. In addition, symbolic links (or .bat files on Windows) pointing to the dispatchers will be written. These have the same names as the programs wrapped by the dispatchers. This allows use of the dispatchers to be transparent; by adding the location of the links to PATH the dispatchers can then be used from the command line exactly as if the programs were being used directly.

Further usage instructions for envExtractor.py and dispatcherGenerator.py are printed when they are run at the command line, without arguments. These instructions show, for example, how to write the CCP4Dispatchers package elsewhere than the current working directory, or to separate the directory of Python modules from the symbolic links (or .bat files).

Using CCP4Dispatchers within Python

The CCP4Dispatcher directory created by dispatcherGenerator.py is a Python package, and the individual modules that set up Dispatcher classes for each wrapped program may be not only run from the command line, but also imported and used from an existing Python process. For this to work, the CCP4Dispatcher directory must be on the Python search path. A typical way to achieve that is to add its parent directory to the PYTHONPATH environment variable.

The CCP4Dispatcher package is designed to be flexible and may be used in different ways to address different use cases. The best way to demonstrate its functionality is to present the same CCP4 job (here the equivalent of 'refmac5-simple.exam') but run through the dispatchers in different ways that meet requirements of some of those use cases.

Use case 1: Simply start a job and wait for the results

The easiest way to run a CCP4 job inside Python, using CCP4Dispatchers, is demonstrated by the following script:

from string import Template
import os
from CCP4Dispatchers import dispatcher_builder

cmd = Template("HKLIN $CEXAM/rnase/rnase18.mtz " + \
               "HKLOUT $CCP4_SCR/rnase_simple_out.mtz " + \
               "XYZIN $CEXAM/rnase/rnase.pdb " + \
               "XYZOUT $CCP4_SCR/rnase_simple_out.pdb")
cmd = cmd.substitute(os.environ)
keywords = """LABIN FP=FNAT SIGFP=SIGFNAT FREE=FreeR_flag
NCYC 10
END
"""

d = dispatcher_builder("refmac5", cmd, keywords)
d.call()

Here, rather than directly import the refmac5 Dispatcher class, then instantiate it with a separate statement, we import the dispatcher_builder factory function from the package and use that. This has the advantage of brevity if many different Dispatchers are to be used in a script, as only one import line is needed. Additionally, the desired Dispatcher is requested by a string matching the original program name (with '.exe' removed from the name of Windows binaries) rather than its Python module name. This aids the automatic import of valid dispatcher modules, as their file names will differ from the original program name in those cases that the original name does not comply with Python's module naming conventions (such as names containing a '.').

The first time a name is imported from the CCP4Dispatchers package, the CCP4 environment is set automatically. This is what allows us to use the convenient Template strings of Python > 2.4 in the above script to fill in the values for the CCP4 environment variables CEXAM and CCP4_SCR in cmd from os.environ. It does not imply that Python has to be started from a sourced CCP4 environment! For scripts that manipulate the environment in some way, a method can be called to re-set the CCP4 environment at any time that a Dispatcher is in scope. For example, in the above script it could be called by d.set_env().

Using dispatcher_builder allowed us to set the command line string and program keywords as arguments to that function. If a Dispatcher was constructed directly, these start as None, and if command line arguments or keywords are required for the job, they must be set separately by methods of the Dispatcher, set_cmd_args(value) and set_keywords(value). In each case, value must be either a string or a list of strings.

The call() method handles dispatch to the refmac5 binary, passing the command line arguments and keywords, and returns the exit code once the process completes. As a side-effect, the attributes stdout_data and stderr_data of d are set, containing the STDOUT and STDERR output from the process respectively.

Use case 2: An interactive pipeline

In some cases it is necessary to have more control over the running sub-process. For those instances there is a different interface provided by call(wait=False). This version of the script does not block on the call statement, but returns control immediately to the user script, which would allow instantiation of further Dispatchers for parallel job execution:

from string import Template
import os
from CCP4Dispatchers import dispatcher_builder

cmd = Template("HKLIN $CEXAM/rnase/rnase18.mtz " + \
               "HKLOUT $CCP4_SCR/rnase_simple_out.mtz " + \
               "XYZIN $CEXAM/rnase/rnase.pdb " + \
               "XYZOUT $CCP4_SCR/rnase_simple_out.pdb")
cmd = cmd.substitute(os.environ)
keywords = """LABIN FP=FNAT SIGFP=SIGFNAT FREE=FreeR_flag
NCYC 10
END
"""

d = dispatcher_builder("refmac5", cmd, keywords)
d.call(wait=False)

while d.isRunning:
    stdout_line, stderr_line = d.monitor()

    # Do something with stdout_line. If the job is going badly, can
    # call d.abort()
    if stdout_line: print stdout_line.rstrip()

This level of control does not come for free. Each Dispatcher caches data output to STDOUT and STDERR in internal buffers. To access that data the user code must monitor the Dispatcher in order to take data one line at a time from these buffers. The Dispatcher provides a method, monitor(), to do this, which returns a tuple containing a single line of output from both STDOUT and STDERR of the process (or None is none is available). As a side-effect monitor also fills d.stdout_data and d.stderr_data so that all of the lines can inspected after the program completes.

With this scheme it is incumbent upon the client code to call monitor() enough times that all of the output of the program is read. One way to ensure this happens is to repeatedly call monitor() while the attribute d.isRunning is True. This is safe because this attribute is set to False by monitor() only when the subprocess has completed execution and there are no more lines of output left to read from either STDOUT or STDERR.

As this approach allows live progress monitoring of a running program, an additional method, abort(), is provided to cleanly stop a job if there is no reason to let it run to completion.

Use case 3: "Fire-and-forget"

For some applications it is useful to be able to start jobs without blocking on the call, but the requirement for a monitoring loop to fill the stdout_data and stderr_data attributes is a needless inconvenience because the user script does not need to interact further with the job. In such cases it is possible to direct STDOUT and STDERR to files, using the Dispatchers only to start the jobs. The following script demonstrates how:

from string import Template
import os
from CCP4Dispatchers import dispatcher_builder

cmd = Template("HKLIN $CEXAM/rnase/rnase18.mtz " + \
               "HKLOUT $CCP4_SCR/rnase_simple_out.mtz " + \
               "XYZIN $CEXAM/rnase/rnase.pdb " + \
               "XYZOUT $CCP4_SCR/rnase_simple_out.pdb")
cmd = cmd.substitute(os.environ)

keywords = """LABIN FP=FNAT SIGFP=SIGFNAT FREE=FreeR_flag
NCYC 10
END
"""

f1 = open("job.log","w")
f2 = open("job.err","w")

d = dispatcher_builder("refmac5", cmd, keywords, capture_streams=False, stdout=f1, stderr=f2)
d.call(wait=False)

f1.close()
f2.close()

The key point here is to request capture_streams=False during the instantiation of the Dispatcher. In that case the STDOUT and STDERR streams are not redirected to the object. Instead, they may be redirected to another location, given by the file objects passed as stdout and stderr. As the above script does not wait for the job to finish it exits quickly, leaving refmac5 writing to the files 'job.log' and 'job.err'.

It is not currently possible to use the abort method for a Dispatcher that was instantiated with capture_streams=False. In principle, any script that might need to abort a job it started should be keeping track of that job using the monitoring loop mechanism described previously.

Fine control of subprocesses

The Dispatcher object exposes the most useful information about the call, including call_val and call_err attributes that provide the exit code from a completed job or the exception from a failed call. However, it is also possible to interact directly with the subprocess.Popen object in case other information or methods are required. The subprocess.Popen object that the Dispatcher call method creates is assigned to the attribute process.

Notes

Some further examples of using the dispatchers to run demonstration jobs are provided in the 'example_scripts' subdirectory. See the project homepage at https://fg.oisin.rc-harwell.ac.uk/projects/ccp4dispatchers/.

Author

David Waterman