BLEND (CCP4: Supported Program)

NAME

blend
- management and processing of multiple crystals / multiple data sets

SYNOPSIS

blend -a foo_in.dat

blend -s cut_level_high [cut_level_low]

blend -sLCV cut_LCV_level_high [cut_LCV_level_low]

blend -saLCV cut_absolute_LCV_level_high [cut_absolute_LCV_level_low]

blend -c d1 d2 d3 d4 ... or [d1] [d2] [[d3-d4]] ..., etc.

Description
Input and output files
Keyworded input
Miscellaneous and problems
References
Authors and credits

DESCRIPTION

X-ray data collection from a single crystal is not always feasible. Very often crystallographers try to collect data from multiple crystals or from multiple locations on a single crystal. The resulting datasets are normally incomplete, or show low redundancy. Using any of them individually does not make for reliable phasing or model building and refinement. It is, rather, potentially better to try and merge all diverse datasets in more-complete ones. While solving the incompleteness issue, this merging produces datasets which, although inherently less precise, have a tendency to be more accurate because systematic errors interfere destructively when coming from different sources (multiple crystals). This, in turn, translates in better-quality structure factors, with positive effects for phasing, model building and refinement. High redundancy can also increase the anomalous signal, when this is needed.

BLEND is a program for the management of multiple datasets. It simplifies the analysis and greatly reduces the combinatorial explosion involved in the formation of multiple groups from the original set of data. The program can run in three different modes. In the analysis mode (option -a) it reads in multiple unmerged reflection files produced by an integration program(either MOSFLM [2] or XDS [3]), and carries out cluster analysis on one or two types of statistical descriptors, extracted or calculated from each dataset. Results produced by the program in analysis mode can, then, be used running the program in synthesis mode (option -s), or in combination mode (option -c). In synthesis mode datasets belonging to clusters previously determined are scaled together and output into individual merged reflection files in MTZ format, ready to be used in all subsequent stages of phasing, model building and refinement. Variants of the synthesis mode are synthesis mode using LCV values and synthesis mode using absolute LCV values (see later for meaning of LCV and absolute LCV). The combination mode allows users to carry out the same tasks enabled by the synthesis mode, this time for any combination of datasets, not necessarily those grouped in clusters.

Input preparation

Input for the program is a group of unmerged reflection files produced by integration programs. At present only files from the MOSFLM and XDS integration software can be handled. Reflection files can be either included in a single directory, or spread across several directories. In the first scenario, only the path to the directory needs to be fed into the program. In the second scenario, paths to all individual files will have to be listed in a single ASCII file and this is fed into BLEND. Program execution is controlled by keywords. From version 0.5.0, keywords are passed as standard input, as it is in general the case for the majority of CCP4 programs. If no keywords are passed to the program, default values and / or procedures for the parameters connected to the keywords will be used. More on BLEND keywords later.

Some problems arising in connection with input preparation can be found in the section Miscellaneous and problems.

Running the program in analysis mode

Once input is ready, BLEND can be run in analysis mode. All multiple datasets will be analysed individually and tested for overall radiation damage. If any dataset is thought to be significantly affected, parts of it will be removed. The amount of data to be removed can be controlled by keyword RADFRAC. Values ranges between 0 and 1; 0 means keeping everything while 1 means removing all affected parts of data. The default value is 0.75, which essentially tells the program to remove all reflections whose intensity, on average, has been dampened by radiation damage of more than 25% of its true value.

Input files for BLEND in analysis mode contain integrated (not scaled) data. They can be either mtz files produced by MOSFLM, or ASCII files produced by XDS ("INTEGRATE.HKL"). If these files are stored within a single directory, then simply type:

         blend -a /where/integrated/data/are/store

If files are spread across a given number of directories, then the user will have to create an ASCII file with all files (and their exact paths) listed one after the other. The content of one such files, which will arbitrarily name "original.dat", could for instance look like the following:

         /home/joe/data/xtal1/xl-d01.mtz
         /home/joe/data/xtal1/xf-d03.mtz
         /home/joe/data/xtal5/xl-d12.mtz
         /home/joe/data/xtal12/INTEGRATE.HKL
         /home/joe/data/xtal13/INTEGRATE.HKL

In this case BLEND can be executed as follows:

         blend -a original.dat

Several files will be produced by the program in analysis mode (see Input and Output files section). Those describing datasets clustering are "tree.png" (a postscript version, "tree.ps") and "CLUSTERS.txt". Others files are needed for bookkeeping. The important binary file "BLEND.RData" contains essential information needed by the program to run in synthesis mode; it cannot be deleted.

Running the program in synthesis mode

By running BLEND in synthesis mode the user aims at producing new datasets out of two or more individual datasets. Each node in the dendrogram can give rise to a scaled dataset. The easiest option for the user is to force BLEND to produce scaled datasets for all nodes in the dendrogram. This is, though, also the lengthiest option, because the user might only be interested in part of the nodes, for example those relating to tighter clusters. In order to single out only part of the nodes in the dendrogram one or two numerical levels need to be provided for execution. Consider for instance a case corresponding to the following "CLUSTERS.txt" file, describing a dendrogram with 13 clusters:

          Cluster     Number of         Cluster         LCV      aLCV      Datasets
           Number      Datasets          Height                            ID

              001             2           0.173        0.03      0.02      5 6
              002             2           0.242        0.01      0.01      9 13
              003             2           0.433        0.05      0.05      3 14
              004             2           0.518        0.08      0.07      8 10
              005             3           0.610        0.05      0.04      7 9 13
              006             4           0.702        0.13      0.11      12 7 9 13
              007             4           0.744        0.11      0.09      5 6 8 10
              008             3           0.982        0.17      0.14      4 3 14
              009             4           1.297        0.19      0.17      2 4 3 14
              010             5           1.623        0.28      0.23      11 12 7 9 13
              011             5           2.711        0.48      0.39      1 2 4 3 14
              012             9           3.343        0.44      0.30      5 6 8 10 11 12 7 9 13
              013            14          13.670        1.04      0.84      1 2 4 3 14 5 6 8 10 11 12 7 9 13

To create merged files out of all nodes below height 4 in the dendrogram we type:

         blend -s 4

This will produce 12 new datasets: the one corresponding to node (5+9), the one corresponding to node (9+13), the one corresponding to node (3+14), etc. To produce datasets for all nodes, simply type:

         blend -s 14

because the whole dendrogram is below 14 (the top height is 13.670). Suppose one needs to merge only data sets 1, 2, 4, 3, 14, because they form a rather tight cluster. With:

         blend -s 3

these data sets will be merged, but so will be data sets 11, 12, 7, 9, 13, data sets 2, 4, 3, 14, and so on, because they all happen to correspond to nodes at heights lower than 3. Given that 1, 2, 4, 3, 14 form a clusters at exactly an height of 2.711, by selecting two levels, one higher and one lower than 2.711, a scaled filed for this cluster only will be calculated. For example:

         blend -s 2.712 2.710

These two numbers are arbitrary numbers that fall just above and below 2.711, and that do not include values for any other node in the dendrogram. It is important to notice that when using two values it is compulsory to type the largest one first.

Variants of run in synthesis mode

As previously mentioned, two variants of synthesis mode are available, synthesis mode using LCV values and synthesis mode using absolute LCV values (see later). In this case cluster selection will use LCV and absolute LCV values, rather than cluster heights (many thanks to Alkistis Mitropoulou for suggesting these variants!).

Running the program in combination mode

Cluster analysis produces a grouping of all datasets in several clusters. This makes it possible to carry out a limited number of merging and scalings among the huge number of possible datasets combinations, thus making it possible to save on processing time. Clustering, though, introduces limitations because the user is forced to calculate datasets only corresponding to nodes in the dendrogram. For example, referring to the dendrogram described previously, there is no way we could obtain scaled data out of the union of data sets 1, 4, 11 and 13 because there is no node corresponding to this combination. Such a limitation can be overcome by running the program in combination mode. In the specific case, simply:

         blend -c 1 4 11 13

When the number of data sets and clusters is large it is very tedious to type or even cut and paste the long string of numbers forming the combination. For this reason an ad hoc syntax has been created to include groups of numerically-contiguous data sets, whole clusters or groups of clusters, and to exclude individual data sets or groups of data sets. The syntax is made up of the following rules:

         "[]"        a single-square bracket including one or more numbers means all data
                     sets in the clusters corresponding to those numbers.

         "[[]]"      a double-square bracket including one or more numbers indicates that all
                     data sets corresponding to those numbers are to be removed from the final group.

         "-"         an hyphen (minus sign) between two numbers indicates all integers between the two
                     numbers. If the first number is greater than the second, the selection is ignored.

         ","         commas between numbers are sometimes needed to separate data sets or clusters,
                     if they are inside single or double-square brackets.

         EXAMPLES (all referring to the dendrogram previously described).

         1) Combine cluster 2 with cluster 4:

                                               blend -c [2] [4]               equivalent to       blend -c 8 9 10 13

         2) All data sets in cluster 12, with the exception of data sets 7 and 11:

                                               blend -c [12] [[7,11]]         equivalent to       blend -c 5 6 8 9 10 11 12 13      or    blend -c 5 6 8-13

         3) Clusters 1 and 9, with the exception of data set 14, and with the addition of data sets 1 and 7:

                                               blend -c [1] [9] [[14]] 1 7    equivalent to       blend -c 1 2 3 4 5 6 7            or    blend 1-7

The ability to create scaled data out of any desired combinations confers flexibility to the program.

INPUT AND OUTPUT FILES

Input

BLEND can read unscaled reflection files in mtz format, or ASCII files in XDS format.
MTZ files contain, typically, integrated intensities as processed by MOSFLM [2]. XDS files are the unscaled integrated data ("INTEGRATE.HKL") produced by XDS [3].
Input can be either a file (no fixed name, but here it is indicated as "foo_in.dat"), or a directory

foo_in.dat (file): Each line of this ASCII file is the path to a valid unscaled reflection file, to be processed by the program

/path/to/a/valid/directory/ (directory): All valid unscaled reflection files in this directory will be processed by the program

Output

Execution of BLEND in different modes (analysis "-a", synthesis "-s" or "-sLCV" or "-saLCV", combination "-c" implies different output files.

(a) From analysis mode:

BLEND_SUMMARY.txt: is an ASCII file with tabulated information for all datasets being processed. Each dataset is given a serial number and this same number is used throughout the whole statistical analysis
mtz_names.dat: is simply the list of files read in by BLEND. If a previous list was already present (because created by the user), this new list is a copy of it, with invalid files removed
xds_files: is a directory containing files in MTZ format, in those cases where integrated data are in XDS format. The mtz files are obtained from the XDS files using POINTLESS. Names for the newly created MTZ files have names like "dataset_xxx.mtz", where xxx is a number. See also "xds_lookup_table.txt"
xds_lookup_table.txt: this file can be checked in order to keep track of the original XDS files. If no XDS files are involved, neither xds_files nor xds_lookup_table.txt will be created. All logs produced by POINTLESS when converting XDS files into MTZ format will also be dumped in the xds_files directory
tree.png, tree.ps: are graphics file in PNG and POSTSCRIPT format, showing the dendrogram derived from cluster analysis of all input datasets. Individual objects (datasets) are recognizable through their serial number. If the number of datasets is relatively low (15- 20 max), the dendrogram can be interpreted quite easily. For larger numbers it might be easier to refer to the ASCII counterpart of the dendrogram, which is the file called "CLUSTERS.txt"
CLUSTERS.txt: In this file the exact numerical value of the dendrogram's merging nodes is also reported, a feature useful to run BLEND in synthesis mode. The dendrogram is the most important outcome of BLEND analysis. The user takes decisions on merged data, based on his/her interpretation of the dendrogram. In this graphics file a "Linear Cell Variation" number is also reported. The Ward distance used to measure cluster mergings indicates the overall loss in cell variability when the number of merged datasets is increased. As cell parameter values are normalised and rotated through principal component analysis their numerical value in the dendrogram is not immediately related to real cell variation. Therefore it is not possible to get a feeling for structural isomorphism using the Ward distance. To help with this issue a parameter directly related to unit cell differences has been introduced, the Linear Cell Variation (LCV). LCV measures the maximum linear increase or decrease of the diagonals on the 3 independent cell faces. Values below 1% in general indicate a good degree of isomorphism among different crystals. Structural differences start to be noticeable with LCV greater than 1.5%. A value in angstroms associated to LCV is provided by the absolute Linear Cell Variation (aLCV), presented jointly to LCV in both "CLUSTERS.txt" and dendrogram. The isomorphism issue will, obviously, have to be considered jointly with the available data resolution
FINAL_list_of_files.dat: is an ASCII file reporting number of batches kept and highest resolution recommended for each dataset analysed by BLEND. Batches can be discarded because intensities in them are deemed to be severely affected by radiation damage (see keyword RADFRAC to control amount of discarded data). The highest recommended resolution is a rough estimate of where data should be cut, if the user wishes signal-to-noise ratio for the average intensity to be greater than a given value. This value is provided by the user with the keyword ISIGI, followed by a numerical value; default value is 1.5. The "FINAL_list_of_files.dat" file has 6 columns. The first is the path to the input files, the second is the serial number assigned from BLEND (and used in cluster analysis), the fourth and fifth are initial and final input image numbers, the third is the image number after which data are discarded because weakened by radiation damage, the sixth is resolution cutoff
BLEND.RData: is a binary file produced by the R code. It stores essential information used by all runs of BLEND in synthesis and combination modes

(b) From synthesis mode:

merged_files: all files produced by BLEND when executed in synthesis mode are stored within this directory, which is created if not already present, or is deleted and recreated if already present. Thus, it is important to rename this directory if more than one run of BLEND in synthesis mode is executed. This is taken care of when BLEND is executed with the CCP4 GUI
MERGING_STATISTICS.info (inside directory "merged_files"): is an ASCII file, essentially a table listing overall merging statistics for all merged datasets produced by the specific run of BLEND. It includes Cluster number, Rmeas, Rpim, Completeness, Multiplicity, Lowest Resolution and Highest Resolution. The table is sorted according to the Rmeas column, from its lowest to its highest value. If scaling with AIMLESS has failed for some reason, NA's are inserted in the corresponding rows. This table should make it easy for the user to select the desired merged dataset, in terms of completeness, multiplicity and data quality
Rmeas_vs_Cmpl.png, Rmeas_vs_Cmpl.ps (inside directory "merged_files"): a plot of all merged datasets in terms of Rmeas vs Completeness, both as PNG and PS graphics file
CLUSTERS.info (inside directory "merged_files"): is an ASCII file listing names and number of batches of each individual dataset composing specific clusters
unscaled_001.mtz, unscaled_002.mtz, ... (inside directory "merged_files"): are unscaled files in mtz format. There are as many of these files as the number of nodes selected by the user in the execution of BLEND in synthesis mode. The number associated with each file name coincide with the cluster (or node) number. Before scaling a dataset, obtained by the collation of individual datasets, it is necessary to have all of them with same space group and same indexing (if they belong to polar groups). Also, individual images will need to have unique numbers. Furthermore some datasets can have some images discarded and resolution limited. All this bookkeeping is taken care by a script calling POINTLESS which, by default, assigns the most likely space group. This can be changed by using keyword CHOOSE SPACEGROUP , where is the space group name (e.g. P 21 21 21, C 2, etc). Another keyword used in BLEND which relates to POINTLESS is TOLERANCE; this keyword controls how much cell sizes are allowed to change if they are to be considered in connection with a same structure. If the user wants to use a specific dataset as reference, the serial number of the desired dataset can be included after keyword DATAREF, if such reference is part of the input data. Alternatively the name of a reference mtz file can be given (as in POINTLESS). The reason why merged but unscaled files are kept for the user is connected with the way subsequent scaling is carried out. At present scaling in default mode is performed by BLEND using AIMLESS. This does not always guarantee the production of final averaged intensities. For example, data could be weak, or the collection followed some unusual set up. A successful scaling could, then, be obtained by running AIMLESS in non-default mode, using specific keywords. The starting files for doing this are the "unscaled_xxx.mtz" files. There is also, of course, the option to re-run BLEND in synthesis mode by adding specific scaling keywords, but, at present, not all available AIMLESS keywords can be used in BLEND.
scaled_001.mtz, scaled_002.mtz, ... (inside directory "merged_files"): are the final scaled files, for those cases that could be successfully scaled
pointless_001.log, pointless_002.log, ... (inside directory "merged_files"): log files from all POINTLESS jobs executed to produce files "unscaled_001.mtz", "unscaled_002.mtz", ...
aimless_001.log, aimless_002.log, ... (inside directory "merged_files"): full logs from the AIMLESS runs. The user can benefit from these files to find out detailed information on merging statistics and scaling in general

combined_files: all files produced by BLEND when executed in combination mode are stored within this directory, which is created if not already present
MERGING_STATISTICS.info (inside directory "combined_files"): same file as the one produced inside "merged_files" when BLEND is executed in synthesis mode. Results are, in this case, not sorted according to decreasing completeness
GROUPS.info (inside directory "combined_files"): this file is the equivalent of "CLUSTERS.info" in the "merged_files" directory when BLEND is executed in synthesis mode
unscaled_001, unscaled_002, ... (inside directory "combined_files"): unscaled files corresponding to all combinations tried by the user. See equivalent files in directory "merged_files", created when BLEND is executed in synthesis mode
scaled_001.mtz, scaled_002.mtz, ... (inside directory "combined_files"): scaled files corresponding to all successful scaling jobs of files unscaled_001.mtz, unscaled_002.mtz, ...
pointless_001.log, pointless_002.log, ... (inside directory "combined_files"): log files from all POINTLESS jobs executed to produce files "unscaled_001.mtz", "unscaled_002.mtz", ...
aimless_001.log, aimless_002.log, ... (inside directory "combined_files"): full logs from the AIMLESS runs. The user can benefit from these files to find out detailed information on merging statistics and scaling in general

KEYWORDED INPUT

BLEND keywords can be divided in three groups, as they control essentially three different parts of the program. Keywords with their default values are summarized here:

Group 1

CPARWT              1.000
ISIGI               1.500
LAUEGROUP           (laue or space group symbol, as used in POINTLESS)
RADFRAC             0.750

Group 2

CHOOSE SPACEGROUP   (space group, as used in POINTLESS)
DATAREF             ()
TOLERANCE           5 (same default value as the one used in POINTLESS)

Group 3

ANOMALOUS           OFF
RUN                 (default is to break into different runs at each discontinuity - see AIMLESS)
EXCLUDE             (individual image numbers or images range - see AIMLESS)
RESOLUTION          HIGH [smallest among highest resolutions of all composing data sets - see AIMLESS]
SCALES              ROTATION SPACING 5 SECONDARY  BFACTOR ON BROTATION SPACING 20 (see AIMLESS)
SDCORRECTION        REFINE INDIVIDUAL (see AIMLESS)

Keywords in group 1 control data preparation, analysis and clustering. Keywords in group 2 are keywords used in POINTLESS [4], while keywords in group 3 are keywords used in AIMLESS [4].
In the definitions below "[]" encloses optional items, "|" delineates alternatives. All keywords are case-insensitive, but are listed below in upper-case.

ANOMALOUS, CHOOSE SPACEGROUP, CPARWT, DATAREF, EXCLUDE, ISIGI, LAUEGROUP, RADFRAC, RUN, RESOLUTION, SCALES, SDCORRECTION, TOLERANCE

ANOMALOUS [OFF | ON]

Default value is OFF. ANOMALOUS is the same keyword used in AIMLESS. By default all I+ and I- observations are averaged together in merging. If ANOMALOUS is ON there will be separate anomalous observations in the final AIMLESS output pass, both for statistics and merging. ANOMALOUS will be automatically be turned ON if a substantial anomalous signal is detected.

CHOOSE SPACEGROUP

Default value is a blank, i.e. the final space group for the specific group of data to be scaled is the one determined by POINTLESS. If the user wishes to fix space group, rather than allowing POINTLESS to determine it, then this keyword should be used with the accompanying chosen space group symbol. This is advisable, for instance, when data are of poor quality and fixing space group is necessary to avoid POINTLESS to select a wrong space group.

CPARWT

Default value is 1.0. CPARWT is a number between 0 and 1, controlling which type of statistical descriptors are used in cluster analysis. A value of 1 means that we are using cell parameters (known as primary descriptors), while a value of 0 means we are using essentially averaged integrated intensities (known as secondary descriptors). Numbers between 0 and 1 are possible, and they essentially mean a weighted use of both descriptors. At the present stage of research, though, it is not clear the advantage in mixing the two descriptors. Primary descriptors seem to behave systematically better than secondary descriptors. Secondary descriptors can be tried, as a valid alternative, in those cases where cell parameters are known to be changing very little.

DATAREF

Default value is the serial number of the longest-sweep individual data set, as selected by BLEND in analysis mode. DATAREF can, alternatively, be fixed to the serial number of a different data set. For example the default reference file could either belong to a space group different from that of the datasets in the selected cluster, or having a different indexing convention. In this case DATAREF can be assigned the serial number of one of the datasets belonging to the cluster. An mtz file not included among the files processed by BLEND can also be associated with DATAREF, similarly to what happens in POINTLESS when using keyword HKLREF.

EXCLUDE [<batch range> | <batch list>]

Default is no exclusion of any image. This keyword, equivalent to the one used in AIMLESS, controls exclusion from the scaling process of specific images. These can be provided as a series of individual image numbers or as an image range:

         Example 1.   EXCLUDE BATCH 12 18 21 89
         Example 2.   EXCLUDE BATCH 32 TO 46

An easier alternative for excluding images from scaling jobs is to write an AIMLESS keywords file by copying and pasting input keywords found in specific AIMLESS logs included in either "merged_files" or "combined_files" directories, and adding as many EXCLUDE keywords as needed.

ISIGI

Default value is 1.5. ISIGI controls the resolution cut. Integrated intensities and their errors are averaged in resolution shells and interpolated with a 10-degrees polynomial. Data are truncated when signal-to-noise ratio falls below the ISIGI value. The user can assess signal- to-noise ratio after scaling (from within the �aimless_xxx.log� files). Normally this is higher than the 1.5 value introduced by ISIGI. This value, in fact, refers to unscaled data. If too much or too little truncation has been applied, BLEND can be executed again to change this value.

LAUEGROUP [ | AUTO | Point Group | Space Group]

Default value is blank, i.e. the point group is unchanged from the one found in the original reflection file. LAUEGROUP can be used for data preparation, when reading data from "INTEGRATE.HKL" files produced by XDS. These files normally include integration data in a low symmetry space group, typically P1. If such data are fed into BLEND directly, the program would treat all 6 cell parameters as independent. This is permitted and feasible, but if the correct laue group is known to have higher symmetry, then treating all 6 cell parameters as independent could introduce unnecessary statistical noise in the process of cluster analysis. In such cases it is advisable to input the correct laue or space group after keyword LAUEGROUP. The resulting mtz file includes data and cell parameters of the desired symmetry. If AUTO is used after LAUEGROUP, the conversion to an mtz file will be carried out with POINTLESS in default mode, i.e. leaving to POINTLESS to find out the correct symmetry. If LAUEGROUP is not used, then the �INTEGRATE.HKL� file will be converted into an mtz file without changing its laue group (default).

RADFRAC

Default value is 0.75. The program makes use of this keyword when data are found to be subject to overall radiation damage. RADFRAC controls the fraction of average intensity retained that a user is willing to accept when decay for radiation damage occurs. When RADFRAC is equal to 1, cutting is quite severe; when RADFRAC is equal to 0 there is no cutting, even when substantial radiation damage is affecting datasets. By default (RADFRAC 0.75) when BLEND detects the occurrence of substantial global radiation damage, then all images collected after a certain image are discarded. The discarded images, on average, include intensities that have been reduced of more than 25% of their original value.

RUN <Nrun> BATCH <b1> TO <b2>

Default keys are the same used in AIMLESS. This keyword is equivalent to the one used in AIMLESS and controls the definition of "runs" (i.e. contiguous batches of data undergoing a same scaling protocol). More details can be found in AIMLESS documentation pages.

RESOLUTION [[LOW] [[HIGH] <Resmax>]

Default for subkey HIGH is the biggest among highest resolutions of all composing data sets; for subkey LOW is the smallest among lowest resolutions of all composing datasets, where resolutions are here meant to be indicated in angstroms. The resolution limits computed by BLEND during analysis are determined via keyword ISIGI. When merging several datasets together it is the smallest among high resolutions and the largest among low resolutions to be fixed for subsequent scaling. Such limits can be changed by the user with the keyword RESOLUTION, exactly in the same way it is used in AIMLESS.

SCALES [<subkeys>]

Default keys are the same used in AIMLESS. This keyword is equivalent to the one used in AIMLESS and controls the scaling procedure followed. More details can be found in AIMLESS documentation pages.

TOLERANCE

Default value is the same used in POINTLESS, i.e. 5. TOLERANCE is equivalent to the corresponding POINTLESS keyword. Multiple crystals can have cell parameters very dissimilar with each other (non isomorphism). When a map is needed to calculate a mid or low resolution electron density, then POINTLESS might need instructions to avoid halting because large cell variations are encountered. Essentially the program is told to stop execution when cell difference among all component data sets goes beyond a threshold (the TOLERANCE value). The higher the TOLERANCE the more cell parameters are allowed to change, i.e. the more non-isomorphism is tolerated. Use high values (say 100) if you do not care about cell variability.

SDCORRECTION [[NO]REFINE] [INDIVIDUAL | SAME [FIXSDB]

Default is REFINE INDIVIDUAL. This keyword is analogous to the one used in AIMLESS (see AIMLESS documentation pages). SDCORRECTION plays a role in the determination of each reflection's error. Errors for all reflections undergo a refinement process equivalent to the refinement used for scaling intensities. But it is more unstable than the refinement for the intensities. Thus it is possible that cycles for SD parameters estimation do not converge, ultimately failing an AIMLESS job. In such circumstances it is possible to re-run BLEND using different values for the SDCORRECTION keyword, similarly to what is prescribed in AIMLESS. Quite often the provision,

         SDCORRECTION SAME

is sufficient to take to completion failed scaling jobs. If no solution is found for obtaining refined SD values, no refinement (NOREFINE) is the only option left.

MISCELLANEOUS AND PROBLEMS

Problems (and, alas, crashes!) could happen in BLEND, as it is the case with any software. Some of them and their cause are known (and described in this section).

(1) Program abrupt terminations

We have made substantial efforts to stop the program from crashing and, rather, to enable it to exit in a clean way with some kind of error message. But crashes are still to be expected. They will become less and less frequent as users report them:

Crashes in analysis mode: At present the program has been reported to crash in analysis mode if the size of data read in exceeds memory storage capacity. Luckily this is quite high for modern laptops and desktops, thus should not be an issue in the majority of cases. It is likely to become an issue if several very large datasets are read in on run. Other types of crashes are unknown.
Crashes in synthesis mode: These are generally a consequence of execution terminations by either the POINTLESS or AIMLESS programs. BLEND can handle several of these terminations and can execute in normal mode with an error or warning message in this case. If POINTLESS is successful, but AIMLESS fails, then the user should find that the "unscaled_xxx.mtz" type of files have been created under the directory "merged_files", but the "scaled_xxx.mtz" type of files are not created, where "xxx" refers to all clusters with successful scaling jobs. In this case it is likely that the default scaling recipe will have to be changed. Some clusters are made of datasets with different point group or other indexing inconsistencies. Unless appropriate keywords are used for POINTLESS, the execution of BLEND in synthesis mode for these cases will return an error message, and files of type "unscaled_xxx.mtz" will not be created.

(2) How to create an ASCII list of input files

Quite often input files are not included in a single directory, but are spread across a number of directories. In this case a judicious use of the unix command "grep" and "find" can quickly produce the input list for BLEND. Suppose all files are spread across directories all under a single directory named, say, "cdir". A quick way to generate the list is to move to directory "cdir" and use "find" as follows (many thanks to Morten Groftehauge for this tip):

         find `pwd` -name "INTEGRATE.HKL" > original.dat

In this case all XDS files found under "cdir" on in "cdir" subdirectories will be listed in original.dat with their full path. Variants of the above line will produce results for specific cases.

(3) Error estimation with AIMLESS

Error estimation and correction for multiple datasets is still not completely reliable in AIMLESS. If AIMLESS crashes while handling errors, or if the Mean((I)/sd(I)) has ridiculously high values, it is advisable to re-run BLEND (with either the -s option, or the -c option for the specific combination of datasets under scrutiny) using keywords "SDCORRECTION SAME" or "SDCORRECTION NOREFINE". Error estimation will be, in this case, less reliable, but this is still better than obtaining no results at all. Phil Evans (the author of AIMLESS) is constantly working to improve error estimation for difficult scaling cases (and multiple crystals are difficult!), but this is an inherently challenging theoretical and computational problem, not likely to be overcome in its entirety any time soon.

REFERENCES

J. Foadi, P. Aller, Y. Alguel, A. Cameron, D. Axford, R.L. Owen, W. Armour, D. Waterman, S. Iwata and G. Evans "Clustering procedures for the optimal selection of data sets from multiple crystals in macromolecular crystallography" Acta Cryst. (2013), D69, 1617–1632
A.G.W. Leslie and H.R. Powell "Processing Diffraction Data with Mosflm" in Evolving Methods for Macromolecular Crystallography (2007), 245, 41–51
W. Kabsch "XDS" Acta Cryst. (2010), D66, 125–132
P.R. Evans "Scaling and assessment of data quality" Acta Cryst. (2006), D62, 72–82

AUTHORS AND CREDITS

James Foadi, Membrane Protein Laboratory, Imperial College and Diamond Light Source (j.foadi@imperial.ac.uk, james_foadi@diamond.ac.uk)
Gwyndaf Evans, Diamond Light Source (gwyndaf.evans@diamond.ac.uk)

Special thanks to David Waterman (CCP4 core team) for implementing BLEND GUI version and Pierre Aller (Diamond Light Source) for BLEND tutorials.