BLEND (CCP4: Supported Program)
NAME
blend
- management and processing of multiple crystals / multiple data sets
SYNOPSIS
blend -a foo_in.dat
blend -s cut_level_high [cut_level_low]
blend -sLCV cut_LCV_level_high [cut_LCV_level_low]
blend -saLCV cut_absolute_LCV_level_high [cut_absolute_LCV_level_low]
blend -c d1 d2 d3 d4 ... or [d1] [d2] [[d3-d4]] ..., etc.
Description
Input and output files
Keyworded input
Miscellaneous and problems
References
Authors and credits
DESCRIPTION
X-ray data collection from a single crystal is not always feasible. Very often crystallographers try to
collect data from multiple crystals or from multiple locations on a single crystal. The resulting
datasets are normally incomplete, or show low redundancy. Using any of them individually does not
make for reliable phasing or model building and refinement. It is, rather, potentially better
to try and merge all diverse datasets in more-complete ones. While solving the incompleteness
issue, this merging produces datasets which, although inherently less precise, have a tendency to be
more accurate because systematic errors interfere destructively when coming from different sources
(multiple crystals). This, in turn, translates in better-quality structure factors, with positive effects
for phasing, model building and refinement. High redundancy can also increase the anomalous
signal, when this is needed.
BLEND is a program for the management of multiple datasets. It simplifies the analysis and greatly
reduces the combinatorial explosion involved in the formation of multiple groups from the original
set of data. The program can run in three different modes. In the analysis mode (option -a) it reads
in multiple unmerged reflection files produced by an integration program(either MOSFLM [2] or
XDS [3]), and carries out cluster analysis on one or two types of statistical descriptors, extracted or
calculated from each dataset. Results produced by the program in analysis mode can, then, be used
running the program in synthesis mode (option -s), or in combination mode (option -c). In synthesis
mode datasets belonging to clusters previously determined are scaled together and output into
individual merged reflection files in MTZ format, ready to be used in all subsequent stages of
phasing, model building and refinement. Variants of the synthesis mode are synthesis mode using LCV values and
synthesis mode using absolute LCV values (see later for meaning of LCV and absolute LCV).
The combination mode allows users to carry out the same
tasks enabled by the synthesis mode, this time for any combination of datasets, not necessarily those
grouped in clusters.
Input preparation
Input for the program is a group of unmerged reflection files produced by integration programs. At present only files from the MOSFLM
and XDS integration software can be handled. Reflection files can be either included in a single directory, or spread across several
directories. In the first scenario, only the path to the directory needs to be fed into the program. In the second scenario, paths to
all individual files will have to be listed in a single ASCII file and this is fed into BLEND.
Program execution is controlled by keywords. From version 0.5.0, keywords are passed as standard input, as it is in general the case
for the majority of CCP4 programs. If no keywords are passed to the program, default values and / or procedures for the parameters
connected to the keywords will be used. More on BLEND keywords later.
Some problems arising in connection with input preparation can be found in the section Miscellaneous and problems.
Running the program in analysis mode
Once input is ready, BLEND can be run in analysis mode. All multiple datasets will be analysed individually
and tested for overall radiation damage.
If any dataset is thought to be significantly affected, parts of it will be removed. The amount of data to
be removed can be controlled by keyword RADFRAC. Values ranges between 0 and 1; 0 means keeping everything
while 1 means removing all affected parts of data. The default value is 0.75, which essentially tells the
program to remove all reflections whose intensity, on average, has been dampened by radiation damage of more
than 25% of its true value.
Input files for BLEND in analysis mode contain integrated (not scaled) data. They can be either mtz files
produced by MOSFLM, or ASCII files produced by XDS ("INTEGRATE.HKL").
If these files are stored within a single directory, then simply type:
blend -a /where/integrated/data/are/store
If files are spread across a given number of directories, then the user will have to create an ASCII file with all
files (and their exact paths) listed one after the other. The content of one such files, which will arbitrarily
name "original.dat", could for instance look like the following:
/home/joe/data/xtal1/xl-d01.mtz
/home/joe/data/xtal1/xf-d03.mtz
/home/joe/data/xtal5/xl-d12.mtz
/home/joe/data/xtal12/INTEGRATE.HKL
/home/joe/data/xtal13/INTEGRATE.HKL
In this case BLEND can be executed as follows:
blend -a original.dat
Several files will be produced by the program in analysis mode (see Input and Output files section).
Those describing datasets clustering are "tree.png" (a postscript version, "tree.ps") and "CLUSTERS.txt". Others files are
needed for bookkeeping. The important binary file "BLEND.RData" contains essential information needed by the program to run
in synthesis mode; it cannot be deleted.
Running the program in synthesis mode
By running BLEND in synthesis mode the user aims at producing new datasets out of two or more individual datasets.
Each node in the dendrogram can give rise to a scaled dataset. The easiest option for the user is to force BLEND
to produce scaled datasets for all nodes in the dendrogram. This is, though, also the lengthiest option, because
the user might only be interested in part of the nodes, for example those relating to tighter clusters.
In order to single out only part of the nodes in the dendrogram one or two numerical levels need to be provided for
execution. Consider for instance a case corresponding to the following "CLUSTERS.txt" file, describing a dendrogram with 13 clusters:
Cluster Number of Cluster LCV aLCV Datasets
Number Datasets Height ID
001 2 0.173 0.03 0.02 5 6
002 2 0.242 0.01 0.01 9 13
003 2 0.433 0.05 0.05 3 14
004 2 0.518 0.08 0.07 8 10
005 3 0.610 0.05 0.04 7 9 13
006 4 0.702 0.13 0.11 12 7 9 13
007 4 0.744 0.11 0.09 5 6 8 10
008 3 0.982 0.17 0.14 4 3 14
009 4 1.297 0.19 0.17 2 4 3 14
010 5 1.623 0.28 0.23 11 12 7 9 13
011 5 2.711 0.48 0.39 1 2 4 3 14
012 9 3.343 0.44 0.30 5 6 8 10 11 12 7 9 13
013 14 13.670 1.04 0.84 1 2 4 3 14 5 6 8 10 11 12 7 9 13
To create merged files out of all nodes below height 4 in the dendrogram we type:
blend -s 4
This will produce 12 new datasets: the one corresponding to node (5+9), the one corresponding to node (9+13), the one corresponding
to node (3+14), etc. To produce datasets for all nodes, simply type:
blend -s 14
because the whole dendrogram is below 14 (the top height is 13.670). Suppose one needs to merge only data sets 1, 2, 4, 3, 14, because
they form a rather tight cluster. With:
blend -s 3
these data sets will be merged, but so will be data sets 11, 12, 7, 9, 13, data sets 2, 4, 3, 14, and so on, because they all happen
to correspond to nodes at heights lower than 3. Given that 1, 2, 4, 3, 14 form a clusters at exactly an height of 2.711, by selecting
two levels, one higher and one lower than 2.711, a scaled filed for this cluster only will be calculated. For example:
blend -s 2.712 2.710
These two numbers are arbitrary numbers that fall just above and below 2.711, and that do not include values for any other node
in the dendrogram. It is important to notice that when using two values it is compulsory to type the largest one first.
Variants of run in synthesis mode
As previously mentioned, two variants of synthesis mode are available, synthesis mode using LCV values
and synthesis mode using absolute LCV values (see later). In this case cluster selection will use LCV
and absolute LCV values, rather than cluster heights (many thanks to Alkistis Mitropoulou for suggesting these variants!).
Running the program in combination mode
Cluster analysis produces a grouping of all datasets in several clusters. This makes it possible to carry out a limited number
of merging and scalings among the huge number of possible datasets combinations, thus making it possible to save on processing time.
Clustering, though, introduces limitations because the user is forced to calculate datasets only corresponding to nodes in the
dendrogram. For example, referring to the dendrogram described previously, there is no way we could obtain scaled data out of the
union of data sets 1, 4, 11 and 13 because there is no node corresponding to this combination. Such a limitation can be overcome
by running the program in combination mode. In the specific case, simply:
blend -c 1 4 11 13
When the number of data sets and clusters is large it is very tedious to type or even cut and paste the long string of numbers forming
the combination. For this reason an ad hoc syntax has been created to include groups of numerically-contiguous data sets, whole
clusters or groups of clusters, and to exclude individual data sets or groups of data sets. The syntax is made up of the following rules:
"[]" a single-square bracket including one or more numbers means all data
sets in the clusters corresponding to those numbers.
"[[]]" a double-square bracket including one or more numbers indicates that all
data sets corresponding to those numbers are to be removed from the final group.
"-" an hyphen (minus sign) between two numbers indicates all integers between the two
numbers. If the first number is greater than the second, the selection is ignored.
"," commas between numbers are sometimes needed to separate data sets or clusters,
if they are inside single or double-square brackets.
EXAMPLES (all referring to the dendrogram previously described).
1) Combine cluster 2 with cluster 4:
blend -c [2] [4] equivalent to blend -c 8 9 10 13
2) All data sets in cluster 12, with the exception of data sets 7 and 11:
blend -c [12] [[7,11]] equivalent to blend -c 5 6 8 9 10 11 12 13 or blend -c 5 6 8-13
3) Clusters 1 and 9, with the exception of data set 14, and with the addition of data sets 1 and 7:
blend -c [1] [9] [[14]] 1 7 equivalent to blend -c 1 2 3 4 5 6 7 or blend 1-7
The ability to create scaled data out of any desired combinations confers flexibility to the program.
INPUT AND OUTPUT FILES
Input
BLEND can read unscaled reflection files in mtz format, or ASCII files in XDS format.
MTZ files contain, typically, integrated intensities as processed by MOSFLM [2].
XDS files are the unscaled integrated data ("INTEGRATE.HKL") produced by XDS [3].
Input can be either a file (no fixed name, but here it is indicated as "foo_in.dat"), or a directory
- foo_in.dat (file)
- Each line of this ASCII file is the path to a valid unscaled reflection file, to be processed by the program
- /path/to/a/valid/directory/ (directory)
- All valid unscaled reflection files in this directory will be processed by the program
Output
Execution of BLEND in different modes (analysis "-a", synthesis "-s" or "-sLCV" or "-saLCV", combination "-c" implies different output files.
(a) From analysis mode:
- BLEND_SUMMARY.txt
- is an ASCII file with tabulated information for all datasets being
processed. Each dataset is given a serial number and this same number is used throughout
the whole statistical analysis
- mtz_names.dat
- is simply the list of files read in by BLEND. If a previous list was already
present (because created by the user), this new list is a copy of it, with invalid files removed
- xds_files
- is a directory containing files in MTZ format, in those cases where integrated data
are in XDS format. The mtz files are obtained from the XDS files using POINTLESS. Names
for the newly created MTZ files have names like "dataset_xxx.mtz", where xxx is a number.
See also "xds_lookup_table.txt"
- xds_lookup_table.txt
- this file can be checked in order to keep track of the original XDS files. If
no XDS files are involved, neither xds_files nor xds_lookup_table.txt will be created. All logs
produced by POINTLESS when converting XDS files into MTZ format will also be dumped
in the xds_files directory
- tree.png, tree.ps
- are graphics file in PNG and POSTSCRIPT format, showing the
dendrogram derived from cluster analysis of all input datasets. Individual objects (datasets)
are recognizable through their serial number. If the number of datasets is relatively low (15-
20 max), the dendrogram can be interpreted quite easily. For larger numbers it might be
easier to refer to the ASCII counterpart of the dendrogram, which is the file called "CLUSTERS.txt"
- CLUSTERS.txt
- In this file the exact numerical value of the dendrogram's merging nodes is
also reported, a feature useful to run BLEND in synthesis mode. The dendrogram is the
most important outcome of BLEND analysis. The user takes decisions on merged data,
based on his/her interpretation of the dendrogram.
In this graphics file a "Linear Cell Variation" number is also reported. The Ward distance
used to measure cluster mergings indicates the overall loss in cell variability when the
number of merged datasets is increased. As cell parameter values are normalised and rotated
through principal component analysis their numerical value in the dendrogram is not
immediately related to real cell variation. Therefore it is not possible to get a feeling for
structural isomorphism using the Ward distance. To help with this issue a parameter directly
related to unit cell differences has been introduced, the Linear Cell Variation (LCV). LCV
measures the maximum linear increase or decrease of the diagonals on the 3 independent
cell faces. Values below 1% in general indicate a good degree of isomorphism among
different crystals. Structural differences start to be noticeable with LCV greater than 1.5%.
A value in angstroms associated to LCV is provided by the absolute Linear Cell Variation (aLCV),
presented jointly to LCV in both "CLUSTERS.txt" and dendrogram.
The isomorphism issue will, obviously, have to be considered jointly with the available data
resolution
- FINAL_list_of_files.dat
- is an ASCII file reporting number of batches kept and highest
resolution recommended for each dataset analysed by BLEND. Batches can be discarded
because intensities in them are deemed to be severely affected by radiation damage (see
keyword RADFRAC to control amount of discarded data). The highest recommended
resolution is a rough estimate of where data should be cut, if the user wishes signal-to-noise
ratio for the average intensity to be greater than a given value. This value is provided by the
user with the keyword ISIGI, followed by a numerical value; default value is 1.5. The
"FINAL_list_of_files.dat" file has 6 columns. The first is the path to the input files, the
second is the serial number assigned from BLEND (and used in cluster analysis), the fourth
and fifth are initial and final input image numbers, the third is the image number after which
data are discarded because weakened by radiation damage, the sixth is resolution cutoff
- BLEND.RData
- is a binary file produced by the R code. It stores essential information used
by all runs of BLEND in synthesis and combination modes
(b) From synthesis mode:
- merged_files
- all files produced by BLEND when executed in synthesis mode are stored within this directory,
which is created if not already present, or is deleted and recreated if already
present. Thus, it is important to rename this directory if more than one run of BLEND in synthesis
mode is executed. This is taken care of when BLEND is executed with the CCP4 GUI
- MERGING_STATISTICS.info (inside directory "merged_files")
- is an ASCII file, essentially a table listing overall merging
statistics for all merged datasets produced by the specific run of BLEND. It includes Cluster
number, Rmeas, Rpim, Completeness, Multiplicity, Lowest Resolution and Highest
Resolution. The table is sorted according to the Rmeas column, from its lowest to its highest
value. If scaling with AIMLESS has failed for some reason, NA's are inserted in the
corresponding rows. This table should make it easy for the user to select the desired merged
dataset, in terms of completeness, multiplicity and data quality
- Rmeas_vs_Cmpl.png, Rmeas_vs_Cmpl.ps (inside directory "merged_files")
- a plot of all merged datasets in terms of Rmeas vs Completeness, both as PNG and PS graphics file
- CLUSTERS.info (inside directory "merged_files")
- is an ASCII file listing names and number of batches of each individual
dataset composing specific clusters
- unscaled_001.mtz, unscaled_002.mtz, ... (inside directory "merged_files")
- are unscaled files in mtz format. There are
as many of these files as the number of nodes selected by the user in the execution of
BLEND in synthesis mode. The number associated with each file name coincide with the
cluster (or node) number. Before scaling a dataset, obtained by the collation of individual
datasets, it is necessary to have all of them with same space group and same indexing (if
they belong to polar groups). Also, individual images will need to have unique numbers.
Furthermore some datasets can have some images discarded and resolution limited. All this
bookkeeping is taken care by a script calling POINTLESS which, by default, assigns the
most likely space group. This can be changed by using keyword CHOOSE SPACEGROUP
, where is the space group name (e.g. P 21 21 21, C 2, etc). Another
keyword used in BLEND which relates to POINTLESS is TOLERANCE; this keyword
controls how much cell sizes are allowed to change if they are to be considered in
connection with a same structure. If the user wants to use a specific dataset as reference, the
serial number of the desired dataset can be included after keyword DATAREF, if such reference is
part of the input data. Alternatively the name of a reference mtz file can be given (as in
POINTLESS).
The reason
why merged but unscaled files are kept for the user is connected with the way subsequent
scaling is carried out. At present scaling in default mode is performed by BLEND using
AIMLESS. This does not always guarantee the production of final averaged intensities. For
example, data could be weak, or the collection followed some unusual set up. A successful
scaling could, then, be obtained by running AIMLESS in non-default mode, using specific
keywords. The starting files for doing this are the "unscaled_xxx.mtz" files. There is also, of
course, the option to re-run BLEND in synthesis mode by adding specific scaling keywords,
but, at present, not all available AIMLESS keywords can be used in BLEND.
- scaled_001.mtz, scaled_002.mtz, ... (inside directory "merged_files")
- are the final scaled files, for those cases that could be
successfully scaled
- pointless_001.log, pointless_002.log, ... (inside directory "merged_files")
- log files from all POINTLESS jobs executed to produce files "unscaled_001.mtz", "unscaled_002.mtz", ...
- aimless_001.log, aimless_002.log, ... (inside directory "merged_files")
- full logs from the AIMLESS runs. The user can
benefit from these files to find out detailed information on merging statistics and scaling in
general
(c) From combination mode:
- combined_files
- all files produced by BLEND when executed in combination mode are stored within this directory,
which is created if not already present
- MERGING_STATISTICS.info (inside directory "combined_files")
- same file as the one produced inside "merged_files" when BLEND is executed in synthesis mode.
Results are, in this case, not sorted according to decreasing completeness
- GROUPS.info (inside directory "combined_files")
- this file is the equivalent of "CLUSTERS.info" in the "merged_files" directory when BLEND is executed
in synthesis mode
- unscaled_001, unscaled_002, ... (inside directory "combined_files")
- unscaled files corresponding to all combinations tried by the user. See equivalent files in
directory "merged_files", created when BLEND is executed in synthesis mode
- scaled_001.mtz, scaled_002.mtz, ... (inside directory "combined_files")
- scaled files corresponding to all successful scaling jobs of files unscaled_001.mtz, unscaled_002.mtz, ...
- pointless_001.log, pointless_002.log, ... (inside directory "combined_files")
- log files from all POINTLESS jobs executed to produce files "unscaled_001.mtz", "unscaled_002.mtz", ...
- aimless_001.log, aimless_002.log, ... (inside directory "combined_files")
- full logs from the AIMLESS runs. The user can
benefit from these files to find out detailed information on merging statistics and scaling in
general
KEYWORDED INPUT
BLEND keywords can be divided in three groups, as they control essentially three different parts of the program. Keywords with
their default values are summarized here:
- Group 1
CPARWT 1.000
ISIGI 1.500
LAUEGROUP (laue or space group symbol, as used in POINTLESS)
RADFRAC 0.750
- Group 2
CHOOSE SPACEGROUP (space group, as used in POINTLESS)
DATAREF ()
TOLERANCE 5 (same default value as the one used in POINTLESS)
- Group 3
ANOMALOUS OFF
RUN (default is to break into different runs at each discontinuity - see AIMLESS)
EXCLUDE (individual image numbers or images range - see AIMLESS)
RESOLUTION HIGH [smallest among highest resolutions of all composing data sets - see AIMLESS]
SCALES ROTATION SPACING 5 SECONDARY BFACTOR ON BROTATION SPACING 20 (see AIMLESS)
SDCORRECTION REFINE INDIVIDUAL (see AIMLESS)
Keywords in group 1 control data preparation, analysis and clustering. Keywords in group 2 are keywords used in
POINTLESS [4],
while keywords in group 3 are keywords used in AIMLESS [4].
In the definitions below "[]" encloses optional items,
"|" delineates alternatives. All keywords are
case-insensitive, but are listed below in upper-case.
ANOMALOUS,
CHOOSE SPACEGROUP,
CPARWT,
DATAREF,
EXCLUDE,
ISIGI,
LAUEGROUP,
RADFRAC,
RUN,
RESOLUTION,
SCALES,
SDCORRECTION,
TOLERANCE
ANOMALOUS [OFF | ON]
Default value is OFF. ANOMALOUS is the same keyword used in AIMLESS. By default all I+ and I- observations
are averaged together in merging. If ANOMALOUS is ON there will be separate anomalous observations in the final AIMLESS output pass, both for
statistics and merging. ANOMALOUS will be automatically be turned ON if a substantial anomalous signal is detected.
CHOOSE SPACEGROUP
Default value is a blank, i.e. the final space group for the specific group of data to be scaled is the one determined by POINTLESS.
If the
user wishes to fix space group, rather than allowing POINTLESS to determine it, then this
keyword should be used with the accompanying chosen space group symbol. This is
advisable, for instance, when data are of poor quality and fixing space group is necessary to
avoid POINTLESS to select a wrong space group.
CPARWT
Default value is 1.0.
CPARWT is a number between 0 and 1, controlling which type of statistical descriptors are
used in cluster analysis. A value of 1 means that we are using cell parameters (known as
primary descriptors), while a value of 0 means we are using essentially averaged integrated
intensities (known as secondary descriptors). Numbers between 0 and 1 are possible, and
they essentially mean a weighted use of both descriptors. At the present stage of research,
though, it is not clear the advantage in mixing the two descriptors. Primary descriptors seem to
behave systematically better than secondary descriptors. Secondary descriptors can be tried, as a valid
alternative, in those cases where cell parameters are known to be changing very little.
DATAREF
Default value is the serial number of the longest-sweep individual data set, as selected by BLEND in analysis mode.
DATAREF can, alternatively, be fixed to the serial number of a different data set.
For example the default reference file could either belong to a space group
different from that of the datasets in the selected cluster, or having a different indexing
convention. In this case DATAREF can be assigned the serial number of one of the datasets
belonging to the cluster. An mtz file not included among the files processed by BLEND can also
be associated with DATAREF, similarly to what happens in POINTLESS when using keyword HKLREF.
EXCLUDE [<batch range> | <batch list>]
Default is no exclusion of any image.
This keyword, equivalent to the one used in AIMLESS, controls exclusion from the scaling process of specific images.
These can be provided as a series of individual image numbers or as an image range:
Example 1. EXCLUDE BATCH 12 18 21 89
Example 2. EXCLUDE BATCH 32 TO 46
An easier alternative for excluding images from scaling jobs is to write an AIMLESS keywords file by copying and pasting
input keywords found in specific AIMLESS logs included in either "merged_files" or "combined_files" directories, and adding
as many EXCLUDE keywords as needed.
ISIGI
Default value is 1.5.
ISIGI controls the resolution cut. Integrated intensities and their errors are averaged in
resolution shells and interpolated with a 10-degrees polynomial. Data are truncated when
signal-to-noise ratio falls below the ISIGI value. The user can assess signal-
to-noise ratio after scaling (from within the ¿aimless_xxx.log¿ files).
Normally this is higher than the 1.5 value introduced by ISIGI. This value, in fact, refers to
unscaled data. If too much or too little truncation has been applied, BLEND can be executed again to
change this value.
LAUEGROUP [ | AUTO | Point Group | Space Group]
Default value is blank, i.e. the point group is unchanged from the one found in the original reflection file.
LAUEGROUP can be used for data preparation, when reading data from
"INTEGRATE.HKL" files produced by XDS. These files normally include integration data
in a low symmetry space group, typically P1. If such data are fed into BLEND directly, the
program would treat all 6 cell parameters as independent. This is permitted and feasible, but
if the correct laue group is known to have higher symmetry, then treating all 6 cell
parameters as independent could introduce unnecessary statistical noise in the process of
cluster analysis. In such cases it is advisable to input the correct laue or space group after
keyword LAUEGROUP. The resulting mtz file includes data and cell parameters of the
desired symmetry. If AUTO is used after LAUEGROUP, the conversion to an
mtz file will be carried out with POINTLESS in default mode, i.e. leaving to POINTLESS
to find out the correct symmetry. If LAUEGROUP is not used, then the
¿INTEGRATE.HKL¿ file will be converted into an mtz file without changing its laue group
(default).
RADFRAC
Default value is 0.75.
The program makes use of this keyword when data are found to be subject to overall radiation damage. RADFRAC controls the fraction
of average intensity retained that a user is willing to accept when decay for radiation damage occurs. When RADFRAC is equal to 1,
cutting is quite severe; when RADFRAC is equal to 0 there is no cutting, even when substantial radiation damage is affecting datasets.
By default (RADFRAC 0.75) when BLEND detects the occurrence of substantial global radiation damage, then all images collected
after a certain image are discarded. The discarded images, on average, include intensities that have been reduced of more than 25% of
their original value.
RUN <Nrun> BATCH <b1> TO <b2>
Default keys are the same used in AIMLESS.
This keyword is equivalent to the one used in AIMLESS and controls the definition of "runs" (i.e. contiguous batches of data undergoing
a same scaling protocol). More details can be found in AIMLESS documentation pages.
RESOLUTION [[LOW] [[HIGH] <Resmax>]
Default for subkey HIGH is the biggest among highest resolutions of all composing data sets; for subkey LOW is the smallest
among lowest resolutions of all composing datasets, where resolutions are here meant to be indicated in angstroms.
The resolution
limits computed by BLEND during analysis are determined via keyword ISIGI.
When merging several datasets together it is the smallest
among high resolutions and the largest among low resolutions to be fixed for subsequent
scaling. Such limits can be changed by the user with the keyword RESOLUTION, exactly in
the same way it is used in AIMLESS.
SCALES [<subkeys>]
Default keys are the same used in AIMLESS.
This keyword is equivalent to the one used in AIMLESS and controls the scaling procedure followed. More details
can be found in AIMLESS documentation pages.
TOLERANCE
Default value is the same used in POINTLESS, i.e. 5.
TOLERANCE is equivalent to the corresponding POINTLESS keyword. Multiple crystals
can have cell parameters very dissimilar with each other (non isomorphism). When a map is
needed to calculate a mid or low resolution electron density, then POINTLESS might need instructions to avoid
halting because large cell variations are encountered. Essentially the program is told to stop execution when
cell difference among all component data sets goes beyond a threshold (the TOLERANCE value). The higher the TOLERANCE
the more cell parameters are allowed to change, i.e. the more non-isomorphism is tolerated.
Use high values (say 100) if you do not care about cell variability.
SDCORRECTION [[NO]REFINE] [INDIVIDUAL | SAME [FIXSDB]
Default is REFINE INDIVIDUAL.
This keyword is analogous to the one used in AIMLESS (see AIMLESS documentation pages). SDCORRECTION plays a role in the
determination of each reflection's error. Errors for all reflections undergo a refinement
process equivalent to the refinement used for scaling intensities. But it is more unstable than
the refinement for the intensities. Thus it is possible that cycles for SD parameters
estimation do not converge, ultimately failing an AIMLESS job. In such circumstances it is
possible to re-run BLEND using different values for the SDCORRECTION keyword, similarly to
what is prescribed in AIMLESS. Quite often the provision,
SDCORRECTION SAME
is sufficient to take to completion failed scaling jobs. If no solution is found for obtaining refined SD values, no refinement
(NOREFINE) is the only option left.
MISCELLANEOUS AND PROBLEMS
Problems (and, alas, crashes!) could happen in BLEND, as it is the case with any software.
Some of them and their cause are known (and described in this section).
(1) Program abrupt terminations
We have made substantial efforts to stop the program from
crashing and, rather, to enable it to exit in a clean way with some kind of error message. But
crashes are still to be expected. They will become less and less frequent as users report them:
- Crashes in analysis mode
- At present the program has been reported to crash in analysis
mode if the size of data read in exceeds memory storage capacity. Luckily this is quite high
for modern laptops and desktops, thus should not be an issue in the majority of cases. It is
likely to become an issue if several very large datasets are read in on run. Other types of
crashes are unknown.
- Crashes in synthesis mode
- These are generally a consequence of execution terminations
by either the POINTLESS or AIMLESS programs. BLEND can handle several of
these terminations and can execute in normal mode with an error or warning message in this
case. If POINTLESS is successful, but AIMLESS fails, then the user should find that the
"unscaled_xxx.mtz" type of files have been created under the directory "merged_files", but
the "scaled_xxx.mtz" type of files are not created, where "xxx" refers to all clusters with successful
scaling jobs. In this case it is likely that the default
scaling recipe will have to be changed. Some clusters are made of datasets with different
point group or other indexing inconsistencies. Unless appropriate keywords are used for
POINTLESS, the execution of BLEND in synthesis mode for these cases will return an error
message, and files of type "unscaled_xxx.mtz" will not be created.
(2) How to create an ASCII list of input files
Quite often input files are not included in a
single directory, but are spread across a number of directories. In this case a judicious use of
the unix command "grep" and "find" can quickly produce the input list for BLEND. Suppose
all files are spread across directories all under a single directory named, say, "cdir".
A quick way to generate the list is to move to directory "cdir" and use "find" as
follows (many thanks to Morten Groftehauge for this tip):
find `pwd` -name "INTEGRATE.HKL" > original.dat
In this case all XDS files found under "cdir" on in "cdir" subdirectories will be listed in original.dat with their full path.
Variants of the above line will produce results for specific cases.
(3) Error estimation with AIMLESS
Error estimation and correction for multiple datasets is
still not completely reliable in AIMLESS. If AIMLESS crashes while handling errors, or if
the Mean((I)/sd(I)) has ridiculously high values, it is advisable to re-run BLEND (with
either the -s option, or the -c option for the specific combination of datasets under scrutiny)
using keywords "SDCORRECTION SAME" or "SDCORRECTION NOREFINE". Error estimation will be, in
this case, less reliable, but this is still better than obtaining no results at all.
Phil Evans (the author of AIMLESS) is constantly working to improve error estimation for
difficult scaling cases (and multiple crystals are difficult!), but this is an inherently
challenging theoretical and computational problem, not likely to be overcome in its entirety
any time soon.
REFERENCES
- J. Foadi, P. Aller, Y. Alguel, A. Cameron, D. Axford, R.L. Owen, W. Armour, D. Waterman, S. Iwata and G. Evans
"Clustering procedures for the optimal selection of data sets from multiple crystals in macromolecular crystallography"
Acta Cryst. (2013), D69, 1617–1632
- A.G.W. Leslie and H.R. Powell
"Processing Diffraction Data with Mosflm"
in Evolving Methods for Macromolecular Crystallography (2007), 245, 41–51
- W. Kabsch
"XDS"
Acta Cryst. (2010), D66, 125–132
- P.R. Evans
"Scaling and assessment of data quality"
Acta Cryst. (2006), D62, 72–82
AUTHORS AND CREDITS
James Foadi, Membrane Protein Laboratory, Imperial College and Diamond Light Source (j.foadi@imperial.ac.uk, james_foadi@diamond.ac.uk)
Gwyndaf Evans, Diamond Light Source (gwyndaf.evans@diamond.ac.uk)
Special thanks to David Waterman (CCP4 core team) for implementing BLEND GUI version and Pierre Aller (Diamond Light Source) for BLEND tutorials.
SEE ALSO
MOSFLM
XDS
POINTLESS
AIMLESS