truncate hklin foo_in.mtz hklout foo_out.mtz
[ plot foo.plt ]
[Keyworded input]
The standard use of the program is to read a file of averaged intensities (output from SCALA, SCALEPACK2MTZ or DTREK2MTZ) and write a file containing mean amplitudes and the original intensities. If anomalous data is present then F(+), F(-), with the anomalous difference, plus I(+) and I(-) are also written out. The amplitudes are put on an approximate absolute scale using the scale factor taken from a Wilson plot.
There are two ways in TRUNCATE to calculate the amplitudes from the intensities. The simplest is just to take the square root of the intensities, setting any negative ones to zero (keyword TRUNCATE NO). Alternatively, the "truncate" procedure (keyword TRUNCATE YES, the default) calculates a best estimate of F from I, sd(I), and the distribution of intensities in resolution shells (see below). This has the effect of forcing all negative observations to be positive, and inflating the weakest reflections (less than about 3 sd), because an observation significantly smaller than the average intensity is likely to be underestimated. See reference below.
If the input specified on the LABIN line includes an assignment for F, then no output will be generated. It is most undesirable to TRUNCATE a set of data where the intensities have already been modified to generate amplitudes.
This program can be used even if the "truncate" procedure is not desired, since it produces some useful statistics on intensity distributions. These can indicate problems with the data; for instance if the data is extremely anisotropic (see the FALLOFF keyword) or if it is likely to be twinned. See the cumulative intensity distribution plot, which for a perfect twinning becomes sigmoidal, and the moments of E (or Z) which are different for twinned data than for untwinned. If there are indications of twinning they will be noted as warnings in the log file.
The scale factor estimated from the Wilson plot is applied to the data and allows the data to be put on a (very approximate) absolute scale. This at least gives amplitudes of a sensible magnitude for further calculations. The calculation relies on the number of residues/atoms given by the keywords NRESIDUE/CONTENTS being roughly correct. The program does not, however, apply any temperature factor.
Note: this is the same test as before, but the description is now in terms of the normalised structure factor E or the normalised intensity Z, since this takes into account correctly the symmetry factor epsilon and the normalisation.
We define a reduced intensity I' = I / epsilon, where epsilon is the symmetry factor which increases the mean intensities for certain planes or lines in reciprocal space. We also define Z = I' / <I'> The general formulae for the expected moments of I' and Z (where k-th moment of x = <x^k>) for untwinned acentric data are:
k-th moment of I' = Gamma(k+1) * <I'>^k k-th moment of Z = Gamma(k+1) = k! if k is an integerThe normalised structure amplitude is given by:
E = I'^(1/2) / <I'>^(1/2) = Z^(1/2)The moments of E can then be related to the half-integral moments of Z:
k-th moment of E = <E^k> = <Z^(k/2)> = Gamma(k/2+1)For k odd, this can be evaluated as:
k-th moment of E = Gamma(k/2+1) with k odd = k/2 * (k/2 - 1) * ... * 0.5 * Gamma(1/2) = sqrt(PI) * k/2 * (k/2 - 1) * ... * 0.5 e.g. <E^3> = sqrt(PI) * 1.5 * 0.5 = 1.329For k even, <E^k> is simply (k/2)!
One can also calculate expected moments for acentric data in the case of perfect twinning, and for untwinned and twinned centric data, giving:
Acentric Centric Untwinned data Perfect twin. Untwinned data Perfect twin. <E> 0.886 0.94 0.798 0.886 <E^3> 1.329 1.175 1.596 1.329 <E^4> 2.0 1.5 3.0 2.0 <E^6> 6.0 3.0 15.0 6.0 <E^8> 24.0 7.5 105.0 24.0Truncate outputs these moments for comparison against these expected values.
The various data control lines are identified by keywords. Only the first 4 letters of each keyword are necessary. Most keywords are optional.
ANOMALOUS, CELL, CONTENTS, FALLOFF, HEADER, HISTORY, LABIN, LABOUT, NRESIDUE, PLOT, RANGES, RESOLUTION, RSCALE, SCALE, SYMMETRY, TITLE, TRUNCATE, VPAT
In addition, the following optional keywords control the data harvesting functionality:
DNAME, NOHARVEST, PNAME, PRIVATE, RSIZE, USECWD
(default TITLE='From Truncate') [OPTIONAL INPUT]
Title to write to output reflection file
(Default <nrange>=60) [OPTIONAL INPUT]
<nrange> is the number of resolution bins over the resolution
range for the Wilson Plot. <range> is the width of the bins on 4(sin
theta/lambda)**2 and is an alterative to <nrange>. The resolution
range used for the Wilson Plot is taken from the input data file, or set
with the RESOLUTION keyword. A subset of these bins, covering a resolution
range defined with the RSCALE keyword, is used to estimate the scale and
B-factor.
The use of this card is discouraged, as the choice of the number of bins
is important. Too few bins and the scale and overall B will be less accurate.
Too few reflections per bin could mean a large scattering of points from
the straight line. This would mean a large uncertainty in the values for
scale and B.
If this card is omitted, the program divides the resolution range into
60 bins, then checks that there are not less than 40 reflections in any
one bin and if necessary reduces the number of bins until the above condition
is satisfied. In all cases, the program will stop if the width of ranges
is greater than 0.03 or the number of bins is greater than 60.
[OPTIONAL INPUT]
Resolution limits - either 4(sin theta/lambda)**2 or d in Angstroms (either order). Reflections outside these limits will be excluded from all analysis and omitted on output. Defaults are taken from the range of data in the input file (i.e. all data included).
[OPTIONAL INPUT]
Resolution limits for scaling (either 4(sin theta/lambda)**2 or d). This option allows you to exclude low resolution reflections from the calculation of the scale and B factor. However, all points in the range defined by RESO are plotted on the Wilson plot. It is probably a good idea to include only high resolution data (beyond 3A, if you have any data there) in the Wilson plot. This is because the assumptions behind Wilson statistics are invalid for low resolution data. The default high resolution limit is the same as RESOLUTION. The low resolution limit is, by default, set to 4.0A if the high res. limit is greater than 3.5A.
[OPTIONAL INPUT]
The default is to apply a scale factor from the Wilson plot. If a scale factor is given here, then that is applied instead. This option is useful if relative scaling is already done in SCALA.
If amplitudes rather than intensities are specified on the LABIN line, then the Wilson scale is not applied, and a default scale of 100 is used.
[OPTIONAL INPUT]
The first argument of the FALLOFF keyword should be "YES" or "NO", followed optionally by subkeywords controlling the detailed behaviour. The default is "YES", which triggers an analysis of the anisotropy of the data according to the "falloff" procedure contributed by Yorgo Modis. This calculates the falloff of mean F and mean F/sigma values as a function of (sin theta/lambda)**2 in 3 orthogonal directions. An overall falloff of all reflections is also calculated. The 3 mutually perpendicular directions are:
DIRECTION 2 = B*-AXIS DIRECTION 3 = PERPENDICULAR TO A* AND B* DIRECTION 1 = PERPENDICULAR TO B* AND DIRECTION 3.
If either of the subkeywords PLTX or PLTY are specified, then an output plot file (PLOT) is produced, in which Direction 1 is plotted as a thick line, Direction 2 is plotted as a hollow line with boxes at regular intervals of resolution, and Direction 3 is plotted as a thin line. The resolution range and number of resolution bins used in the calculation can be set by the keywords RESOLUTION and RANGES respectively.
Subkeywords:
Specify input column labels. [OPTIONAL INPUT]
Truncate takes output from SCALA, SCALEPACK2MTZ or DTREK2MTZ which generate standard labels. This is the most common usage of the program, in which case LABIN records are not required. If F is assigned, there will be no reflections output. You must assign either IMEAN/SIGIMEAN or F/SIGF.
The program labels defined are: IMEAN SIGIMEAN I(+) SIGI(+) I(-) SIGI(-) F SIGF FreeR_flag
- IMEAN
- Original average Structure Intensity
- SIGIMEAN
- Standard deviation of the above
- I(+)
- Structure Intensity of hkl
- SIGI(+)
- Standard deviation of the above
- I(-)
- Structure Intensity of -h -k -l
- SIGI(-)
- Standard deviation of the above
- F
- Original average Structure Amplitude
- SIGF
- Standard deviation of the above
- FreeR_flag
- Column of free-R flags
Specify output column labels. [OPTIONAL INPUT]
The labels allowed are F SIGF DANO SIGDANO F(+) SIGF(+) F(-) SIGF(-) IMEAN SIGIMEAN I(+) SIGI(+) I(-) SIGI(-) FreeR_flag ISYM. The output labels will default to these unless they are changed by assigning a program label to a user label.
- F
- Structure Amplitude
- SIGF
- Standard deviation of the above
- DANO
- Anomalous difference
- SIGDANO
- Standard deviation of the above
- F(+)
- Structure Amplitude for hkl
- SIGF(+)
- Standard deviation of the above
- F(-)
- Structure Amplitude for -h -k -l
- SIGF(-)
- Standard deviation of the above
- IMEAN
- Original average Structure Intensity
- SIGIMEAN
- Standard deviation of the above
- I(+)
- Structure Intensity of hkl
- SIGI(+)
- Standard deviation of the above
- I(-)
- Structure Intensity of -h -k -l
- SIGI(-)
- Standard deviation of the above
- FreeR_flag
- Column of free-R flags
- ISYM
- Symmetry number for F: normally=0 but 1 or 2 if the F column arises entirely from F+ or F- reflections respectively
If there is no anomalous data present then only the appropriate columns (F, SIGF, IMEAN and SIGIMEAN) are output. Values may be given in any order and as either Proglabel=Userlabel or Userlabel=Proglabel.
[ALTERNATIVE COMPULSORY INPUT]
followed by number of atoms in asymmetric unit, including hydrogens
A maximum of 20 atom (element) types is allowed, each followed by a number, e.g.
CONTENTS H 746 C 454 N 115 O 139 S 12 ! Must include hydrogens
The average scattering power is calculated from a table of form factors. By default the file $CLIBD/atomsf.lib contains this table of form factors. You can change the table used by assigning 'ATOMSF' to your preferred file. [NOTE the program RWCONTENTS provides the information for this keyword; how many Carbons etc., from a PDB file. Also, it gives an estimate of the number of hydrogens there would be.]
[ALTERNATIVE COMPULSORY INPUT]
<Nres> is the number of residues expected in the asymmetric unit
A very approximate atom composition is calculated:
mean mass of an amino acid = 110 add on one ordered water per amino acid = ca. 128
This is then taken as 5 C + 1.35 N + 1.5 O + 8 H /residue as number of atoms in asymmetric unit.
[OPTIONAL INPUT]
volume per atom - default = 10
PLOT or PLOT ON produces extra ascii plots in the log output. The default is PLOT OFF.
[OPTIONAL INPUT]
[OPTIONAL INPUT]
Controls printout from reading file and batch headers
[OPTIONAL INPUT]
History strings to be added to history records in output file
[OPTIONAL INPUT]
Controls whether anomalous differences are output. Defaults YES if anomalous information is present on input file, otherwise NO
[OPTIONAL INPUT]
If YES (default) the data will be truncated according to the procedure of French and Wilson. If NO the data are not truncated but the structure amplitudes are calculated simply by taking the square root of the intensities. Negative intensities are set to zero.
[OPTIONAL INPUT]
Default is to use symmetry in input HKLIN file. (Normally OMIT this command.)
[OPTIONAL INPUT]
The cell dimensions in Angstroms and degrees. The angles default to 90 degrees. If this key is omitted then the cell dimensions are taken from the input file (normally OMIT this command)
Provided a Project Name and a Dataset Name are specified (either explicitly or from the MTZ file) and provided the NOHARVEST keyword is not given, the program will automatically produce a data harvesting file. This file will be written to
$HARVESTHOME/DepositFiles/<projectname>/<datasetname>.truncate
The environment variable $HARVESTHOME defaults to the user's home directory, but could be changed, for example, to a group project directory. When running the program through the CCP4 interface, the $HARVESTHOME variable defaults to the 'PROJECT' directory.
Dataset information in the input MTZ file header includes a Project Name and a Dataset Name for each dataset. These will be used to define the harvest file, unless overridden by the PNAME and DNAME keywords. Note that the latter only affect the harvest file, and do not change the header information of the output MTZ file. Editing of dataset information can be done with the program CAD.
Project Name. In most cases, this will be inherited from the MTZ file.
A dataset, as listed in the MTZ header, defines a project-name/dataset-name
pair. The project-name specifies a particular structure solution project,
while the dataset-name specifies a particular dataset contributing to the
structure solution. An entry in the PNAME keyword should therefore be
accompanied by a corresponding entry in the DNAME keyword.
Dataset Name. In most cases, this will be inherited from the MTZ file.
A dataset, as listed in the MTZ header, defines a project-name/dataset-name
pair. The project-name specifies a particular structure solution project,
while the dataset-name specifies a particular dataset contributing to the
structure solution. An entry in the DNAME keyword should therefore be
accompanied by a corresponding entry in the PNAME keyword.
Set the directory permissions to '700', i.e. read/write/execute for the user only (default '755').
Write the deposit file to the current directory, rather than a subdirectory of $HARVESTHOME. This can be used to send deposit files from speculative runs to the local directory rather than the official project directory, or can be used when the program is being run on a machine without access to the directory $HARVESTHOME.
Maximum width of a row in the deposit file (default 80). <row_length> should be between 80 and 132 characters.
Do not write out a deposit file; default is to do so provided Project and Dataset names are available.
The input files are:
Control data file.
This will contain one record per reflection with the following items:SCALA (with or without anomalous) H K L IMEAN SIGIMEAN I(+) SIGI(+) I(-) SIGI(-) SCALEPACK2MTZ or DTREK2MTZ (with anomalous) H K L IMEAN SIGIMEAN I(+) SIGI(+) I(-) SIGI(-) Any set of amplitudes in mtz format for which you wish to analyse statistics. H K L F SIGF F(+) SIGF(+) F(-) SIGF(-)where I+ and I- are used for the anomalous differences
The output files are:
The output file is a reflection data file in standard MTZ format (i.e. one record/reflection) containing 7 or 18 items per reflection as follows (see above for labels used)H K L F SDF [DANOM SD(DANOM) F(+) SDF(+) F(-) SDF(-)] Imean SDImean [I(+) SDI(+) I(-) SDI(-) ISYM]The Fs are multiplied by the Wilson scale and the Is are multiplied by the square of the Wilson scale / 100.0.
Produced if either of the subkeywords PLTX or PLTY of the FALLOFF keyword are specified. This can be viewed with XPLOT84DRIVER, or converted to postscript with PLTDEV.
The printer output starts with details of the control data and details of the input MTZ reflection data file. Analyses of the data against resolution are then given and include intensity distributions for comparison with Wilson's theoretical distributions. The following graphs are output (which can be viewed via XLOGGRAPH or LOGGRAPH):
The program TRUNCATE reads a reflection data file of averaged intensities (SCALA, SCALEPACK2MTZ or DTREK2MTZ output) and outputs an MTZ reflection data file containing F and DeltaFanom values. The input intensities are assumed to follow a normal distribution with the standard deviations, i.e. negative observations must have been preserved. The truncation procedure used was devised by French and Wilson and is based on Bayesian statistics. The F's are calculated using the prior knowledge of Wilson's distributions for acentric or centric data (calculated in shells of reciprocal space in a first pass through the data) and the mean intensity and standard deviation values. The F's output are all positive and follow Wilson's distribution. The truncation procedure has little effect on reflections larger than 3 standard deviations but should give significantly better values for the weak data than those obtained by merely taking the square root of the intensities and setting negative intensities to zero. Reflections of less than minus four standard deviations are rejected.
The following warnings should be heeded:
Do not truncate data more than once, e.g. do not merge truncated data with untruncated data and then truncate again.
The standard deviations are crucial to the procedure and the standard deviation analysis from SCALA should be checked.
The procedure should not be used on data which has been forced to be positive e.g. from ordinate analysis measurements on a diffractometer or where negative observations have been set to zero.
The Wilson plot part of the program attempts to calculate an absolute scale and temperature factor for a set of observed intensities, using the theory of A C Wilson. This says that IF the atoms are randomly distributed through the asymmetric unit THEN
<f**2> should equal scale*<Fobs**2> * exp -2B(sin theta/lambda)**2
By fitting a least squares line through ln(<f**2>/<Fobs**2>) v 2(sin theta/lambda)**2 the program derives the scale and B value.
For real structures the assumption that the atoms are randomly distributed is obviously incorrect. The effect of this is most obvious in the low resolution reflections. The Wilson plot will deviate from a straight line from about 3.0A - 4.0A downwards. Although all the points on the Wilson plot are plotted, the scale and B are only determined from a limited resolution range determined by the user (see keyword RSCALE).
There may be a problem in evaluating <Fobs**2> if all the weak data have been systematically omitted (this should NOT be the case for data measured in any proper manner: note that if this IS the case, the Truncate procedure will also fail). If this is the case then you need to use TRUNCATE NO. The program estimates the expected number of reflections in each resolution shell and then calculates <Fobs**2> by dividing by the number of predicted reflections.
The program may stop with the following error message:
TRUNCATE: *** Data beyond useful resolution limit ***
This indicates that there is a resolution bin where the <I> is <= 0.0 (which indeed is tantamount to having indexed the data beyond a useful resolution limit).
In this situation it is worth running TRUNCATE again setting the high resolution limit to something below the current maximum (using the RESO keyword). However it is also worth considering whether this is an indication of a more serious problem (for example, is the wrong cell set?).
scala, scalepack2mtz, dtrek2mtz, rwcontents, Data Harvesting
K.S. Wilson and S. French
"falloff" program contributed by Yorgo MODIS, European Molecular Biology Lab (original program: W.G.J. HOL/SINEKE BREEN (part of the Groningen BIOMOL package). Incorporated into TRUNCATE by Martyn Winn.