TRUNCATE (CCP4: Supported Program)

NAME

truncate - obtain structure factor amplitudes using Truncate procedure and/or generate useful intensity statistics

SYNOPSIS

truncate hklin foo_in.mtz hklout foo_out.mtz [ plot foo.plt ]
[Keyworded input]

DESCRIPTION

The standard use of the program is to read a file of averaged intensities (output from SCALA, SCALEPACK2MTZ or DTREK2MTZ) and write a file containing mean amplitudes and the original intensities. If anomalous data is present then F(+), F(-), with the anomalous difference, plus I(+) and I(-) are also written out. The amplitudes are put on an approximate absolute scale using the scale factor taken from a Wilson plot.

There are two ways in TRUNCATE to calculate the amplitudes from the intensities. The simplest is just to take the square root of the intensities, setting any negative ones to zero (keyword TRUNCATE NO). Alternatively, the "truncate" procedure (keyword TRUNCATE YES, the default) calculates a best estimate of F from I, sd(I), and the distribution of intensities in resolution shells (see below). This has the effect of forcing all negative observations to be positive, and inflating the weakest reflections (less than about 3 sd), because an observation significantly smaller than the average intensity is likely to be underestimated. See reference below.

If the input specified on the LABIN line includes an assignment for F, then no output will be generated. It is most undesirable to TRUNCATE a set of data where the intensities have already been modified to generate amplitudes.

This program can be used even if the "truncate" procedure is not desired, since it produces some useful statistics on intensity distributions. These can indicate problems with the data; for instance if the data is extremely anisotropic (see the FALLOFF keyword) or if it is likely to be twinned. See the cumulative intensity distribution plot, which for a perfect twinning becomes sigmoidal, and the moments of E (or Z) which are different for twinned data than for untwinned. If there are indications of twinning they will be noted as warnings in the log file.

The scale factor estimated from the Wilson plot is applied to the data and allows the data to be put on a (very approximate) absolute scale. This at least gives amplitudes of a sensible magnitude for further calculations. The calculation relies on the number of residues/atoms given by the keywords NRESIDUE/CONTENTS being roughly correct. The program does not, however, apply any temperature factor.

Moments of the intensities

Note: this is the same test as before, but the description is now in terms of the normalised structure factor E or the normalised intensity Z, since this takes into account correctly the symmetry factor epsilon and the normalisation.

We define a reduced intensity I' = I / epsilon, where epsilon is the symmetry factor which increases the mean intensities for certain planes or lines in reciprocal space. We also define Z = I' / <I'> The general formulae for the expected moments of I' and Z (where k-th moment of x = <x^k>) for untwinned acentric data are:


  k-th moment of I' = Gamma(k+1) * <I'>^k

  k-th moment of Z = Gamma(k+1)
                   = k!          if k is an integer

The normalised structure amplitude is given by:

  E = I'^(1/2) / <I'>^(1/2) = Z^(1/2)

The moments of E can then be related to the half-integral moments of Z:

  k-th moment of E = <E^k>  
                   = <Z^(k/2)> 
                   = Gamma(k/2+1)

For k odd, this can be evaluated as:

   k-th moment of E = Gamma(k/2+1)    with k odd
                    =  k/2 * (k/2 - 1) * ... * 0.5 * Gamma(1/2)
                    =  sqrt(PI) * k/2 * (k/2 - 1) * ... * 0.5

  e.g. <E^3>  = sqrt(PI) * 1.5 * 0.5 = 1.329

For k even, <E^k> is simply (k/2)!

One can also calculate expected moments for acentric data in the case of perfect twinning, and for untwinned and twinned centric data, giving:

                    Acentric                        Centric
              Untwinned data  Perfect twin.   Untwinned data  Perfect twin.
  <E>             0.886         0.94               0.798       0.886 
  <E^3>           1.329         1.175              1.596       1.329
  <E^4>           2.0           1.5                3.0         2.0
  <E^6>           6.0           3.0               15.0         6.0 
  <E^8>          24.0           7.5              105.0        24.0 

Truncate outputs these moments for comparison against these expected values.

KEYWORDED INPUT

The various data control lines are identified by keywords. Only the first 4 letters of each keyword are necessary. Most keywords are optional.

ANOMALOUS, CELL, CONTENTS, FALLOFF, HEADER, HISTORY, LABIN, LABOUT, NRESIDUE, PLOT, RANGES, RESOLUTION, RSCALE, SCALE, SYMMETRY, TITLE, TRUNCATE, VPAT

In addition, the following optional keywords control the data harvesting functionality:

DNAME, NOHARVEST, PNAME, PRIVATE, RSIZE, USECWD

DESCRIPTION OF KEYWORDS

TITLE <title>

(default TITLE='From Truncate') [OPTIONAL INPUT]

Title to write to output reflection file

RANGES <nrange> | <range>

(Default <nrange>=60) [OPTIONAL INPUT]

<nrange> is the number of resolution bins over the resolution range for the Wilson Plot. <range> is the width of the bins on 4(sin theta/lambda)**2 and is an alterative to <nrange>. The resolution range used for the Wilson Plot is taken from the input data file, or set with the RESOLUTION keyword. A subset of these bins, covering a resolution range defined with the RSCALE keyword, is used to estimate the scale and B-factor.
The use of this card is discouraged, as the choice of the number of bins is important. Too few bins and the scale and overall B will be less accurate. Too few reflections per bin could mean a large scattering of points from the straight line. This would mean a large uncertainty in the values for scale and B.
If this card is omitted, the program divides the resolution range into 60 bins, then checks that there are not less than 40 reflections in any one bin and if necessary reduces the number of bins until the above condition is satisfied. In all cases, the program will stop if the width of ranges is greater than 0.03 or the number of bins is greater than 60.

RESOLUTION <Dmin> <Dmax>

[OPTIONAL INPUT]

Resolution limits - either 4(sin theta/lambda)**2 or d in Angstroms (either order). Reflections outside these limits will be excluded from all analysis and omitted on output. Defaults are taken from the range of data in the input file (i.e. all data included).

RSCALE <Dmin> <Dmax>

[OPTIONAL INPUT]

Resolution limits for scaling (either 4(sin theta/lambda)**2 or d). This option allows you to exclude low resolution reflections from the calculation of the scale and B factor. However, all points in the range defined by RESO are plotted on the Wilson plot. It is probably a good idea to include only high resolution data (beyond 3A, if you have any data there) in the Wilson plot. This is because the assumptions behind Wilson statistics are invalid for low resolution data. The default high resolution limit is the same as RESOLUTION. The low resolution limit is, by default, set to 4.0A if the high res. limit is greater than 3.5A.

SCALE <scale>

[OPTIONAL INPUT]

The default is to apply a scale factor from the Wilson plot. If a scale factor is given here, then that is applied instead. This option is useful if relative scaling is already done in SCALA.

If amplitudes rather than intensities are specified on the LABIN line, then the Wilson scale is not applied, and a default scale of 100 is used.

FALLOFF YES | NO [ CONE <cone> ] [ PLTX | PLTY ]

[OPTIONAL INPUT]

The first argument of the FALLOFF keyword should be "YES" or "NO", followed optionally by subkeywords controlling the detailed behaviour. The default is "YES", which triggers an analysis of the anisotropy of the data according to the "falloff" procedure contributed by Yorgo Modis. This calculates the falloff of mean F and mean F/sigma values as a function of (sin theta/lambda)**2 in 3 orthogonal directions. An overall falloff of all reflections is also calculated. The 3 mutually perpendicular directions are:

 

        DIRECTION 2 = B*-AXIS 
        DIRECTION 3 = PERPENDICULAR TO A* AND B*
        DIRECTION 1 = PERPENDICULAR TO B* AND DIRECTION 3.

If either of the subkeywords PLTX or PLTY are specified, then an output plot file (PLOT) is produced, in which Direction 1 is plotted as a thick line, Direction 2 is plotted as a hollow line with boxes at regular intervals of resolution, and Direction 3 is plotted as a thin line. The resolution range and number of resolution bins used in the calculation can be set by the keywords RESOLUTION and RANGES respectively.

Subkeywords:

CONE <cone>
The falloff of mean F-values along each orthogonal direction is calculated using reflections falling within a cone orientated along that direction. <cone> is the angle the surface of the cone makes with the associated direction. Reflections which are located at a greater angle than <cone> from the closest direction will not be included in the falloff calculations.
Default: 30.0 degrees.
PLTX | PLTY
Produce an output plot file (PLOT) and orientate it horizontally or vertically.
Default is horizontally (PLTX).

LABIN <program label>=<file label>...

Specify input column labels. [OPTIONAL INPUT]

Truncate takes output from SCALA, SCALEPACK2MTZ or DTREK2MTZ which generate standard labels. This is the most common usage of the program, in which case LABIN records are not required. If F is assigned, there will be no reflections output. You must assign either IMEAN/SIGIMEAN or F/SIGF.

The program labels defined are: IMEAN SIGIMEAN I(+) SIGI(+) I(-) SIGI(-) F SIGF FreeR_flag

IMEAN
Original average Structure Intensity
SIGIMEAN
Standard deviation of the above
I(+)
Structure Intensity of hkl
SIGI(+)
Standard deviation of the above
I(-)
Structure Intensity of -h -k -l
SIGI(-)
Standard deviation of the above
F
Original average Structure Amplitude
SIGF
Standard deviation of the above
FreeR_flag
Column of free-R flags

LABOUT <program label>=<file label>...

Specify output column labels. [OPTIONAL INPUT]

The labels allowed are F SIGF DANO SIGDANO F(+) SIGF(+) F(-) SIGF(-) IMEAN SIGIMEAN I(+) SIGI(+) I(-) SIGI(-) FreeR_flag ISYM. The output labels will default to these unless they are changed by assigning a program label to a user label.

F
Structure Amplitude
SIGF
Standard deviation of the above
DANO
Anomalous difference
SIGDANO
Standard deviation of the above
F(+)
Structure Amplitude for hkl
SIGF(+)
Standard deviation of the above
F(-)
Structure Amplitude for -h -k -l
SIGF(-)
Standard deviation of the above
IMEAN
Original average Structure Intensity
SIGIMEAN
Standard deviation of the above
I(+)
Structure Intensity of hkl
SIGI(+)
Standard deviation of the above
I(-)
Structure Intensity of -h -k -l
SIGI(-)
Standard deviation of the above
FreeR_flag
Column of free-R flags
ISYM
Symmetry number for F: normally=0 but 1 or 2 if the F column arises entirely from F+ or F- reflections respectively

If there is no anomalous data present then only the appropriate columns (F, SIGF, IMEAN and SIGIMEAN) are output. Values may be given in any order and as either Proglabel=Userlabel or Userlabel=Proglabel.

CONTENTS <symbol> <n> ...

[ALTERNATIVE COMPULSORY INPUT]

followed by number of atoms in asymmetric unit, including hydrogens

A maximum of 20 atom (element) types is allowed, each followed by a number, e.g.

  CONTENTS  H 746 C  454  N 115    O 139   S 12  ! Must include hydrogens

The average scattering power is calculated from a table of form factors. By default the file $CLIBD/atomsf.lib contains this table of form factors. You can change the table used by assigning 'ATOMSF' to your preferred file. [NOTE the program RWCONTENTS provides the information for this keyword; how many Carbons etc., from a PDB file. Also, it gives an estimate of the number of hydrogens there would be.]

NRESIDUE <Nres>

[ALTERNATIVE COMPULSORY INPUT]

<Nres> is the number of residues expected in the asymmetric unit

A very approximate atom composition is calculated:

    mean mass of an amino acid = 110
    add on one ordered water per amino acid = ca. 128

This is then taken as 5 C + 1.35 N + 1.5 O + 8 H /residue as number of atoms in asymmetric unit.

VPAT <vpat>

[OPTIONAL INPUT]

volume per atom - default = 10

PLOT [ ON | OFF ]

PLOT or PLOT ON produces extra ascii plots in the log output. The default is PLOT OFF.

[OPTIONAL INPUT]

HEADER [ NONE | BRIEF | HISTORY | ALL ] [ NOBATCH | BATCH | ORIENTATION ]

[OPTIONAL INPUT]

Controls printout from reading file and batch headers

  1. file header printing:
    NONE
    no header printed
    BRIEF
    brief header (default)
    HISTORY
    brief + history
    ALL
    full header printed
  2. batch header printing:
    NOBATCH
    no batch header printed
    BATCH
    batch titles printed (default)
    ORIENTATION
    batch orientation data also printed

HISTORY string

[OPTIONAL INPUT]

History strings to be added to history records in output file

ANOMALOUS YES | NO

[OPTIONAL INPUT]

Controls whether anomalous differences are output. Defaults YES if anomalous information is present on input file, otherwise NO

TRUNCATE YES | NO

[OPTIONAL INPUT]

If YES (default) the data will be truncated according to the procedure of French and Wilson. If NO the data are not truncated but the structure amplitudes are calculated simply by taking the square root of the intensities. Negative intensities are set to zero.

SYMMETRY <space_group> | <number>

[OPTIONAL INPUT]

Default is to use symmetry in input HKLIN file. (Normally OMIT this command.)

CELL <a> <b> <c> [ <alpha> <beta> <gamma> ]

[OPTIONAL INPUT]

The cell dimensions in Angstroms and degrees. The angles default to 90 degrees. If this key is omitted then the cell dimensions are taken from the input file (normally OMIT this command)

Data Harvesting keywords

Provided a Project Name and a Dataset Name are specified (either explicitly or from the MTZ file) and provided the NOHARVEST keyword is not given, the program will automatically produce a data harvesting file. This file will be written to

$HARVESTHOME/DepositFiles/<projectname>/<datasetname>.truncate

The environment variable $HARVESTHOME defaults to the user's home directory, but could be changed, for example, to a group project directory. When running the program through the CCP4 interface, the $HARVESTHOME variable defaults to the 'PROJECT' directory.

Dataset information in the input MTZ file header includes a Project Name and a Dataset Name for each dataset. These will be used to define the harvest file, unless overridden by the PNAME and DNAME keywords. Note that the latter only affect the harvest file, and do not change the header information of the output MTZ file. Editing of dataset information can be done with the program CAD.

PNAME<project_name>

Project Name. In most cases, this will be inherited from the MTZ file.
A dataset, as listed in the MTZ header, defines a project-name/dataset-name pair. The project-name specifies a particular structure solution project, while the dataset-name specifies a particular dataset contributing to the structure solution. An entry in the PNAME keyword should therefore be accompanied by a corresponding entry in the DNAME keyword.

DNAME <dataset_name>

Dataset Name. In most cases, this will be inherited from the MTZ file.
A dataset, as listed in the MTZ header, defines a project-name/dataset-name pair. The project-name specifies a particular structure solution project, while the dataset-name specifies a particular dataset contributing to the structure solution. An entry in the DNAME keyword should therefore be accompanied by a corresponding entry in the PNAME keyword.

PRIVATE

Set the directory permissions to '700', i.e. read/write/execute for the user only (default '755').

USECWD

Write the deposit file to the current directory, rather than a subdirectory of $HARVESTHOME. This can be used to send deposit files from speculative runs to the local directory rather than the official project directory, or can be used when the program is being run on a machine without access to the directory $HARVESTHOME.

RSIZE <row_length>

Maximum width of a row in the deposit file (default 80). <row_length> should be between 80 and 132 characters.

NOHARVEST

Do not write out a deposit file; default is to do so provided Project and Dataset names are available.

INPUT AND OUTPUT FILES

The input files are:

  1. Control data file.

  2. HKLIN - Input reflection data file in MTZ format.
    This will contain one record per reflection with the following items:
         SCALA (with or without anomalous)
            H K L IMEAN SIGIMEAN I(+) SIGI(+) I(-) SIGI(-)
    
         SCALEPACK2MTZ or DTREK2MTZ (with anomalous)
            H K L IMEAN SIGIMEAN I(+) SIGI(+) I(-) SIGI(-)
    
      Any set of amplitudes in mtz format for which you wish to analyse statistics.
            H K L F SIGF F(+) SIGF(+) F(-) SIGF(-)
    
    where I+ and I- are used for the anomalous differences

The output files are:

  1. HKLOUT - Output reflection data file. Not generated if F assigned as input column.
    The output file is a reflection data file in standard MTZ format (i.e. one record/reflection) containing 7 or 18 items per reflection as follows (see above for labels used)
       H K L  F SDF [DANOM SD(DANOM) F(+) SDF(+) F(-) SDF(-)] 
              Imean SDImean [I(+) SDI(+) I(-) SDI(-) ISYM]
    
    The Fs are multiplied by the Wilson scale and the Is are multiplied by the square of the Wilson scale / 100.0.
  2. PLOT - plot file showing fall-off of mean F in 3 perpendicular directions.
    Produced if either of the subkeywords PLTX or PLTY of the FALLOFF keyword are specified. This can be viewed with XPLOT84DRIVER, or converted to postscript with PLTDEV.

PRINTER OUTPUT

The printer output starts with details of the control data and details of the input MTZ reflection data file. Analyses of the data against resolution are then given and include intensity distributions for comparison with Wilson's theoretical distributions. The following graphs are output (which can be viewed via XLOGGRAPH or LOGGRAPH):

PROGRAM FUNCTION

The program TRUNCATE reads a reflection data file of averaged intensities (SCALA, SCALEPACK2MTZ or DTREK2MTZ output) and outputs an MTZ reflection data file containing F and DeltaFanom values. The input intensities are assumed to follow a normal distribution with the standard deviations, i.e. negative observations must have been preserved. The truncation procedure used was devised by French and Wilson and is based on Bayesian statistics. The F's are calculated using the prior knowledge of Wilson's distributions for acentric or centric data (calculated in shells of reciprocal space in a first pass through the data) and the mean intensity and standard deviation values. The F's output are all positive and follow Wilson's distribution. The truncation procedure has little effect on reflections larger than 3 standard deviations but should give significantly better values for the weak data than those obtained by merely taking the square root of the intensities and setting negative intensities to zero. Reflections of less than minus four standard deviations are rejected.

The following warnings should be heeded:

  1. Do not truncate data more than once, e.g. do not merge truncated data with untruncated data and then truncate again.

  2. The standard deviations are crucial to the procedure and the standard deviation analysis from SCALA should be checked.

  3. The procedure should not be used on data which has been forced to be positive e.g. from ordinate analysis measurements on a diffractometer or where negative observations have been set to zero.

WILSON plot

The Wilson plot part of the program attempts to calculate an absolute scale and temperature factor for a set of observed intensities, using the theory of A C Wilson. This says that IF the atoms are randomly distributed through the asymmetric unit THEN

<f**2> should equal scale*<Fobs**2> * exp -2B(sin theta/lambda)**2

By fitting a least squares line through ln(<f**2>/<Fobs**2>) v 2(sin theta/lambda)**2 the program derives the scale and B value.

For real structures the assumption that the atoms are randomly distributed is obviously incorrect. The effect of this is most obvious in the low resolution reflections. The Wilson plot will deviate from a straight line from about 3.0A - 4.0A downwards. Although all the points on the Wilson plot are plotted, the scale and B are only determined from a limited resolution range determined by the user (see keyword RSCALE).

There may be a problem in evaluating <Fobs**2> if all the weak data have been systematically omitted (this should NOT be the case for data measured in any proper manner: note that if this IS the case, the Truncate procedure will also fail). If this is the case then you need to use TRUNCATE NO. The program estimates the expected number of reflections in each resolution shell and then calculates <Fobs**2> by dividing by the number of predicted reflections.

Errors And Warnings

The program may stop with the following error message:

TRUNCATE:  *** Data beyond useful resolution limit ***

This indicates that there is a resolution bin where the <I> is <= 0.0 (which indeed is tantamount to having indexed the data beyond a useful resolution limit).

In this situation it is worth running TRUNCATE again setting the high resolution limit to something below the current maximum (using the RESO keyword). However it is also worth considering whether this is an indication of a more serious problem (for example, is the wrong cell set?).

SEE ALSO

scala, scalepack2mtz, dtrek2mtz, rwcontents, Data Harvesting

REFERENCES

  1. French G.S. and Wilson K.S. Acta. Cryst. (1978), A34, 517.

AUTHORS

K.S. Wilson and S. French

"falloff" program contributed by Yorgo MODIS, European Molecular Biology Lab (original program: W.G.J. HOL/SINEKE BREEN (part of the Groningen BIOMOL package). Incorporated into TRUNCATE by Martyn Winn.

EXAMPLES

unix example scripts found in $CEXAM/unix/runnable/

....and non runnable examples in $CEXAM/unix/non-runnable/