where cov(.,.) and var(.) are the sample covariance and variance (i.e. calculated with respect to the sample means of ρobs and ρcalc).
where σ(Δρ) is the standard uncertainty of the difference Fourier map.  Note that this is the standard uncertainty of the 'Fo-Fc' map, NOT the RMS value of the '2Fo-Fc' map, which bears no relationship whatsoever to the uncertainty!
1. Order the values in each set in increasing numerical value (i.e. ignoring the sign).
2. For each of N subsets of size 1, 2, ..., N-1, N of the numerically highest values of the original set of size N, compute the cumulative probability of chi-square (χ2 = Σ (Δρ/σ(Δρ))2) for the subset.  So the subset of size 1 is simply the numerically highest value ('maximum order statistic') in the original set, the subset of size 2 consists of the 2 highest values of the set, the subset of size N-1 excludes the lowest value, and the subset of size N is just the set itself.
3. In practice this χ2 cumulative probability is very difficult to compute (even by stochastic numerical integration) for subsets other than those of size 1 and N (it involves integrals up to dimension N where N may be anything from 10 to 1000).  Note that the standard χ2 cumulative probability assumes that the sample is selected randomly, whereas here we are selecting the highest values.  Therefore we approximate it as the product of two components: the standard cumulative probability of χ2 for a randomly selected subset, and a correction, the Dunn-Šidák correction (Sokal & Rohlf, 1995; Gibbons & Chakraborti, 2003), in this case the cumulative probability of the order statistic, for the fact that we are selecting the highest values.
4. Take the highest cumulative probability over all subsets, and convert this to the corresponding normal Z-score, making the Z-score negative for the set of negative values; this is the final RSZD- or RSZD+ score.  The program also computes a combined RSZD score which is simply the maximum of |RSZD-| and RSZD+.
Note that model accuracy is related to the likelihood of the model (i.e. the consistency of the model with the data), and is what is improved by model building and refinement.  The difference Fourier density is obviously a measure of any discrepancy between the model and the data, so is a direct measure of model accuracy.
Model precision is a property of the crystal and the data (assuming the refinement is done optimally), and is related to data quality and completeness, resolution, atom type (or atomic scattering factor), occupancy and atomic Biso; hence model precision can only be improved by crystallizing in a different crystal form and/or collecting better (e.g. more precise and/or higher resolution) data.  The ρobs density, divided by its standard uncertainty (note: this is not the same as RMS(ρobs)), is a measure of model precision which incorporates all the above factors correlated with precision (e.g. the atomic Biso is also a precision metric but it doesn't take account of the variation of precision with atom type and occupancy).
B/Å2: 10 20 30 40 50 60 70 80 90 dmin/Å rmax/Å (SFALL: all atoms) All 2.35 2.67 2.95 3.21 3.45 3.67 3.88 4.08 4.27 rmax/Å (EDSTATS: O atom) 3.5 1.72 1.78 1.83 1.89 1.95 2.02 2.08 2.15 2.22 3.0 1.51 1.58 1.65 1.72 1.80 1.88 1.97 2.06 2.14 2.5 1.31 1.39 1.49 1.59 1.70 1.80 1.91 2.02 2.12 2.0 1.12 1.24 1.38 1.52 1.66 1.79 1.91 2.02 2.13 1.5 0.96 1.16 1.35 1.52 1.66 1.79 1.91 2.02 2.13 <=1.0 0.91 1.16 1.35 1.52 1.66 1.79 1.91 2.02 2.13Note that the limiting high-resolution values of rmax are attained at ~ dmin = 1.5Å.
smax ρ(r) = FT(f(s)) = (8/r) ∫ f(s) exp(-Bs2) sin(4πrs) s ds sminfor specified limits smin and smax of sin(θ)/λ.
Then the ratio of the radius integral of ρ(r) integrated out to the outer limit rmax relative to the radius integral integrated to infinite distance is:
rmax ∞ Radius integral ratio = ∫ ρ(r) dr / ∫ ρ(r) dr 0 0and this equation solved to obtain rmax for a radius integral ratio = 0.95 (i.e. 95% of the integral lies within distance rmax of the atom centre).  The integrals with respect to r can be obtained analytically; the integrals with respect to s in general have no analytical solution and must be computed numerically (using e.g. the QUADPACK library).  Note that ideally the volume integral of ρ(r):
rmax rmax Volume integral = ∫ ρ(r) dV = 4π ∫ ρ(r) r2 dr 0 0should be used, but unfortunately this integral does not converge.
However, for the latter, because we cannot rely on the correct Fourier coefficient for ρcalc being present in the file of map coefficients, it is necessary to obtain it as the difference between the ρobs and Δρ coefficients.  Since we have:
Δρ = ρobs - ρcalcor:
ρcalc = ρobs - Δρtherefore for acentrics:
ρcalc = F(2mFo-DFc) - F(2(mFo-DFc)) = F(DFc)whereas for centrics:
ρcalc = F(mFo) - F(mFo-DFc) = F(DFc)Hence the correct Fourier coefficient for ρcalc is DFc for all reflections.  Note that it is frequently stated that the coefficient for acentrics is mFo-DFc but if this were used it would give completely the wrong result for the ρcalc coefficient (it would give mFo !).
Available options:
RESI averages all map values for the main-chain atoms in each residue.
ATOM averages the map values for each atom, but reports the extreme values of these as the residue metrics.
This option has no effect on the real-space Z-scores, which are as defined in the DESCRIPTION section above.
Scaling type ALL rescales using a single scale factor and offset based on all map points in the asymmetric unit.
BULK rescales using a single scale factor and offset based only on points in the bulk solvent.
CHAIN independently rescales each chain and the bulk solvent with a separate scale factor and offset for each group (ordered waters are treated as belonging to a single separate chain '0' regardless of their chain IDs in the PDB file).  This is now the recommended procedure.
NONE does no Q-Q plot rescaling; the value of σ(Δρ) read from the map header is used, with zero for the offset.
LS-bit Value Output 0 1 General debugging. 1 2 P-P & Q-Q difference plots for chains. 2 4 Memory allocation debugging. 3 8 ZSCORE s/r debugging for RSZD values. 4 16 RSZD outliers. 5 32 Cumulative frequencies for RSZDs > 3 σ. 6 64 Normality tests.
Both maps should be calculated with a grid spacing between 1/4 and 1/6 of the high resolution cut-off (usually 1/4 is sufficient), and the PDB file and the maps should all be from the same refinement job.
NOTE: it is essential that the MTZ file from the refinement job is run through the MTZFIX program before map calculation with FFT to ensure that the map coefficients are correct and consistent between programs (unfortunately different refinement programs have different conventions for the map coefficients!).
> gnuplot Terminal type set to 'x11' gnuplot> plot'edstats.his' w l,exp(-.5*x**2)/sqrt(2*pi)
> gnuplot Terminal type set to 'x11' gnuplot> plot'edstats.qqd' w l,0 lt 0
Columns 13-21 contain the same information as columns 4-12 above (i.e. add 9), but for the side-chain atoms (excluding Cβ) if present.
To plot the RSZD- and RSZD+ metrics (in columns 11 & 12) by residue for the main-chain atoms with the suggested threshold lines at ±3σ, using gnuplot:
> gnuplot Terminal type set to 'x11' gnuplot> set style data impulses gnuplot> plot'edstats.out'u 11,''u 12,-3 lt 0,3 lt 0Similarly use columns 20 & 21 to plot the side-chain values.  See separate section below on interpreting these plots.
The supplied Perl script percent-rank.pl extracts a small subset of the overall metrics from the standard output, compares the results with a pre-calculated set in the supplied data file pdb-edstats.out, and for each metric prints out the per-cent rank (i.e. the percentage of structures in the pre-calculated set which have a worse score, so 0% is 'worst' and 100% is 'best').  This is intended to a give a quick overview of the state of the difference Fourier and is not a meant as a substitute for interpreting the per-residue metrics (see next section).  Generally you would probably want your structure to score above average on all measures, so at least above the median 50% rank.  But obviously not every structure can be above average!
The data file pdb-edstats.out, or a link to it, must be present in the current directory; alternatively set the environment variable PDB_EDSTATS to point to it.  The data were obtained by running edstats on ~ 600 supposedly 'good' structures (anonymous!) from PDB_REDO with Rfree < 0.175 and > 100 residues (protein only).  This is not ideal, since it would clearly be much better to bin the known structures by high resolution cut-off and compare your structure only with known structures at roughly the same resolution; however this will require a much larger database than I have the resources to set up in the short term.  Hopefully this feature will be developed and improved in a future release.
The columns in the data file pdb-edstats.out contain:
The percent-rank.pl script prints out the per-cent ranks for metrics 3-6 above.
Examples of usage:
percent-rank.pl edstats.log or percent-rank.pl *.logNote that the overall statistics for the RSZO metrics which appear in the standard output are not listed by the percent-rank.pl script; this is deliberate: the RSZO metric is a measure of precision and is really only meaningful when analysed at the residue level.  For example it may be that only say 50% of the residues score above the threshold of the precision metric, but if these 50% tell you all that you wanted to know about the biological function, then clearly the experiment can be counted as a success (assuming of course that all residues have acceptable scores for the accuracy metrics).  So it all depends on which residues have high values of the precision metric.  On the other hand, if only 50% of residues scored above the threshold for the accuracy metric then this would be regarded as a poor result, no matter which residues they were.
The RSZD scores are accuracy metrics, i.e. at least in theory they can be improved by adjusting the model (by eliminating the obvious difference density), so start by checking the worst offenders first.  Use the Fourier and difference maps in your favourite graphics model-building program to guide any adjustments of the model that may be required, in the usual way.  Note that positive density deviations are usually more frequent than negative ones, because they represent uninterpretable, as opposed to incorrectly interpreted density, and are therefore less symptomatic of underlying problems.
The RSZO scores are precision metrics and will be strongly correlated with the Bisos (since that is also a precision metric), i.e. assuming you've fixed any issues with accuracy of that residue there's nothing you can do about the precision, short of re-collecting the data.
The RSR and RSCC (both 'sample' and 'population') metrics are
tabulated for comparison but are correlated with both accuracy and
precision, so they can be useful in some circumstances, but they don't
always help with telling you whether adjustment of the model is
required, or whether the problem is actually an intrinsic property of
the structure, or lies with the data.  Note that the RSR and RSCC
metrics vary with the program used, since they depend strongly on the
radius cut-off, scaling algorithm and other variables which can vary a
lot between programs.
REFERENCES
C-I. Brändén & T.A. Jones Nature (1990). 343,
687-689.
J.D. Gibbons & S. Chakraborti, S. (2003). Nonparametric statistical
inference, 4th ed., New York: Marcel Dekker, Inc.
T.A. Jones, J-Y. Zou, S.W. Cowan & M. Kjeldgaard Acta Cryst.
(1991). A47, 110-119.
P. Main Acta Cryst. (1979). A35, 779-785.
R.J. Read Acta Cryst. (1986). A42, 140-149.
R.R. Sokal & F.J. Rohlf (1995). Biometry, 3rd ed., New York: WH
Freeman.
I.J. Tickle, R.A. Laskowski, & D.S. Moss Acta Cryst. (1998).
D54, 243-252.
I.J. Tickle CCP4 Study Weekend (2011). Manuscript of
presentation submitted - to be published in Acta Cryst. D.
AUTHOR
EXAMPLES
Example 1
This example illustrates how the maps must be prepared.  Failure to
follow this recipe is likely to give inaccurate results!
#!/bin/tcsh
# Fix up the map coefficients: FLABEL specifies the label for Fobs &
# σ(Fobs) (defaults are F/SIGF or FOSC/SIGFOSC). Here, 'in.mtz'
# is the output reflection file from the refinement program in MTZ
# format.
rm -f fixed.mtz
mtzfix FLABEL FP HKLIN in.mtz HKLOUT fixed.mtz >mtzfix.log
if($?) exit $?
# Good idea to check the mtzfix output before proceeding!
less mtzfix.log
# If no fix-up was needed, use the original file.
if(! -e fixed.mtz) ln -s in.mtz fixed.mtz
# Compute the 2mFo-DFc map; you need to specify the correct labels for
# the F and phi columns: 'FWT' & 'PHWT' should work for Refmac.
# Note that EDSTATS needs only 1 asymmetric unit (but will also work
# with more). Grid sampling must be at least 4.
echo 'labi F1=FWT PHI=PHWT\nxyzl asu\ngrid samp 4.5' | fft \
HKLIN fixed.mtz MAPOUT fo.map
if($?) exit $?
# Compute the 2(mFo-DFc) map; again you need to specify the right
# labels.
echo 'labi F1=DELFWT PHI=PHDELWT\nxyzl asu\ngrid samp 4.5' | fft \
HKLIN fixed.mtz MAPOUT df.map
if($?) exit $?
Example 2
#!/bin/tcsh
# Q-Q difference plot & main- & side-chain residue statistics.
echo resl=50,resh=2.1 | edstats XYZIN in.pdb MAPIN1 fo.map \
MAPIN2 df.map QQDOUT q-q.out OUT stats.out
if($?) exit $?
Example 3
#!/bin/tcsh
# Main- & side-chain atom statistics, using chains A & I only & writing
# PDB file with per-atom Zdiff metrics.
echo mole=AI,resl=50,resh=2.1,main=atom,side=atom | edstats \
XYZIN in.pdb MAPIN1 fo.map MAPIN2 df.map XYZOUT out.pdb \
OUT stats.out
if($?) exit $?