The MTZ file format is used for the storage of reflection data. The file contains the data and a header of metadata. The former is held as a table with rows representing reflections and columns representing different quantities for each reflection. The latter aims to make the file self-contained by including all necessary information, such as symmetry operations, cell dimensions, etc. The MTZ file is a flat-file representation of a particular data model. We first describe the data model, and then the particular implementation used.
File -> Crystal -> Dataset -> Datalist -> ColumnA `Crystal' is essentially a single crystal form: usually there will be one crystal per derivative, unless a single derivative can crystalise in several cells (e.g. RT and frozen). A `Dataset' is a set of observations on a particular crystal. If data is collected at several wavelengths, each of these becomes a separate dataset. A `Datalist' is a grouping of associated columns. Thus a single list will hold both F and SigF. Another list holds all four Hendrickson Lattman coefficients. Each data list is linked to one of the datasets and each dataset is linked to one of the crystals. There may be several data lists per dataset and several datasets per crystal.
The Datalist level is not yet implemented in the MTZ format, but the remainder of the above hierarchy is recorded in the MTZ file header. The header lists the columns of data held in the file, and identifies which dataset they belong to, and in turn which crystal that dataset belongs to. The crystals, datasets and columns are each identified by a label. The labels for the datasets and columns need not be unique, provided the full identification "crystal name/dataset name/column label" is unique.
Each crystal is further identified as belonging to a project, labelled by a "project name". The project name is currently used in Data Harvesting where it corresponds to a particular structure determination (and is equivalent to the mmCIF data item _entry.id). In the current implementation of MTZ files, the project is simply an attribute of a crystal and is not an integral part of the data structure.
The total number of datasets represented in a file is given by the keyword NDIF in the main file header (see below), and a list of the project, crystal and dataset names associated with each dataset is given by the PROJECT, CRYSTAL and DATASET keywords also in the main file header. Each dataset is identified internally by an integer "dataset ID". For a merged single-record-per-reflection MTZ file, each column has as one of its attributes (included in the COL keyword) a "dataset ID", which acts as a pointer to the main list of datasets. For unmerged multi-record MTZ files, a column may be associated with several datasets (corresponding to different batches) and the "dataset ID" is not used. Instead, each batch header contains a "dataset ID", which points to the dataset associated with that batch.
The main file header also contains properties of each dataset. Each crystal can have its own cell dimensions identified by the keyword DCELL, e.g. native and derivative crystals may well have significantly different cells. All datasets belonging to a particular crystal should have the same cell dimensions. The information held in DCELL records is distinct from the general cell held in the CELL record; the use of DCELL is now preferred. A wavelength can also be attributed to each dataset via the keyword DWAVEL. Other dataset information may be added in the future. The records DCELL and DWAVEL are optional; the header reading routines assume that if they are present, then they will occur immediately after the relevant PROJECT, CRYSTAL and DATASET keywords.
The dataset information can be viewed via the program MTZDUMP:
* Base dataset: 0 HKL_base HKL_base HKL_base * Number of Datasets = 1 * Dataset ID, project/crystal/dataset names, cell dimensions, wavelength: 1 HEWL wildtype native 79.0026 79.0026 36.8933 90.0000 90.0000 90.0000 1.54180
The MTZ reflection file format uses fixed length logical 'records' written in a byte stream with, in general, four bytes for each data item (REAL*4), with a minimum of 3 columns and currently a maximum of 200 columns of data per record, although these limits could easily be increased. Additional information (title, cell dimensions, column labels, symmetry information, resolution range, history information and, if necessary, batch titles and orientation data) is contained in labelled header records. The columns of the reflection data records are identified by alphanumeric labels held as part of the file header information. The user relates the item names used by the program to the required data items, as identified by the labels, by means of assignment statements in the program control data.
Record Formats
The file contains basically two classes of records - header records and reflection data records. A standard reflection data file contains the following items, in the order given, not necessarily all items have to be present:
- VERS
- Version stamp (Character*10, currently MTZ:V1.1)
- TITLE
- File Title - short identification of file (Character*70)
- NCOL
- number of columns, number of reflections in file, number of batches (Integer) if number of batches > 0 this indicates multi-record file
- CELL
- Global Cell Parameters (Real(6)). The use of these is deprecated in favour of the dataset cell parameters, see DCELL below.
- SORT
- Sort order of 1st 5 columns in file (Integer(5))
- SYMINF
- Number of Symmetry operations (Integer)
Number of Primitive operations (Integer)
Lattice Type (Character*1)
Space Group Number (Integer)
Space Group Name (Character*10)
Point Group Name (Character*6)- SYMM
- Symmetry operations in international tables style
- RESO
- Minimum (smallest number) and Max (largest number) resolution stored as 1/d-squared (Real(2))
- VALM
- Value with which Missing Number Flag is represented.
- COL
- Column Label (Character*30)
Column Type (Character*1) for each column
Minimum and Maximum value in each column (Real)
ID of corresponding dataset (Integer)- NDIF
- Number of datasets represented in the file.
- PROJECT
- ID of dataset (Integer)
Project Name (Character*64). Normally one for each structure determination.- CRYSTAL
- ID of dataset (Integer)
Crystal Name (Character*64). May be several for each structure determination, representing the different crystals used.- DATASET
- ID of dataset (Integer)
Dataset Name (Character*64). May be several for each structure determination, representing the different datasets measured.- DCELL
- ID of dataset (Integer)
Cell dimensions (Real(6)). These are identical for all datasets belonging to the same crystal.- DWAVEL
- ID of dataset (Integer)
Wavelength (Real) for dataset.- BATCH
- Batch Serial Number for each batch present (Integer). This line is only present in `multi-record' files.
NB: Column Types are an extra check that the user input assignment for a requested program label is of the correct type. For a list of all column types see section COLUMN TYPES.
Normally the Miller indices will be held in the first three columns though, within the definition of the format, there is no restriction on the use of the columns of the reflection data records. However, the subroutines which output the MTZ header information in a formatted way (e.g. Subroutine LHPRT) presume that the first 3 columns of a standard MTZ file are the Miller Indices, and the first 5 columns of a multi-record MTZ file are H,K,L,M/ISYM and Batch number.
Columns of reflection data in an MTZ file are identified through column labels. Through the LABIN/LABOUT mechanism, it is possible to connect a column of data expected by a particular program with a column of data in a file.
Column labels must be no more than 30 characters long. This limit is hardwired in the software library used to read and write MTZ files. It is also imposed by the MTZ format itself, where several details of a particular column must be fitted onto a header record of 80 characters.
Column labels should be alphanumeric. Avoid special characters (the "/" in the standard label "M/ISYM" is a special case - slashes in other places will break the code).
Although you are free to choose any column labels, the following table lists some conventional choices:
Name | Item |
---|---|
H, K, L | Miller indices. |
M/ISYM | Column contains a combination of the partiality flag M and the symmetry number ISYM: 256M+ISYM. M is 0 for fully recorded reflections or 1 for partials. ISYM = 2*isymop - 1 for reflections placed in the positive asu, i.e. I+ of a Friedel pair, and ISYM = 2*isymop for reflections placed in the negative asu, i.e. I- of a Friedel pair. Here "isymop" is the number of the symmetry operator used. |
BATCH | Batch number. |
I | Intensity. |
SIGI | sigma(I). |
FRACTIONCALC | Calculated partial fraction of spot. |
IMEAN | Mean intensity. |
SIGIMEAN | sigma(IMEAN). |
FP | Native F value. |
FC | Calculated F. |
FPH<n> | F value for derivative <n>. |
DP | Anomalous difference for native data. |
DPH<n> | Anomalous difference for derivative <n>. |
SIGFP | sigma(FP). |
SIGDP | sigma(DP). |
SIGFPH<n> | sigma(F<n>). |
SIGDPH<n> | sigma(DEL<n>). |
PHIC | Calculated Phase. |
PHIB | Phase from experimental phasing. |
FOM | figure of merit. |
WT | weight |
HLA,HLB,HLC,HLD | Hendrickson-Lattman (HL) coefficients |
FREE | Free R flag (program label) |
FreeR_flag | Free R flag (file label) |
All columns in an MTZ file are assigned a type, taken from the following list. The LABIN line of a particular job connects columns in an input MTZ file with the columns expected by the program. The column types are used to check that a sensible assignment is made, e.g. that you do not try to use an Intensity (type J) column where a Structure Factor Amplitude (type F) is expected. If there is a mismatch between file and program column types, a warning will be issued by the CCP4 library. The allowed column types are as follows:
H | index h,k,l |
J | intensity |
F | structure amplitude, F |
D | anomalous difference |
Q | standard deviation of J,F,D or other (but see L and M below) |
G | structure amplitude associated with one member of an hkl -h-k-l pair, F(+) or F(-) |
L | standard deviation of a column of type G |
K | intensity associated with one member of an hkl -h-k-l pair, I(+) or I(-) |
M | standard deviation of a column of type K |
E | structure amplitude divided by symmetry factor ("epsilon"). Normally scaled as well to give normalised structure factor |
P | phase angle in degrees |
W | weight (of some sort) |
A | phase probability coefficients (Hendrickson/Lattman) |
B | BATCH number |
Y | M/ISYM, packed partial/reject flag and symmetry number |
I | any other integer |
R | any other real |