as part of the |
program suite for protein crystallography version 5.0.1 |
July 2004
Documentation and
Examples for using pdb_extract as part of the CCP4 interface (available from http://www.ccp4.ac.uk/)
The PDB_EXTRACT Program Suite contains tools and
examples for extracting mmCIF data from structure determination applications.
Standalone and web versions of pdb_extract are available at:
http://deposit.pdb.org/mmcif/PDB_EXTRACT/
CCP4i version of pdb_extract
Documentation and Examples
ă2004
Research Collaboratory for Structural Bioinformatics
Questions and comments about this manual should be sent to
info@rcsb.org.
The Protein Data Bank (PDB) is operated by
The
RCSB PDB is supported by funds from the National Science Foundation (NSF), the
National Institute of General Medical Sciences (NIGMS), the Office of Science,
Department of Energy (DOE), the National Library of Medicine (NLM), the
National Cancer Institute (NCI), the National Center for Research Resources
(NCRR), the National Institute of Biomedical Imaging and Bioengineering
(NIBIB), and the National Institute of Neurological Disorders and Stroke
(NINDS).
Contents
1. General
1.1 Description 2
1.2 Credits 2
2. Usage 3
2.1 The CCP4i
interface 3
2.2 The command
line interface 4
2.3 The script
interface 5
2.4 The Web
interface 5
2.5 Example for
using the different interfaces 5
3. Files 9
3.1 The data
template file 9
3.2 The script
input file 9
3.3 The output
file 10
3.4 Input log
files from various crystallographic applications 10
3.4.1 Data
collection/reduction 10
3.4.2 Molecular
replacement 11
3.4.3 Heavy atom phasing 12
3.4.4 Density
modification 16
3.4.5 Final structure
refinement 17
3.4.6
List of crystallographic applications supported by pdb_extract 19
4. Command line
arguments for running pdb_extract 20
4.1 Arguments for running
pdb_extract to prepare the coordinate files 21
4.1.1 Examples 23
4.2 Arguments for running
pdb_extract_sf to prepare sf files 24
4.2.1 Examples 26
4.3
Arguments for running extract to generate data template and script input files 27
4.3.1 Examples 28
4.4 Summary of arguments 29
5. Appendices 30
5.1 Example of a data template file 31
5.2 Example of a script input file 36
6. References 41
7. Frequently asked
questions 43
1 General
1.1 Description
The pdb_extract software is designed to
automatically extract information and statistics about data reduction, heavy atom
phasing, molecular replacement, density modification, and final structure
refinement from the output and log files produced by many X-ray
crystallographic applications. The program can merge all the extracted
information into macromolecular Crystallographic Information File (mmCIF) data
files for validation and deposition to the PDB.
Some of the
advantages of using this software are listed below:
·
It reduces manual
intervention during the assembly and preparation of coordinate and structure
factor data thereby making it quick and accurate.
·
Files prepared using
this software have more detailed information pertaining to the structure
determination and quality.
·
Since this application
is based on the PDB mmCIF exchange dictionary, its use for structure deposition
also facilitates annotation and processing of the data.
·
The coordinate and
structure factor files prepared by pdb_extract
can be validated and deposited using ADIT
(either at http://deposit.pdb.org/adit/ or http://pdbdep.protein.osaka-u.ac.jp/adit/). Alternatively, the files can be validated (http://deposit.pdb.org/validate/ or http://pdbdep.protein.osaka-u.ac.jp/validate/) and directly submitted to the PDB either via email
(deposit@rcsb.rutgers.edu) or via ftp (pdb.rutgers.edu).
·
The program pdb_extract can be used to separately
extract and save relevant information and statistics regarding different stages
of structure determination. This may be useful for situations where the
structure determination process is extended over a long period of time or when
different people are involved in the various steps of structure determination.
1.2 Credits
This program was developed by the RCSB-Protein Data Bank in order to
facilitate deposition of structural data.
2. Usage
The various
interfaces for running the pdb_extract
application are explained in the following sections. A few important points to
keep in mind regarding the use of this program are:
·
A number of trials may
have been used at each step of the structure determination. Please use the
output and log files from the best or final trial of data processing, heavy
atom phasing, density modification and final structure refinement for running pdb_extract.
·
Multiple applications
may have been used at a single step of structure determination. For example, if
program A was used to locate heavy atom positions and program B was used to
refine heavy atom parameters (like x, y, z, occupancy and B factors),
information regarding the phasing statistics should be extracted from the
output of program B.
·
If multiple structures
need to be deposited to the PDB, please run pdb_extract separately for each file since each structure will be
deposited as a separate entry with unique PDB IDs.
·
Once the pdb_extract program has been installed,
as part of the CCP4 package, it can be run using any one of the following ways:
o The CCP4i interface
o The command line interface
o The script interface
o The Web interface
2.1 The CCP4i
interface
The
graphical user interface for CCP4 (CCP4i) can be used to prepare structural
data for deposition to the PDB. This interface is intuitive and easy to use.
Steps for running pdb_extract using
this interface are described below:
·
In the main CCP4i window, click on yellow button at the top left
hand corner. This lists the different modules of CCP4.
·
Select the ‘Validation and
Deposition’ module.
·
Now select the ‘Data
Harvesting Management Tool’ option from the left hand menu. This opens a
new window titled the 'Harvesting Manager'.
·
In the 'Harvesting Manager' window, under the option ‘Run program to’, select ‘Extract additional information for deposition’
·
Now under the 'Extract
information from' options, select ‘Generate
a data template’. This opens up boxes for uploading either a PDB or mmCIF
format file with the coordinates and the name of an output file. You may either
type the complete path for the input PDB or mmCIF file name (or select it using
the browsing function) in the appropriate box. Note that even if no file name
is included in any of the boxes, blank spaces should not be entered in them.
·
Select the 'Run Now' option in the Run button to generate the data
template file.
·
Edit and complete the data template file according to the
instructions included within the file. Please refer to section 3.1 for more
information. An example data template file is included in section 5.1.
·
Return to the 'Extract
information from' category and select the ‘Generate a complete mmCIF file for PDB deposition’ option.
·
Select the program and log file names used for each stage of the
structure determination like data scaling, heavy atom phasing, density
modification, molecular replacement and structure refinement. Also include the
names of the data template file generated above, and an output file.
·
Run the pdb_extract
program to obtain a complete mmCIF format file that can be uploaded to the web
version of ADIT (at http://deposit.pdb.org/adit/
or http://pdbdep.protein.osaka-u.ac.jp/adit/) for structure validation and
deposition. Alternatively, the mmCIF format file may be validated either at http://deposit.pdb.org/validate/ or http://pdbdep.protein.osaka-u.ac.jp/validate/, corrected if necessary and submitted to the PDB either via email
(deposit@rcsb.rutgers.edu) or via ftp (pdb.rutgers.edu).
·
The structure factor file for the deposition should be converted to
mmCIF format using the 'Structure factor
for deposition' button in the main CCP4i window. Alternatively, the
mtz2various application can be used to convert a mtz format structure factor
file(s) to mmCIF format. Note that the structure factor data should be the one
used for the final refinement and at least have h, k, l, F, SigmaF, (and/or I,
SigmaI) and test flags, if appropriate. Another method for preparing the
structure factor file using pdb_extract
is described in section 2.2 and 2.3
Also see section 2.5 for an
example.
2.2 The command line interface
Once
the CCP4 suite of programs has been completely installed, the pdb_extract application may be run
using a command line interface. This allows greater flexibility for using the
various options of the program.
·
Obtain the data template file 'data_template.text'
using the command
extract -pdb coordinate_PDB_file_name
or
extract -cif coordinate_CIF_file_name
·
Edit and complete the data template file according to the
instructions included within the file. An example of the data template file is
included in section 5.1.
·
Run the pdb_extract
program using the appropriate arguments to include the names of the programs
and their log files in the command line, to obtain a complete mmCIF format file
including coordinates and all the data statistics. Instructions for including
the different filenames and a list of the commonly used arguments for running
this program are included in section 4.1. They are also described in an example
in section 2.5.
·
Run pdb_extract_sf to
convert the structure factor file to mmCIF format. If the structure factor file
is in mtz format, it can also be converted to mmCIF format using the
mtz2various program, available as part of CCP4. If multiple structure factor
data were used for phasing the structure (for example in the case of a MAD
experiment), pdb_extract can be used
to concatenate all the data sets in a single file. The first block of structure
factors should be the one used for the refinement. Note that each block of
structure factor data should have h, k, l, F, SigmaF, (and/or I, SigmaI) and
test flags if appropriate.
Also see section 2.5 for an
example.
2.3 The script interface
This
interface uses scripts similar to the CNS script input files. It is an easy and
user friendly interface that can be executed without the use of a graphical
interface. The advantage here is that it does not involve the use of specific
arguments to include the names of all the programs, output and log files. All
this information can be included in the script input file 'log_script.inp'.
·
Obtain the data template file 'data_template.text'
and script input file ‘log_script.inp’
using the command
extract -pdb coordinate_PDB_file_name
or
extract -cif coordinate_CIF_file_name
·
Edit the data template file according to the instructions included
in the file. Fill the names of all relevant software applications, their log
and output file names, as well as the data_template file name in the
'log_script.inp' file. Examples of the data template and script input files are
included in section 5.1 and 5.2 respectively.
·
Run the program using the command
extract -ext log_script.inp
Also see section 2.5 for an
example.
2.4 The Web interface
This is
actually not part of the CCP4 package. However, if internet access is available
on the workstation running CCP4, this option is available at http://pdb-extract.rutgers.edu/.
Detailed instructions and examples for running the program using this interface
are available from this link.
2.5 Examples
Here
is an example, where the experimental method for solving the protein structure
was multiple anomalous diffraction (MAD). The structure determination details
were as follows:
·
A single crystal was used for data collection.
·
Three datasets were collected for the MAD experiment at wavelengths
(inflection, peak, remote edge of Selenium).
·
The program HKL2000 was
used for indexing and scaling the data sets.
·
The program SOLVE was
used for phase determination and refinement of heavy atom parameters. All three
reflection data files were used for phasing.
·
RESOLVE was used for density modification.
·
REFMAC5 was used for final structure
refinement.
The output and log files generated
from above programs were as follows:
·
The HKL2000 program generated three reflection data files (scale1.sca, scale2.sca, scale3.sca)
and three log files (scale1.log, scale2.log, scale3.log) for the three data sets collected.
·
The SOLVE program generated one log file (solve.prt) containing phasing statistics and one PDB file (ha.pdb) containing heavy atom (Selenium
in this case) coordinates.
·
The RESOLVE program generated one log file (resolve.log) containing statistics
·
The REFMAC5 program generated one PDB file (refmac.pdb) containing atomic coordinates and one mmCIF file (native.refmac) containing refinement
statistics. The structure factor data used for the final refinement was refmac_sf.mtz.
·
The steps involved in running pdb_extract
for complete data extraction for this example, using the different interfaces
(sections 2.1-2.4) are described below.
Using the CCP4i interface:
·
Follow the instructions (in section 2.1) to launch the ‘Harvest
manager’ window and select the ‘Generate
a data template’ option.
·
In the box titled ‘PDB File’, upload the file refmac.pdb.
·
Include the name of an output file, for example ‘data_template.text’
with the complete path.
·
Select the 'Run Now'
option to generate the data template file.
·
Edit and complete this file according to the instructions included
in it. Replace any chain breaks (denoted by question marks ‘????’ in the
one-letter-code sequence listed in the ‘Sequence information’ category), with
the sequence of the residues that were not modeled due to missing density etc.
Also add any residues missing from the N- and C-termini and correct the
sequence where the residues were modeled as
·
Return to the 'Extract
information from' category and select the ‘Generate a complete mmCIF file for PDB deposition’ option
·
In the data scaling section, select the scaling program HKL2000 and
upload the log file scale1.log to extract scaling statistics (scale1 was the data
used for the final refinement).
·
In the phasing section, select phasing method MAD, program SOLVE and
upload the log file solve.prt to obtain phasing statistics.
·
In the density modification section, select the program RESOLVE and
upload the log file resolve.log to obtain density modification statistics.
·
In the structure refinement section, select the program REFMAC5 and
upload the PDB coordinate file refmac.pdb and the data harvest file
native.refmac to obtain the PDB coordinates and refinement statistics
·
Upload the data template file generated above (data_template.text)
to obtain the sequence information for all unique polymers in the file and any
other the non-electronically produced information that you may have added in
the file.
·
Run the program to obtain a complete mmCIF format file.
·
The structure factor data accompanying this file, refmac_sf.mtz, can
be prepared using the 'Structure factor
for deposition' button in the main CCP4i window. Alternatively, the
mtz2various application can be used to convert the mtz format structure factor
file(s) to mmCIF format. For details on generating a file with this structure
factor data and the other data sets used for phasing the structure see
instructions in the command line section below.
·
The coordinate and structure factor files can be validated and
deposited to the PDB as instructed in section 1.1.
Using the command line interface
·
Generate the data_template file, ‘data_template.text’ using the
command
extract -pdb coordinate_PDB_file_name
or
extract -cif coordinate_CIF_file_name
·
Edit and complete this file according to the instructions included
in it. Replace any chain breaks (denoted by question marks ‘????’ in the
one-letter-code sequence listed in the ‘Sequence information’ category), with
the sequence of the residues that were not modeled due to missing density etc.
Also add any residues missing from the N- and C-termini and correct the
sequence where the residues were modeled as
·
Run the pdb_extract program to obtain coordinates and statistics
using the following command:
pdb_extract -e MAD \
-p SOLVE solve.prt \
-d RESOLVE -iLOG
resolve.log \
-r
refmac5 -icif peak.refmac
-ipdb refmac.pdb\
-s HKL –iLOG scale3.log \
-sp HKL scale1.log scale2.log scale3.log
\
-iENT date_template.text \
-o output.cif
Note
that the command line can be extended by using a backslash (\) at the end of a
line. There should be no space after the backslash (\). Refer to section 4 for
a list and explanation of arguments used to input the names of applications and
their output and/or log files.
·
Run pdb_extract_sf to
convert HKL format structure factors to mmCIF format. Since the structure
factor file used for the final refinement was in mtz format (refmac_sf.mtz),
convert this to refmac_sf.mmcif either using the CCP4i interface or using the
mtz2various application. The structure factor data for all the three
wavelengths were used for phase determination, they should be merged to one
file for deposition using the following command:
pdb_extract_sf -rt F -rp
refmac5 -idat refmac_sf.mmcif
\ (for refinement)
-dt I -dp HKL \ (for phasing)
-c 1 -w 1
-idat scale1.sca \
-c 1 -w 2 -idat scale2.sca \
-c 1 -w 3 -idat scale3.sca \
-o output_sf.cif
Note
that each block of structure factor data should have h, k, l, F, SigmaF,
(and/or I, SigmaI) and test flags if appropriate. In this case, a test set was
used for the final structure refinement, thus the file refmac_sf.mmcif should
include a column with the test flags. The output file (output_sf.cif) contains
one reflection data block for refinement (derived from refmac_sf.mmcif) and a
data block for protein phasing (derived from scale1.sca, scale2.sca and
scale3.sca).
·
The coordinate and structure factor files can be validated and
deposited to the PDB as instructed in section 1.1.
Using the script interface
·
Generate the data_template file, ‘data_template.text' using the
command
extract -pdb coordinate_PDB_file_name
or
extract -cif coordinate_CIF_file_name
Two
files are generated, 'data_template.text’ and 'log_script.inp'.
·
Edit and complete this file according to the instructions included
in it. Replace any chain breaks (denoted by question marks ‘????’ in the
one-letter-code sequence listed in the ‘Sequence information’ category), with
the sequence of the residues that were not modeled due to missing density etc.
Also add any residues missing from the N- and C-termini and correct the
sequence where the residues were modeled as
·
Edit the log_script.inp file according to the instructions in the
file to include names of all the applications used for the structure
determination and the names of their output and log files.
·
Run the program using the command:
extract -ext log_script.inp
The coordinate and structure factor
output files generated in this run should be identical to those generated using
the command line interface.
·
The coordinate and structure factor files can be validated and
deposited to the PDB as instructed in section 1.1.
3. Files
This section describes various input and output files that are used
for running pdb_extract. Useful tips for using these files are included here. For examples
of a data template file and script input file see sections 5.1 and 5.2.
3.1 The data template file
·
This file is generated by running extract on a coordinate file as
follows:
extract –pdb coordinate_file_name or
extract –cif
coordinate_file_name
·
The data template file contains the sequence information for all
unique polymers (protein or nucleic acids) in the structure and other
non-electronically captured information.
·
The categories 1 and 2 must be filled in the file before running pdb_extract. The categories 3-18 may
either be filled in here or later during deposition using ADIT.
·
In the data template file, only strings included between the 'lesser
than' and 'greater than' signs (<.....>) will be parsed for evaluation by
the program. Therefore, DO NOT write either on the left or right of the 'less
than' and 'greater than' signs respectively.
·
All alphanumeric values or strings that you include in the different
categories should be within double-quotes. Blank spaces or carriage returns
within a pair of double quotes are ignored by the program. DO NOT use double quotes (") within
strings that you enter.
·
See section 5.1 for an example of a data template file.
3.2 The script input file
·
This file is also generated by running extract (in addition to the
data template file). This file is used only for running pdb_extract using the script interface.
·
The script input file is used to enter the names of the
crystallographic software used for structure determination and the log, PDB,
mmCIF or other text files generated by them. Names of the coordinate, structure
factor and data template file should also be included here.
·
The script input file should be completed according to the type of
experiment used for structure determination. The command 'extract -ext
log_script.inp' is then used to obtain the completed structure data files ready
for validation and deposition.
·
Only strings included between the 'lesser than' and 'greater than'
signs (<.....>) will be parsed for evaluation by the program. Therefore,
DO NOT write either on the left or right of the 'less than' and 'greater than'
signs respectively.
·
All alphanumeric values or strings that you include in the different
categories should be within double-quotes. Blank spaces or carriage returns
within a pair of double quotes are ignored by the program. DO NOT use double
quotes (") within strings that you enter.
·
The log files used for generating the deposition should be generated
from the best (usually the last) trial for each crystallographic application.
·
See section 5.2 for an example of a script input file.
3.3 Output files
·
The output files generated by pdb_extract
(coordinate and structure factors) are in mmCIF format.
·
These files are ready to be uploaded to the validation server for
validation or to the ADIT tool for validation and deposition.
·
mmCIF files containing
information regarding different stages of structure determination can be
prepared separately. Thus instead of saving all the log files generated during
structure determination, pdb_extract
output files containing relevant information regarding a particular step of
structure determination can be saved. These output files (which are in mmCIF
format) can be read later by pdb_extract
and combined to create a complete mmCIF format file for validation and
deposition.
3.4 Running pdb_extract on the output and log files from various
crystallographic applications
pdb_extract can be run independently on the output and log files obtained from
the various applications used for structure determination. Details about the
information extracted from the log and output files of the different
applications are described in the following sections.
3.4.1
Data collection/reduction/scaling
This is the early stage of solving a crystal structure. The
statistics and details about the data integration and scaling describe the
quality of the structure factor data. Information that can be extracted at this
stage include:
A few data scaling programs that are
commonly used are described in the following sections.
3.4.1.1
Using HKL/
HKL2000/ or scalepack
(http://www.lnls.br/infra/linhasluz/denzo-hkl.htm)
pdb_extract
–s
HKL –ilog scale1.log (one dataset for refinement)
pdb_extract –sp HKL –ilog scale1.log scale2.log … (multiple datasets for phasing)
3.4.1.2 Using D*trek (http://www.msc.com/protein/dtrek.html)
pdb_extract
–s
Dtrek –ilog scale1.log (one dataset for refinement)
pdb_extract
–sp
Dtrek –ilog scale1.log scale2.log …
(multiple datasets for phasing)
3.4.1.3 Using SAINT (http://xray.chm.bris.ac.uk/facilities/smart.html)
pdb_extract –s SAINT –ilog scale1.ls (one dataset for refinement)
pdb_extract –sp SAINT –ilog scale1.ls scale2.ls … (multiple
datasets for phasing)
3.4.1.4Using 3DSCALE
·
This program by Fu et
al. is used for data scaling. pdb_extract
can be run on the log file (e.g. scale1.log) as follows:
pdb_extract –s 3DSCALE –ilog scale1.log (one dataset for refinement)
pdb_extract –sp 3DSCALE –ilog
scale1.log scale2.log … (multiple
datasets for phasing)
3.4.1.5Using SCALA or TRUNCATE (http://www.ccp4.ac.uk/dist/html/scala.html)
pdb_extract –s scala –icif name.scala (one dataset for refinement)
3.4.2
Programs for molecular replacement
Information and
statistics regarding molecular replacement that can be extracted from the log
files are listed below:
·
Low and high
resolution used in rotation and translation.
·
Rotation and
translation methods
·
Reflection cut off
criteria, reflection completeness.
·
Correlation coefficients
for I or F between observed and calculated.
·
R_factor, packing
information, and model details.
A few molecular replacement programs
that are commonly used are described in the following sections.
3.4.2.1
Using CNS/XPLOR (http://cns.csb.yale.edu/v1.1/)
pdb_extract -o
test.mmcif –e MR –m CNS –ilog translation.list
3.4.2.2Using Amore (CCP4 version 4.1-5.0)
(http://www.ccp4.ac.uk/dist/html/INDEX.html)
·
Amore is a CCP4
supported program, commonly used for molecular replacement. After rotation and
translation search two log files rotation.log and translation.log are generated.
pdb_extract can be run on the log
files as follows:
pdb_extract –e
MR –m amore –ilog rotation.log translation.log -o test.mmcif
3.4.2.3Using Molrep (CCP4 version 4.1-5.0)
(http://www.ccp4.ac.uk/dist/html/INDEX.html)
·
Molrep is another CCP4
supported program that is used for molecular replacement. All the statistics
regarding the molecular replacement can be recorded in the log file, say
molrep.log. pdb_extract can be run
on the log file as follows:
pdb_extract –e
MR –m molrep –ilog morep.log -o
test.mmcif
3.4.2.4Using EPMR (http://www.msg.ucsf.edu/local/programs/epmr/epmr.html)
·
EPMR is a command line
program for molecular replacement. Write out a log file when you run the
program as:
Epmr [options] files
> epmr.log
All the relevant
statistics will be recorded in the log file. pdb_extract can be run as follows:
pdb_extract –e
MR –m epmr –ilog epmr.log -o
test.mmcif
3.4.3
Programs for heavy atom position location and protein phasing
The phase problem lies
at the center of macromolecular crystallography. Heavy atom phasing may be used
to solve this problem. The log files generated at this stage contain important
statistics and information. pdb_extract
can be used to extract the following information from the log files:
·
Wavelength, f’,f” ,
resolution range
·
FOM (acentric,
centric, overall, resolution shells)
·
R-Cullis (acentric,
centric, overall, resolution shells)
·
R-Kraut (acentric,
centric, overall, resolution shells)
·
Phasing power
(acentric, centric, overall, resolution shells)
·
Number of heavy atom
sites, heavy atom type.
·
Method used to locate
heavy atom(s).
·
Heavy atom B-factor,
occupancies, and xyz coordinates.
A few commonly used
programs for heavy atom phasing are described in the following sections.
3.4.3.1Using CNS/XPLOR (http://cns.csb.yale.edu/v1.1/)
·
CNS may be used for
initial phase determination for a structure. The scripts for locating heavy
atoms and phase refinement are ‘mad_phase.inp’ or ‘ir_phase.inp’. When you run these scripts, you will get the
output files like ‘phase_final.summary’, ‘phase_final.sdb’ or ‘mad_phase.fp’.
The file phase_final.summary has all the phasing statistics while the file
phase_final.sdb has all the heavy atom coordinates, occupancies and B factors.
The file mad_phase.fp has refined f_prime and f_double_prime, if applicable.
(Note: The refined
heavy atom coordinates, B factors and occupancies can be found in a file like
‘phase_final.sdb’. This file may be converted to the PDB format, by running the
script sdb_to_pdb.inp. This generates a PDB format file ‘phase_final.pdb’.)
·
To extract phasing
information, run the following:
pdb_extract -o
test.mmcif –e MAD –p CNS
\
–iLOG
phase_final.summary phase_final.sdb mad_phase.fp
or, if you have the heavy atom coordinates in PDB
format:
pdb_extract -o
test.mmcif –e MAD –p CNS
\
–iLOG
phase_final.summary mad_phase.fp \
–iPDB phase_final.pdb
3.4.3.2Using MLPHARE (http://www.ccp4.ac.uk/dist/html/INDEX.html)
·
MLPHARE is a CCP4
supported program and is used for refining heavy atom parameters.
·
When running the
program using the CCP4i interface, select the data harvest button. When using scripts do not use the keyword NOHARV. In either case a file (say
name.mlphare) is generated, which is in mmCIF format. This contains all the
statistics and information regarding the heavy atom phasing refinement.
·
Run the program REVISE
(in CCP4) to extract wavelength information. This generates a log file say
prephadata.log. Extract phasing information from these files as follows:
pdb_extract
-o test.mmcif –e method –p
MLPHARE \
–iCIF name.mlphare –iLOG prephadata.log
3.4.3.3Using SOLVE (http://www.solve.lanl.gov/)
·
Solve is a program
used for locating heavy atoms and refining their position and occupancy.
Information regarding these stages of structure solution is summarized in a
file called “solve.prt” (default name used by the program). The program exports
the heavy atom coordinates, in a file called “ha.pdb”.
·
The pdb_extract program can be used to
extract phasing information for any one of the following:
o SOLVE log file for a single SAD experiment
o SOLVE log file for a single MAD experiment
o SOLVE log file for a single MIR experiment
o SOLVE log file for phasing based on a single MIR experiment and
anomalous data at the native wavelength (e.g. MIR using two different
derivatives Hg, plus Fe anomalous data in the native dataset)
o SOLVE log file for phasing based on a single MAD experiment and two
sets of anomalous scattering in the native dataset. (e.g. MAD using Se,
combined with anomalous data for Se and Fe at the native wavelength)
o SOLVE log file for phasing based on a combination of MAD and MIR
experiments
o SOLVE log file for phasing based on two MAD experiments (e.g. Using
Se and Hg)
o SOLVE log file for phasing based on more than one MIR experiments
(e.g. using Hg, I, Pt etc.)
The phasing
information can be extracted as follows:
pdb_extract –e method –p SOLVE –iLOG solve.prt -ipdb ha.pdb -o
test.mmcif
3.4.3.4Using SHARP (http://babinet.globalphasing.com/sharp/)
·
SHARP is used for
finding heavy atom positions and refining the heavy atom parameters. When
running SHARP or autoSHARP, the log files are saved in the directory
sharpfiles/logfiles_local/dirs, where
dirs refer to the subdirectories for
your various structures. Please note that the location of log files generated
by the program may vary depending on how the program is installed.
·
Of the numerous output
files generated by SHARP, the following are used for extracting information
regarding this stage of structure determination:
(For version 1.3.x)
o Heavy.pdb: which contains the heavy atom coordinates.
o FOMstats.html: which contains figure of merit statistics.
o Name.sin: which is a generated input script with input information.
o Otherstat.html which contains Rcullis, Rkraut, phasing power.
(For version 2.0 or
above)
o Heavy.pdb: which contains the heavy atom coordinates.
o FOMstats.html: which contains figure of merit statistics.
o Name.sin: which is a generated input script with input information.
o RCullis_?.html which contains Rcullis.
o PhasingPower_?.html which contains phasing power
·
The easiest way to
obtain these files is to run the program from the SUSHI interface. Review all
the log files from the internet browser and save the files in plain text (or
html) format. The phasing information can be extracted as follows:
pdb_extract -o
test.mmcif –e method –p SHARP –iPDB heavy.pdb \
–iLOG FOMstats.html Otherstat.html Name.sin
3.4.3.5Using SnP (http://www.hwi.buffalo.edu/SnB/)
·
SnB is graphical
interface software using the Shake-and-Bake
algorithm. It produces heavy atom coordinates (e.g. heavy.pdb) in PDB format.
However, this program does not refine the heavy atom parameters, thus has no
statistics regarding this. The heavy atom coordinates can be extracted as
follows:
pdb_extract -o
test.mmcif –e method –p SNB –iPDB heavy.pdb
Note: If a program
like MLPHARE or CNS was used for refining the heavy atom coordinates determined
by SnB. Extract the heavy atom coordinates as well as the phasing information
from the MLPHARE or CNS output files (even though SnB may have been used to
find the initial heavy atom positions).
3.4.3.6Using BnP (http://www.hwi.buffalo.edu/BnP/)
·
BnP is a combination
of the programs SnB (described above) and Phases by Furey (described below).
Here, the heavy atom positions are located by SnB while the heavy atom
parameters are refined by Phases. The log file (for example auto.log) can be
found from the directory ~/PHASES/* and contains phasing power for each phasing
set. The phasing information can be extracted as follows:
pdb_extract -o test.mmcif –e method –p BnP –ilog auto.log –iPDB
heavy.pdb
or
pdb_extract -o test.mmcif –e method –p phases –ilog auto.log –iPDB
heavy.pdb
3.4.3.7Using SHELXD or SHELXS (http://shelx.uni-ac.gwdg.de/SHELX/)
·
These programs are
similar to SnB in that they also only compute the heavy atom substructure in
PDB format (e.g. heavy.pdb). The heavy atom coordinates may be extracted as
follows:
pdb_extract -o test.mmcif –e method –p SHELXD –iPDB heavy.pdb
or
pdb_extract -o test.mmcif –e method –p SHELXS –iPDB heavy.pdb
3.4.3.8Using PHASES (http://imsb.au.dk/~mok/phases/phases.html)
·
The PHASES package was
developed by Furey and can be used to locate heavy atom positions and refine
the heavy atom parameters. The log file
(for example name.log) can be found from the directory ~/PHASES/* and contains
phasing power for each phasing set. Heavy atom coordinates and phasing
information can be extracted as follows:
pdb_extract -o test.mmcif –e method –p Phases –ilog name.log
–iPDB heavy.pdb
3.4.4
Programs for density modification
Density modification is normally applied after obtaining the phase
information (from heavy atom coordinates, molecular replacement etc.). The
application pdb_extract can be used
to extract the following information from the generated log files:
·
Density modification
method
·
FOM after density
modification (overall, resolution shells)
·
Solvent mask
determination method
·
Structure solution
software
A few refinement programs that are commonly used are described in
the following sections
3.4.4.1Using CNS/XPLOR (http://cns.csb.yale.edu/v1.1/)
·
The input script like
‘density_modify.inp’ in CNS runs density modification and produces a log file
called ‘density_modify.list’. The density modification statistics can be
extracted as follows:
pdb_extract -o test.mmcif –e method –d CNS –iLOG density_modify.list
3.4.4.2Using DM (http://www.ccp4.ac.uk/dist/html/INDEX.html)
·
DM is a density
modification program supported by CCP4. It generates a log file (like dm.log),
both when the program is run using the CCP4i interface and also using scripts.
The density modification statistics can be extracted as follows:
pdb_extract -o test.mmcif –e method –d DM –iLOG dm.log
3.4.4.3Using SOLOMON (http://www.ccp4.ac.uk/dist/html/INDEX.html)
·
SOLOMON is also a
density modification program supported by CCP4. A log file (like Solomon.log)
is generated, when the program is run either by using the CCP4i interface or
scripts. The density modification statistics can be extracted as follows:
pdb_extract -o test.mmcif –e method –d SOLOMON
–iLOG solomon.log
3.4.4.4Using RESOLVE (http://www.solve.lanl.gov/)
·
RESOLVE is a density
modification program in the solve/resolve package. Normally it runs together
with SOLVE, but it can be run separately.
Run RESOLVE so that a log file (like resolve.log) is written out using
“resolve input_file > resolve.log”. The density modification statistics can
be extracted as follows:
pdb_extract -o test.mmcif –e method –d RESOLVE –iLOG
resolve.log
3.4.4.5Using SHARP (http://babinet.globalphasing.com/sharp/)
·
Density modification
used in SHARP actually runs DM or solomon. Thus running density modification in
SHARP, generates a log file like ‘dm.log’. The density modification statistics
can be extracted as follows:
pdb_extract -o test.mmcif –e method –d SHARP –iLOG dm.log
or
pdb_extract -o test.mmcif –e method –d dm –iLOG dm.log
3.4.5
Programs for final structure refinement
The structure refinement is performed at the end of structure
determination. Normally the atom coordinates are generated in PDB format and
the statistics are generated in log files. The pdb_extract program can be applied to extract the following
information:
·
Number of reflections
used in refinement, and in R-Free set.
·
Resolution range
(overall, highest resolution shell)
·
R-factor (overall,
resolution shells)
·
Number of atoms
refined
·
Cell parameters and
space group.
·
The xyz coordinates of
all the atoms.
·
RMS Bond Distances,
Bond Angles, Chiral Volume, Torsion Angles
·
Isotropic temperature
factor restraints
·
Non-crystallographic
symmetry restraints
·
Solvent model used
·
Overall Average Isotropic
B Factor
·
Overall Anisotropic B
Factor
·
Overall Isotropic B
Factor
·
Topology/parameter
data used to refine the structure
·
Refinement software
A few refinement
programs that are commonly used are described in the following sections
3.4.5.1 Using CNS/XPLOR (http://cns.csb.yale.edu/v1.1/)
·
CNS is used for final
structure refinement. After completion of the refinement, a pre-deposition file
can be created which is rich in various statistics regarding the refinement.
This is done by running the script deposit_mmcif.inp to produces a file say
deposit.mmcif. This file should be used for extracting refinement statistics as
follows:
pdb_extract -o
test.mmcif –e method –r CNS –iCIF deposit.mmcif
3.4.5.2Using REFMAC5 (http://www.ccp4.ac.uk/dist/html/INDEX.html)
·
REFMAC5 is a program
used for structure refinement (also supported by CCP4). When using the CCP4i
interface, select the data harvest button and in the script mode, do not use
the keyword NOHARV. The output files generated upon running this application
includes the mmCIF format file, name.refmac, which contains information about
the structure refinement and a PDB format file (name.pdb), which contains the
atomic coordinates. Refinement statistics can be extracted from these files as
follows:
pdb_extract -o
test.mmcif –e method –r REFMAC5 –iCIF \
name.refmac –iPDB name.pdb
3.4.5.3Using SHELXL (http://shelx.uni-ac.gwdg.de/SHELX/)
·
SHELXL is a program
within the SHELX package and is used for structure refinement.
·
After completion of
structure refinement, please run the interactive program shelxpro and use
option B. This program generates a PDB format file (name.pdb) with header
information. Refinement statistics can be extracted from these files as
follows:
pdb_extract -o test.mmcif –e method –r SHELXL –iPDB name.pdb
3.4.5.4Using TNT (http://www.uoxray.uoregon.edu/tnt/welcome.html)
·
TNT is a crystal
structure refinement program. After
completion of the structure refinement the command rfactor is used to generate a log file (rfactor.log) as follows:
rfactor name.cor >
rfactor.log
·
The to_pdb
command may be used to convert coordinates in TNT format (name.cor) to the PDB
format (name.pdb) as:
to_pdb name.cor
·
The symmetry
information must also be provided via a symmetry file (e.g. p6122.dat) in the control file name.tnt
·
Complete information
regarding the refinement statistics can be extracted from the output PDB file
and log files as follows:
pdb_extract –r TNT –iLOG p6122.dat rfactor.log –iPDB name.pdb
3.4.5.5Using ARP/wARP (http://www.embl-hamburg.de/ARP/)
·
ARP/wARP is a program
for automatic structure solution and refinement, where REFMAC5 is used for the
structure refinement.
·
The new version (6.0)
can use the graphical interface of CCP4i. Thus the program may either be run
from the CCP4i interface or using scripts. The output files include a log file
(warpNtrace_refine.log) and a PDB file (warpNtrace.pdb). Information can be extracted from these files
as follows:
pdb_extract -o test.mmcif –e method –r WARP\
–iLOG warpNtrace_refine.log
\
–iPDB warpNtrace.pdb
3.4.5.6Using RESTRAIN (http://www.ccp4.ac.uk/dist/html/INDEX.html)
·
RESTRAIN is a CCP4
supported program, used for structure refinement. When using the script, do not
use the keyword NOHARV. This program generates a mmCIF format file
(name.restrain), which contains information about the structure refinement, and
a PDB format file (name.pdb), which contains the coordinates. The refinement
statistics can be extracted as follows:
pdb_extract -o
test.mmcif –e method –r RESTRAIN –iCIF name.restrain \
–iPDB name.pdb
3.4.6 Summary of
crystallographic applications supported by pdb_extract
Category |
Software |
Versions |
Authors |
Data collection and reduction |
HKL/SCALEPACK |
1.30 - 1.96 |
Otwinowski & Minor (1997) |
d*TREK |
7.0SSI |
Pflugrath (1997) |
|
SAINT |
V6.35A |
Siemens (1994) |
|
SCALA |
3.1.4 - 3.2.3 |
Evans (1997) |
|
Molecular replacement |
CNS |
0.9 - 1.1 |
Brunger et al. (1998) |
AMORE |
CCP4 (4.0 - 5.0) |
Navaza (1994) |
|
Molrep |
7.5.01 |
Vagin & Teplyakov (1997) |
|
EPMR |
2.5 |
Kissinger et al. (1999) |
|
Heavy atom phase determination |
CNS |
0.9 - 1.1 |
Brunger et al. (1998) |
SOLVE |
2.0 - 2.06 |
Terwilliger & Berendzen (1999) |
|
MLPHARE |
CCP4 (4.0 - 5.0) |
CCP4 (1994) |
|
SHARP/autoSHARP |
1.3.x - 2.02 |
Fortelle & Bricogne (1997) |
|
SHELXD/SHELXS |
97 |
Sheldrick (1997) |
|
PHASES |
95 |
Furey (1997) |
|
SnB |
2.0 - 2.2 |
Weeks & Miller (1999). |
|
BnP |
0.93 - 0.94 |
Weeks et al. (2002) |
|
Density modification |
CNS |
0.9 - 1.1 |
Brunger et al. (1998) |
DM |
2.0 - 2.1 |
Cowtan (1994) |
|
Solomon |
CCP4 (4.0 - 5.0) |
Abrahams & Leslie (1996) |
|
RESOLVE |
2.0 - 2.06 |
Terwilliger (2000) |
|
SHELXE |
97 |
Sheldrick (1997) |
|
Structure refinement |
CNS |
0.9 - 1.1 |
Brunger et al. (1998) |
REFMAC5 |
5.0 - 5.2 |
Murshudov (1997) |
|
RESTRAIN |
4.7.7 |
CCP4 (1994) |
|
SHELXL |
97 |
Sheldrick (1997) |
|
TNT |
5F |
Tronrud (1997) |
|
WARP |
5.0 - 6.0 |
Lamzin & Wilson (1997) |
4. Command line arguments for running pdb_extract
There are three components of
the pdb_extract application (pdb_extract, pdb_extract_sf, and extract).
The following sections describe the arguments used for running each of these
components from the command line. Examples for using these arguments are also
included here.
4.1 Arguments and
options for preparing coordinate files using pdb_extract
NAME pdb_extract
SYNOPSIS pdb_extract [OPTIONs]... [FILEs]...
DESCRIPTION
pdb_extract is used to extract
information about data processing, heavy atom phasing, molecular replacement,
density modification, and final structure refinement from the output files
produced by many X-ray crystallographic applications. This program also merges
the all this information into mmCIF format files, ready for validation and
deposition.
Help on how to run
this program is also available by typing ‘pdb_extract
–h or
pdb_extract –help’ in the
command line.
OPTIONS
-o Followed by a given output
file name.
example:
-o outfile.mmcif
Note: if you do
provide an output file name, a default output file name (pdb_extract.mmcif)
will be used.
-e Followed by one of the
following experimental methods:
·
MR molecular replacement.
·
SAD single anomalous
diffraction.
·
MAD multiple anomalous
diffraction.
·
SIR single isomorphous
diffraction.
·
SIRAS single isomorphous
with anomalous diffraction.
·
MIR multiple isomorphous
diffraction.
·
MIRAS multiple isomorphous
with anomalous diffraction.
example:
-e MAD
Note: If you have used
a combination of methods to solve the structure (e.g. MR with MAD), you may
extract information and details regarding both methods (e.g. -e MR –m program_mr –ilog
Log_file –e MAD –p program_mad –ilog file_name).
Here program_mr and program_mad are the names of the programs used for
molecular replacement and MAD phasing, respectively.
-m Followed by the one of
following programs for molecular replacement:
·
CNS (versions 1.0 and 1.1).
·
Amore from CCP4 suite
(versions 4.1-5.0).
·
EPMR (versions 2.5).
·
MOLREP from CCP4 suite
(versions 4.1-5.0)
example:
-m
amore
-p Followed by the one of
following program names for phasing:
·
CNS (versions 1.0 and 1.1).
·
MLPHARE from CCP4 suite
(versions 4.0-5.0).
·
SOLVE (versions 2.00-2.06).
·
SHARP (versions 1.3.x –
2.03).
·
SHELXS (version 97).
·
SHELXD (version 97).
·
SnB (version 2.2).
·
BnP (version 0.93-0.96).
·
PHASES (version 0.97).
example:
-p CNS
Note: if the program
that you have used for phasing is not in the above list, you should use the
program name and run pdb_extract. If
the log and/or output file is in PDB or mmCIF format, some information (like
heavy atom coordinates) may still be extracted. (use as –p program_name).
-d Followed by the one of
following program names for density modification:
·
CNS (versions 1.0 and 1.1).
·
DM from CCP4 suite (CCP4
versions 4.0~5.0).
·
SOLOMON from CCP4 suite (CCP4 versions
4.0~5.0).
·
RESOLVE (versions 2.01~2.06).
·
SHELXE (version 97).
·
SHARP (version 1.3.x-2.03.
using DM version 2.2 for density modification).
example: -d CNS
-r Followed by one of the
following program names for final structure refinement.
·
CNS (versions 1.0 and 1.1).
·
REFMAC5 from CCP4 suite version 4.1-5.0 (REFMAC
version 5.2).
·
RESTRAIN from CCP4 suite version 4.1-5.0 (RESTRAIN
v4.6).
·
SHELXL (version 97).
·
TNT (version 5F).
·
WARP (version 6.0, It uses REFMAC5 for refinement)
example:
-r CNS
Note: if the program
that you used for final structure refinement is not in the above list, you may
still give the program name. Some information (like atom coordinates) may still
be extracted, if the produced file is in PDB or CIF format. (use –r program_name )
-s Followed by one of the
following programs used for scaling the structure refinement dataset:
·
HKL/HKL2000/SCALEPACK
(versions 1.30 ~ 1.96).
·
SCALA (version 3.1.4 ~3.2.3) or from CCP4 suite
version 4.1-5.0
·
D*trek (version 7.0SSI)
·
SAINT (version 6.35A)
·
3DSCALE
example:
-s HKL
Note: The –s option is used to extract statistics
from data reduction of the dataset that is finally used for structure
refinement. This option must be used for preparing all structure factors files.
If you would like to deposit additional datasets that were used for phasing the
structure, please use the –sp option described below, in addition to the –s
option.
-sp Followed by one of the
following programs used for scaling the dataset(s) used in phasing the
structure:
·
HKL/HKL2000/SCALEPACK (versions 1.30 ~ 1.96).
·
SCALA (version 3.1.4 ~3.2.3) or from CCP4 suite
version 4.1-5.0
·
D*trek (version 7.0SSI)
·
SAINT (version 6.35A)
·
3DSCALE
example:
-sp HKL
Note: This option is
different from –s, since it is used
to extract statistics from data reduction of dataset used for phasing the
structure (e.g. by SAD, MAD, SIR, MIR). Normally, this option is followed by
multiple data sets as in a MAD or MIR experiment.
-iPDB Followed by a input
file with PDB format.
example:
-iPDB test1.pdb
Note: PDB files are
usually generated from heavy atom phasing (heavy atom coordinates) or the final
structure refinement.
-iCIF Followed by a input
file with mmCIF format.
example:
-iCIF deposit_cns.cif
Note: This option may
be used to read in any mmCIF format file at different stages of structure
determination. For instance, if you used MLPHARE for refining the heavy atom
parameters, the output file is in mmCIF format. Another instance where mmCIF
format files are produced is by running the deposit.inp script in CNS. This is
run at the end of the refinement to generate a file that contains the final
coordinates and refinement statistics.
-iLOG Followed by one or
more input LOG files
example:
-iLOG mad_sdb.dat mad_summary.dat
Note: All stages of
structure determination produce log files. The specific format of the file depends
on the program used. They may contain phasing statistics or heavy atom
coordinates. In some cases, multiple log files may be generated, each
containing a different type of information regarding that stage of structure
determination. For instance, when CNS is
used for heavy atom phasing, it generates log files mad_sdb.dat, which contains
the heavy atom coordinates and mad_summary.dat, which contains phase refinement
statistics.
-iENT Followed either by the data template file
(in plain text format) or a mmCIF format file with additional information that
you may wish to include in your deposition (add_info.mmcif).
example:
-iENT data_template.text or -iENT
add_info.mmcif
Note: The data template file is generated by the
program extract (see section 4.3)
and contains sequence information of all unique polymers present in the
structure. It also has tokens for including other non-electronically produced
information regarding the deposition. The option iENT also allows you to
include any additional information regarding the deposition in a mmCIF format
file. For further details regarding the mmCIF format, please consult the mmCIF
dictionary at: http://pdb.rutgers.edu/mmcif.
4.1.1
Examples for using pdb_extract options
Note: You can run pdb_extract to separately extract
information and statistics from each step of structure determination (data
processing, heavy atom phasing, density modification, molecular replacement and
final structure refinement). Alternatively, pdb_extract may be run to extract and combine information from all
these stages and add non-electronically produced information for a complete
deposition. A few examples of running pdb_extract
are shown below:
·
Command for extracting
information about heavy atom phasing
pdb_extract -e
experimental_method -p
program_name_phasing \
-iPDB
pdb_files –iLOG log_files \
–iCIF
mmCIF_files -o output_file_name
(The
experimental_method must be given for this step)
·
Command for extracting
information about density modification
pdb_extract -d program_name_for_dm –iLOG
log_files -o output_file_name
·
Command for extracting
information about molecular replacement
pdb_extract -m program_name_for_mr –iLOG
log_files -o output_file_name
·
Command for extracting
information from final structure refinement:
pdb_extract -r
program_name_for_refinement -iPDB
pdb_files \
–iLOG log_files –iCIF
mmCIF_files -o output_file_name
·
Command for extracting
information from data scaling log files (for the dataset used for refinement):
pdb_extract -s program_name_scaling –iLOG
log_file -o output_file_name
·
Command for extracting
information from data scaling log files (for the dataset(s) used for phasing):
pdb_extract -sp program_name_scaling –iLOG
log_file1 log_file2 \
-o
output_file_name
·
Command for extracting
information and generating a complete mmCIF file for deposition:
pdb_extract -e
experimental_method -r program_name_for_refinement
\
-iPDB pdb_files –iLOG
log_files –iCIF mmCIF_files \
-p program_name_for_phasing -iPDB
pdb_files \
–iLOG log_files –iCIF
mmCIF_files \
-d program_name_for_dm –iLOG
log_files \
-s program_name_for_scaling –iLOG
log_files \
-sp program_name_for_scaling –iLOG
log_files \
-iENT data_template.text -o
output_file_name
4.2 Arguments and options for preparing structure
factor files using pdb_extract_sf
NAME pdb_extract_sf
SYNOPSIS pdb_extract_sf
[OPTIONs]... [FILEs]...
DESCRIPTION
This program can
either be used to prepare
(a) a single
reflection dataset used for final structure refinement or
(b) Multiple reflection dataset (eg. in the MAD,
MIR …) used for phasing the structure.
OPTIONS
-o Followed by an output file
name.
example: -o
outfile.cif
Note: if you do not
specify the output file name, a default output file name (pdb_extract_sf.mmcif)
will be used.
-dt followed by data type for
initial data processing (Amplitude (F) or Intensity (I)). The data type at this
step is usually intensity.
example:
-dt I
-dp Data format for initial data
processing.
It
is followed by one of the following program names:
HKL/SCALEPACK, DTREK,
SAINT, XPREP, 3DSCALE, SCALA, OTHER.
example: -dp HKL
Note1: If the program
used for data scaling is not in the above list, please use “OTHER” as the
program name. Please provide the
reflection data in a text format file including h, k, l, F, SigmaF (and/or I,
SigmaI), and test flags. These columns should be separated by spaces. Usage –dt I –dp OTHER –idat
file_name
Note2: If the
structure factor data is in mtz format (processed by MOSFILM and SCALA), you
must convert it to either CNS format, scalepack format or mmCIF format. This
may be done using the mtz2various application of CCP4.
If the data is
converted to CNS format use:
-dp CNS –idat file-name.
For a scalepack format
file use:
-dp HKL –idat file-name
For a mmCIF format
file, use:
-dp SCALA –idat file-name.
-c
followed by crystal index. This is the crystal number which was used for data
collection (this value is always an integer like 1,2,3, ..)
example:
-c 2
(Thus
the reflection dataset was collected using crystal 2)
-w followed by the wavelength
index. This is the wavelength number at which the data was collected (this
value is also an integer like 1, 2, 3, …)
example:
-w 2
(Thus
the dataset was collected at the second
wavelength).
-idat followed by the
reflection data file name
example:
-idat scalepack.sca
Note: Please be
careful in including the file names of the reflection file. It should be –c i,
-w j –idat file_name in the right order, where i is the crystal index, j is wavelength index, and file_name is the name of the file containing
the reflections.
-rt followed by data type used
for final structure refinement (Amplitude (F) or Intensity (I))
example:
-dt F
-rp data format in the final
structure refinement.
It is followed by the
data format name: CNS/XPLOR, REFMAC5, SHELX, TNT, HKL/SCALEPACK, DTREK, SAINT,
XPREP, 3DSCALE, SCALA
example: -rp
CNS
Note: If you used
REFMAC5 for the final structure refinement, the mtz format structure factor
file should be converted to mmCIF or CNS format using the mtz2various
application of CCP4.
If it is converted to
mmCIF format use:
pdb_extract_sf –rt
I –rp REFMAC5 –idat data-file-name
For a CNS format file
use:
pdb_extract_sf –rt
I –rp CNS –idat data-file-name
Note1: If the program that you used for structure
refinement is not in the above list, please use “OTHER” as the program name and
provide either a plain text or mmCIF format file with reflection data. The file
should contain h, k, l, F, SigmaF, (and/or I, SigmaI), and test flags if
appropriate. These columns should be
separated by spaces. Usage –rt F –rp OTHER –idat file_name
-imgCIF followed by input
file name in imgCIF format.
example: -imgCIF
example.cbf
Note: Only some header
information can be extracted from the imgCIF file. This format is not commonly
used.
4.2.1
Examples for using pdb_extract_sf options
·
Extracting reflection
data used for final structure refinement:
pdb_extract_sf -rt data-type -rp
data-format-for-refinement \
-idat
data-file-name –o output-file-name
This option is used to
prepare the dataset used for the final refinement for deposition to the PDB. If
you collected several datasets and merged them together for the structure
refinement, use the merged file here.
·
Extracting reflection
data used for phase determination of the structure:
pdb_extract_sf -dt data_type -dp program_name_for_scaling \
-c crystal_number_1 -w wavelength_number_1 -idat data_file_name_1 \
-c crystal_number_2 -w wavelength_number_2 -idat data_file_name_2 \
…
–o output_file_name
Include the details
regarding all datasets used for phasing the structure (e.g. by MAD, MIR …). The
initial scaled reflection datasets files are used here.
Note: Even if only one
reflection dataset is included here (e.g. in the case of a SAD experiment),
structure factor data used for phasing should always be accompanied by the
crystal and wavelength numbers (using –c and –w).
·
Preparing a mmCIF
format structure factor file with all the reflection data:
pdb_extract_sf -rt
data-type_refine -rp
data-format-for_refine \
-idat
data-file-name_refine -dt data_type_scaling \
-dp program_name_for_scaling\
-c crystal_number_1 -w wavelength_number_1 -idat data_file_name_1 \
-c crystal_number_2 -w wavelength_number_2 -idat
data_file_name_2 \
…
–o output_file_name
The output_file_name
contains blocks of reflections used for the final structure refinement and for
phasing.
4.3 Arguments and options
for running extract
NAME extract
SYNOPSIS extract [OPTIONs] [FILE]
DESCRIPTION
This program can be
used to generate the data template and script input files. Both these files are
in plain text format and used for running pdb_extract.
OPTIONS
-pdb Followed by the coordinate PDB file name
example: -pdb pdb_file_name
Note: this generates
two plain text files (data_template.text and log_script.inp). See sections 3.1,
3.2, 5.1 and 5.2 for more details on these files.
-cif Followed by the coordinate mmCIF file name
example: -cif
mmCIF_file_name
Note: this also
generates the same files as above. See sections 3.1, 3.2, 5.1 and 5.2 for more
details on the data template and script input files.
-ext Followed by the completed log script file
example: -ext
log_script.inp
Note: The script input
file should be completed appropriately by including names of programs and their
output/log files generated at different stages of structure determination.
Since the name of the data template file is included in the script input file,
at least the sequence information should be completed in the data template
file. Use ‘extract –ext
log_script.inp’ to generate complete mmCIF format coordinate and structure
factor files.
4.3.1
Examples for using extract options
·
Generate the data
template and log script input files:
extract -pdb pdb_file_name
or
extract -cif cif_file_name
·
Get a complete mmCIF
file for deposition
extract -ext log_script.inp
4.4
Summary of arguments
Command line options for the three components
of pdb_extract are: pdb_extract_sf (used to capture structure factors), pdb_extract (used to capture the
details of data scaling, molecular replacement, heavy atom phasing, density
modification and structure refinement, and extract
(used to generate data_template.text and log_script.inp files).
pdb_extract_sf [OPTION]... [FILE]... |
|
Option |
Argument
descriptions |
-o |
output
file name (default name is pdb_extract_sf.mmCIF) |
-dt |
data
type (I or F) after data processing at beam line |
-dp |
program
for processing data (e.g. HKL/Scalepack, D*Trek, SCALA) |
-rt |
data
type (I or F) used for final structure refinement |
-rp |
program
for structure refinement (e.g. CNS|REFMAC5|SHELX|TNT) |
-c |
crystal
number (like 1, 2, 3 …) for diffraction |
-w |
wavelength
number (like 1, 2, 3 …) for diffraction
|
-idat |
data
file name used for phasing or structure refinement |
-ilog |
log
file name obtained from data processing |
-icif |
file
name obtained from data processing (in mmCIF format) |
|
|
pdb_extract
[OPTION]... [FILE]... |
|
Option |
Argument
descriptions |
-o |
output
file name (default name is pdb_extract.mmcif) |
-e |
experimental
method (eg. MR|SAD|MAD|SIR|MIR|SIRAS|MIRAS) |
-m |
program
for molecular replacement (e.g.
CNS|AMORE|MOLREP|EPMR) |
-p |
program
for heavy atom phasing (e.g.
CNS|MLPHARE|SOLVE|SHARP|SHELXD|SnB|BnP) |
-d |
program
for density modification (e.g.
CNS|DM|SOLOMON|RESOLVE) |
-r |
program
for final structure refinement (e.g.
CNS|REFMAC5|RESTRAIN|SHELXL|TNT|WARP) |
-s |
program
for reflection data scaling (only for refinement) (e.g.
HKL/Scalepack, D*Trek, SAINT, SCALA, 3DSCALE) |
-sp |
program
for reflection data scaling (only for phasing) (e.g.
HKL/Scalepack, D*Trek, SAINT, SCALA, 3DSCALE) |
-ilog |
the
input file with format corresponding to the program used |
-ipdb |
the
input file with PDB format |
-icif |
the
input file with mmCIF format |
-ient |
the
input file data_template.text (for complete sequence) |
|
|
extract
[OPTION] [FILE] |
|
Option |
Argument
descriptions |
-pdb |
input
coordinate file name (PDB format) |
-cif |
input
coordinate file name (mmCIF format) |
-ext |
input
script file name log_script.inp |
5. Appendices
5.1 An example of the data template
file (data_template.text)
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
THE
DATA_TEMPLATE.TEXT FILE
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
NOTES AND REMINDER
The data template file contains data entries for unique chemical
sequences
present in the structure and other non-electronically captured
information.
PLEASE CHECK CATEGORIES 1 & 2: Before proceeding any further,
make necessary
corrections here so that all information in these categories are
complete
and correct.
You may choose to fill in CATEGORIES (3-18) either here or later in
ADIT.
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
GUIDELINES FOR USING THIS FILE
1. Only strings included
between the 'lesser than' and 'greater than'
signs (<.....>)
will be parsed for evaluation by the program. Therefore,
DO NOT write either on
the left or right of the 'less than' and 'greater
than' signs respectively.
2. All alphanumeric values
or strings that you include in the different
categories should be
within double-quotes. Blank spaces or carriage
returns within a pair of
double quotes are ignored by the program.
DO NOT use double quotes
(") within strings that you enter.
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
~~~~~~~~~~~~~~~~~~~~~~~~~~~~START INPUT DATA
BELOW~~~~~~~~~~~~~~~~~~~~~~~
================CATEGORY 1:
Crystallographic Data=======================
Enter crystallographic data
<space_group = "P3221 "> (use International Table conventions)
<space_group_number = "? ">
<unit_cell_a = "
120.831 " >
<unit_cell_b = "
120.831 " >
<unit_cell_c = "
185.222 " >
<unit_cell_alpha = " 90.00 " >
<unit_cell_beta = "
90.00 " >
<unit_cell_gamma = "120.00 " >
================CATEGORY 2:
Sequence Information =======================
Enter one letter sequence for each polymeric entity in asymmetric
unit
--------------------------------------------------------------------------
SOME
DEFINITIONS
An ENTITY is defined as
any unique molecule present in the asymmetric
unit. Each unique biological
polymer (protein or nucleic acids) in the
structure is considered
an entity. Thus, if there are five copies of
a single protein in the
asymmetric unit, the molecular entity is still
only one. Water and
non-polymers like ions, ligands and sugars are
also entities.
Here we only consider the
sequences of polymeric entities (protein or
nucleic acid).
GUIDELINES FOR
COMPLETING THIS CATEGORY
* In a PDB or mmCIF
format file, all residues of a single polymeric
entity should have one
chain ID. Multiple copies of the same entity
should each be assigned
a unique chain ID. The multiple chain IDs
should be separated by
commas as 'A,B,C,...'. If incorrect chain IDs
are used the entity
groups extracted by this program will not be
correct. To avoid this,
make necessary corrections in the PDB or mmCIF
file used to generate
the data_template file and regenerate the
data_template.text file.
Alternatively, edit the extracted sequence
in this file to
correctly represent the sequence and chain IDs of each
polymeric entity.
* In addition to chain
IDs, this program uses distance geometry to
assess if there are any
breaks in the polymer sequence. These breaks
may occur due to missing
residues (not included in the model due to
missing electron
density) or due to poor geometry. Four question marks
'????' are used to
denote these chain breaks. Replace these question
marks with the sequence
of residues missing from the coordinates. Also
add any residues missing
from the N- and/or C-termini here.
* If there are
non-standard residues in the coordinates, this program
lists them according to
the three letter code used in the coordinate
file as (ABC). If all
the residues in your sequence are nonstandard,
check and edit the
sequence manually to represent it correctly in this
file.
* If any residue was
modeled as Ala or Gly due to lack of the side-chain
density, the sequence
extracted here will represent them as A or G
respectively. Correct
this to the original sequence that was present in
the crystal.
----------------------------------------------------------------------------
Below is the one
letter chemical sequence extracted from your PDB
coordinate file. The
molecular entities are grouped and listed
together.
PLEASE CHECK THE SEQUENCE of each entity carefully and modify it, as
necessary.
Make sure that you REVIEW THE FOLLOWING:
* chain breaks due to
missing residues,
* missing residues in
the N- and/or C-termini,
* non-standard
residues and
* cases of residues
modeled as Ala or Gly due to missing
side-chain
density.
<molecule_entity_id="1" >
<molecule_entity_type="polypeptide(L)"
>
<molecule_one_letter_sequence="
SASFDGPKFK(MSE)TDGSYVQTKTIDVGSSTDISPYLSLIREDSILNGNRAVIFDVYWDVGF????TKTSGWSLSSV
KLSTRNLCLFLRLPKPFHDNLKDLYRFFASKFVTFVGVQIEEDLDLLRENHGLVIRNAINVGKLAAEARG
TLVLEFLGTRELAHRVLWSDLGQLDSIEAKWEKAGPEEQLEAAAIEGWLIVNVWDQLSDE"
>
<
molecule_chain_id="A,B,C,D,E,F" >
Copy the following template to add information regarding more
entities:
<molecule_entity_id="
" >
<molecule_entity_type="
" >
<molecule_one_letter_sequence=" " >
<molecule_chain_id=" " >
================CATEGORY 3:
Contact Authors=============================
Enter information about the contact authors.
Information about the
Principal investigator (PI) should be given.
For principal investigator
<contact_author_PI_name = " ">
<contact_author_PI_email = " ">
<contact_author_PI_phone = " ">
<contact_author_PI_fax = " ">
<contact_author_PI_address = " ">
For other contact authors
<contact_author_name_1 = " ">
<contact_author_email_1 = " ">
<contact_author_phone_1 = " ">
<contact_author_fax_1 = " ">
<contact_author_address_1 = " ">
<contact_author_name_2 = " ">
<contact_author_email_2 = " ">
<contact_author_phone_2 = " ">
<contact_author_fax_2 = " ">
<contact_author_address_2 = " ">
...(add more if needed)...
================CATEGORY 4:
Release Status==============================
Enter release status for the coordinates, constraints and sequence
Status should be chosen
from one of the following:
(release now, hold for
publication, hold for 6 months,
hold for 1 year)
<Release_status_for_coordinates = " ">
<Release_status_for_structure_factor = " ">
<Release_status_for_sequence = " ">
================CATEGORY 5:
Title=======================================
Enter the title for the structure
<structure_title = " ">
================CATEGORY 6:
Citation Authors============================
Enter citation authors (e.g. Surname, F.M.)
The primary citation is
the article in which the deposited coordinates
were first reported.
Other related citations may also be provided.
For the primary citation
<primary_citation_author_name_1 = " ">
<primary_citation_author_name_2 = " ">
<primary_citation_author_name_3 = " ">
<primary_citation_author_name_4 = " ">
<primary_citation_author_name_5 = " ">
...add more if needed...
For other related citations (if applicable)
<citation_1_author_name_1 = " ">
<citation_1_author_name_2 = " ">
<citation_1_author_name_3 = " ">
<citation_1_author_name_4 = " ">
<citation_1_author_name_5 = " ">
...add more if needed...
<citation_2_author_name_1 = " ">
<citation_2_author_name_2 = " ">
<citation_2_author_name_3 = " ">
<citation_2_author_name_4 = " ">
<citation_2_author_name_5 = " ">
...add more if needed...
...(add more citations if needed)...
================CATEGORY 7:
Citation Article============================
Enter citation article (journal, title, year, volume, page)
If the citation has not
yet been published, use 'To be published'
for the category
'journal_abbrev'. The order of citations in this
category should
correspond to that is CATEGORY 6.
For primary citation
<primary_citation_journal_abbrev = " ">
<primary_citation_title = " ">
<primary_citation_year = " ">
<primary_citation_journal_volume = " ">
<primary_citation_page_first = " ">
<primary_citation_page_last = " ">
For other related citation (if applicable)
<citation_1_journal_abbrev = " ">
<citation_1_title = " ">
<citation_1_year = " ">
<citation_1_journal_volume = " ">
<citation_1_page_first = " ">
<citation_1_page_last = " ">
<citation_2_journal_abbrev = " ">
<citation_2_title = " ">
<citation_2_year = " ">
<citation_2_journal_volume = " ">
<citation_2_page_first = " ">
<citation_2_page_last = " ">
...(add more citations if needed)...
================CATEGORY 8:
Molecule Names==============================
Enter the name of the molecule for each entity
The name of molecule
should be obtained from the appropriate
sequence database
reference, if available. Otherwise the gene name or
other common name of the
entity may be used.
e.g. HIV-1 integrase for
protein
RNA Hammerhead
Ribozyme for RNA
The number of entities
should be the same as in CATEGORY 1.
<molecule_name_1 = " "> (entity 1)
<molecule_name_2 = " "> (entity 2)
<molecule_name_3 = " "> (entity 3)
...(add more if needed)...
================CATEGORY 9:
Molecule Details============================
Enter additional information about each entity
Additional information
would include details such as fragment name
(if applicable),
mutation, and E.C. number.
For entity 1
<Molecular_entity_id_1 = " "> (e.g. 1, 2, ...)
<Fragment_name_1 = " "> (e.g. ligand binding domain,
hairpin)
<Specific_mutation_1 = " "> (e.g. C280S)
<Enzyme_Commission_number_1 = " "> (if known: e.g. 2.7.7.7)
For entity 2
<Molecular_entity_id_2 = " ">
<Fragment_name_2 = " ">
<Specific_mutation_2 = " ">
<Enzyme_Comission_number_2 = " ">
For entity 3
<Molecular_entity_id_3 = " ">
<Fragment_name_3 = " ">
<Specific_mutation_3 = " ">
<Enzyme_Comission_number_3 = " ">
...(add more if needed)...
================CATEGORY 10:
Genetically Manipulated Source=============
Enter data in the genetically manipulated source category
If the biomolecule has
been genetically manipulated, describe its
source and expression
system here.
For entity 1
<Manipulated_entity_id_1 = " "> (e.g. 1, 2, ...)
<Source_organism_scientific_name_1 = " "> (e.g. Homo sapiens)
<Source_organism_gene_1 = " "> (e.g. RPOD, ALKA...)
<Expression_system_scientific_name_1 = " "> (e.g. Escherichia coli)
<Expression_system_strain_1 = " "> (e.g. BL21(DE3))
<Expression_system_vector_type_1 = " "> (e.g. plasmid)
<Expression_system_plasmid_name_1 = " "> (e.g. pET26)
<Manipulated_source_details_1 = " "> (any other relevant information)
For entity 2
<Manipulated_entity_id_2 = " ">
<Source_organism_scientific_name_2 = " ">
<Source_organism_gene_2 = " ">
<Expression_system_scientific_name_2 = " ">
<Manipulated_source_description_2 = " ">
For entity 3
<Manipulated_entity_id_3 = " ">
<Source_organism_scientific_name_3 = " ">
<Source_organism_gene_3 = " ">
<Expression_system_scientific_name_3 = " ">
<Manipulated_source_description_3 = " ">
...(add more if needed)...
================CATEGORY 11:
Natural Source=============================
Enter data in the natural source category
If the biomolecule was
derived from a natural source, describe
it here.
For entity 1
<natural_source_entity_id_1 = " "> (e.g. 1, 2, ...)
<natural_source_scientific_name_1 = " "> (e.g. Homo sapiens)
<natural_source_details_1 = " "> (any other relevant information
e.g. organ, tissue, cell ..)
For entity 2
<natural_source_entity_id_2 = " ">
<natural_source_scientific_name_2 = " ">
<natural_source_description_2 = " ">
for entity 3
<natural_source_entity_id_3 = " ">
<natural_source_scientific_name_3 = " ">
<natural_source_description_3 = " ">
...(add more if needed)...
================CATEGORY 12:
Keywords===================================
Enter a list of keywords that describe important features of the
deposited
structure.
For example, beta
barrel, protein-DNA complex, double helix,
hydrolase, structural
genomics etc.
<structure_keywords = " ">
================CATEGORY 13:
Biological Assembly========================
Enter data in the biological assembly category
Biological assembly
describes the functional unit(s) present in the
structure. There may be
part of a biological assembly, one or more
than one biological assemblies in the
asymmetric unit.
Case 1
* If the asymmetric unit
is the same as the biological assembly
nothing special needs
to be noted here.
Case 2
* If the asymmetric unit
does not contain a complete biological unit.
Please provide
symmetry operations including translations required
to build the
biological unit.
(example:
The biological
assembly is a hexamer generated from the dimer
in the asymmetric unit
by the operations: -y, x-y-1, z-1 and
-x+y, -x-1, z-l.)
Case 3
* If the asymmetric unit
has multiple biological units
Please specify how to
group the contents of the asymmetric unit into
biological units.
(example:
The biological unit is
a dimer. There are 2 biological units in the
asymmetric unit
(chains A & B and chains C & D).
For biological unit 1
<biological_assembly_1 = " ">
For biological unit 2
<biological_assembly_2 = " ">
....(add more if needed)....
================CATEGORY 14:
Crystals===================================
Enter the number of crystals used for diffraction
<number_of_crystals = " ">
================CATEGORY 15:
Methods and Conditions=====================
Enter the crystallization conditions for each crystal
For crystal 1:
<crystal_number_1 = " "> (e.g. 1, 2, ...)
<crystallization_method_1 = " "> (e.g. vapor diffusion, hanging drop)
<crystallization_pH_1 = " "> (e.g. 7.5 ...)
<crystallization_temperature_1 = " "> (e.g. 100) (in
Kelvin)
<crystallization_components_1 = " "> (e.g. PEG 4000, NaCl etc.)
For crystal 2:
<crystal_number_2 = " ">
<crystallization_method_2 = " ">
<crystallization_pH_2 = " ">
<crystallization_temperature_2 = " ">
<crystallization_components_2 = " ">
...(add more if needed)...
================CATEGORY 16:
Crystal Property===========================
Enter details about the crystals used
Include additional information
about the crystals used
for example: solvent
content, Matthews coefficient
For crystal 1:
<crystals_number_1 = " "> (e.g. 1, 2, ...)
<crystals_solvent_content_1 = " "> (e.g. 63.7 )
<crystals_matthews_coefficient_1 = " "> (e.g. 2.5 ...)
For crystal 2:
<crystals_number_2 = " ">
<crystals_solvent_content_2 = " ">
<crystals_matthews_coefficient_2 = " ">
...(add more if needed)...
================CATEGORY 17:
Radiation Source===========================
Enter the details of the source of radiation, the X-ray generator,
and the wavelength for each diffraction.
For experiment 1:
<radiation_experiment_1 = " "> (e.g. 1, 2, ...)
<radiation_source_1 = " "> (e.g. rotating-anode, synchrotron ...)
<radiation_source_type_1= " "> (e.g. Rigaku RU200, CHESS Beamline A1 ...)
<radiation_wavelengths_1= " "> (e.g. 1.502 ...)
<radiation_protocol_1= " "> (e.g. MAD, SINGLE WAVELENGTH ...)
<radiation_detector_1 = " "> (e.g. CCD, IMAGE PLATE ...)
<radiation_detector_type_1= " "> (e.g.
SIEMENS-NICOLET, RIGAKU RAXIS ...)
For experiment 2:
<radiation_experiment_2 = " ">
<radiation_source_2 = " ">
<radiation_source_type_2 = " ">
<radiation_wavelengths_2 = " ">
<radiation_protocol_2= " ">
<radiation_detector_2 = " ">
<radiation_detector_type_2= " ">
....(add more if needed)....
================CATEGORY 18:
Collection Temperature=====================
Enter the Temperature for data collection (in Kelvin)
<collection_temperature_crystal_1 = " "> (for crystal
1:)
<collection_temperature_crystal_2 = " "> (for crystal
2:)
....(add more if needed)....
=====================================END==================================
5.2 An example of the script input
file (log_script.inp)
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
THE
LOG_SCRIPT.INP FILE
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
NOTES
AND REMINDER
This script file is used to enter the names of the crystallographic
software used for structure determination and the log, PDB, mmCIF or
text files generated by them.
PLEASE COMPLETE the ENTRY FIELDS according to the type of your
experiment
and use the command 'extract -ext log_script.inp' to obtain the
completed
structure data ready for validation and deposition.
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
GUIDELINES FOR USING THIS FILE
1. Only strings included
between the 'lesser than' and 'greater than'
signs (<.....>)
will be parsed for evaluation by the program. Therefore,
DO NOT write either on
the left or right of the 'less than' and 'greater
than' signs respectively.
2. All alphanumeric values
or strings that you include in the different
categories should be
within double-quotes. Blank spaces or carriage
returns within a pair of
double quotes are ignored by the program.
DO NOT use double quotes
(") within strings that you enter.
3. Log files used for
generating the deposition should be generated from
the best (usually the
last) trial for each crystallographic software.
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
~~~~~~~~~~~~~~~~~~~~~~~~~~~~START INPUT DATA
BELOW~~~~~~~~~~~~~~~~~~~~~~~
===============PART 1: Structure Factor for Final
Refinement==============
Enter reflection data file used for final structure refinement
NOTE:
* Usually the highest resolution or best data
set is used for the
refinement. Use that
structure factor file here.
* In some cases, it may not be possible to
collect a complete dataset
from a single crystal.
Thus, multiple data sets have to be scaled
and merged together
for refinement. Use the merged reflection file
here.
* If the reflection data format is not one of
those listed below,
please use OTHER for
the data format, and provide an ASCII file
that has at least five values [H, K, L, I
(or F), sigmaI (or sigmaF)]
for each reflection.
Include the test flags as the sixth column
in the file (if
available).
* If the reflection file is in mtz format,
convert it to mmCIF format
using the mtz2various
application provided by CCP4. This can be
Reflection data
format:
CNS|SHELX|TNT|REFMAC5|HKL|SCALEPACK|DTREK|SAINT|3DSCALE
<reflection_data_type = "F" > (enter I (intensity) or F (amplitude))
<reflection_data_format = "CNS" >
<reflection_data_file_name = " " >
==============PART 2: Structure Factors for Protein
Phasing================
Enter reflection data files used for heavy atom or MAD phasing
NOTE:
* Enter this category if you have more than one
complete reflection
file (e.g. in the case
of MAD,SIRAS, MIR). The LOG files generated
from data scaling
software for all these data sets is also needed.
* If the scaling program is not one of those
listed below
(HKL|SCALEPACK|DTREK|SAINT|3DSCALE),
enter OTHER for the program
name and provide an
ASCII file with at least five values
[H, K, L, I (or F),
sigmaI (or sigmaF)] for each reflection.
* If the same crystal was used for collecting
multiple data sets, the
crystal number will
remain '1' as the wavelength numbers change.
However, if multiple
crystals were used, for the data collections,
the corresponding
crystal numbers should be used for each data set.
<scale_data_type = "I" > (enter I (intensity) or F
(amplitude))
<scale_program_name = "HKL" >
For data set 1:
<crystal_number =
"1" >
<diffract_number = "1" >
<scale_data_file_name_1 =
" " >
<scale_log_file_name_1 =
" " >
For data set 2:
<crystal_number_2 =
"1" >
<diffract_number_2 = "2" >
<scale_data_file_name_2 =
" " >
<scale_log_file_name_2 =
" " >
For data set 3:
<crystal_number_3 =
"1" >
<diffract_number_3 = "3" >
<scale_data_file_name_3 =
" " >
<scale_log_file_name_3 =
" " >
==================PART 3: Statistics for Data
Scaling=====================
Enter log file and software name for data scaling
NOTE:
* The log file included here should have
scaling statistics of
the file used for the
final structure refinement. If multiple data
sets were scaled and
merged for refinement (as described in Part 1
above) use the log
file generated during merging of the data sets.
* While SCALA produces a mmCIF format file with
the scaling statistics,
most other software
produce ASCII LOG files with this information.
Software for scaling
is one of the following:
(HKL|SCALEPACK|DTREK|SAINT|3DSCALE|SCALA)
<data_scaling_software = "HKL" >
<data_scaling_LOG_file_name = " " >
<data_scaling_CIF_file_name = " " > (in mmcif format)
==============PART 4: Statistics for Molecular
Replacement================
Enter log files and software name for molecular replacement
NOTE:
Software is one of the
following:
(CNS|AMORE|MOLREP|EPMR)
The log file should be
from the best trial of MR.
<mr_software = " " >
<mr_log_file_LOG_1 = " " >
<mr_log_file_LOG_2 = " " >
=================PART 5: Statistics for Protein
Phasing===================
Enter log files and software name for heavy atom phasing
NOTE:
Software is one of the
following:
(CNS|MLPHARE|SOLVE|SHELXS|SHELXD|SNB|BNP|SHARP|PHASES)
The log file should be
from the best trial of phasing.
<phasing_method = "MAD" > (SAD|MAD|SIR|SIRAS|MIR|MIRAS)
<phasing_software = "SOLVE" >
<phasing_log_file_LOG_1 = " " >
<phasing_log_file_PDB_1 = " " > (in PDB format)
<phasing_log_file_CIF_1 = " " > (in mmCIF format)
<phasing_log_file_LOG_2 = " " >
<phasing_log_file_PDB_2 = " " >
<phasing_log_file_CIF_2 = " " >
... add more if needed ...
===============PART 6: Statistics for Density
Modification================
Enter log files and software name for density modification
NOTE:
Software is one of the
following:
(CNS|DM|RESOLVE|SOLOMON|SHELXE|SHARP)
The log file should be
from the best trial of density modification.
<dm_software = "RESOLVE " >
<dm_log_file_LOG_1 = " " >
<dm_log_file_CIF_1 = " " > (in mmCIF format)
===============PART 7: Statistics for Structure
Refinement================
Enter log files and software name used for final structure
refinement
NOTE:
Software is one of the
following:
(CNS|REFMAC5|SHELXL|TNT|PROLSQ|RESTRAIN)
The log file should be
from the final trial of structure refinement.
<refine_software = "REFMAC5" >
<refine_log_file_PDB_1 = " " > (coordinate file in PDB format)
<refine_log_file_CIF_1 = " " > (LOG file in mmCIF format)
<refine_log_file_LOG_1 = " " >
<refine_log_file_PDB_2 = " " >
<refine_log_file_CIF_2 = " " >
<refine_log_file_LOG_2 = " " >
=======================PART 8: Data Template
File=========================
Enter file name of the data template file
NOTE:
This file
'data_template.text' was generated by using the
command 'extract -pdb
pdb_file' or 'extract -cif cif_file'. It
contains the sequences
of all unique polymers (protein or nucleic
acid) present in the
structure. It also contains other
non-electronically
captured information. Please complete the
data template file
before running pdb_extract.
<data_template_file = "data_template.text" >
=====================================END==================================
Westbrook, J., Feng, Z., Burkhardt, K. & Berman, H. M. (2003). Meth. Enz. 374, 370-385.
7. Frequently Asked
Questions
Q. What does pdb_extract do?
A. pdb_extract can read in
log files from various crystallographic applications, coordinates and structure
factor files to automatically extract relevant information regarding the data
reduction, scaling, heavy atom phasing, molecular replacement, density
modification and final structure refinement. This program can combine all this
information to prepare a mmCIF format file for validation and deposition to the
PDB.
Q. What should I do if the program that I used for solving the structure
is not supported by pdb_extract?
A. If the program generates a coordinate file in the PDB format and
any log files in mmCIF format, include these files and the name of the program
and pdb_extract should be able to
prepare a deposition file for you. Please send the name of the unsupported
program, any other relevant details about it and its log file to
help@rcsb.rutgers.edu. We will include this program to our list of supported
applications.
Q. I included all the appropriate file names in the log_script.inp
file but the program does not run to completion. What should I do?
A. Check the data template file to make sure that there are no ‘????’
in the sequence of the polymers. This represents a break in the chain due to
missing residues. Edit the sequence information appropriately to ensure that
all residues that were not modeled due to missing density or residues that were
modeled as Ala or Gly due to missing side chain density have been appropriately
corrected.
Q. All the residues in my file are non-standard or modified. Will
pdb_extract be able to extract the sequence from the coordinate file?
A. pdb_extract can recognize and extract the sequence of polymers (protein or
nucleic acid) including some non-standard residues. The non-standard or
modified residues are denoted by their 3 letter code as '(MSE)' for
selenomethionine. However, if all the residues in the polymer are non-standard,
the program may fail to get a correct register for the sequence. Thus it is
recommended that in such cases the entity_poly (sequence) category should be
manually edited in the data_template.text file to ensure that the sequence
included is complete and correct.
Q. I am behind a firewall so I can not use the web version of ADIT.
How do I complete my deposition? How do I complete my deposition?
A. Please use either the validation server (web or desktop versions)
or the command line option for validating the files that you prepared using pdb_extract. You can email the
validated coordinate and structure factor files to deposit@rcsb.rutgers.edu or
ftp them to pdb.rutgers.edu.
Q. It will probably take me a really long time to complete solving
this structure. Why should I bother with pdb_extract right now?
A. pdb_extract can help you
keep track of all the relevant information from the different stages of
structure solution required for depositing the structure. Apply pdb_extract to the output and log files
of each step of structure determination (scaling, molecular replacement,
density modification etc.). Finally you can combine all these output files
(using the –icif cif_file_name
option) to generate an mmCIF format file that contains all the information
regarding the different stages of structure solution.