topp (CCP4: Supported Program)

NAME

topp - an automatic topological and atomic comparison program for protein structures

SYNOPSIS

topp
[Keyworded input]

top3d foo_1.pdb foo_2.pdb

topsearch foo_1.pdb

AUTHOR

Author:: Guoguang Lu
: Div. of Molecular Structural Biology
: Dept. of Medical Biochemistry and Biophysics,
: Karolinska Institute, Stockholm, 17 177, Sweden
E-mail:: Guoguang.Lu@mbfys.lu.se

NOTES ON CCP4 VERSION

Note: TOPP has been renamed from the original TOP to avoid a clash with the UNIX command of that name.

TOPP can be run directly using the command topp with Keyworded input, or via the script top3d which takes two file names as arguments and program parameters from the file $CLIBD/TOP.PARM (see examples section). A search with one file against a database of structures can be done using the script topsearch which takes one file name as argument and program parameters from the file $CLIBD/SEARCH.PARM (see examples section).

Use of the browser facility to search a Protein Data Bank site requires two commands to be on the user's path, namely wget and pdbhtf. The latter is part of the CCP4 suite and should have been compiled and installed. On the other hand, wget is not part of CCP4, but is a GNU program available via internet from the usual GNU sites.

DESCRIPTION

TOP is a protein TOPological comparison program which detects whether there are structural similarities between two proteins. It superimposes two protein structures automatically without any previous knowledge of sequence alignment. The program can be used to find out if a newly determined protein structure is similar to any structures in the Protein Data Bank and rank the homologous proteins according to topological and structural diversities (similarities). The program (version 6 or higher) can directly browse data from Protein Data Bank or its mirror sites via internet, so that users can search most recent data without regularly downlowding the whole database to their local disks. The program has a 3DB browser interface so that it can perform rapid structure similarity search if users limit a searching range by sequence, keywords, resolution, date or other restraints. This provides possibilities that TOP is conveniently used for modelling homologous proteins and automatic tracing new coming similar structures related for special interests without literature reading.

TOP is designed to be user friendly. For example, once the program is properly set up on unix computers, users can use simple commands such as top3d file1 file2 so that the coordinate file2 will be automatically superimposed to file1. The Protein Data Bank (PDB) entry code can be recognized by the program. For example if the second molecule is 2cnd in PDB, user can just type top3d file1 2cnd@pdb so the program will browse the coordinates of 2cnd into the local disk and perform the comparison. If a user wants to know whether a structure in file is similar to any structures in PDB, one can type topsearch file.pdb so that the program will output a list of pdb code which are ranked according to 3d-structure similarities. The user can type top3d file.pdb code@pdb to get the interested coordinates superimposed to the probe model. The program can detect sequence permutation and be used for special purpose, such as motif searching.

The program runs two steps in each structure comparison. In the first step topology of secondary structures in the two are compared. The program uses two points to represent each secondary structure element (alpha helixes or beta strands) then systematically searches all the possible superposition of these elements between the two protein structures. Once a couple of elements in the two structures can fit each other in 3-d space (defined as, the rms, the angle between the two lines formed by the two points and the line-line distance are smaller than the given values), the program will search whether more secondary structure elements can fit by the same superposition operation. If secondary structures which can fit each other exceed a given number, the program will claim the two structures are similar, outputs names of secondary structures which correspond to each other in the two proteins and output the superimposed coordinates. It also outputs a matrix, with which one molecule can be rotated and translated to the other molecule. The program output a comparison score called "Topological Diversity", which considers both the rate of matching SSEs and structure difference of the representing points. In the data base searching, this parameter can be used for rank the topological similarities of SSEs.

While Ca atoms are available, the program can run the second step to find the alignment based on Ca atoms of all the residues from the initial comparison matrix, and improve the comparison matrix based on the superposition of newly aligned Ca atoms. The procedure is iterated until the member of matching residues converges. The program is able to overcome sequence permutation in the superpositions. According to both r.m.s deviations and numbers of matching residues, the program calculated a score of "Structure Diversity", which can be used to rank the structure difference of homologous proteins.

Use of a SSE database

The optimized way of database searching in TOP is to use a library of Secondary Structure Elements (SSEs). This can be created from a set of PDB files with the command MAKEVEC (see below).

The compact SSE library is automatically updated in Karolinska Institute every week, which include not only the current released structures in Protein Data Bank, but also compact SSE databases of independent family, super-family, structures classified in the SCOP database for efficient similarity search. It can be obtained from ftp://gamma.mbb.ki.se/pub/guoguang/sndlib.tar.Z . After you get this TAR file from FTP and save to your local disk as, for example /dir/sndlib.tar.Z, use following commands:


cd $TOPHOME
zcat /dir/sndlib.tar.Z | tar -xvf -

you can have the most recently updated SSE databases.

Keyworded Input

The parameters of the TOP program can be controlled by different lines of text, each of which is a "keyword command". Any command line which starts with "!" will be ignored. Available keywords are:

Location of protein coordinates
LIBDIR, MAKEVEC, MOL1, MOL2, MOLVEC, PDBSITE, SERVER, SITE, WEBSITE
Input for structure comparisons
AMPLIFY, APPEND, BTARMS, DIRWEIGHT, DISTANCE, ERRANG, ERRDLL, ERRRMSA, FAST, HLXRMS, MATCH, MAXANG, MAXDLL, MULTIPLE, REFWEIGHT, RESIDUE, SINGLE/NOSINGLE, SND1, SND2, WRITE
3DB interface
3DBAFTER, 3DBBEFORE, 3DBFASTA, 3DBHET, 3DBKEYWORD, 3DBLOWER, 3DBRESOLUTION, 3DBSEQ, 3DBSITE, 3DBTEXT, 3DBUPPER

Keywords for location of protein coordinates

The TOP program can compare two structures or search similarities in database by comparing one structure with a group of other structures. The MOL1 command specifies the data location of one molecule (called Molecule 1) while the commands MOL2, LIBDIR, MOLVEC, PDBSITE or WEBSITE specify the data location of another molecule (called Molecule 2) or the other molecules (called database).

TOP can read 3d coordinates of protein structures in "Brookhaven" (PDB) format either from user's local computer disk, CD ROM or via internet. In the case of structure similarity searching, there can be many ways to read data. The recommended setup for the program is to use automatic updating of a secondary structure element (SSE) library searching (see automatic updating of SSE library and MOLVEC). In this way the program can search most recent database from compact SSE library and browse the detailed coordinates of only those structures which are found similar with the molecule 1. It is considerably faster and does not require regular maintaining works for database after setup.

MOL1

If you don't have the coordinates in your local disk and wish to read the coordinates directly from a Web site by giving a PDB entry code, you could give the filename something like code@pdb in this command, for example: MOL1 2cnd@pdb, the program will use the code and browse the coordinates from a PDB mirror site or another web site, the URL address of which is specified in the PDBSITE or WEBSITE commands.

MOL2 Coordinate_file_name or @List_file_name or @URL_address [zone]
This command controls whether users wish to compare two structures or do a similarity search in Protein Data Bank. If the filename is something like 2cnd.pdb or 2cnd@pdb, the program will just superimpose two structures and give sequence comparisons.

If the second text string in the command start with @ and the rest text does not start with http: or ftp:, the rest text in this string text will be assumed a name of List_file which lists names of a number of coordinate files such as:


/nfs/protein/pdb/current_release/uncompressed_files/00/pdb200d.ent
/nfs/protein/pdb/current_release/uncompressed_files/00/pdb200l.ent
/nfs/protein/pdb/current_release/uncompressed_files/00/pdb300d.ent
/nfs/protein/pdb/current_release/uncompressed_files/00/pdb100d.ent
....

If the command PDBSITE or WEBSITE is given before this command, or the LIBDIR command specify a directory name which contains "current_release", the List_file can be list of PDB entry code, such as


200d			!		|	pdb200d.ent
200l			|	or 	|	pdb200l.ent
300d			|		|	pdb300d.ent
.... 			|		|	....

This list of PDB codes can be obtained from "3DB browser" in Protein Data Bank or other bioinformatics tools outside the program. It provide a possibility that TOP search for a certain group of structures for a special purpose.

LIBDIR directory_name If the program is searching a number of coordinates files (see MOL2) and those files are under an identical directory, the user can indicate in which directory the coordinates files are located. for example, if users have pdb200d.ent pdb3001.ent ... in the /nfs/pdb/all_entries/ directory, the user can use UNIX command: ls -1 /nfs/pdb/all_entries/uncompressed_files/ > allpdb.lis, this file will be something like


pdb100d.ent
pdb101d.ent
pdb101m.ent
pdb102d.ent
pdb102l.ent
...


libdir /nfs/pdb/all_entries/uncompressed_files/
mol2 @allpdb.lis

MOL1

Alternatively, one can use UNIX command


find /directory_name/ -name "*.ent" -print > pdball.lis

LIBDIR

In the case the directory name in the LIBDIR command contains a substring ".../current_release/uncompressed_files", the program will think this directory is organised as "current_release" directory in Protein Data Bank i.e. PDB entries are distributed under subdirectories whose name correspond to the 2 middle characters of the PDB id code, e.g.


...pub/pdb_data/current_release/uncompressed_files/00
...pub/pdb_data/current_release/uncompressed_files/zy


100d				pdb1001.ent
100e		or 		pdb100e.ent
.....                           ....

If the rest text after first character"@" start with "http:", the program will assume there is a 3db browser in this URL address and try to get a list of current released entries. (This command is not necessary if PDBSITE command is present.)

If the rest text after first character"@" start with "ftp:", the program will list all the files under the directories. This can be used for an anonymous ftp site in which a directory contains all the entries of the coordinates (such as old PDB directory .../all_release/compressed_files/*.pdb ) However, in this form, all the PDB files should be in one directory, but not distributed in sub-directories.

PDBsite URL_address
[default: http://www.pdb.bnl.gov]
This command specifies an URL address of one of the official mirror sites of the Protein Data Bank. Given the "recognized mirror site", the program can browsed most recent data in PDB. A collection of the URL addresses which have been tested by the program is listed in http://gamma.mbb.ki.se/~guoguang/webtop/pdb_url_collect.html. To get efficient and fast data browsing, users should choose a site which is inside or close to their local countries.

If this command is given, the commands WEBSITE, PDBSITE and LIBDIR are not necessary to be present.

WEBsite URL_address (or SITE or SERVER)
Sometimes, users prefer to read data from a Web site other than a standard PDB site (for example a laboratory which is in the same campus or city), user can use WEBsite instead of PDBSITE for example:


WEBSITE http://pdb.pdb.bnl.gov/	or http://www.rcsb.org/pdb/ 
WEBSITE  ftp://ftp.ebi.ac.uk/pub/databases/msd/pdb_uncompressed/ 

WEBSITE  ftp://gamma.mbb.ki.se/pub/pdb/current_release/uncompressed_files

Protein Data Bank Quarterly Newsletter

http://gamma.mbb.ki.se/~guoguang/webtop/url_collect.html

In the case it is FTP site, if the directory name contains a sub-string "current_release", the program can automatically find out the PDB entries in sub-directories. Otherwise, it will assume all the files are in the same directory in the argument of this command.

MOLVEC vector_file_name
Instead of reading all the PDB files in PROTEIN DATA BANK, the TOP program can use a compact database which is a library of secondary structures of each protein. This command indicate the filename of the database so that the program can perform the topological comparisons based on secondary structures. If the WEBSITE or LIBDIR commands are also present, the program will first perform the rapid topological search in the compact database. Once a structure in the data base with a pdb entry code is found, TOP will browse the PDB file from Internet or local disk and perform the comparisons based on Ca atoms. If users repeatedly use the database searching function, this command is the fast and efficient way, because it can save a lot of time for repeating browsing files and assign the secondary structures.

The MAKEVEC command can help to update the compact database in order to follow the most recent changes in Protein Data Bank. The updated database can also obtained via the Web (See example 3).

MAKEVEC output_database_filename pdb_list_file_name [format]
If this command is present, the SSE library mentioned above is made. The program can read coordinates either from local disk/CD, which is specified by LIBDIR, or via internet which is specified by PDBSITE or WEBSITE. The first argument of this command is the name of the output SSE library file. The second argument is a name of List_file (as in the MOL2 command) which can contains either a list of file name or PDB entry codes. If the third (format) argument is ZONE or SCOP, the program will assume the second column in the pdb_list_file specifies the residue range (see ZONE1 and ZONE2 keywords) while the first column specifies the PDB code or file name of the structure.


example:
MAKEVEC sndnew.vec pdb.list


101l.pdb
102l.pdb
103l.pdb
104l.pdb
....

LIBDIR


example: 
PDBSITE http://www2.ebi.ac.uk
MAKEVEC sndnew.vec

example:
MAKEVEC snd.vec ftp://pdb.pdb.bnl.gov/pub/pdb/all_entries/compressed_files/


example:
MAKEVEC snd.vec ftp://gamma.mbb.ki.se/pub/guoguang/scop_family.lis scop


3sdh             a:  1.001.001.001.001.001   d3sdha_
1phn             a:  1.001.001.001.002.001   d1phna_
1grj           2-79  1.001.002.001.001.001   d1grj_1
....

example: makevec.com
# for PDB on local disk
$LUEXE/top << 'end-top'
LIBDIR /nfs/protein/pdb/current_release/
MAKEVEC sndlib.vec pdblist.txt
'end-top'
#


cd /nfs/pdb/full/
ls -1 *.pdb > /nfs/ylgs/guoguang/pdblist.txt

LIBDIR

PDBSITE

In fact the keywords 3DBBEFore and 3DBAFTfer together with MAKEVEC provide a possibility of automatically making SSE library of the new coming structures which can be appended to the old ones. This should be very quick.

Keyword Input for structure comparisons

MATCH


example: MATCH RATE 0.35 0.8
	 MATCH auto 		[DEFAULT]
	 MATCH 5

RAT1 is the minimum matching rate of secondary structures. The program chooses a minimum secondary structures (comparing mode) or number of secondary of mol1 (searching mode) and times with rat1. If matching secondary structures of the two compared protein exceeds this rate, the program will think the two structures are similar. For example, if mol1 has 12 secondary structures, and mol2 has 10, and rat1 is 0.5, the program will think the two structures are similar when there are 5 secondary structures that can match each other in comparing mode (or 6 in searching mode).

AUTO is equivalent to RATE 0.35 0.8

Alternatively, users also can give this number by estimating at least how many secondary structures can match each other before running the program. It has to be lower than real number. If the number is overestimated, the program will fail to superimpose the two similar structures. Under-estimating is usually OK. However if user gives too low a value, (for example 3), the program might superimpose motif instead of overall structures. This might give many ways of superpositions, many of which do not really interest the users. In database searching, an over underestimate value can also slow down the speed unnecessarily.

If user have no idea how to put this parameter, he/she can start either with 5 or 30%-50% of number of secondary structures in molecule 1 (use rate). This will be successful in 95% cases. If the comparison fails, look at the Hint section to see how to fix the problem.

RESIDUE lstres
LSTRES is the minimum number of residues in a consecutive fragment of protein. Default is 3. If lstres is smaller than or equal to 0 the program only compares the structures based on SSEs. In this case, no superimposed coordinates will be output. If lstres is larger than 0, the program will improve the comparison based on Ca atoms. When all Ca atoms in a fragment with more than LSTRES (usually 3) consecutive residues in one protein are closest to a fragment in the other protein and all the distances are smaller than DSTMIN, all the Ca atoms in these two corresponding fragments will be included in the superposition calculations. The rms and sequence comparison will be presented.

DISTance dstmin
[Default 3.8]
This value the represents the maximum distance between Ca atoms of the matched residues. (see RESIDUE). If dstmin is more than 3.0, the value is not so important because of the rule that Ca atoms of matched residues must be closest to each other. A value between 3-7 usually does not change the result of which residues can match each other in the comparisons.

WRITE
If this statement is present and Ca comparison is carried out, the program will write out superimposed coordinates from Mol2 to Mol1. The file name will be something like mol2_mol1.xxx. For example if name of mol1 is sfv.pdb, name of mol2 is sin.pdb, the output name will be sin_sfv.pdb

APPEnd yes/no
If the input is yes and there are no secondary structure assignments in the input coordinates file, the program will append the assignment at the end of the coordinate file. [Default: NO]

HLXRMS hlxrms
If rms between an alpha helix and standard helix is higher than this value, this helix will not be used for the comparisons.

BTARMS btahlx
If rms between a beta strand and a straight line formed by the two representing points is higher than this value, this strand will not be used for comparisons.

ERRRMSA errrmsa
If rms value of certain helix or sheet is higher than this value, this helix or sheet is not considered to be similar.

ERRANG errang_alpha, errang_beta
If the direction difference of a certain helix in the two structures is higher than errang_alpha, this helix is not considered to be similar.
If the direction difference of a certain sheet in the two structures is higher than errang_beta, this sheet is not considered to be similar.

ERRDLL errdll_alpha, errdll_beta
If the line-line distance of a certain helix in the two structures is higher than errdll_alpha, this helix is not considered to be similar.
If the line-line distance of a certain strand in the two structures is higher than errdll_beta, this strand is not considered to be similar.

MAXANG angmax
When expanding the search for similar secondary structures, if the maximum direction difference exceeds this angle, the last expand is rejected.

MAXDLL dllmax
When expanding the search for similar secondary structures, if the maximum line-line distance exceeds this number, the last expand is rejected.

SINGLE/NOSIngle (or MULTiple)
If the SINGLE statement appears, comparison is only carried out on one polypeptide chain. If NOSINGLE or MULTIPLE appears, the program can compare protein structures with multiple chains.

FAST
When this option is chosen, if a helix disturbs the match of a beta strand, the program will delete the first helix and re-search for the match.

DIRWEIGHT dirweight
Weight of the direction in the refinement.

REFWEIGHT refwalpha, refwbeta
In the least squares refinement, the weight of alpha helix and beta strand.

SND1 Yes/No [CA]
If input is yes, the program will not read the secondary structure assignment in the coordinate file of Mol1 but will assign it self using a algorithm defined by Smith/Laskowski (SECSTR program from PROCHECK). If the input is no, the program will first try to use the assigned secondary structure in the coordinates file. If it does not exist or it does not work, the program will assign itself. If CA is present in the second input column after the keyword, the program will assign the secondary structures based only on Ca atoms.

SND2 Yes/No [CA]
same as SND1 but for Mol2

AMPLify ampl ampltop [default: 1.5 2.0]
ampl is the amplification order for structure diversity
ampltop is the amplification order for topological diversity
The value of Structural Diversity and Toplogical Diversity used are used in TOP for describing the structure difference between the two compared structures based on both r.m.s deviation and number of matched residues or SSEs. The "amplification order" is used to control the influence from number of matched residues or SSEs (see the conventions for more details).

Keywords for 3DB interface

3DB browser

Jaime Prilusky

3DB Browser's Help file

PDB SearchFields Help

3DBSITE Site_name
Example: 3dbsite http://www.pdb.bnl.gov
If users wish to read data from their local disk/CD or a close Web site but use 3DB browser to choose searching range, one can use LIBDIR or WEBsite for specifying the location of coordinates and use this command to specify the URL address of 3DB server. The URL address of 3DB must be one of the mirror sites of Protein Data Bank. This 3DB sever site name does not have to be same as in the WEBSITE command. The program can obtain the PDB entry list from the 3DB server and browse the coordinates from other URL address. If WEBSITE, LIBDIR and PDBSITE commands are not given, the program will use this 3DB address for browsing coordinates. If this address is not given, the default server is from BNL. However, I strongly recommend choosing a PDB mirror site close to user's local lab.

3DBKEYword word1 word2


	example: 3DBKEYWORD FAD + FMN + FLAVIN
		 3DBKEYWORD NITRATE REDUCTASE
		 3DBKEYWORD FAD .or. FMN .or. FLAVIN

3DB

3DBTEXT Word


		example:  3DBTEXT FAD + FMN + FLAVIN
			  3DBTEXT REDUCTASE

3DB

3DBSEQ (or 3DBFASTA) cutoff sequence (or cutoff @seq_file_name)


Example: 3DBSEQ 0.02 GXGXTGGTX
     or	 3DBSEQ 0.02 @zm.seq

3DB

FASTA

WRITE


SYTVGTYLAERLVQIGLKHHFAVAGDYNLVLLDNLLLNKNMEQVYCCNEL
NCGFSAEGYARAKGAAAAVVTYSVGALSAFDAIGGAYAENLPVILISGAP
NNNDHAAGHVLHHALGKTDYHYQLEMAKNITAAAEAIY

see 3DB Browser Help File

3DBRESOlution res1-res2 or RESO res1 res2


example: 3DBRESOLUTION 0.1-3.0	 or 3DBRESOLUTION 0.1 3.0

3DB

3DBBEFore (or 3DBUPPer) date
Example: 3dbbefore 12/3/1998
Equivalent to the "Date (upper)" column in 3DB. If this command appears, the TOP program only search those structures which is deposited this date.

3DBAFTer (or 3DBLOWer) date
Example: 3dbafter 12/1/1998
Equivalent to the "Date (lower)" column in 3DB. If this command appears, the TOP program only search those structures which is deposited after the date. This makes users to trace the new structures which are similar to a certain family. It is possible to let this procedure fully automatic by making a simple unix script file.

3DBHET compound_name
Example: 3dbhet FMN
Equivalent to the "Associated group" column in 3DB. If this command appears, the TOP program only search those structures with this Hetero compound.

Conventions of the Coordinate files

When comparing two protein structures, the program needs two coordinates files in Brookhaven format. It can read the secondary structure elements (SSEs i.e. alpha helices and beta strands) which are pre-assigned in the files in the PDB format file as in the following example:


HELIX    1  F1 LEU     96  SER    103  
HELIX    2  N1 ILE    148  ARG    160 
HELIX    3  N2 ARG    184  GLU    193 
HELIX    4  N3 GLU    223  HIS    229 
HELIX    5 N4A PRO    245  GLN    249 
HELIX    6 N4B SER    253  GLU    257 
HELIX    7  N5 MET    263  SER    266 
SHEET    1  FB 6 LYS    58  TYR    64  0 
SHEET    2  FB 6 HIS    48  ILE    55 -1
SHEET    3  FB 6 TYR   109  LEU   116 -1
SHEET    4  FB 6 ILE    13  SER    24 -1
SHEET    5  FB 6 VAL    27  SER    33 -1
SHEET    6  FB 6 HIS    75  LYS    81 -1

If there are no SSE assignments in the coordinates file, the program will take some CPU time to calculate it. If the file contains coordinates of all mainchain atoms, the program will use the "Smith-Laskowski method" as in the PROCHECK package. If the file only contains Ca coordinates or many mainchain atoms are missing, the program can also automatically assign the secondary structures using another method, but some elements, especially beta strands, might be not as accurate as in the case that all the mainchain atoms are provided. However, this does not influence the structure comparisons in most cases.

Conventions of some output parameters

Matching Residues number of matched residues
The program counts a pair of residues as matched residues when:

1) There are at least a certain number of residues in a consecutive fragment which Ca number of the two superimposed structures are less than certain distance. The distance is defined in the DISTANCE command (default 3.8 angstrom) while the number of consecutive residues is defined in the RESIDUE command (default 3)
2) The Ca atoms of the matched residues in the two superimposed structures must be the closest each other.
Identical residues and Identity

Identical residues represents the number of those matched residues which amino acid type are identical
Identity represents (Identical residues)/(matched residues)
r.m.s. deviation
```
            N
r.m.s. = (Sigma(d_i)² /N))^1/2
            i
```
where
N is the number of the matchable Ca atoms
d_i is the distance between the 1st molecule and 2nd molecule of the i'th atoms
Mean distance:
```
         N
d_mean = Sigma(d_i)/N
         i
```
where
N is the number of the matchable Ca atoms
d_i is the distance between the 1st molecule and 2nd molecule of the i'th atoms

Usually, if the difference is distributed homogenously all overall the two structures, values of d_mean and r.m.s are close. If some parts of two structures are much more different than the other parts, r.m.s is usually significantly higher than d_mean. In my opinion, d_mean is more able to reflect the distance between the two structures in the comparisons than r.m.s.
Structural Diversity
This value is used to describe the difference between the two structures, based on distance of matched Ca atoms and number of matched residues. The definition is:
```
Structure Diversity = (r.m.s)*(N_mol1/N_fit)^A
```
where
N_fit is the number of matched residues (Ca atoms)
N_mol1 is the total number of residues in the 1st molecule.
A is the amplification order for number of matched residues. (defined in the AMPLIFY command, default 2.0). Higher this value is, more the structure diversity is influenced by number of matched residues, rather than by the r.m.s deviation.
Topological Diversity
This value is used to described to topological difference of Secondary Structure Elements between the two molecules. The definition is

Examples

In many cases, users can quickly learn how to use the program just by studying appropriate examples. One can use one of two ways to run TOP: simple commands or Unix script files. The simple commands are designed for the convenience of those users who don't have the Protein Data Bank in their local lab and use TOP for ordinary purposes. The Unix command files are more flexible for special purposes.

Simple commands:

Comparing two structures: top3d
For comparing two structures which are similar, the program can do two things:
1. superimpose the two structures so that the user can display them in graphics.
2. Output sequence alignment and statistics about the differences such as r.m.s deviation, fitting residues and so on.
For these purposes, one can just type top3d file1 file2 or top3d and answer the questions. For example if you type: top3d mol1.pdb mol2.pdb (in the case the two structures are similar) the program will output a sequence alignment of the two proteins and output a coordinates file mol2_mol1.pdb in which mol2.pdb is superimposed to mol1.pdb
In the case the two molecules or one of them have been deposited to Protein Data Bank and the entry code is known, you tell the program by a special format: code@pdb. For example, if you want to compare PDB entry 1KXD and 1VCP, you can just type top3d 1kxd@pdb 1vcp@pdb the program will output a file 1vcp_1kxd.pdb in which 1VCP is superimposed to 1KXD.
In the case user wish to change the parameter for the TOP program, one can edit a file TOP.PARM in the directory.
Searching Proteins which are similar in 3D in database: topsearch
If user have a protein structure (for example mol1.pdb) and wish to detect which proteins in Protein Data Bank are similar to it, he can type topsearch mol1.pdb, the database searching will start. After the procedure is finished, there will be a long output file topsearch_name.log. and a two shortened list strdiv_name.lis and topdiv_name.lis.
The file strdiv_name.lis is a list of similar structures ranked by "Structure Diversity" (based on Ca atoms). The file todiv_name.lis is a list of similar structures ranked by "Topological Diversity" (based on Secondary Structure Elements). If users wish to have detailed comparisons, one can pick up the code from one of these two lists and use the command top3d for further information.

Unix script file

There are several examples files available at http://gamma.mbb.ki.se/~guoguang/webtop/examples showing how to use the TOP program. Here is a summary of them

Name	PDB data from	Function
top.com	local disk or internet	Superimposing two protein structures and compare them
pdbscan.com	local disk	Searching similar structures in Protein Data Bank
topscan.com	internet	Searching similar structures in Protein Data Bank
pdbsearch.com	local disk	Searching similar structures in a compact database.
topsearch.com	internet	Searching similar structures in a compact database.
top3db.com	internet	Searching similar structures with 3DB restraints
makevec.com	local disk	Making SSE library
makevec_web.com	internet	Making SSE library

Example 1: Compare two structures Two files 1kxd.pdb and 1vcp.pdb will be compared by the following script file. ($TOPHOME/examples/top.com in the distribution package)


#
rm fort.10 fort.11 fort.12
ln -s omatrix.ofm fort.10
ln -s mol1.ofm fort.11
ln -s mol2.ofm fort.12
$LUEXE/top << 'end-top'
MOL1 1kxd.pdb
MOL2 1vcp.pdb
RESIDUE 3
WRITE
'end-top'
#

type "top.com > top.log", the program will output which secondary structure elements are corresponding to each other in the two structures. Optionally, the program also superimposes the two structures based on the Ca atoms and output the sequence comparison. (See instruction of keyword RESIDUE). The rms deviation is output. When the WRITE statement appears, the program will write a file which superimposes molecule 2 onto molecule 1. In this case the output file name is 1vcp_1kxd.pdb. Sometimes, there are more than one way to superimpose the two structures (e.g. when the two structures are dimers AB, the program can superimpose AB to A'B' and AB to B'A'). In this case the program will output several superimposed coordinate files, called 1vcp_1kxd.pdb, 1vcp_1kxd.pdb_2, 1vcp_1kxd.pdb_3,....). One can use any graphics program (such as O, Insight or Frodo) to display the superimposed coordinates together with 1kxd.pdb. Look at top.log for more information.

There are other commands concerning the parameters for different purpose of the comparisons. For details, please see "Keyworded Input"

The TOP software can directory browse coordinates from Protein Data Bank (PDB), if an URL address of a mirror site of PDB is provided. In this example, if you know one of structures PDB entry code is 1vcp, you can do the following: 1) add a command to indicate from which site you want to browse PDBSITE http://www.pdb.bnl.gov/ 2) use xxxx@pdb in the MOL2MOL2 1vcp@pdb Then the program will directly read 1vcp from Brookhaven National Laboratory via the internet.

Example 2: Searching similar structures in Protein Data Bank TOP can be used to see whether a protein is similar with certain structures in Protein Data Bank. Regarding how to obtaining the data from database, TOP may have two ways to run database searching.

Search Protein Data Bank installed in the local disk. The example command files are shown in pdbsearch.com and pdbscan.com in the directory $TOPHOME/examples/
Search Protein Data Bank via internet (see in topsearch.com and topscan.com).

The recommended way run TOP is first searching a compact library of Secondary Structure Elements (SSEs) . If SSEs constructions of some proteins are found to be similar to the studied structure, the program can do the further comparisons based on Ca atoms (as shown in pdbsearch.com and topsearch.com). This ways requires a regularly updated SSEs library which can be obtained from ftp://gamma.mbb.ki.se/pub/guoguang/sndlib.tar.Z It can also be made and updated automatically (see instructions for " Automatic updating of SSE library"

If users choose not to use compact SSE library, one can use pdbscan.com or topscan.com instead of pdbsearch.com or topsearch.com for searching PDB in local disk or via internet.

In pdbscan.com, it is assumed that user have all the Protein Data Bank files under directory /nfs/protein/pdb/current_release/uncompressed_files and all the files are called *.ent. In this example file, the command find $pdbdir -name "*.ent" -print > current.lis find all the PDB entries and write into the file current.lis which has contents like:


/nfs/protein/pdb/current_release/uncompressed_files/00/pdb100d.ent
/nfs/protein/pdb/current_release/uncompressed_files/00/pdb200d.ent
/nfs/protein/pdb/current_release/uncompressed_files/00/pdb200l.ent
/nfs/protein/pdb/current_release/uncompressed_files/00/pdb300d.ent
/nfs/protein/pdb/current_release/uncompressed_files/01/pdb101d.ent
/nfs/protein/pdb/current_release/uncompressed_files/01/pdb201d.ent
....

In this way all the file names are stored in current.lis which will be read by the MOL2 command in the TOP program. MOL2 @current.lis In fact, one can search not only the whole protein data bank, but also a group of selected structures, for example, structures represent independent folding in the SCOP classification.

Still take pdbscan.com as an example. To run database searching, type "pdbscan.com &", after some hours, there will be all the information in pdbscan.log which users usually don't have to look at. User can look at the summary files: "strdiv.lis" or "topdiv.log" (If the program crash, you could also look at the middle results by typing "grep Str pdbscan.log | sort +3 -4" or "grep Top pdbscan.log | sort +3 -4")

The content of strdiv.lis is the following:


 1692 structures are found to be similar under the given criteria
 Best Structure Diversity   7.67  with   52 matched residues to 2cnd
 Best Structure Diversity   7.68  with   56 matched residues to 1azz
 Best Structure Diversity   8.13  with   57 matched residues to 1epa
 Best Structure Diversity   8.33  with   48 matched residues to 1cnf
 Best Structure Diversity   8.48  with   54 matched residues to 1ave
 Best Structure Diversity   8.70  with   54 matched residues to 1hav
 Best Structure Diversity   8.70  with   54 matched residues to 2pia
 Best Structure Diversity   9.28  with   51 matched residues to 1avd
 ............

The structure here 2cnd, 1azz, 1epa ... and so on are found similar to the searched model. (2cnd is ranked as most similar structure by the program). Users can use command file of example 1 and pick up the coordinates to run the individual comparison which gives superimposed structure and details of the comparison such as r.m.s and sequence alignment and so on (these information are also inside pdbscan.log, run nicelist.com or toplist.com to get a better output.)

Example 3: Searching similar structures from a compact SSE library As described in the description section, in the first step TOP detects the similarities based on SSE topology of two proteins. Except coordinates files in PDB format, the program can also read a compact database which contains SSE topology derived from Protein Data Bank. Using the SSE library is a fast and recommended way for similarity searching in database. To make the library from PDB in local disk, user can use $TOPHOME/examples/makevec.com. To make the library from PDB on Web, please use $TOPHOME/examples/makevec_web.com. This SSE library can be automatically updated according most recent PDB data. Please see installation section.

The following is an example how to use SSE library for similarity searching. It is similar with example 2, but with one more command MOLVEC.


rm -f fort.10 fort.11 fort.12
ln -s omatrix.ofm fort.10
ln -s mol1.ofm fort.11
ln -s mol2.ofm fort.12
cat > topsearch.inp << EOF
MATCH auto
PDBSITE http://www2.ebi.ac.uk
!LIBDIR /nfs/pdb/current_release/uncompressed_files/
MOL1 kinA.pdb
MOLVEC $TOPHOME/lib/sndlib.vec
EOF
$TOPBIN/top < topsearch.inp  > topsearch.log
grep Top topsearch.log | sort +3 -4 >> topdiv.lis
grep similar topsearch.log > strdiv.lis
grep Str topsearch.log | sort +3 -4 >> strdiv.lis

The running and analysis procedure is similar to example 2

In this example, if you use LIBDIR /nfs/pdb/current_release/uncompressed_files/ instead of PDBSITE http://www2.ebi.ac.uk, the program will browse the coordinates from local disk instead of internet.

If you use another SSE database, for example MOLVEC $TOPHOME/lib/scop_structure.vec You search only about 2000 independent domain structures selected in the SCOP dastabase instead of 8000 in Protein Data Bank. The speed would be much faster (only 1/10 to 1/5 as before). For same reason, you could use $TOPHOME/lib/scop_family.vec (about 900 domain structures) or $TOPHOME/lib/scop_superfamily.vec (about 600 domain structures) to even search for a short time. The SCOP database is not updated as frequently as PDB, so far once every year. The SSE database for most recent SCOP is always kept in our FTP distribution site

In the Web server of TOP, there is another way to search all the structures: The program search classification unit of independent domain structures, families or super-families in SCOP. Once it found the similarity, it can optionally further search other structures in the same classification unit. Such a search is very efficient in terms of speed although it does not search the most recent data in Protein Data Bank. Please have a look at: http://alfa.mbb.ki.se:8000/TOP/search_SCOP_new.html

Example 4: Superimpose all the sequence-homologous proteins in PDB If users wish to compare all the structures in PDB which have sequence homology to a particular structure, one can use following simple procedure to make all the superimposed structures.


#!/bin/csh
rm fort.10 fort.11 fort.12
ln -s omatrix.ofm fort.10
ln -s mol1.ofm fort.11
ln -s mol2.ofm fort.12
$TOPBIN/top << 'end-top' 
MOL1 zmA.pdb
MOLVEC snd1.vec
pdbsite http://www2.ebi.ac.uk
3dbseq 0.02 @zm.seq
MATCH auto
WRITE yes
'end-top'

In this example zm.pdb is the PDB coordinates of the probe structure. zm.seq is the file which contains the sequence in format of 1-letter code:


SYTVGTYLAERLVQIGLKHHFAVAGDYNLVLLDNLLLNKNMEQVYCCNEL
TLKFIANRDKVAVLVGSKLRAAGAEEAAVKFTDALGGAVATMAAAKSFFP
EENALYIGTSWGEVSYPGVEKTMKEADAVIALAPVFN
....

The filename for all the superimposed coordinates will be 1pyd_zmA.pdb, 1pvd_zmA.pdb, 1pox_zmA.pdb....

Some hints about the program

Database searching If you find that structures in Protein Data Bank are similar to your new structure, the program can not directly tell you which family it belong to. However there are some Web sites where you can get this information and classify your new protein according to the results from TOP program. Some of these sites are listed below.

Name	URL address	Function	Group
SCOP	http://scop.mrc-lmb.cam.ac.uk/scop	Structure Classification of Proteins	Chothia, Murzin...
CATH	http://www.biochem.ucl.ac.uk/bsm/cath	Class Architecture Topology Homology	Thornton...

While searching similar structures in the whole protein data bank usually, a lot of time is wasted on tens of Lysozyme mutants or other closely related homologous proteins. It is possible to make a file list where only structures with independent folds or super-families are present (see example 2), if such information can be obtained from other sources. So far, no such effort has been made by the author.

Speed. When you have a huge structure with many domains, it is much faster if you divide your protein into several independent domains and search each domain individually. The results will be much easier to understand too.
Parameter of MATCH Over-estimation: If the program fail to compare two similar structures, it can be because the parameter value in the MATCH command is too high. Users can find out in the following way. For example the MATCH number should be 4 or less, but you use 7, at the end of the output the program would write something like: ... No way to align in 12ca.pdb Maximum match : 4 Minimum Align: 7 Then you can change MATCH from 7 to 4 and the program will run successfully.
In the case database searching, too high value in this command will cause that no or too few similar structures are found. Users can find out what is the proper parameter for by typing: grep "Maximum match" pdbscan.log | sort +10 -11 (it is assumed that the log file is called pdbscan.log). For example, you give MATCH number 5 and you have no hitted structure, you will get something like
```
 ......
 ... No way to align in 1abj.pdb Maximum match :  3 Minimum Align:  5
 ... No way to align in 1abn.pdb Maximum match :  3 Minimum Align:  5
 ... No way to align in 1abo.pdb Maximum match :  3 Minimum Align:  5
 ... No way to align in 12ca.pdb Maximum match :  4 Minimum Align:  5
 ... No way to align in 1aag.pdb Maximum match :  4 Minimum Align:  5
 ... No way to align in 1aao.pdb Maximum match :  4 Minimum Align:  5
```
In this example, you can get 3 more matched similar structures if you use 4 in the MATCH command.
Under-estimation: Usually under-estimation of this number is OK. The program will find too many structures which you are not interested, but you can always rank the similarity by "Structure Diversity" or "Topological Diversity" and look only the structures at top in the rankings. If you find you think the speed of searching is too slow because of the too low value of this parameter, you also have some way to know the your wanted number far before the searching is finished. For example, you give 5 in the MATCH command. After a while of running the program, you can type grep "Max Align" pdbscan.log | sort +3 -4 you get
```
.......
...(too many hints)...
......
 1cax.pdb<->mol1.pdb  Max Align:  5  Max Match:  5
 1cwa.pdb<->mol1.pdb  Max Align:  5  Max Match:  5
 1cwb.pdb<->mol1.pdb  Max Align:  5  Max Match:  5
 1cwc.pdb<->mol1.pdb  Max Align:  5  Max Match:  5
 1cxf.pdb<->mol1.pdb  Max Align:  5  Max Match:  5
 1cyn.pdb<->mol1.pdb  Max Align:  5  Max Match:  5
 1dlc.pdb<->mol1.pdb  Max Align:  5  Max Match:  5
 1cnd.pdb<->mol1.pdb  Max Align:  7  Max Match:  7
 1cne.pdb<->mol1.pdb  Max Align:  7  Max Match:  7
 1cnf.pdb<->mol1.pdb  Max Align:  7  Max Match:  7
```
If you find only the last 3 structures fall into your "similarity" criterion, you can give "MATCH 6" (or 7) when you re-scan the database.

Reference

Lu G., A WWW service system for automatic comparison of protein structures Protein Data Bank Quarterly Newsletter, #78, 10-11. 1996
Guoguang Lu, An automatic topological and atomic comparison program for protein structures (in manuscript or http://gamma.mbb.ki.se/~guoguang/top.html).

Acknowledgment

The author is grateful to Dr. Ylva Lindqvist and Prof. Gunter Schneider for encouraging me to make this program and contributing important ideas. I also thank Dr. Roman Laskowski for permission to use his secondary structure assignment program and Dr. Jaime Prilusky for suggestions of 3DB interface. Thank a number of colleagues for suggestions and bug reporting.

topp (CCP4: Supported Program)

NAME

SYNOPSIS

AUTHOR

NOTES ON CCP4 VERSION

Index

DESCRIPTION

Keywords for location of protein coordinates

Simple commands:

Comparing two structures: top3d

Searching Proteins which are similar in 3D in database: topsearch

Unix script file