MMTSB - All-atom Modeling: PDB File Manipulation

PDB file manipulation

PDB format files from the Protein Data Bank or other sources usually require minor modifications with respect to residue and atom names before they can be read into CHARMM or Amber, the main modeling packages used by the MMTSB Tool Set. CHARMM also requires unique segment IDs at the end of each PDB line.

When MMTSB application tools such as minCHARMM.pl or minAmber.pl are called the residue and atom names are adjusted automatically and CHARMM segment names are generated if necessary but in other cases it may be necessary to do this manually.
Other PDB file manipulations that may also need to be done but cannot be done automatically involve changing residue numbering and assigning chain IDs.

All these functions are available through convpdb.pl which will be introduced in this section.

Format conversion

convpdb.pl can convert atom and residue names for a number of specific formats. Most of the changes involve histidine residue names. The -out option followed by the format name is used for this purpose:

convpdb.pl -out charmm22 1vii.orig.pdb

will match atom and residue names to the naming convention used in the CHARMM22 force field. This will also strip all PDB lines that do not begin with ATOM, i.e. all remarks, crystallographic, sequence and other information. convpdb.pl always ignores such lines in PDB input files since none of the MMTSB tools use any of this extra information.

Other supported formats are charmm19 for the CHARMM19 force field and amber for suitable input to Amber's leap/tleap program that is used to setup topology and coordinate files.

Also supported is generic to convert atom and residue names in CHARMM or Amber PDB output back to PDB files with generic names. Again, this involves for the most part histidine residue names, since names other than HIS are often not recognized by other programs.

The MMTSB Tool Set automatically converts all PDB files to CHARMM22 format before writing them out. This is done so that the histidine protonation state is unambigously preserved through the corresponding residue naming used in CHARMM22.
All tools, including convpdb.pl therefore expect input with either canonical or CHARMM22 residue names.

If PDB files that were generated for or by CHARMM19 need to be read by convpdb.pl problems with histidine residues may arise and the special option -charmm19 needs to be used to indicate that the input file contains CHARMM19 residue names.

As an example, let us first generate a PDB file with CHARMM19 residue naming:

convpdb.pl -out charmm19 1vii.exp.pdb > c19.pdb

Then try the following command to convert the file back to CHARMM22 format:

convpdb.pl -charmm19 -out charmm22 c19.pdb

Please note that conversion to and from CHARMM19 and CHARMM22 formats does not add or remove any hydrogen atoms. So, when a CHARMM19 output file is converted to CHARMM22 this means only that the naming convention is compatible with the CHARMM22 force field. The structure will still miss the non-polar hydrogen atoms expected by CHARMM22 since only polar hydrogens are included in CHARMM19.
Another utility, complete.pl, is available for completing structures for a given force field. It is explained in more detail in another part of this tutorial section.

CHARMM segment names

When PDB files are read into CHARMM they are expected to have a four letter segment ID starting at position 73 at the end of each PDB ATOM line. Segment IDs are used like the more common chain IDs to distinguish different molecular segments that are not covalently bound to each other.

Segment ID names have to match the segment names given during PSF generation within CHARMM but are otherwise arbitrary.
The convention used in the MMTSB Tool Set uses names like PRO1, PR11, or, to reflect chain IDs, if present, something like PROA.

MMTSB Tool Set tools that interact with CHARMM will automatically take care of generating appropriate segment names, but if PDB files are read into CHARMM directly segment IDs need to be generated manually.

The option to generate segment IDs with convpdb.pl is -segnames. Here is an example:

convpdb.pl -segnames 1vii.orig.pdb

This will write out a CHARMM22 PDB file with segment IDs. CHARMM22 is the default output mode of convpdb.pl, so -out charmm22 can be omitted as in the example.

Normally, convpdb.pl will ignore segment IDs when reading PDB files because PDB files from other sources may contain other information in the segment ID columns which might lead to confusion with CHARMM.

It may be useful, however, to preserve existing valid segment IDs in PDB files written out by CHARMM or by convpdb.pl. In this case, the option -readseg can be given so that existing segment IDs are not discarded.

This is used in the following example where a PDB file written out by CHARMM with segment IDs is converted to CHARMM19 format while preserving the original segment IDs from the input file:

convpdb.pl -readseg -out charmm19
           1vii.sample.1.pdb

Residue numbering

Residue numbering in PDB files is important for determining (non-)continuous fragment and identifying structure fragments, e.g. loop regions in loop modeling problems. Especially, if only parts of a given structure are modeled it is crucial to maintain the correct residue numbering to be able to merge again with the rest of the protein system at a later point to form a complete structure.

The tools in the MMTSB Tool Set preserve consistent residue numbering, but if structures are used from external sources it may be necessary to adjust residue numbers accordingly.

A few different options are available in convpdb.pl to change residue numbers:

The first option, -renumber <startvalue>, will renumber residues starting with the given number, but continuously until the last residue of the structure. This option will not preserve gaps in residue numbering if parts of the structure are missing.

As an example, consider 1vii.orig.pdb, the original Protein Data Bank entry for the villin headpiece. Residue numbering starts at 41, since this is a fragment of a larger protein. One may prefer residue numbering to start at 1. This can be achieved with

convpdb.pl -renumber 1 1vii.orig.pdb > 1vii1.pdb

The result is a new PDB file 1vii1.pdb where residue numbering starts at 1 instead of 41.

The second option, -add <shiftvalue>, is used for maintaining relative numbering in fragmented structures while shifting all residue numbers by a constant.

The third option, -match <reference PDB>, is more sophisticated. It will first align the amino acid sequence with the sequence from the reference PDB. If a complete alignment is not possible, a partial alignment of the largest fragment with exact matching residue names will be done. Any shift in residue numbers after alignment with respect to the reference is then applied to the whole molecule so that the residue numbering agrees with the reference for the matching residues.

This option is useful in the following example: A villin conformation 1vii.sample.1.pdb with residue numbering starting at 1 should be compared to the experimental structure deposited in the Protein Data Bank where residue numbering starts at 41 by calculating root mean square deviations between coordinate positions. Trying to use rms.pl directly will not work because the residue numbering does not match. In this case

convpdb.pl -match 1vii.orig.pdb
1vii.sample.1.pdb | rms.pl 1vii.orig.pdb

will first change residue numbering in the sample file to match the original PDB entry before passing the structure on to rms.pl for calculating an RMSD value.

Chain ID

Single letter chain IDs are commonly used to distinguish units in multidomain proteins or other types of complexes. As explained above CHARMM does not recognize chain IDs and uses segment IDs instead. However, chain IDs are recognized by many tools in the MMTSB Tool Set and can be used in residue selection criteria for loop modeling or other applications, e.g. for restraining part of a structure during minimization.

convpdb.pl can be used to set chain IDs for a given structure if chain IDs are needed for this purpose.

With the option -setchain <ID> a chain ID may be set for the whole structure. This is most useful for assembling a complex from different PDB files. Different chain IDs can then be set for each file before merging them into a single file as in the following example:

convpdb.pl -setchain A 1poa.exp.pdb > A.pdb
convpdb.pl -setchain B 1vii.exp.pdb > B.pdb
convpdb.pl -merge B.pdb A.pdb > AB.pdb

The resulting file AB.pdb contains both molecules, 1POA and 1VII, distinguished by the chain IDs A and B.

An alternative method is to set chain IDs automatically from the last letter in CHARMM segment IDs. This is done with the option -chainfromseg and may be useful for multidomain PDB files that were written out by CHARMM without chain IDs.