MMTSB - All-atom Modeling: Structure Manipulation

Structure manipulation

The MMTSB Tool Set provides a limited range of functions for manipulating (protein) structures. They are available through the convpdb.pl and vicinity.pl utilities and involve atom coordinate translations, substructure selections, merging of structure fragments, as well as completion of incomplete structures.

More complicated structure editing tasks are not supported by the MMTSB Tool Set but may be done either directly with CHARMM or by using commercial graphics-based molecular modeling packages.

Structure translation

convpdb.pl has two options for translating a protein structure in cartesian coordinate space. With -center the structure can be centered at the origin while -translate dx dy dz shifts the structure by a given (arbitrary) displacement.

Shifting the coordinates for a protein model may be useful when fitting a protein molecule into a given fixed grid. Typical applications would be low-resolution lattice-based simulations or grid-based Poisson-Boltzmann calculations.

As an example, let's take a look at the coordinates of the experimental villin headpiece structure in 1vii.exp.pdb:

head -10 1vii.exp.pdb

ATOM      1  N   MET     1       1.177 -10.035  -3.493  1.00  0.00
ATOM      2  CA  MET     1       0.292  -8.839  -3.377  1.00  0.00
ATOM      3  C   MET     1      -0.488  -8.912  -2.063  1.00  0.00
ATOM      4  O   MET     1      -1.039  -9.937  -1.709  1.00  0.00
ATOM      5  CB  MET     1      -0.674  -8.793  -4.565  1.00  0.00
ATOM      6  CG  MET     1      -0.091  -7.889  -5.657  1.00  0.00

Centering the molecule shifts all of the coordinates accordingly:

convpdb.pl -center 1vii.exp.pdb | head -10

ATOM      1  N   MET     1       0.437 -10.341  -5.602  1.00  0.00
ATOM      2  CA  MET     1      -0.448  -9.145  -5.486  1.00  0.00
ATOM      3  C   MET     1      -1.228  -9.218  -4.172  1.00  0.00
ATOM      4  O   MET     1      -1.779 -10.243  -3.818  1.00  0.00
ATOM      5  CB  MET     1      -1.414  -9.099  -6.674  1.00  0.00
ATOM      6  CG  MET     1      -0.831  -8.195  -7.766  1.00  0.00

Translating by a given displacement also shifts all of the coordinates with respect to the original file:

convpdb.pl -translate 1 0 0 1vii.exp.pdb | 
head -10

ATOM      1  N   MET     1       2.177 -10.035  -3.493  1.00  0.00
ATOM      2  CA  MET     1       1.292  -8.839  -3.377  1.00  0.00
ATOM      3  C   MET     1       0.512  -8.912  -2.063  1.00  0.00
ATOM      4  O   MET     1      -0.039  -9.937  -1.709  1.00  0.00
ATOM      5  CB  MET     1       0.326  -8.793  -4.565  1.00  0.00
ATOM      6  CG  MET     1       0.909  -7.889  -5.657  1.00  0.00

Please note that displacements in all directions (x, y, and z) are always required as arguments to -translate, even if they are zero, as in the example above.

Substructure selection

In some applications it is possible to reduce the problem size (and the associated computational cost) by limiting modeling efforts to a particular region of interest without affecting results significantly.
A typical example would be the prediction of loop conformations in partial protein models derived through sequence or structure homology. In this case, conformations of a small number of (loop) residues are sampled in the environment of a well-defined, relatively rigid structure template. Template residues beyond the immediate vicinity of the loop may usually be restrained, fixed, or omitted altogether if sufficiently far enough for long-range electrostatic energy contributions to become insignificant.

This section introduces tools from the MMTSB Tool Set that can be used in such applications for extracting substructure parts of proteins or protein assemblies.

Substructures are identified by a residue list consisting of one or more residue number intervals. Each residue interval is given by the first and last residue number, separated with a colon, as in 2:20. Multiple intervals are separated with the equal sign: 2:20=25:90
Such a residue list can be used to extract part of a protein structure in convpdb.pl if given with the option -sel
In the following example residues 10 through 15 and 18 through 20 are extracted from the input PDB file:

convpdb.pl -sel 10:15=18:20 1vii.exp.pdb

For multi-chain structure assemblies it is possible to select an entire chain with the -chain option of convpdb.pl followed by the chain ID of the chain that should be selected.

The -chain option can be combined with -sel, but it is also possible to combine chain IDs with residue numbers in the residue list argument of -sel.
As an example, one could specify a residue list such as A10:20=B50:78 which would select (different) residue subsets from chains A and B.

It is also possible to select a substructure from a larger structure by giving the amino acid sequence of the substructure rather than the residue range. This is more convenient for extracting part of a structure based on sequence alignments with other structures as in homology modeling and fold recognition applications. This feature is available with the -selseq option of convpdb.pl, followed by the desired amino acid sequence as a single string in single-letter form. The sequence has to match exactly part of the sequence in the input structure. Sequence mismatches, gaps or insertions will produce an error message. That means, that only a single continuous part of the sequence can be used at a time. It is possible, though, to extract multiple pieces one after another and merge them subsequently as described in more detail below. If the sequence can be matched more than once, its first occurence is used.
Following is an example of how to extract a four-residue fragment from the villin headpiece structure with this option:

convpdb.pl -selseq FGMT 1vii.exp.pdb | grep CA

ATOM    158  CA  PHE    11      -1.330  -0.424   6.489  1.00  0.00
ATOM    178  CA  GLY    12      -4.783   0.272   7.945  1.00  0.00
ATOM    185  CA  MET    13      -7.005   1.470   5.080  1.00  0.00
ATOM    202  CA  THR    14      -6.738  -0.094   1.592  1.00  0.00

Cutoff-based residue selection

In many cases, the selection of protein substructures is based on residues that lie within a given cutoff from a region of interest. This would be the typical case e.g. in loop modeling applications, where only residues in the protein template that are sufficiently close to the loop influence the energetics for different loop conformations significantly.

Since a manual identification of cutoff-based residue lists would be cumbersome, the utility vicinity.pl is available to generate such lists automatically.

In the simplest case, vicinity.pl needs a list of residues with the option -l that are used as reference for finding other residues in the vicinity and the name of a PDB file with a protein structure. In the loop modeling case the reference residues would usually be the loop residues.

In the following example a cutoff-based residue list is generated with respect to residues 71 through 75 in a 118-residue phospholipase structure (1POA):

vicinity.pl -l 71:75 1poa.exp.pdb

The output of this command consists of two lines. Let's ignore the second line for now and look only at the first one:

1:19=21=40=43=47=51=65:83=86:103

This is the list of residues (1 through 19, 21, 40, 43, etc.) that are within the default cutoff of 16 Å from residues 71 through 75. "within a cutoff" means here that the minimum distance between any heavy atoms of two residues is less than the cutoff. Please note, that the reference residues are also included in the output list.

This list can now be used as is to cut out the corresponding residues from the complete structure with convpdb.pl as in:

convpdb.pl -sel 1:19=21=40=43=47=51=65:83=86:103
1poa.exp.pdb

Using UNIX shell syntax this can be combined into a single command:

convpdb.pl -sel `vicinity.pl -l 71:75 
1poa.exp.pdb | head -1` 1poa.exp.pdb

The cutoff can be changed with the option -hard. So, in order to include only residues that are within 15 Å in the above example, the command would be:

vicinity.pl -hard 15 -l 71:75 1poa.exp.pdb

The first line of the output is now a much shorter residue list compared to the list above with a 16 Å cutoff:

1:18=21=47=51=66:83=86:87=89:102

It is possible to provide not just a single PDB file but multiple files each containing a different conformation for the protein structure of interest. If more than one conformation is given, the vicinity.pl utility determines which residues are within the given cutoff from the reference residues in any of the different conformations. This function is intended for cases where the conformations of the reference residues cover a wide range of configurational space either due to inherent flexibility or when searching for the native conformation during structure prediction applications.

Extra conformations can be given through additional file names on the command line. Alternatively, the option -f is available to read file names from a list file instead. This is useful for a large number of files.

When using a cutoff-based residue subset from a larger protein structure in modeling applications one would usually restrain most of the residues to varying degrees since important interactions and connectivities with the rest of the protein are missing and the cutout fragment would probably not be stable otherwise.

A good strategy that could be used for loop modeling applications would leave the loop residues completely flexible, use light restraints on residues in direct contact with the loop to allow for some structural adjustments in the template as response to different loop conformations, and use strong restraints on the outer shell residues to provide the protein template as a rigid frame away from the loop residues.

The second output line of vicinity.pl that has been ignored so far provides such a two-layer restraint description that can then be used directly with MMTSB modeling utilities such as minCHARMM.pl.

Restraint descriptions look very much like simple residue lists as can be seen in the second line of output from the example above with a 15 Å cutoff:

1:2_10=16:18=21=47=51=66=81=83=86:87=89=92=102=
3:15_0.5=67:70=76:80=82=90:91=93:101

The only difference are force constants (in kcal/mol) that are appended with an underscore sign (_) to residue intervals. In the example above there are only two places at the beginning (1:2_10) and in the middle of the string (3:15_0.5) where force constants are set (to 10 and 0.5 kcal/mol, respectively). For the other residue intervals the respective last value along the list is used by default. So the above list translates into a set of residues 1, 2, 16, 17, 18, 21, 47, 51, 66, 81, 83, 86, 87, 89, 92, and 102 that are restrained strongly with 10 kcal/mol and a set of residues 3 through 15, 67 through 70, 76 through 80, 82, 90, 91, and 93 through 101 for which a much weaker restraint force of 0.5 kcal/mol is applied. Notice, that the reference residues 71 through 75 are not included in either list since they are assumed to be fully flexible.

How is the two-layer restraint list generated? It uses a second "soft" cutoff. For all residues within the soft cutoff a weak restraint force is used while the remaining residues within the overall "hard" cutoff are considered to be on the rigid outer shell and will have a much stronger cutoff.

The hard cutoff is set with -hard as in the example shown above. The soft cutoff is changed from the default value of 12.0 with the option -soft.

Let's look at another example to understand a particular feature of vicinity.pl when the hard cutoff gets close to the soft cutoff. Assuming we want to get all residues within 12 Å from the reference residues. One might try the following command:

vicinity.pl -hard 12 -l 71:75 1poa.exp.pdb

The output looks like this:

2:16=66:83=89:102
2_10=16=66=81=83=89=92=102=3:15_0.5=67:70=76:80=
82=90:91=93:101

Keeping in mind that the soft cutoff radius is 12 Å it should be surprising that there is separate list of residues with a strong restraint force. Since the hard cutoff is set to the same value as the soft cutoff all selected residues should have a weak restraint force. This would mean, however, if such a restraint list is used, that the rigid framework of restrained residues on the outside of the selected fragment is missing and most likely there would be stability issues.

For this reason, the vicinity.pl utility automatically includes additional residues from the template that cap each fragmented chain if the edge residues are not already beyond the soft but within the hard cutoff. In the example above, only the residues in the soft restraint list are in fact within 12 Å, while all the residues in the strong restraint list are beyond 12 Å and included only to provide a rigid frame. Please note, that this also reflected in the residue selection list in the first line that includes all selected residues, some of which are beyond 12 Å.

In order to get a list of residues that are truly within 12 Å in the first output line one has to use a trick. Reducing the soft cutoff to a small number will ensure that there are enough residues within the hard cutoff but beyond the soft cutoff to cap all ends of the fragmented protein chains so that no additional residues have to be added. The following example

vicinity.pl -hard 12 -soft 2 -l 71:75
1poa.exp.pdb

shows how the residue selection list contains fewer residues when the soft cutoff is reduced:

3:15=67:80=82=90:91=93:101
3:15_10=67:69=77:80=82=90:91=93:101=70_0.5=76

The first line of output now really contains only residues within 12 Å from the reference residues.

Finally, it is possible to change the value of the force constants for the inner and outer regions by adding the new value with a colon after the cutoff radius as in:

vicinity.pl -hard 15:5.0 -l 71:75 1poa.exp.pdb

Merging of structures

As shown in examples above it is sometimes necessary to combine structures from different sources into a single PDB file. This may be done to add ligands or additional chains in multi-domain proteins, or structure fragments for the same protein may be combined to form a complete structure. All of these tasks can be done with convpdb.pl and the -merge option.
A simple example has been given before, where two chains are combined into a single PDB file containing both chains:

convpdb.pl -merge B.pdb A.pdb > AB.pdb

As another example let us consider the construction of a template for loop modeling from different sources. In this case we use residues 1 through 28 from the first template, temp1.pdb, and residues 36 through 118 from a second template, temp2.pdb, to generate a structure that contains all residues except for a loop from 29 through 35.
The first step is to generate the fragments from both templates:

convpdb.pl -sel 1:28 temp1.pdb > frag1:28.pdb

convpdb.pl -sel 36:118 temp2.pdb > frag36:118.pdb

Then both fragments can be merged into a single structure:

convpdb.pl -merge frag1:28.pdb frag36:118.pdb > 
frag1:28_36:118.pdb

If multiple fragments are being combined, it is possible to daisy chain merge commands in the following way:

convpdb.pl -merge frag2.pdb frag1.pdb | 
convpdb.pl -merge frag3.pdb | 
convpdb.pl -merge frag4.pdb | 
convpdb.pl -merge frag5.pdb  ...

The merge feature is implemented such that a structure is setup first based on the PDB file given last on the command line. Then the PDB file given as an argument after -merge is read and added to the structure, replacing existing residues from the first structure based on residue number and chain ID.

If this option is used to combine structural fragments from different sources, steric clashes may occur, especially if the fragments are continuous in space. No checks or structural adjustments are done in this regard by convpdb.pl, which combines the atom coordinates from the input as is. In order to relieve steric clashes and/or unfavorable bond geometries etc. one may relax the resulting structures by force-field based minimization.

Structure completion

The MMTSB Tool Set offers the utility complete.pl for completing missing atoms in protein structures if at least C-alpha coordinates are present for a residue. Complete all-atom structures are required for all force-field based methods such as energy evaluations, minimization, and dynamics. However, structures from experiments or other modeling efforts often lack some atoms for different reasons. In experimental data this may be due to insufficient resolution while other modeling techniques may use simplified models based only on the backbone, for example.

The use of this utility is rather simple:

convpdb.pl incomplete.pdb > complete.pdb

As explained on the manual page a few options are available to specify CHARMM19 (united aliphatic carbon atoms) or CHARMM22 (all atoms), request blocked terminal groups, and to specify histidine protonation states.

Depending on the available structure information different algorithms are being used in an automatic fashion. If at least all of the backbone and side chain heavy atoms are present CHARMM is being used to add any missing hydrogen atoms through a quick minimization procedure. If at least coordinates for C-beta are available, the side chain center is approximated and the complete side chain is rebuilt using a side chain based reconstruction procedure (M. Feig, P. Rotkiewicz, A. Kolinski, J. Skolnick, C. Brooks: Accurate Reconstruction of All-Atom Protein Representations From Side-Chain-Based Low-Resolution Models, Proteins (2000) 41, 86-97). With this method the accuracy of the side chain atoms can be expected to be less than 1 Å if at least C-alpha coordinates are given as well and about 1.4 Å if the backbone is not available. Finally, if only backbone atoms are available, at least C-alpha coordinates are necessary, side chains are rebuilt using SCWRL (M. Bower, F. E. Cohen, and R. L. Dunbrack, Jr.: Sidechain prediction from a backbone-dependent rotamer library: A new tool for homology modeling. J. Mol. Biol. (1997) 267, 1268-1282) with an expected accuracy between 1.5 and 1.8 Å. Therefore, as much information as possible should be included to obtain the highest possible accuracy in the completed all-atom structure. The complete.pl utility can handle structures with mixed levels of resolution. So if some residues have only backbone coordinates, but other residues are complete, side chains will only be added to the incomplete residues.

Please note, that a separate tool, rebuild.pl, is available for reconstructing all-atom models from side chain based lattice models that are used with MONSSTER within the MMTSB Tool Set.

Structure mutation

The mutation of residues in computational models of protein structures is rather easy compared to experimental efforts. Mutations might be useful for investigating sequence-dependent effects in much the same way experimental mutation studies are used but also for building models of new proteins from closely related templates with homologous sequences.

Within the MMTSB Tool Set the most practical way to mutate residues is to first convert a given structure to a sequence-insensitive low-resolution representation consisting only of C-alpha and side chain center coordinates for each residue and then reconstruct an all-atom model based on the new sequence. The reconstruction procedure can maintain the original coordinates for residues that are not being mutated and generate mutated residues with an accuracy of less than 1 Å. This would normally require a number of utilities from the Tool Set for lattice modeling that are described in a different section. More convenient is the use of mutate.pl, which combines the necessary steps in a single utility without requiring familiarity with the lattice modeling tools.

The use of this tool involves an input structure in PDB format and a list of residues that are being mutated:

mutate.pl -seq 10:AAF 1vii.exp.pdb

In this example the residues starting at number 10 are mutated to the given sequence AAF (Ala-Ala-Phe). Multiple mutations may be requested like this:

mutate.pl -seq 10:AAF=15:GG 1vii.exp.pdb

Now residues 15 and 16 are also mutated, into two glycines in this example.

One more option, -minimize, is available for automatic minimization of the mutated structure. While the rebuilding procedure works fairly well on its own, a subsequent minimization may be necessary to relieve minor steric clashes and other unfavorable energetic interactions. The force-field minimization procedure uses and requires the CHARMM program. The protocol involves the use of a distance dependent dielectric over a short number of steps in order to provide a somewhat realistic solvent environment without requiring too much computational time. If more extensive minimization is required, the -minimize should be skipped and minCHARMM.pl may be used as in the following example:

mutate.pl -seq 10:AAF 1vii.exp.pdb | 
minCHARMM.pl -par gb,minsteps=1000

Here, a fairly long minimization run over 1000 steps with Generalized Born implicit solvent is used. For more information on force field based minimization please refer to the corresponding section in this tutorial.