Trajectory Analysis Tutorial
 

Objective and Overview

The objective of this tutorial is to introduce users to simple analyses of trajectories with CHARMM. In this tutorial we will use a 1ns trajectory of the small protein Glutaredoxin 1 (grx1) in a 56Ax46Ax46A box of TIP3P water. The secondary structure of the 85 aa protein is shown below, and a schematic 3D representation is to the right. This is a redox protein, with an active site motif Cys-Pro-Tyr-Cys. In this structure the N-terminal Cys (Cys11) is in a reactive thiolate form (Foloppe, N., and Nilsson, L. (2004). The Glutaredoxin -C-P-Y-C- Motif: Influence of Peripheral Residues. Structure 12, 289-300.).

 

Structure of E. coli glutaredoxin 1 (grx1; PDB ID 1EGR), with the active site cysteins shown in green.

We will learn to:

  • Reorient/recenter the trajectory, and also remove all solvent from it
  • Extract RMSD vs initial structure and radius of gyration of the protein
  • Find hydrogen bonds in the initial structure, and from the trajectory
  • Analyze the secondary structure using the Kabsch&Sander (DSSP) algorithm
  • Extract NMR relxation rates and order parameters for the backbone amides
  • Extract histograms of phi/psi distributions
  • Cluster structures from the trajectory, based on selected phi and psi dihedrals
  • Compute a matrix of pair-wise RMSD values for structures in the trajectory, and project these RMSDs onto a 2D plane as another way of clustering
  • Analyze the time-dependent fluorescence anisotropy decay of a Trp residue in the protein
  • Analyze water diffusion and water structure around the protein

We demonstrate the application of CHARMM language scripting to carry-out more routine and fixed tasks. Examine the structure, and the trajectory using VMD; use your favorite graphing software to make diagrams from the data files produced by the examples. This 1ns trajectory fragment contains 200 coordinate frames (sampled every 5ps), which is not optimal for all the phenomena that we are going to look at - eg, water rotational diffusion occurs on a slightly faster time scale.

All the necessary files,  including the PSF and coordinates, are in the "files" subdirectory of this tutorial. See files/00README for a quick description.

CHARMM version 33a2 (33b1) or later is required if all examples are to be used; most work with quite old CHARMM versions, though.

Tutorial produced by Lennart Nilsson, Karolinska Institutet (lennart.nilsson@biosci.ki.se) summer 2006.


Recentering/orienting and removing water from the trajectory

We must first process the raw trajectory to facilitate some of the later analyses. The trajectrory was obtained from a simulation using periodic boundary conditions (PBC), which means that it is possible for the solute (the protein) to be partly out of the primary simulation box if we look at the raw trajectory. By recentering the trajectory we move solvent molecules, according to the PBC, so that the protein is in the center of the box in each frame. Furthermore we then re-orient each frame so that the protein is superimposed on the coordinates of the initial protein structure, thus removing overall protein rotation/translation motions. For a very flexible system the removal of overall rotation may be a non-trivial task, but grx1 is compact and quite rigid so we base the superposition on all atoms in the protein.

You run the job in this way:

$CHARMMEXEC < orient-recenter.inp > orient-recenter.out

The script uses the CHARMM MERGE command (dynamc.doc). The output file contains numerous messages about and is quite large. Two new trajectory files are produced, both of which have the protein translation/rotation removed. One has the solvent recentered around the protein, and the other has the solvent completely removed. MERGE can also be used to join or split trajectory files or to change (reduce) sampling frequency.

Note that this input file (as do all the other examples in this tutorial) starts by "calling" (STREam, see miscom.doc) another file, init.str, which contains all commands necessary to setup this system:

* Open&read rtf,param,psf,crd for grx1 (1 ns part of full simulation)
* Open "raw" trajectory on unit 51, 
* recentered/oriented trajectory on unit 61
* 
read rtf card name top_all22_prot_cust.inp 
read para card name par_all22_prot_3.inp
read psf card name grx1_solv.psf
read coor card name grx1_solv_min.crd
define SOLUTE sele segid grx1 end
set MXTIME 81 
set NF 1
set SG grx1
set xtla 56.0 
set xtlb 46.0 
set xtlc 46.0

open read form unit 50 name grx1_solv.frm
open read unform unit 51 name grx1_solv.trj
open read unform unit 61 name grx1_solv_rc.trj
! TO USE THIS TRAJECTORY: DELETE ATOM SELE .NOT. SOLUTE END
open read unform unit 62 name grx1_prot_orient.trj

return

The idea behind this is that (most of) the analysis scripts can be used on (not too) different systems without change - all that you have to do is to substitute a new "init.str" with the appropriate information for the new system to be analyzed.

Root Mean Square Deviation (RMSD) and Radius of Gyration (RGYR)

Two simple characteristics of a structure are i) its RMSD wrt  to some reference structure (eg, the starting point of a simulation), which tells us something about the dissimilarity between the structures - 1Å RMSD is barely visible to the eye, and RMSD 10Å is very different, and ii) its RGYR, which is a combined measure of its overall size and shape - a sphere has the smallest RGYR of all bodies of the same volume (=0.6*sphere-radius); increasing the volume, making the shape more anisotropic or having more of its mass at the periphery all lead to an increased RGYR.  The proper definition of RGYR in mechanics is mass-weighted, which you can get by adding keywor dMASS to the COOR RGYR command.

You run the job in this way:

$CHARMMEXEC < rmsd-rgyr.inp > rmsd-rgyr.out

This script uses the CORREL module (correl.doc) to extract the RMSD and RGYR timeseries, and the results are output to one file. For single structures the COOR command (corman.doc) is used. The results could have been output to one file for each property, but we use the ability to edit the dimensions of the extracted time series in  CORREL (edit ... veccod) to put all the information into one file.

Hydrogen Bonds

Hydrogen bonding patterns often provide useful information. In this exercise we use the COOR HBOND (corman.doc) command to find hydrogen bonds in a single structure, as well as average number of hydrogen bonds and their average life times from a trajectory. Hydrogen bonds can be defined in many ways, but we use a simple geometric criterion: A hydrogen bond exists if the distance between the hydrogen and acceptor atoms is less than 2.4Å. This works well in practice. Note that when we have information about the hydrogen position, the hydrogen bond definition is not very sensitive to angular criteria, as are often used when determining hydrogen bonds in X-ray structures (lacking hydrogens). We will look at hydrogen bonds within the protein, between protein and water, and also on water-mediated hydrogen bonding contacts between different parts of the protein, ie hydrogen bonded interactions of the form A/D - water - A/D, where A/D denotes a hydrogen bond donor or acceptor in the protein. COOR HBOND uses the information about acceptors and donors in the PSF so we can use quite simple, and general atom-selections; all hydrogen bonds between the acceptors/donors in the two selections are calculated. A second form of the command (COOR CONTact) just applies the distance criterion to all the selected atoms.

You run the job in this way:

$CHARMMEXEC < hbonds.inp > hbonds.out

The results are presented as hydrogen bonds per each acceptor/donor in the first atom-selection; if the VERBose keyword is present a break-down in terms of accptors/donors in the second atom-selection is given. NB! The VERBose keyword is not useful in a CHARMM loop when you want to extract total number of hydrogen bonds through CHARMM variables (subst.doc). <occupancy> is the average number of hydrogen bonds formed by a given accptor/donor during the trajectory.  The <lifetime> (in ps) is the average duration of each instance of a given hydrogen bond.

Secondary Structure

COOR SECS (corman.doc) computes the secondary structure characteristics of a protein using the Kabsch&Sander algorithm (Kabsch, W., and Sander, C. (1983). Dictionary of Protein Secondary Structure: Pattern Recognition of Hydrogen-Bonded and Geometrical Features. Biopolymers 22, 2577-2637.), which is based on backbone hydrogen bond patterns; this is the same as in the DSSP program (almost... the CHARMM implementation is a slight simplification and uses the hydrogen-acceptor-atom definition of a hydrogen bond).

You run the job in this way:

$CHARMMEXEC < secondary-structure.inp > secondary-structure.out

Results are summarized in the output file, and are also returned as a numerical flag in the WMAIN array.

NMR Relaxation Rates and Generalized Order Parameters

MD simulations and NMR experiments often cover similar time-scales. Relaxation phenomena observed in NMR experiments depend on motional behavior of nuclear dipoles in the macromolecule, and the NMR module in CHARMM has been designed to allow efficient extraction of NMR related parameters from a trajectory (nmr.doc). This example analyzes relaxtaion parameters (relaxation rates, generalized order parameters, configurational entropy estimates) for the grx1 backbone amide groups.

You run the job in this way:

$CHARMMEXEC < nmr.inp > nmr.out

Results, with some details such as the computed correlation functions are given in the output file, and are also summarized in the nh.dat file. Note that we use the trajectory with overall protein (translation)/rotation removed - we assume that internal and overall motions occur on very different time scales so that we can confidently separate these motions and focus on the internal motions.

Phi/Psi Distributions and Phi/Psi Based Clustering

Peptide backbone dihdrals phi and psi determine the overall fold of a protein. In this example we first extract histograms of phi and psi  distribution from the trajectory for a number of residues (correl.doc) and then use some of the dihdrals that vary during the simulation for clustering the structures in the trajectory

You run the dihedral distribution job in this way (NB! this is the one example that requires command-line arguments):

$CHARMMEXEC < phi-psi-dist.inp > phi-psi-dist.out NFIRST=2 NLAST=60

There are more than 60 residues in grx1, but some internal limit in CHARMM restricts us to less than about 70 residues for a single CORREL analysis of phi/psi dihedral; if you want data for all residues  just run the job several times with appropriate NFIRST and NLAST arguments (note that the very first and last residues cannot be analyzed with this script, since the Phi and Psi definitions go beyond one single residue).

After the TRAJ command in the CORREL module has been executed averages and fluctuations are printed out for each time series; you can use this as a guide to find dihedrals that are likely to be of interest (ie, they populate different regions); primary candidates have large fluctuations. Phi and Psi histograms for each residue are output to a file named phi-psi-n.dat, where "n" is the residue number.

Having checked the results of the phi/psi distributions we may now try clustering conformations (correl.doc):

$CHARMMEXEC < cluster.inp > cluster.out 

Pair-wise RMSD Matrix and 2D Projection

In this exercise we compute the a matrix whose of RMSDs between all pairs of N conformations in the trajectory, and then project the result onto a 2D plane, represented by a set of coordinate pairs {pi,qi}, i=1,N, with the property that the distance between (pi,qi) and  (pj,qj) is optimally close to the RMSD between conformers i and j (Levitt, M. (1983). Molecular Dynamics of Native Protein.II. Analysis and Nature of Motion. J Mol Biol 168, 621-657.). The CHARMM implementation of this algorithm is for a 2D projection (easy to visualize), but it is easily generalized to several dimensions.

You run the job in this way:

$CHARMMEXEC < rmsd-matrix.inp > rmsd-matrix.out

The 2D projection requires a non-linear optimization, and thus is dependent on initial estimeates (generated using random numbers), and it is strongly suggested to run the procedure several  times, with different seeds (PQseed keyword) for the random number generator in order to get a feel for the stability and validity of the 2D projection.

Fluorescence Anisotropy

The motion of chromophores may be detected by following the time-dependence of the polarized components of fluorescence emission following a brief pulse of polarized excitation. This time dependent anisotropy can also be computed from the (rank two) auto correlation of a unit vector along the transmission dipole (or, more strictly, the corelation between absorption and emission dipoles). Trp is the most useful intrinsic chromophore in proteins; Tyr also has a certain fluorescence, but less intense than that of Trp. Extrinsic probes (dansyl chloride and many others) are often used. The fast decay of this fluorescence anisotropy means that we usually can decouple the internal motions from the overall rotational diffusion of the protein (see also the nmr exercise).

You run the job in this way:

$CHARMMEXEC < trp-anisotropy.inp > trp-anisotropy.out

Solvent Dynamics and Structure

All cells live an aqueous environment and their insides (cytoplasm) are also largely aqueous. Biomacromolecules thus have evolved to function with this in "mind" (membrane proteins are of course different in this respect); consquently CHARMM has well-developed facilities to analyze solvation behavior (see also the Hydrogen Bonds exercise). Here we will use the COOR ANAL (corman.doc) module to extract solvent dynamics (translational and rotational diffusion) and structure in terms of radial distribution functions.

You run the job in this way:

$CHARMMEXEC < solvent.inp > solvent.out

 

Further Examples

The above exercises have been meant to show some of the analysis capabilities of CHARMM. Most important is that you have to formulate your own scientifically relevant questions, and then find a way to answer them. CHARMM is an evolving research tool, and by combining its existing analysis facilites in various ways using the scripting language you get a long way towards obtaining the data that you need. More examples can be found at the CHARMM Web-site www.charmm.org, in the Script Archive forum, to which everybody is encouraged to submit their own scripts, for analysis, structure manipulation, simulation or ....