0-Application walk thourgh

In this notebook, we introduce the basics of PyLipID via a walk-through application. We keep the details to minimum and invite users to refer to other specific notebooks in which you can find more details, exercise and theory.

In this tutorial, we show the PyLipID workflow of checking the lipid interactions with a Class A GPCR, Adenosine A2a receptor from coarse-grained simulations with a complex membrane bilayer that contains 7 lipid species.

807072de8a7a421cb9fa526f7f778144

The simulation ensemble reported in the original paper contained 10 repeats, but here we took 2 of the repeats for demonstration and saved in two directories run1 and run2, both of which contains a simulation trajectory protein_lipids.xtc and a corresponding topology coordinate protein_lipids.gro. The two simulation trajectories have the same topology but different initial configurations. The trajectories were saved with a 20 ns timestep. The trajectories are available for download here.

In this tutorial, we will first check cholesterol interactions with receptor residues and then interactions with binding sites. We will also illustrate some of the plotting funtions and ways to save the interaction data in various formats that assist different analysis.

For those of you who are impatient, PyLipID documentation website has provided a no-brainer script that runs all PyLipID analysis in one run. The script can be downloaded here.

[1]:
%matplotlib inline
import pylipid
from pylipid.api import LipidInteraction
print(pylipid.__version__) # make sure pylipid version is later than 1.4
1.5.10

Load data

We start by initializing the main class LipidInteraction.

Trajectories and topology coordinates of the two repeats were saved in the directory ../../tests/data. Thus we first create a trajfile_list and topfile_list. Then, we load the trajfile_list and topfile_list to LipidInteraction together with a couple of calculation settings, including what cutoffs to use, which lipid to check, which lipid atoms are used to define the interactions and what time unit the generated data use etc. You can also specify which directory the generated data should be saved at. But here we will use the default value, i.e. the current working directory.

[2]:
trajfile_list = ["./traj_data/A2a/run1/protein_lipids.xtc", "./traj_data/A2a/run2/protein_lipids.xtc"]
topfile_list = ["./traj_data/A2a/run1/protein_lipids.gro", "./traj_data/A2a/run2/protein_lipids.gro"]
lipid = "CHOL"
cutoffs = [0.475, 0.8] # use of dual-cutoff
nprot = 1  # num. of protein copies in the system. if the simulation system has N copies of receptors,
           # "nprot=N" will report interactions averaged from the N copies, but "nprot=1"
           # will ask pylipid to report interaction data for each copy.
timeunit = 'us' # micro-second
save_dir = None  # if None, pylipid data will be saved at current working directory.

# initialize
li = LipidInteraction(trajfile_list, topfile_list=topfile_list, cutoffs=cutoffs, lipid=lipid,
                      nprot=nprot, save_dir=save_dir)

Interaction with residues

PyLipID provides functions to calculate the durations, occupancy (percentage of frames in which lipids formed contacts) and lipid count (the average number of surrounding lipids) of lipid interactions with protein residues. In addition, PyLipID can calculate interaction residence time, which is calculated from 1/koff. The function collect_residue_contacts() is required before any of the calculation of interactions with residues. This function will create a index of contacting lipids for each residue at each frame.

[3]:
li.collect_residue_contacts()
COLLECT INTERACTIONS FROM TRAJECTORIES: 100%|██████████| 2/2 [00:26<00:00, 13.20s/it]

After this calculation, a pandas.DataFrame is created as an attribute, named dataset, of the main class, which is accessible through li.dataset. It currently contains only two columns, Residue and Residue ID, but will be updated each time after a calculation is carried out to the class object LipidInteraction, i.e. what li refers to here.

[4]:
print(li.dataset)
    Residue  Residue ID
0      1ILE           0
1      2MET           1
2      3GLY           2
3      4SER           3
4      5SER           4
..      ...         ...
297  298ARG         297
298  299LYS         298
299  300ILE         299
300  301ILE         300
301  302ARG         301

[302 rows x 2 columns]

Durations, occupancy and lipid count

After collect_residue_contacts(), we can calculate the lipid interactions in a couple of metrics, using different functions. Each function has a parameter of residue_id which allows for calculation of specified residues via residue ID. Calculation will run for all residues if residue_id is set to the default value of None. Let’s check the interaction durations for all residues:

[5]:
durations = li.compute_residue_duration()
CALCULATE DURATION PER RESIDUE: 100%|██████████| 302/302 [00:00<00:00, 306.00it/s]

The returned durations is a list with a length of the number of residues. The n-th list is the durations of lipid interactions with the n-th residue from the trajectories. For example, let’s check the durations of the first residue. It shows a list of two list, corresponding to durations from the two trajectories.

[6]:
print(len(li.residue_list), len(li.trajfile_list))
print(len(durations), len(durations[0]))
print(durations[0])
302 2
302 2
[[0.02, 0.02, 0.02, 0.02, 0.02, 0.02, 0.02, 0.02, 0.02, 0.04, 0.04, 0.04, 0.04, 0.06, 0.06, 0.08, 0.1, 0.12, 0.16], [0.02, 0.02, 0.02, 0.02, 0.02, 0.02, 0.02, 0.02, 0.04, 0.04, 0.04, 0.04, 0.04, 0.06, 0.1, 0.1, 0.12, 0.12]]

Alternatively, if we are only interested in checking the interactions with first residue, we can just do the calculation for this residue:

[7]:
durations_of_first_residue = li.compute_residue_duration(residue_id=0)
print(durations_of_first_residue)
CALCULATE DURATION PER RESIDUE: 100%|██████████| 1/1 [00:00<00:00, 198.17it/s]
[[0.02, 0.02, 0.02, 0.02, 0.02, 0.02, 0.02, 0.02, 0.02, 0.04, 0.04, 0.04, 0.04, 0.06, 0.06, 0.08, 0.1, 0.12, 0.16], [0.02, 0.02, 0.02, 0.02, 0.02, 0.02, 0.02, 0.02, 0.04, 0.04, 0.04, 0.04, 0.04, 0.06, 0.1, 0.1, 0.12, 0.12]]

Similary, we can check the occupancy and lipid count as follow, the returned data has the same structure as compute_residue_duration()

[8]:
occupancies = li.compute_residue_occupancy()
lipidcounts = li.compute_residue_lipidcount()
print(len(occupancies), len(occupancies[0]))
print(len(lipidcounts), len(lipidcounts[0]))
CALCULATE OCCUPANCY: 100%|██████████| 302/302 [00:00<00:00, 13443.71it/s]
CALCULATE RESIDUE LIPIDCOUNT: 100%|██████████| 302/302 [00:00<00:00, 10329.96it/s]
302 2
302 2

After each of the above calculation, the pandas.DataFrame that is stored as dataset of the main class, is updated to include the average values for each residue. We can have a look at the dataset by using head() to see the columns have expanded to include “Duration”, “Occupancy”, “Lipid Count”:

[9]:
li.dataset.head(10) # show 10 lines
[9]:
Residue Residue ID Duration Duration std Occupancy Occupancy std Lipid Count Lipid Count std
0 1ILE 0 0.048108 0.037333 8.366534 1.195219 1.000000 0.000000
1 2MET 1 0.035000 0.016583 1.792829 0.996016 1.000000 0.000000
2 3GLY 2 0.020000 0.020000 0.199203 0.199203 0.500000 0.500000
3 4SER 3 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
4 5SER 4 0.080000 0.178232 7.569721 3.187251 1.000000 0.000000
5 6VAL 5 0.072340 0.068486 12.749004 0.000000 1.015625 0.015625
6 7TYR 6 0.099259 0.149664 31.274900 5.776892 1.021505 0.021505
7 8ILE 7 0.187368 0.236128 6.374502 1.992032 1.000000 0.000000
8 9THR 8 0.074085 0.104188 21.314741 3.784861 1.038600 0.006854
9 10VAL 9 0.105246 0.146742 19.521912 0.398406 1.051250 0.011250
[10]:
print(li.dataset["Duration"])
0      0.048108
1      0.035000
2      0.020000
3      0.000000
4      0.080000
         ...
297    0.063750
298    0.010000
299    0.111765
300    0.086809
301    0.041538
Name: Duration, Length: 302, dtype: float64

Residence time and koff

Residence time provides useful insights into the dynamical behavior of the bound lipids, which due to their interaction with the protein, is no longer diffusive. Indeed, we often see both prolonged interactions and quick diffusive contacts of lipids at the protein surface. PyLipID therefore uses bi-exponentials for residence time calculation to account for the long and short decays of lipid relaxation. Residence time is calculated as 1 over koff, which is calculated from a normalized survival function using the durations. plot_data parameter allows for plotting of the koff data, which provides a more visually straight-forward way to check lipid interactions and understand the data.

[11]:
koffs, res_times = li.compute_residue_koff(plot_data=True, fig_close=True)
# fig_close=True will close all the figures. As it can generate substantial amount of figures
# (one for each residue), leaving all the figures open can consume considerable amount of memory.

# li.compute_residue_koff(residue_id=[10,15,25,35])
# li.compute_residue_koff(residue_id=10)

The koff figure looks like this:

0cf2ce9fe8cc4c1fb1a9770ee23c1ded

The left panel plots the interactions durations in the sorted order and the right panel plots the normalised survival rates of these interactions (the purple doted line) and the bi-exponential curve fitted to the survival rates (red broken line). To check the sampling quality, the interaction durations are bootstrapped and the fitted curve of the bootstraped interaction durations are plotted in gray lines. Insufficient sampling of interactions would give large deviation to the bootstrapped koffs.

Again, the dataset is updated to include information calculated from koff calculation. Added columns include the average value of each residue of “Residence Time”, “Koff”, “R Squared”, “Residence Time Bootstrap avg”, and “R Sqaured Bootstrap”. “R Sqaured” is the r_sqaured values of the curve fitting to the normarlised survival function. It indicates the quality of the curve fitting, hence indicates the reliability of the calculated koff and residence time. “Residence Time Bootstrap avg” and “R Squared Bootstrap avg” are the averaged residence time and r_squared from bootstrapping respectively.

[12]:
li.dataset.head(10)
[12]:
Residue Residue ID Duration Duration std Occupancy Occupancy std Lipid Count Lipid Count std Koff Residence Time R Squared Koff Bootstrap avg R Squared Bootstrap avg
0 1ILE 0 0.048108 0.037333 8.366534 1.195219 1.000000 0.000000 11.804224 0.084715 0.998837 17.188663 0.998392
1 2MET 1 0.035000 0.016583 1.792829 0.996016 1.000000 0.000000 30.345855 0.032953 0.997823 32.201659 0.996950
2 3GLY 2 0.020000 0.020000 0.199203 0.199203 0.500000 0.500000 29.872911 0.033475 0.975950 26.885620 NaN
3 4SER 3 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
4 5SER 4 0.080000 0.178232 7.569721 3.187251 1.000000 0.000000 1.790196 0.558598 0.960367 7.793917 0.975108
5 6VAL 5 0.072340 0.068486 12.749004 0.000000 1.015625 0.015625 13.032816 0.076729 0.999307 10.955595 0.998862
6 7TYR 6 0.099259 0.149664 31.274900 5.776892 1.021505 0.021505 4.313793 0.231815 0.996524 4.922636 0.994342
7 8ILE 7 0.187368 0.236128 6.374502 1.992032 1.000000 0.000000 2.335581 0.428159 0.992678 2.604414 0.993542
8 9THR 8 0.074085 0.104188 21.314741 3.784861 1.038600 0.006854 11.245471 0.088925 0.924258 6.244023 0.992139
9 10VAL 9 0.105246 0.146742 19.521912 0.398406 1.051250 0.011250 4.454339 0.224500 0.997426 3.898776 0.995095

Interaction with binding sites

Before calculation of interactions with binding sites, we need to calculate where the binding sites are. PyLipID calculates the lipid interaction correlations between the protein residues and finds clusters of residues that bind to the same lipid molecule at the same time. This is carried out by computing the community structures of the interaction network of the protein residues.

[13]:
node_list, modularity = li.compute_binding_nodes(threshold=4, print_data=False)

# threshold=4 decides that binding sites should contain at least 4 residues. This is particularly
# helpful when itneraction samplings are not sufficient, in which case false positively correlation
# is likely to happen among 2 or 3 residues.
#
# print_data=True will print the residues for each binding site. It can be quite verbose.

This function returned two data, first is the node_list which is a list of binding site lists. Each binding site list contains the IDs of residues in that binding site. Modularity is the network modularity, which is taken from the Louvain’s method of calculating network community structures. Modularity is an indication of the quality of community structures, i.e. binding sites in our cases. The value of modularity is between 1 and -1. The closer to 1, the more distinctive the community structures, hence the more trust-worthy the calculated binding sites. Modularity is dependent on the quality of interaction network, which is ultimately dependent on such factors as the sampling of lipid interactions and choice of interaction cutoffs.

[14]:
print(node_list)
print()
print(modularity)
[[0, 1, 2, 4, 5, 8, 64], [6, 9, 10, 12, 13, 265, 268, 272, 279], [7, 11, 14, 15, 18, 53, 54, 57, 58, 61, 298], [16, 20, 282, 286, 293, 296, 297, 300], [19, 22, 23, 26, 30, 299, 301], [25, 28, 29, 40, 43, 44, 47, 48, 50, 51, 55, 80, 119, 122, 123, 126], [41, 90, 93, 94, 97, 98, 101, 112, 113, 121, 125, 128, 181, 182, 186], [59, 62, 67, 73, 77], [72, 76, 127, 130, 131, 132, 134, 135, 137, 138, 140, 173, 177], [100, 104, 105, 185, 189, 192, 193], [116, 117, 120, 124], [176, 179, 180, 183, 184, 187, 188, 239, 240, 243, 244, 251, 255, 256], [191, 194, 198, 230, 233, 236, 237], [195, 196, 199, 202, 203], [227, 231, 234, 235, 238, 242, 273, 276, 280, 283, 284, 287, 288], [241, 245, 248, 249, 252, 259, 262, 263, 266, 267, 269, 270]]

0.8311593806596966

Afer this calculation, the dataset had a new column Binding Site ID, which gives the residues belonging to the same binding site a binding site ID, whereas labels residues that do not belong to any binding site as ‘-1’. This binding site ID will be an invariable and be accessed through all the binding site calculation.

[15]:
li.dataset.head(10)
[15]:
Residue Residue ID Duration Duration std Occupancy Occupancy std Lipid Count Lipid Count std Koff Residence Time R Squared Koff Bootstrap avg R Squared Bootstrap avg Binding Site ID
0 1ILE 0 0.048108 0.037333 8.366534 1.195219 1.000000 0.000000 11.804224 0.084715 0.998837 17.188663 0.998392 0
1 2MET 1 0.035000 0.016583 1.792829 0.996016 1.000000 0.000000 30.345855 0.032953 0.997823 32.201659 0.996950 0
2 3GLY 2 0.020000 0.020000 0.199203 0.199203 0.500000 0.500000 29.872911 0.033475 0.975950 26.885620 NaN 0
3 4SER 3 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 -1
4 5SER 4 0.080000 0.178232 7.569721 3.187251 1.000000 0.000000 1.790196 0.558598 0.960367 7.793917 0.975108 0
5 6VAL 5 0.072340 0.068486 12.749004 0.000000 1.015625 0.015625 13.032816 0.076729 0.999307 10.955595 0.998862 0
6 7TYR 6 0.099259 0.149664 31.274900 5.776892 1.021505 0.021505 4.313793 0.231815 0.996524 4.922636 0.994342 1
7 8ILE 7 0.187368 0.236128 6.374502 1.992032 1.000000 0.000000 2.335581 0.428159 0.992678 2.604414 0.993542 2
8 9THR 8 0.074085 0.104188 21.314741 3.784861 1.038600 0.006854 11.245471 0.088925 0.924258 6.244023 0.992139 0
9 10VAL 9 0.105246 0.146742 19.521912 0.398406 1.051250 0.011250 4.454339 0.224500 0.997426 3.898776 0.995095 1

Therefore, to check residues of a binding site and look at their interaction properties, we can use this Binding Site ID:

[16]:
mask = (li.dataset["Binding Site ID"] == 0) # check binding site NO. 0
li.dataset[mask]
[16]:
Residue Residue ID Duration Duration std Occupancy Occupancy std Lipid Count Lipid Count std Koff Residence Time R Squared Koff Bootstrap avg R Squared Bootstrap avg Binding Site ID
0 1ILE 0 0.048108 0.037333 8.366534 1.195219 1.000000 0.000000 11.804224 0.084715 0.998837 17.188663 0.998392 0
1 2MET 1 0.035000 0.016583 1.792829 0.996016 1.000000 0.000000 30.345855 0.032953 0.997823 32.201659 0.996950 0
2 3GLY 2 0.020000 0.020000 0.199203 0.199203 0.500000 0.500000 29.872911 0.033475 0.975950 26.885620 NaN 0
4 5SER 4 0.080000 0.178232 7.569721 3.187251 1.000000 0.000000 1.790196 0.558598 0.960367 7.793917 0.975108 0
5 6VAL 5 0.072340 0.068486 12.749004 0.000000 1.015625 0.015625 13.032816 0.076729 0.999307 10.955595 0.998862 0
8 9THR 8 0.074085 0.104188 21.314741 3.784861 1.038600 0.006854 11.245471 0.088925 0.924258 6.244023 0.992139 0
64 65SER 64 0.180000 0.180000 0.199203 0.199203 0.500000 0.500000 3.454593 0.289470 0.973915 2.418215 NaN 0

Durations, occupancy and lipid count

Similar to interaction with residues, interactions with binding sites can also be checked in a couple of metrics. The same syntax is used.

[17]:
durations = li.compute_site_duration()
occupancies = li.compute_site_occupancy()
lipidcounts = li.compute_site_lipidcount()
print(len(node_list), len(li.trajfile_list))
print(len(durations), len(durations[0]))
print(len(occupancies), len(occupancies[0]))
print(len(lipidcounts), len(lipidcounts[0]))
CALCULATE DURATION PER BINDING SITE: 100%|██████████| 16/16 [00:00<00:00, 49.09it/s]
CALCULATE OCCUPANCY PER BINDING SITE: 100%|██████████| 16/16 [00:00<00:00, 134.74it/s]
CALCULATE LIPIDCOUNT PER BINDING SITE: 100%|██████████| 16/16 [00:00<00:00, 132.91it/s]
16 2
16 2
16 2
16 2

Similar to the calculation for residues, the returned list contains calculated values for each binding site from the two trajectories.

Residence Time and koff

Similar to that for residue, residence time and koff can be calculated for binding sites, via the function compute_site_koff

[18]:
koff_BS, res_time_BS = li.compute_site_koff(plot_data=True)

The returned values are koff and residence time for each binding site. plot_data=True will generate the koff figure shown above for each binding site. Now the dataset is expanded to include information for binding site residence time and koff. Residues from the same binding site will record the same binding site interaction data:

[19]:
li.dataset[li.dataset["Binding Site ID"] == 2]
[19]:
Residue Residue ID Duration Duration std Occupancy Occupancy std Lipid Count Lipid Count std Koff Residence Time ... R Squared Bootstrap avg Binding Site ID Binding Site Duration Binding Site Occupancy Binding Site Lipid Count Binding Site Koff Binding Site Koff Bootstrap avg Binding Site Residence Time Binding Site R Squared Binding Site R Squared Bootstrap avg
7 8ILE 7 0.187368 0.236128 6.374502 1.992032 1.000000 0.000000 2.335581 0.428159 ... 0.993542 2 0.160196 73.904382 1.26079 2.358624 1.492173 0.423976 0.998354 0.995259
11 12LEU 11 0.257778 0.351371 23.306773 1.792829 1.059524 0.003968 2.295437 0.435647 ... 0.995768 2 0.160196 73.904382 1.26079 2.358624 1.492173 0.423976 0.998354 0.995259
14 15ALA 14 0.223333 0.182361 1.195219 0.000000 1.000000 0.000000 3.246453 0.308028 ... 0.985841 2 0.160196 73.904382 1.26079 2.358624 1.492173 0.423976 0.998354 0.995259
15 16VAL 15 0.128621 0.182931 19.920319 1.593625 1.040258 0.003221 1.211126 0.825678 ... 0.997457 2 0.160196 73.904382 1.26079 2.358624 1.492173 0.423976 0.998354 0.995259
18 19ILE 18 0.172105 0.172141 12.350598 1.992032 1.038462 0.038462 3.809314 0.262514 ... 0.996998 2 0.160196 73.904382 1.26079 2.358624 1.492173 0.423976 0.998354 0.995259
53 54GLY 53 0.200000 0.161864 0.796813 0.796813 0.500000 0.500000 3.723346 0.268576 ... 0.984969 2 0.160196 73.904382 1.26079 2.358624 1.492173 0.423976 0.998354 0.995259
54 55VAL 54 0.184444 0.235132 22.310757 1.992032 1.026197 0.006590 2.649627 0.377412 ... 0.997034 2 0.160196 73.904382 1.26079 2.358624 1.492173 0.423976 0.998354 0.995259
57 58ILE 57 0.383158 0.329674 24.302789 1.992032 1.008929 0.008929 1.316120 0.759809 ... 0.995668 2 0.160196 73.904382 1.26079 2.358624 1.492173 0.423976 0.998354 0.995259
58 59PRO 58 0.296429 0.301996 16.932271 2.589641 1.037982 0.017574 2.093627 0.477640 ... 0.994831 2 0.160196 73.904382 1.26079 2.358624 1.492173 0.423976 0.998354 0.995259
61 62ILE 61 0.212941 0.269635 6.772908 1.992032 1.000000 0.000000 2.067640 0.483643 ... 0.991763 2 0.160196 73.904382 1.26079 2.358624 1.492173 0.423976 0.998354 0.995259
298 299LYS 298 0.010000 0.010000 0.199203 0.199203 0.500000 0.500000 0.000000 0.000000 ... NaN 2 0.160196 73.904382 1.26079 2.358624 1.492173 0.423976 0.998354 0.995259

11 rows × 22 columns

LipidInteraction also provides a assisting function binding_site to show binding site information and show some stats and returns the pandas.DataFrame containing residues of the specified binding site.

[20]:
df = li.binding_site(binding_site_id=3) # check the binding site No 3
df.head(10)
# Binding site 3
 Binding Site Residence Time:       0.402 us
 Binding Site Koff:                 2.490  R squared:  0.9984
 Binding Site Duration:             0.084 us
 Binding Site Occupancy:           36.056 %
 Binding Site Lipid Count:          1.087
 Pos. Charge:        2/8
 Neg. Charge:        0/8
 Polar:              0/8
 Hydrophobic:        4/8
 Special:            2/8
 Residue Res ID  Res. Time (us)  Duration (us)   Occupancy (%) Lipid Count  Koff  R Squared
 297PHE    296       0.322           0.098          17.131        1.008     3.108   0.998
 301ILE    300       0.108           0.087          13.745        1.000     9.263   0.999
  21GLY    20        0.103           0.077           3.984        1.000     9.719   0.990
 298ARG    297       0.095           0.064           3.386        1.000    10.535   0.997
  17LEU    16        0.052           0.057          13.147        1.014    19.296   0.999
 287ALA    286       0.047           0.030           0.797        0.500    21.240   0.993
 294ARG    293       0.033           0.025           0.598        0.500    30.652   0.996
 283PRO    282       0.030           0.032           0.996        1.000    32.883   0.994



[20]:
Residue Residue ID Duration Duration std Occupancy Occupancy std Lipid Count Lipid Count std Koff Residence Time ... R Squared Bootstrap avg Binding Site ID Binding Site Duration Binding Site Occupancy Binding Site Lipid Count Binding Site Koff Binding Site Koff Bootstrap avg Binding Site Residence Time Binding Site R Squared Binding Site R Squared Bootstrap avg
282 283PRO 282 0.032000 0.024000 0.996016 0.597610 1.000000 0.000000 32.882928 0.030411 ... 0.693587 3 0.08427 36.055777 1.086764 2.489836 6.14596 0.401633 0.998424 0.998174
293 294ARG 293 0.025000 0.021794 0.597610 0.597610 0.500000 0.500000 30.652487 0.032624 ... 0.794795 3 0.08427 36.055777 1.086764 2.489836 6.14596 0.401633 0.998424 0.998174
286 287ALA 286 0.030000 0.030000 0.796813 0.796813 0.500000 0.500000 21.240404 0.047080 ... 0.492717 3 0.08427 36.055777 1.086764 2.489836 6.14596 0.401633 0.998424 0.998174
16 17LEU 16 0.056923 0.047940 13.147410 0.796813 1.014286 0.014286 19.295543 0.051825 ... 0.999266 3 0.08427 36.055777 1.086764 2.489836 6.14596 0.401633 0.998424 0.998174
297 298ARG 297 0.063750 0.060917 3.386454 0.199203 1.000000 0.000000 10.535137 0.094920 ... 0.994440 3 0.08427 36.055777 1.086764 2.489836 6.14596 0.401633 0.998424 0.998174
20 21GLY 20 0.077143 0.090351 3.984064 0.796813 1.000000 0.000000 9.719108 0.102890 ... 0.991340 3 0.08427 36.055777 1.086764 2.489836 6.14596 0.401633 0.998424 0.998174
300 301ILE 300 0.086809 0.071014 13.745020 2.589641 1.000000 0.000000 9.262659 0.107960 ... 0.998140 3 0.08427 36.055777 1.086764 2.489836 6.14596 0.401633 0.998424 0.998174
296 297PHE 296 0.098095 0.099696 17.131474 6.772908 1.008333 0.008333 3.107937 0.321757 ... 0.997092 3 0.08427 36.055777 1.086764 2.489836 6.14596 0.401633 0.998424 0.998174

8 rows × 22 columns

Analyze bound poses

Once the binding sites are defined, we can proceed to analyze bound poses, e.g. check the most representative bound poses for binding sites and check bound pose clusters for binding sites etc.

[21]:
pose_pool, pose_rmsd = li.analyze_bound_poses(binding_site_id=None, pose_format="pdb",
                                             n_top_poses=3, n_clusters='auto') # None means analysis for all residues
../_images/tutorials_0-application-walk-through_43_1.png

analyze_bound_poses() will rate the bound poses for each binding site based on probability density calculated from the simulation trajectories and write out a couple of top rated poses (controled by n_poses). It will also cluster the bound poses for each binding site, for which the number of clusters can be determined by a density based cluster DBSCAN or provided by users via the parameter n_clusters. So for each binding site, it will generate two directories (if n_clusters is not 0), one for top-rated poses (in the directory of ‘BSid{num}_rank’) and one for all the clustered poses (in the directory of ‘BSid{num}_clusters’). All these directories are then put under the “Bound_Poses_{lipid}” directory if the parameter save_dir is None, i.e. not given. An overview of the directories organisation is shown below:

f48443b8c303410c9f79966b4f28a324

The written poses store the lipid pose and the recetpor conformation that this pose was bound to. By default, the pose files are saved in the Gromacs ‘gro’ format but can be changed to, e.g. ‘pdb’ format by the parameter pose_format. For the top-ranked poses, the pose named ‘top1’ is the one with highest scores, i.e. the most representative bound pose from the simulations; and ‘top2’ is the second, and so on. For coarse-grained simulations, these top-ranked poses can be quite similar to each other due to the smoothened potentials. These poses can be checked by such molecular visualization softwares as VMD and PyMol etc. Below shows the top rated bound cholesterol pose at binding site No 3 of A2a. This cholesterol binding site have been seen in many of the Class A GPCR x-ray structures:

11fe9c1ee8fb44abaf865b4ae4dc77fc

The lipid bound poses for each binding site will be returned. The RMSD of bound poses to the binding site average will be returned and plotted in a violin plot.

42c2f80c1f95473899c4e1151bc4f516

Similar to other analysis functions, analyze_bound_poses can also select a couple of binding sites for analysis via binding_site_id.

The returned pose_pool is a dict object that stores bound poses for binding sites, with the binding site id as the dictionary keys and the bound pose coordinates as the corresponding values. pose_rmsd is a pandas.DataFrame that stores RMSD for the bound poses (in unit of nm).

[22]:
pose_rmsd.head()
[22]:
Binding Site 0 Binding Site 1 Binding Site 2 Binding Site 3 Binding Site 4 Binding Site 5 Binding Site 6 Binding Site 7 Binding Site 8 Binding Site 9 Binding Site 10 Binding Site 11 Binding Site 12 Binding Site 13 Binding Site 14 Binding Site 15
0 0.223008 0.192302 0.288550 0.487145 0.325898 0.216980 0.417243 0.598373 0.303224 0.260776 0.174348 0.681689 0.269166 0.122740 0.362086 0.569712
1 0.214176 0.374907 0.317757 1.028707 0.492244 0.234929 0.465488 0.313493 0.312007 0.245380 0.189006 0.454326 0.267336 0.193313 0.836323 0.692334
2 0.279867 0.690856 0.540312 0.383474 0.372813 0.216115 0.463774 0.152496 0.780332 0.550248 0.127422 0.375738 0.265146 0.129976 0.524896 0.978642
3 0.199653 0.377980 0.362978 0.379433 0.406318 0.261471 0.273237 0.332318 0.325981 0.206775 0.196669 0.272414 0.179198 0.272044 0.273617 0.529334
4 0.288850 0.348632 0.273372 0.251446 0.244760 0.240792 0.359739 0.161057 0.385767 0.587389 0.326057 0.516921 0.297746 0.280991 0.701244 0.343846

Calculate surface area

We can also calculate the surface area for binding site, obtaining the values as a function of time. print_data=True generates a timeseries plot and a violinplot.

[23]:
surface_area = li.compute_surface_area(plot_data=True)
../_images/tutorials_0-application-walk-through_48_1.png
../_images/tutorials_0-application-walk-through_48_2.png

The returned surface_area is pandas.DataFrame object that stores the time evolution of the surface area for each binding site. The reported values are in the unit of nm^2.

[24]:
surface_area.head()
[24]:
Binding Site 0 Binding Site 1 Binding Site 2 Binding Site 3 Binding Site 4 Binding Site 5 Binding Site 6 Binding Site 7 Binding Site 8 Binding Site 9 Binding Site 10 Binding Site 11 Binding Site 12 Binding Site 13 Binding Site 14 Binding Site 15 Time
0 0 0 5.180549 4.713027 6.588768 5.615480 6.622356 7.665939 7.261710 3.359309 9.046471 4.861176 3.672477 8.362637 3.414401 6.126667 8.513744 8.631628 3.00
1 5.207631 4.389207 6.678460 5.471372 6.412016 8.460323 7.108385 3.279337 8.903983 4.817450 3.687601 8.675516 3.511277 5.917765 9.195548 8.689010 3.02
2 5.548610 4.465664 5.872028 5.773362 6.685437 7.553239 6.973439 3.484172 9.458707 4.827971 3.402123 9.300999 3.251616 5.497607 9.364325 7.785925 3.04
3 5.607398 4.589885 6.560425 5.169632 6.339952 8.063607 7.462493 3.413392 9.532347 4.830526 3.462173 8.392797 3.873588 5.904237 8.515190 7.945544 3.06
4 5.024073 4.214376 6.399018 6.042726 6.304193 9.341050 7.435935 3.429047 9.438010 4.907884 3.322324 8.536524 3.349148 5.717446 9.065802 7.539258 3.08

Similar to other binding site analysis functions, compute_surface_area allows for selection of binding site for calculation via binding_site_id.

The calculation of surface area needs definition of the radii of atoms in the system. The function uses mdtraj’s atom radii and defines the radii of MARTINI 2 beads. New radii definition can be provided via radii.

Plots and save data

The class object LipidInteraction provides a couple of assisting functions to plot and save data. These figures plot various interaction metrics as a function of the protein residue index. These figures help to point out the contact hotspots.

[25]:
for item in ["Residence Time", "Duration", "Occupancy", "Lipid Count"]:
    li.plot(item=item) # plot values as a function of time
Creating new director: /Users/wlsong/PyLipID/PyLipID-dev/docs/tutorials/Interaction_CHOL/Figure_CHOL
../_images/tutorials_0-application-walk-through_54_1.png
../_images/tutorials_0-application-walk-through_54_2.png
../_images/tutorials_0-application-walk-through_54_3.png
../_images/tutorials_0-application-walk-through_54_4.png

PyLipID also plots interactions in logo figures generated by logomakek. This kind of figure can directly point out the amino acids that make the contacts. It could be helpful to understand the chemical nature of lipid interactions.

[26]:
li.plot_logo(item="Residence Time")
../_images/tutorials_0-application-walk-through_56_0.png

As mentioned above, the calculated binding site information can be saved in a text file. The order of the data can be controled by the sort_residue parameter:

[27]:
li.write_site_info(sort_residue="Residence Time") # write binding site information in a txt file

To assist understanding of the interaction from a structural perspective, PyLipID can also save the interaction information in the B factor column of the protein coordinate in the pdb format. The following example will save the ‘Residence Time’ of each residue in a pdb file “Coordinate_CHOL_Residence_Time.pdf” in the directory of “Coordinate_{lipid}”:

[28]:
li.save_coordinate(item="Residence Time") # write protein coordinate in pdb format with the b factor column
                                          # showing 'Residence Time' values
Creating new director: /Users/wlsong/PyLipID/PyLipID-dev/docs/tutorials/Interaction_CHOL/Coordinate_CHOL

The coordinate file can be viewed in such visualization softwares as VMD and PyMol, and the structure can be coloured based on B factors to obtain the following images:

f81f96183489458e89c49fbd186fb909

Shown above are the A2aR structure in the coarse-grained model. To make the secondary and teriary structure clearer, only the backbone beads were shown whereas the side beads were hidden. A ‘white to red’ color map was used to show the residence time for each residue in VMD. The interaction hotspots and their locations can be easily analyzed through such a presentation.