0-Application walk thourgh¶
In this notebook, we introduce the basics of PyLipID via a walk-through application. We keep the details to minimum and invite users to refer to other specific notebooks in which you can find more details, exercise and theory.
In this tutorial, we show the PyLipID workflow of checking the lipid interactions with a Class A GPCR, Adenosine A2a receptor from coarse-grained simulations with a complex membrane bilayer that contains 7 lipid species.
The simulation ensemble reported in the original paper contained 10 repeats, but here we took 2 of the repeats for demonstration and saved in two directories run1
and run2
, both of which contains a simulation trajectory protein_lipids.xtc
and a corresponding topology coordinate protein_lipids.gro
. The two simulation trajectories have the same topology but different initial configurations. The trajectories were saved with a 20 ns timestep. The trajectories are available for
download here.
In this tutorial, we will first check cholesterol interactions with receptor residues and then interactions with binding sites. We will also illustrate some of the plotting funtions and ways to save the interaction data in various formats that assist different analysis.
For those of you who are impatient, PyLipID documentation website has provided a no-brainer script that runs all PyLipID analysis in one run. The script can be downloaded here.
[1]:
%matplotlib inline
import pylipid
from pylipid.api import LipidInteraction
print(pylipid.__version__) # make sure pylipid version is later than 1.4
1.5.10
Load data¶
We start by initializing the main class LipidInteraction
.
Trajectories and topology coordinates of the two repeats were saved in the directory ../../tests/data
. Thus we first create a trajfile_list
and topfile_list
. Then, we load the trajfile_list
and topfile_list
to LipidInteraction
together with a couple of calculation settings, including what cutoffs to use, which lipid to check, which lipid atoms are used to define the interactions and what time unit the generated data use etc. You can also specify which directory the
generated data should be saved at. But here we will use the default value, i.e. the current working directory.
[2]:
trajfile_list = ["./traj_data/A2a/run1/protein_lipids.xtc", "./traj_data/A2a/run2/protein_lipids.xtc"]
topfile_list = ["./traj_data/A2a/run1/protein_lipids.gro", "./traj_data/A2a/run2/protein_lipids.gro"]
lipid = "CHOL"
cutoffs = [0.475, 0.8] # use of dual-cutoff
nprot = 1 # num. of protein copies in the system. if the simulation system has N copies of receptors,
# "nprot=N" will report interactions averaged from the N copies, but "nprot=1"
# will ask pylipid to report interaction data for each copy.
timeunit = 'us' # micro-second
save_dir = None # if None, pylipid data will be saved at current working directory.
# initialize
li = LipidInteraction(trajfile_list, topfile_list=topfile_list, cutoffs=cutoffs, lipid=lipid,
nprot=nprot, save_dir=save_dir)
Interaction with residues¶
PyLipID provides functions to calculate the durations, occupancy (percentage of frames in which lipids formed contacts) and lipid count (the average number of surrounding lipids) of lipid interactions with protein residues. In addition, PyLipID can calculate interaction residence time, which is calculated from 1/koff. The function collect_residue_contacts()
is required before any of the calculation of interactions with residues. This function will create a index of contacting lipids for
each residue at each frame.
[3]:
li.collect_residue_contacts()
COLLECT INTERACTIONS FROM TRAJECTORIES: 100%|██████████| 2/2 [00:26<00:00, 13.20s/it]
After this calculation, a pandas.DataFrame is created as an attribute, named dataset
, of the main class, which is accessible through li.dataset
. It currently contains only two columns, Residue
and Residue ID
, but will be updated each time after a calculation is carried out to the class object LipidInteraction
, i.e. what li
refers to here.
[4]:
print(li.dataset)
Residue Residue ID
0 1ILE 0
1 2MET 1
2 3GLY 2
3 4SER 3
4 5SER 4
.. ... ...
297 298ARG 297
298 299LYS 298
299 300ILE 299
300 301ILE 300
301 302ARG 301
[302 rows x 2 columns]
Durations, occupancy and lipid count¶
After collect_residue_contacts()
, we can calculate the lipid interactions in a couple of metrics, using different functions. Each function has a parameter of residue_id
which allows for calculation of specified residues via residue ID. Calculation will run for all residues if residue_id
is set to the default value of None
. Let’s check the interaction durations for all residues:
[5]:
durations = li.compute_residue_duration()
CALCULATE DURATION PER RESIDUE: 100%|██████████| 302/302 [00:00<00:00, 306.00it/s]
The returned durations
is a list with a length of the number of residues. The n-th list is the durations of lipid interactions with the n-th residue from the trajectories. For example, let’s check the durations of the first residue. It shows a list of two list, corresponding to durations from the two trajectories.
[6]:
print(len(li.residue_list), len(li.trajfile_list))
print(len(durations), len(durations[0]))
print(durations[0])
302 2
302 2
[[0.02, 0.02, 0.02, 0.02, 0.02, 0.02, 0.02, 0.02, 0.02, 0.04, 0.04, 0.04, 0.04, 0.06, 0.06, 0.08, 0.1, 0.12, 0.16], [0.02, 0.02, 0.02, 0.02, 0.02, 0.02, 0.02, 0.02, 0.04, 0.04, 0.04, 0.04, 0.04, 0.06, 0.1, 0.1, 0.12, 0.12]]
Alternatively, if we are only interested in checking the interactions with first residue, we can just do the calculation for this residue:
[7]:
durations_of_first_residue = li.compute_residue_duration(residue_id=0)
print(durations_of_first_residue)
CALCULATE DURATION PER RESIDUE: 100%|██████████| 1/1 [00:00<00:00, 198.17it/s]
[[0.02, 0.02, 0.02, 0.02, 0.02, 0.02, 0.02, 0.02, 0.02, 0.04, 0.04, 0.04, 0.04, 0.06, 0.06, 0.08, 0.1, 0.12, 0.16], [0.02, 0.02, 0.02, 0.02, 0.02, 0.02, 0.02, 0.02, 0.04, 0.04, 0.04, 0.04, 0.04, 0.06, 0.1, 0.1, 0.12, 0.12]]
Similary, we can check the occupancy and lipid count as follow, the returned data has the same structure as compute_residue_duration()
[8]:
occupancies = li.compute_residue_occupancy()
lipidcounts = li.compute_residue_lipidcount()
print(len(occupancies), len(occupancies[0]))
print(len(lipidcounts), len(lipidcounts[0]))
CALCULATE OCCUPANCY: 100%|██████████| 302/302 [00:00<00:00, 13443.71it/s]
CALCULATE RESIDUE LIPIDCOUNT: 100%|██████████| 302/302 [00:00<00:00, 10329.96it/s]
302 2
302 2
After each of the above calculation, the pandas.DataFrame that is stored as dataset
of the main class, is updated to include the average values for each residue. We can have a look at the dataset
by using head()
to see the columns have expanded to include “Duration”, “Occupancy”, “Lipid Count”:
[9]:
li.dataset.head(10) # show 10 lines
[9]:
Residue | Residue ID | Duration | Duration std | Occupancy | Occupancy std | Lipid Count | Lipid Count std | |
---|---|---|---|---|---|---|---|---|
0 | 1ILE | 0 | 0.048108 | 0.037333 | 8.366534 | 1.195219 | 1.000000 | 0.000000 |
1 | 2MET | 1 | 0.035000 | 0.016583 | 1.792829 | 0.996016 | 1.000000 | 0.000000 |
2 | 3GLY | 2 | 0.020000 | 0.020000 | 0.199203 | 0.199203 | 0.500000 | 0.500000 |
3 | 4SER | 3 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
4 | 5SER | 4 | 0.080000 | 0.178232 | 7.569721 | 3.187251 | 1.000000 | 0.000000 |
5 | 6VAL | 5 | 0.072340 | 0.068486 | 12.749004 | 0.000000 | 1.015625 | 0.015625 |
6 | 7TYR | 6 | 0.099259 | 0.149664 | 31.274900 | 5.776892 | 1.021505 | 0.021505 |
7 | 8ILE | 7 | 0.187368 | 0.236128 | 6.374502 | 1.992032 | 1.000000 | 0.000000 |
8 | 9THR | 8 | 0.074085 | 0.104188 | 21.314741 | 3.784861 | 1.038600 | 0.006854 |
9 | 10VAL | 9 | 0.105246 | 0.146742 | 19.521912 | 0.398406 | 1.051250 | 0.011250 |
[10]:
print(li.dataset["Duration"])
0 0.048108
1 0.035000
2 0.020000
3 0.000000
4 0.080000
...
297 0.063750
298 0.010000
299 0.111765
300 0.086809
301 0.041538
Name: Duration, Length: 302, dtype: float64
Residence time and koff¶
Residence time provides useful insights into the dynamical behavior of the bound lipids, which due to their interaction with the protein, is no longer diffusive. Indeed, we often see both prolonged interactions and quick diffusive contacts of lipids at the protein surface. PyLipID therefore uses bi-exponentials for residence time calculation to account for the long and short decays of lipid relaxation. Residence time is calculated as 1 over koff, which is calculated from a normalized survival
function using the durations. plot_data
parameter allows for plotting of the koff data, which provides a more visually straight-forward way to check lipid interactions and understand the data.
[11]:
koffs, res_times = li.compute_residue_koff(plot_data=True, fig_close=True)
# fig_close=True will close all the figures. As it can generate substantial amount of figures
# (one for each residue), leaving all the figures open can consume considerable amount of memory.
# li.compute_residue_koff(residue_id=[10,15,25,35])
# li.compute_residue_koff(residue_id=10)
The koff figure looks like this:
The left panel plots the interactions durations in the sorted order and the right panel plots the normalised survival rates of these interactions (the purple doted line) and the bi-exponential curve fitted to the survival rates (red broken line). To check the sampling quality, the interaction durations are bootstrapped and the fitted curve of the bootstraped interaction durations are plotted in gray lines. Insufficient sampling of interactions would give large deviation to the bootstrapped koffs.
Again, the dataset
is updated to include information calculated from koff calculation. Added columns include the average value of each residue of “Residence Time”, “Koff”, “R Squared”, “Residence Time Bootstrap avg”, and “R Sqaured Bootstrap”. “R Sqaured” is the r_sqaured values of the curve fitting to the normarlised survival function. It indicates the quality of the curve fitting, hence indicates the reliability of the calculated koff and residence time. “Residence Time Bootstrap avg” and
“R Squared Bootstrap avg” are the averaged residence time and r_squared from bootstrapping respectively.
[12]:
li.dataset.head(10)
[12]:
Residue | Residue ID | Duration | Duration std | Occupancy | Occupancy std | Lipid Count | Lipid Count std | Koff | Residence Time | R Squared | Koff Bootstrap avg | R Squared Bootstrap avg | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1ILE | 0 | 0.048108 | 0.037333 | 8.366534 | 1.195219 | 1.000000 | 0.000000 | 11.804224 | 0.084715 | 0.998837 | 17.188663 | 0.998392 |
1 | 2MET | 1 | 0.035000 | 0.016583 | 1.792829 | 0.996016 | 1.000000 | 0.000000 | 30.345855 | 0.032953 | 0.997823 | 32.201659 | 0.996950 |
2 | 3GLY | 2 | 0.020000 | 0.020000 | 0.199203 | 0.199203 | 0.500000 | 0.500000 | 29.872911 | 0.033475 | 0.975950 | 26.885620 | NaN |
3 | 4SER | 3 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
4 | 5SER | 4 | 0.080000 | 0.178232 | 7.569721 | 3.187251 | 1.000000 | 0.000000 | 1.790196 | 0.558598 | 0.960367 | 7.793917 | 0.975108 |
5 | 6VAL | 5 | 0.072340 | 0.068486 | 12.749004 | 0.000000 | 1.015625 | 0.015625 | 13.032816 | 0.076729 | 0.999307 | 10.955595 | 0.998862 |
6 | 7TYR | 6 | 0.099259 | 0.149664 | 31.274900 | 5.776892 | 1.021505 | 0.021505 | 4.313793 | 0.231815 | 0.996524 | 4.922636 | 0.994342 |
7 | 8ILE | 7 | 0.187368 | 0.236128 | 6.374502 | 1.992032 | 1.000000 | 0.000000 | 2.335581 | 0.428159 | 0.992678 | 2.604414 | 0.993542 |
8 | 9THR | 8 | 0.074085 | 0.104188 | 21.314741 | 3.784861 | 1.038600 | 0.006854 | 11.245471 | 0.088925 | 0.924258 | 6.244023 | 0.992139 |
9 | 10VAL | 9 | 0.105246 | 0.146742 | 19.521912 | 0.398406 | 1.051250 | 0.011250 | 4.454339 | 0.224500 | 0.997426 | 3.898776 | 0.995095 |
Interaction with binding sites¶
Before calculation of interactions with binding sites, we need to calculate where the binding sites are. PyLipID calculates the lipid interaction correlations between the protein residues and finds clusters of residues that bind to the same lipid molecule at the same time. This is carried out by computing the community structures of the interaction network of the protein residues.
[13]:
node_list, modularity = li.compute_binding_nodes(threshold=4, print_data=False)
# threshold=4 decides that binding sites should contain at least 4 residues. This is particularly
# helpful when itneraction samplings are not sufficient, in which case false positively correlation
# is likely to happen among 2 or 3 residues.
#
# print_data=True will print the residues for each binding site. It can be quite verbose.
This function returned two data, first is the node_list
which is a list of binding site lists. Each binding site list contains the IDs of residues in that binding site. Modularity is the network modularity, which is taken from the Louvain’s method of calculating network community structures. Modularity is an indication of the quality of community structures, i.e. binding sites in our cases. The value of modularity
is between 1 and -1. The closer to 1, the more distinctive the community
structures, hence the more trust-worthy the calculated binding sites. Modularity is dependent on the quality of interaction network, which is ultimately dependent on such factors as the sampling of lipid interactions and choice of interaction cutoffs.
[14]:
print(node_list)
print()
print(modularity)
[[0, 1, 2, 4, 5, 8, 64], [6, 9, 10, 12, 13, 265, 268, 272, 279], [7, 11, 14, 15, 18, 53, 54, 57, 58, 61, 298], [16, 20, 282, 286, 293, 296, 297, 300], [19, 22, 23, 26, 30, 299, 301], [25, 28, 29, 40, 43, 44, 47, 48, 50, 51, 55, 80, 119, 122, 123, 126], [41, 90, 93, 94, 97, 98, 101, 112, 113, 121, 125, 128, 181, 182, 186], [59, 62, 67, 73, 77], [72, 76, 127, 130, 131, 132, 134, 135, 137, 138, 140, 173, 177], [100, 104, 105, 185, 189, 192, 193], [116, 117, 120, 124], [176, 179, 180, 183, 184, 187, 188, 239, 240, 243, 244, 251, 255, 256], [191, 194, 198, 230, 233, 236, 237], [195, 196, 199, 202, 203], [227, 231, 234, 235, 238, 242, 273, 276, 280, 283, 284, 287, 288], [241, 245, 248, 249, 252, 259, 262, 263, 266, 267, 269, 270]]
0.8311593806596966
Afer this calculation, the dataset
had a new column Binding Site ID
, which gives the residues belonging to the same binding site a binding site ID, whereas labels residues that do not belong to any binding site as ‘-1’. This binding site ID will be an invariable and be accessed through all the binding site calculation.
[15]:
li.dataset.head(10)
[15]:
Residue | Residue ID | Duration | Duration std | Occupancy | Occupancy std | Lipid Count | Lipid Count std | Koff | Residence Time | R Squared | Koff Bootstrap avg | R Squared Bootstrap avg | Binding Site ID | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1ILE | 0 | 0.048108 | 0.037333 | 8.366534 | 1.195219 | 1.000000 | 0.000000 | 11.804224 | 0.084715 | 0.998837 | 17.188663 | 0.998392 | 0 |
1 | 2MET | 1 | 0.035000 | 0.016583 | 1.792829 | 0.996016 | 1.000000 | 0.000000 | 30.345855 | 0.032953 | 0.997823 | 32.201659 | 0.996950 | 0 |
2 | 3GLY | 2 | 0.020000 | 0.020000 | 0.199203 | 0.199203 | 0.500000 | 0.500000 | 29.872911 | 0.033475 | 0.975950 | 26.885620 | NaN | 0 |
3 | 4SER | 3 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | -1 |
4 | 5SER | 4 | 0.080000 | 0.178232 | 7.569721 | 3.187251 | 1.000000 | 0.000000 | 1.790196 | 0.558598 | 0.960367 | 7.793917 | 0.975108 | 0 |
5 | 6VAL | 5 | 0.072340 | 0.068486 | 12.749004 | 0.000000 | 1.015625 | 0.015625 | 13.032816 | 0.076729 | 0.999307 | 10.955595 | 0.998862 | 0 |
6 | 7TYR | 6 | 0.099259 | 0.149664 | 31.274900 | 5.776892 | 1.021505 | 0.021505 | 4.313793 | 0.231815 | 0.996524 | 4.922636 | 0.994342 | 1 |
7 | 8ILE | 7 | 0.187368 | 0.236128 | 6.374502 | 1.992032 | 1.000000 | 0.000000 | 2.335581 | 0.428159 | 0.992678 | 2.604414 | 0.993542 | 2 |
8 | 9THR | 8 | 0.074085 | 0.104188 | 21.314741 | 3.784861 | 1.038600 | 0.006854 | 11.245471 | 0.088925 | 0.924258 | 6.244023 | 0.992139 | 0 |
9 | 10VAL | 9 | 0.105246 | 0.146742 | 19.521912 | 0.398406 | 1.051250 | 0.011250 | 4.454339 | 0.224500 | 0.997426 | 3.898776 | 0.995095 | 1 |
Therefore, to check residues of a binding site and look at their interaction properties, we can use this Binding Site ID
:
[16]:
mask = (li.dataset["Binding Site ID"] == 0) # check binding site NO. 0
li.dataset[mask]
[16]:
Residue | Residue ID | Duration | Duration std | Occupancy | Occupancy std | Lipid Count | Lipid Count std | Koff | Residence Time | R Squared | Koff Bootstrap avg | R Squared Bootstrap avg | Binding Site ID | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1ILE | 0 | 0.048108 | 0.037333 | 8.366534 | 1.195219 | 1.000000 | 0.000000 | 11.804224 | 0.084715 | 0.998837 | 17.188663 | 0.998392 | 0 |
1 | 2MET | 1 | 0.035000 | 0.016583 | 1.792829 | 0.996016 | 1.000000 | 0.000000 | 30.345855 | 0.032953 | 0.997823 | 32.201659 | 0.996950 | 0 |
2 | 3GLY | 2 | 0.020000 | 0.020000 | 0.199203 | 0.199203 | 0.500000 | 0.500000 | 29.872911 | 0.033475 | 0.975950 | 26.885620 | NaN | 0 |
4 | 5SER | 4 | 0.080000 | 0.178232 | 7.569721 | 3.187251 | 1.000000 | 0.000000 | 1.790196 | 0.558598 | 0.960367 | 7.793917 | 0.975108 | 0 |
5 | 6VAL | 5 | 0.072340 | 0.068486 | 12.749004 | 0.000000 | 1.015625 | 0.015625 | 13.032816 | 0.076729 | 0.999307 | 10.955595 | 0.998862 | 0 |
8 | 9THR | 8 | 0.074085 | 0.104188 | 21.314741 | 3.784861 | 1.038600 | 0.006854 | 11.245471 | 0.088925 | 0.924258 | 6.244023 | 0.992139 | 0 |
64 | 65SER | 64 | 0.180000 | 0.180000 | 0.199203 | 0.199203 | 0.500000 | 0.500000 | 3.454593 | 0.289470 | 0.973915 | 2.418215 | NaN | 0 |
Durations, occupancy and lipid count¶
Similar to interaction with residues, interactions with binding sites can also be checked in a couple of metrics. The same syntax is used.
[17]:
durations = li.compute_site_duration()
occupancies = li.compute_site_occupancy()
lipidcounts = li.compute_site_lipidcount()
print(len(node_list), len(li.trajfile_list))
print(len(durations), len(durations[0]))
print(len(occupancies), len(occupancies[0]))
print(len(lipidcounts), len(lipidcounts[0]))
CALCULATE DURATION PER BINDING SITE: 100%|██████████| 16/16 [00:00<00:00, 49.09it/s]
CALCULATE OCCUPANCY PER BINDING SITE: 100%|██████████| 16/16 [00:00<00:00, 134.74it/s]
CALCULATE LIPIDCOUNT PER BINDING SITE: 100%|██████████| 16/16 [00:00<00:00, 132.91it/s]
16 2
16 2
16 2
16 2
Similar to the calculation for residues, the returned list contains calculated values for each binding site from the two trajectories.
Residence Time and koff¶
Similar to that for residue, residence time and koff can be calculated for binding sites, via the function compute_site_koff
[18]:
koff_BS, res_time_BS = li.compute_site_koff(plot_data=True)
The returned values are koff and residence time for each binding site. plot_data=True
will generate the koff figure shown above for each binding site. Now the dataset
is expanded to include information for binding site residence time and koff. Residues from the same binding site will record the same binding site interaction data:
[19]:
li.dataset[li.dataset["Binding Site ID"] == 2]
[19]:
Residue | Residue ID | Duration | Duration std | Occupancy | Occupancy std | Lipid Count | Lipid Count std | Koff | Residence Time | ... | R Squared Bootstrap avg | Binding Site ID | Binding Site Duration | Binding Site Occupancy | Binding Site Lipid Count | Binding Site Koff | Binding Site Koff Bootstrap avg | Binding Site Residence Time | Binding Site R Squared | Binding Site R Squared Bootstrap avg | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
7 | 8ILE | 7 | 0.187368 | 0.236128 | 6.374502 | 1.992032 | 1.000000 | 0.000000 | 2.335581 | 0.428159 | ... | 0.993542 | 2 | 0.160196 | 73.904382 | 1.26079 | 2.358624 | 1.492173 | 0.423976 | 0.998354 | 0.995259 |
11 | 12LEU | 11 | 0.257778 | 0.351371 | 23.306773 | 1.792829 | 1.059524 | 0.003968 | 2.295437 | 0.435647 | ... | 0.995768 | 2 | 0.160196 | 73.904382 | 1.26079 | 2.358624 | 1.492173 | 0.423976 | 0.998354 | 0.995259 |
14 | 15ALA | 14 | 0.223333 | 0.182361 | 1.195219 | 0.000000 | 1.000000 | 0.000000 | 3.246453 | 0.308028 | ... | 0.985841 | 2 | 0.160196 | 73.904382 | 1.26079 | 2.358624 | 1.492173 | 0.423976 | 0.998354 | 0.995259 |
15 | 16VAL | 15 | 0.128621 | 0.182931 | 19.920319 | 1.593625 | 1.040258 | 0.003221 | 1.211126 | 0.825678 | ... | 0.997457 | 2 | 0.160196 | 73.904382 | 1.26079 | 2.358624 | 1.492173 | 0.423976 | 0.998354 | 0.995259 |
18 | 19ILE | 18 | 0.172105 | 0.172141 | 12.350598 | 1.992032 | 1.038462 | 0.038462 | 3.809314 | 0.262514 | ... | 0.996998 | 2 | 0.160196 | 73.904382 | 1.26079 | 2.358624 | 1.492173 | 0.423976 | 0.998354 | 0.995259 |
53 | 54GLY | 53 | 0.200000 | 0.161864 | 0.796813 | 0.796813 | 0.500000 | 0.500000 | 3.723346 | 0.268576 | ... | 0.984969 | 2 | 0.160196 | 73.904382 | 1.26079 | 2.358624 | 1.492173 | 0.423976 | 0.998354 | 0.995259 |
54 | 55VAL | 54 | 0.184444 | 0.235132 | 22.310757 | 1.992032 | 1.026197 | 0.006590 | 2.649627 | 0.377412 | ... | 0.997034 | 2 | 0.160196 | 73.904382 | 1.26079 | 2.358624 | 1.492173 | 0.423976 | 0.998354 | 0.995259 |
57 | 58ILE | 57 | 0.383158 | 0.329674 | 24.302789 | 1.992032 | 1.008929 | 0.008929 | 1.316120 | 0.759809 | ... | 0.995668 | 2 | 0.160196 | 73.904382 | 1.26079 | 2.358624 | 1.492173 | 0.423976 | 0.998354 | 0.995259 |
58 | 59PRO | 58 | 0.296429 | 0.301996 | 16.932271 | 2.589641 | 1.037982 | 0.017574 | 2.093627 | 0.477640 | ... | 0.994831 | 2 | 0.160196 | 73.904382 | 1.26079 | 2.358624 | 1.492173 | 0.423976 | 0.998354 | 0.995259 |
61 | 62ILE | 61 | 0.212941 | 0.269635 | 6.772908 | 1.992032 | 1.000000 | 0.000000 | 2.067640 | 0.483643 | ... | 0.991763 | 2 | 0.160196 | 73.904382 | 1.26079 | 2.358624 | 1.492173 | 0.423976 | 0.998354 | 0.995259 |
298 | 299LYS | 298 | 0.010000 | 0.010000 | 0.199203 | 0.199203 | 0.500000 | 0.500000 | 0.000000 | 0.000000 | ... | NaN | 2 | 0.160196 | 73.904382 | 1.26079 | 2.358624 | 1.492173 | 0.423976 | 0.998354 | 0.995259 |
11 rows × 22 columns
LipidInteraction
also provides a assisting function binding_site
to show binding site information and show some stats and returns the pandas.DataFrame containing residues of the specified binding site.
[20]:
df = li.binding_site(binding_site_id=3) # check the binding site No 3
df.head(10)
# Binding site 3
Binding Site Residence Time: 0.402 us
Binding Site Koff: 2.490 R squared: 0.9984
Binding Site Duration: 0.084 us
Binding Site Occupancy: 36.056 %
Binding Site Lipid Count: 1.087
Pos. Charge: 2/8
Neg. Charge: 0/8
Polar: 0/8
Hydrophobic: 4/8
Special: 2/8
Residue Res ID Res. Time (us) Duration (us) Occupancy (%) Lipid Count Koff R Squared
297PHE 296 0.322 0.098 17.131 1.008 3.108 0.998
301ILE 300 0.108 0.087 13.745 1.000 9.263 0.999
21GLY 20 0.103 0.077 3.984 1.000 9.719 0.990
298ARG 297 0.095 0.064 3.386 1.000 10.535 0.997
17LEU 16 0.052 0.057 13.147 1.014 19.296 0.999
287ALA 286 0.047 0.030 0.797 0.500 21.240 0.993
294ARG 293 0.033 0.025 0.598 0.500 30.652 0.996
283PRO 282 0.030 0.032 0.996 1.000 32.883 0.994
[20]:
Residue | Residue ID | Duration | Duration std | Occupancy | Occupancy std | Lipid Count | Lipid Count std | Koff | Residence Time | ... | R Squared Bootstrap avg | Binding Site ID | Binding Site Duration | Binding Site Occupancy | Binding Site Lipid Count | Binding Site Koff | Binding Site Koff Bootstrap avg | Binding Site Residence Time | Binding Site R Squared | Binding Site R Squared Bootstrap avg | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
282 | 283PRO | 282 | 0.032000 | 0.024000 | 0.996016 | 0.597610 | 1.000000 | 0.000000 | 32.882928 | 0.030411 | ... | 0.693587 | 3 | 0.08427 | 36.055777 | 1.086764 | 2.489836 | 6.14596 | 0.401633 | 0.998424 | 0.998174 |
293 | 294ARG | 293 | 0.025000 | 0.021794 | 0.597610 | 0.597610 | 0.500000 | 0.500000 | 30.652487 | 0.032624 | ... | 0.794795 | 3 | 0.08427 | 36.055777 | 1.086764 | 2.489836 | 6.14596 | 0.401633 | 0.998424 | 0.998174 |
286 | 287ALA | 286 | 0.030000 | 0.030000 | 0.796813 | 0.796813 | 0.500000 | 0.500000 | 21.240404 | 0.047080 | ... | 0.492717 | 3 | 0.08427 | 36.055777 | 1.086764 | 2.489836 | 6.14596 | 0.401633 | 0.998424 | 0.998174 |
16 | 17LEU | 16 | 0.056923 | 0.047940 | 13.147410 | 0.796813 | 1.014286 | 0.014286 | 19.295543 | 0.051825 | ... | 0.999266 | 3 | 0.08427 | 36.055777 | 1.086764 | 2.489836 | 6.14596 | 0.401633 | 0.998424 | 0.998174 |
297 | 298ARG | 297 | 0.063750 | 0.060917 | 3.386454 | 0.199203 | 1.000000 | 0.000000 | 10.535137 | 0.094920 | ... | 0.994440 | 3 | 0.08427 | 36.055777 | 1.086764 | 2.489836 | 6.14596 | 0.401633 | 0.998424 | 0.998174 |
20 | 21GLY | 20 | 0.077143 | 0.090351 | 3.984064 | 0.796813 | 1.000000 | 0.000000 | 9.719108 | 0.102890 | ... | 0.991340 | 3 | 0.08427 | 36.055777 | 1.086764 | 2.489836 | 6.14596 | 0.401633 | 0.998424 | 0.998174 |
300 | 301ILE | 300 | 0.086809 | 0.071014 | 13.745020 | 2.589641 | 1.000000 | 0.000000 | 9.262659 | 0.107960 | ... | 0.998140 | 3 | 0.08427 | 36.055777 | 1.086764 | 2.489836 | 6.14596 | 0.401633 | 0.998424 | 0.998174 |
296 | 297PHE | 296 | 0.098095 | 0.099696 | 17.131474 | 6.772908 | 1.008333 | 0.008333 | 3.107937 | 0.321757 | ... | 0.997092 | 3 | 0.08427 | 36.055777 | 1.086764 | 2.489836 | 6.14596 | 0.401633 | 0.998424 | 0.998174 |
8 rows × 22 columns
Analyze bound poses¶
Once the binding sites are defined, we can proceed to analyze bound poses, e.g. check the most representative bound poses for binding sites and check bound pose clusters for binding sites etc.
[21]:
pose_pool, pose_rmsd = li.analyze_bound_poses(binding_site_id=None, pose_format="pdb",
n_top_poses=3, n_clusters='auto') # None means analysis for all residues
analyze_bound_poses()
will rate the bound poses for each binding site based on probability density calculated from the simulation trajectories and write out a couple of top rated poses (controled by n_poses
). It will also cluster the bound poses for each binding site, for which the number of clusters can be determined by a density based cluster DBSCAN
or provided by users via the parameter n_clusters
. So for each binding site, it will generate two directories (if n_clusters
is not 0
), one for top-rated poses (in the directory of ‘BSid{num}_rank’) and one for all the clustered poses (in the directory of ‘BSid{num}_clusters’). All these directories are then put under the “Bound_Poses_{lipid}” directory if the parameter save_dir
is None, i.e. not given. An overview of the directories organisation is shown below:
The written poses store the lipid pose and the recetpor conformation that this pose was bound to. By default, the pose files are saved in the Gromacs ‘gro’ format but can be changed to, e.g. ‘pdb’ format by the parameter pose_format
. For the top-ranked poses, the pose named ‘top1’ is the one with highest scores, i.e. the most representative bound pose from the simulations; and ‘top2’ is the second, and so on. For coarse-grained simulations, these top-ranked poses can be quite similar to each
other due to the smoothened potentials. These poses can be checked by such molecular visualization softwares as VMD and PyMol etc. Below shows the top rated bound cholesterol pose at binding site No 3 of A2a. This cholesterol binding site have been seen in many of the Class A GPCR x-ray structures:
The lipid bound poses for each binding site will be returned. The RMSD of bound poses to the binding site average will be returned and plotted in a violin plot.
Similar to other analysis functions, analyze_bound_poses
can also select a couple of binding sites for analysis via binding_site_id
.
The returned pose_pool
is a dict
object that stores bound poses for binding sites, with the binding site id as the dictionary keys and the bound pose coordinates as the corresponding values. pose_rmsd
is a pandas.DataFrame that stores RMSD for the bound poses (in unit of nm).
[22]:
pose_rmsd.head()
[22]:
Binding Site 0 | Binding Site 1 | Binding Site 2 | Binding Site 3 | Binding Site 4 | Binding Site 5 | Binding Site 6 | Binding Site 7 | Binding Site 8 | Binding Site 9 | Binding Site 10 | Binding Site 11 | Binding Site 12 | Binding Site 13 | Binding Site 14 | Binding Site 15 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0.223008 | 0.192302 | 0.288550 | 0.487145 | 0.325898 | 0.216980 | 0.417243 | 0.598373 | 0.303224 | 0.260776 | 0.174348 | 0.681689 | 0.269166 | 0.122740 | 0.362086 | 0.569712 |
1 | 0.214176 | 0.374907 | 0.317757 | 1.028707 | 0.492244 | 0.234929 | 0.465488 | 0.313493 | 0.312007 | 0.245380 | 0.189006 | 0.454326 | 0.267336 | 0.193313 | 0.836323 | 0.692334 |
2 | 0.279867 | 0.690856 | 0.540312 | 0.383474 | 0.372813 | 0.216115 | 0.463774 | 0.152496 | 0.780332 | 0.550248 | 0.127422 | 0.375738 | 0.265146 | 0.129976 | 0.524896 | 0.978642 |
3 | 0.199653 | 0.377980 | 0.362978 | 0.379433 | 0.406318 | 0.261471 | 0.273237 | 0.332318 | 0.325981 | 0.206775 | 0.196669 | 0.272414 | 0.179198 | 0.272044 | 0.273617 | 0.529334 |
4 | 0.288850 | 0.348632 | 0.273372 | 0.251446 | 0.244760 | 0.240792 | 0.359739 | 0.161057 | 0.385767 | 0.587389 | 0.326057 | 0.516921 | 0.297746 | 0.280991 | 0.701244 | 0.343846 |
Calculate surface area¶
We can also calculate the surface area for binding site, obtaining the values as a function of time. print_data=True
generates a timeseries plot and a violinplot.
[23]:
surface_area = li.compute_surface_area(plot_data=True)
The returned surface_area
is pandas.DataFrame object that stores the time evolution of the surface area for each binding site. The reported values are in the unit of nm^2.
[24]:
surface_area.head()
[24]:
Binding Site 0 | Binding Site 1 | Binding Site 2 | Binding Site 3 | Binding Site 4 | Binding Site 5 | Binding Site 6 | Binding Site 7 | Binding Site 8 | Binding Site 9 | Binding Site 10 | Binding Site 11 | Binding Site 12 | Binding Site 13 | Binding Site 14 | Binding Site 15 | Time | |||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | 0 | 5.180549 | 4.713027 | 6.588768 | 5.615480 | 6.622356 | 7.665939 | 7.261710 | 3.359309 | 9.046471 | 4.861176 | 3.672477 | 8.362637 | 3.414401 | 6.126667 | 8.513744 | 8.631628 | 3.00 |
1 | 5.207631 | 4.389207 | 6.678460 | 5.471372 | 6.412016 | 8.460323 | 7.108385 | 3.279337 | 8.903983 | 4.817450 | 3.687601 | 8.675516 | 3.511277 | 5.917765 | 9.195548 | 8.689010 | 3.02 | ||
2 | 5.548610 | 4.465664 | 5.872028 | 5.773362 | 6.685437 | 7.553239 | 6.973439 | 3.484172 | 9.458707 | 4.827971 | 3.402123 | 9.300999 | 3.251616 | 5.497607 | 9.364325 | 7.785925 | 3.04 | ||
3 | 5.607398 | 4.589885 | 6.560425 | 5.169632 | 6.339952 | 8.063607 | 7.462493 | 3.413392 | 9.532347 | 4.830526 | 3.462173 | 8.392797 | 3.873588 | 5.904237 | 8.515190 | 7.945544 | 3.06 | ||
4 | 5.024073 | 4.214376 | 6.399018 | 6.042726 | 6.304193 | 9.341050 | 7.435935 | 3.429047 | 9.438010 | 4.907884 | 3.322324 | 8.536524 | 3.349148 | 5.717446 | 9.065802 | 7.539258 | 3.08 |
Similar to other binding site analysis functions, compute_surface_area
allows for selection of binding site for calculation via binding_site_id
.
The calculation of surface area needs definition of the radii of atoms in the system. The function uses mdtraj
’s atom radii and defines the radii of MARTINI 2 beads. New radii definition can be provided via radii
.
Plots and save data¶
The class object LipidInteraction
provides a couple of assisting functions to plot and save data. These figures plot various interaction metrics as a function of the protein residue index. These figures help to point out the contact hotspots.
[25]:
for item in ["Residence Time", "Duration", "Occupancy", "Lipid Count"]:
li.plot(item=item) # plot values as a function of time
Creating new director: /Users/wlsong/PyLipID/PyLipID-dev/docs/tutorials/Interaction_CHOL/Figure_CHOL
PyLipID also plots interactions in logo figures generated by logomakek. This kind of figure can directly point out the amino acids that make the contacts. It could be helpful to understand the chemical nature of lipid interactions.
[26]:
li.plot_logo(item="Residence Time")
As mentioned above, the calculated binding site information can be saved in a text file. The order of the data can be controled by the sort_residue
parameter:
[27]:
li.write_site_info(sort_residue="Residence Time") # write binding site information in a txt file
To assist understanding of the interaction from a structural perspective, PyLipID can also save the interaction information in the B factor column of the protein coordinate in the pdb format. The following example will save the ‘Residence Time’ of each residue in a pdb file “Coordinate_CHOL_Residence_Time.pdf” in the directory of “Coordinate_{lipid}”:
[28]:
li.save_coordinate(item="Residence Time") # write protein coordinate in pdb format with the b factor column
# showing 'Residence Time' values
Creating new director: /Users/wlsong/PyLipID/PyLipID-dev/docs/tutorials/Interaction_CHOL/Coordinate_CHOL
The coordinate file can be viewed in such visualization softwares as VMD and PyMol, and the structure can be coloured based on B factors to obtain the following images:
Shown above are the A2aR structure in the coarse-grained model. To make the secondary and teriary structure clearer, only the backbone beads were shown whereas the side beads were hidden. A ‘white to red’ color map was used to show the residence time for each residue in VMD. The interaction hotspots and their locations can be easily analyzed through such a presentation.