0-Application walk thourgh¶

In this notebook, we introduce the basics of PyLipID via a walk-through application. We keep the details to minimum and invite users to refer to other specific notebooks in which you can find more details, exercise and theory.

In this tutorial, we show the PyLipID workflow of checking the lipid interactions with a Class A GPCR, Adenosine A2a receptor from coarse-grained simulations with a complex membrane bilayer that contains 7 lipid species.

807072de8a7a421cb9fa526f7f778144

The simulation ensemble reported in the original paper contained 10 repeats, but here we took 2 of the repeats for demonstration and saved in two directories run1 and run2, both of which contains a simulation trajectory protein_lipids.xtc and a corresponding topology coordinate protein_lipids.gro. The two simulation trajectories have the same topology but different initial configurations. The trajectories were saved with a 20 ns timestep. The trajectories are available for download here.

In this tutorial, we will first check cholesterol interactions with receptor residues and then interactions with binding sites. We will also illustrate some of the plotting funtions and ways to save the interaction data in various formats that assist different analysis.

For those of you who are impatient, PyLipID documentation website has provided a no-brainer script that runs all PyLipID analysis in one run. The script can be downloaded here.

[1]:

%matplotlib inline
import pylipid
from pylipid.api import LipidInteraction
print(pylipid.__version__) # make sure pylipid version is later than 1.4

1.5.10

Load data¶

We start by initializing the main class LipidInteraction.

Trajectories and topology coordinates of the two repeats were saved in the directory ../../tests/data. Thus we first create a trajfile_list and topfile_list. Then, we load the trajfile_list and topfile_list to LipidInteraction together with a couple of calculation settings, including what cutoffs to use, which lipid to check, which lipid atoms are used to define the interactions and what time unit the generated data use etc. You can also specify which directory the generated data should be saved at. But here we will use the default value, i.e. the current working directory.

[2]:

trajfile_list = ["./traj_data/A2a/run1/protein_lipids.xtc", "./traj_data/A2a/run2/protein_lipids.xtc"]
topfile_list = ["./traj_data/A2a/run1/protein_lipids.gro", "./traj_data/A2a/run2/protein_lipids.gro"]
lipid = "CHOL"
cutoffs = [0.475, 0.8] # use of dual-cutoff
nprot = 1  # num. of protein copies in the system. if the simulation system has N copies of receptors,
           # "nprot=N" will report interactions averaged from the N copies, but "nprot=1"
           # will ask pylipid to report interaction data for each copy.
timeunit = 'us' # micro-second
save_dir = None  # if None, pylipid data will be saved at current working directory.

# initialize
li = LipidInteraction(trajfile_list, topfile_list=topfile_list, cutoffs=cutoffs, lipid=lipid,
                      nprot=nprot, save_dir=save_dir)

Interaction with residues¶

PyLipID provides functions to calculate the durations, occupancy (percentage of frames in which lipids formed contacts) and lipid count (the average number of surrounding lipids) of lipid interactions with protein residues. In addition, PyLipID can calculate interaction residence time, which is calculated from 1/koff. The function collect_residue_contacts() is required before any of the calculation of interactions with residues. This function will create a index of contacting lipids for each residue at each frame.

[3]:

li.collect_residue_contacts()

COLLECT INTERACTIONS FROM TRAJECTORIES: 100%|██████████| 2/2 [00:26<00:00, 13.20s/it]

After this calculation, a pandas.DataFrame is created as an attribute, named dataset, of the main class, which is accessible through li.dataset. It currently contains only two columns, Residue and Residue ID, but will be updated each time after a calculation is carried out to the class object LipidInteraction, i.e. what li refers to here.

[4]:

print(li.dataset)

    Residue  Residue ID
0      1ILE           0
1      2MET           1
2      3GLY           2
3      4SER           3
4      5SER           4
..      ...         ...
297  298ARG         297
298  299LYS         298
299  300ILE         299
300  301ILE         300
301  302ARG         301

[302 rows x 2 columns]

Durations, occupancy and lipid count¶

After collect_residue_contacts(), we can calculate the lipid interactions in a couple of metrics, using different functions. Each function has a parameter of residue_id which allows for calculation of specified residues via residue ID. Calculation will run for all residues if residue_id is set to the default value of None. Let’s check the interaction durations for all residues:

[5]:

durations = li.compute_residue_duration()

CALCULATE DURATION PER RESIDUE: 100%|██████████| 302/302 [00:00<00:00, 306.00it/s]

The returned durations is a list with a length of the number of residues. The n-th list is the durations of lipid interactions with the n-th residue from the trajectories. For example, let’s check the durations of the first residue. It shows a list of two list, corresponding to durations from the two trajectories.

[6]:

print(len(li.residue_list), len(li.trajfile_list))
print(len(durations), len(durations[0]))
print(durations[0])

302 2
302 2
[[0.02, 0.02, 0.02, 0.02, 0.02, 0.02, 0.02, 0.02, 0.02, 0.04, 0.04, 0.04, 0.04, 0.06, 0.06, 0.08, 0.1, 0.12, 0.16], [0.02, 0.02, 0.02, 0.02, 0.02, 0.02, 0.02, 0.02, 0.04, 0.04, 0.04, 0.04, 0.04, 0.06, 0.1, 0.1, 0.12, 0.12]]

Alternatively, if we are only interested in checking the interactions with first residue, we can just do the calculation for this residue:

[7]:

durations_of_first_residue = li.compute_residue_duration(residue_id=0)
print(durations_of_first_residue)

CALCULATE DURATION PER RESIDUE: 100%|██████████| 1/1 [00:00<00:00, 198.17it/s]

[[0.02, 0.02, 0.02, 0.02, 0.02, 0.02, 0.02, 0.02, 0.02, 0.04, 0.04, 0.04, 0.04, 0.06, 0.06, 0.08, 0.1, 0.12, 0.16], [0.02, 0.02, 0.02, 0.02, 0.02, 0.02, 0.02, 0.02, 0.04, 0.04, 0.04, 0.04, 0.04, 0.06, 0.1, 0.1, 0.12, 0.12]]

Similary, we can check the occupancy and lipid count as follow, the returned data has the same structure as compute_residue_duration()

[8]:

occupancies = li.compute_residue_occupancy()
lipidcounts = li.compute_residue_lipidcount()
print(len(occupancies), len(occupancies[0]))
print(len(lipidcounts), len(lipidcounts[0]))

CALCULATE OCCUPANCY: 100%|██████████| 302/302 [00:00<00:00, 13443.71it/s]
CALCULATE RESIDUE LIPIDCOUNT: 100%|██████████| 302/302 [00:00<00:00, 10329.96it/s]

302 2
302 2

After each of the above calculation, the pandas.DataFrame that is stored as dataset of the main class, is updated to include the average values for each residue. We can have a look at the dataset by using head() to see the columns have expanded to include “Duration”, “Occupancy”, “Lipid Count”:

[9]:

li.dataset.head(10) # show 10 lines

[9]:

	Residue	Residue ID	Duration	Duration std	Occupancy	Occupancy std	Lipid Count	Lipid Count std
0	1ILE	0	0.048108	0.037333	8.366534	1.195219	1.000000	0.000000
1	2MET	1	0.035000	0.016583	1.792829	0.996016	1.000000	0.000000
2	3GLY	2	0.020000	0.020000	0.199203	0.199203	0.500000	0.500000
3	4SER	3	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000
4	5SER	4	0.080000	0.178232	7.569721	3.187251	1.000000	0.000000
5	6VAL	5	0.072340	0.068486	12.749004	0.000000	1.015625	0.015625
6	7TYR	6	0.099259	0.149664	31.274900	5.776892	1.021505	0.021505
7	8ILE	7	0.187368	0.236128	6.374502	1.992032	1.000000	0.000000
8	9THR	8	0.074085	0.104188	21.314741	3.784861	1.038600	0.006854
9	10VAL	9	0.105246	0.146742	19.521912	0.398406	1.051250	0.011250

[10]:

print(li.dataset["Duration"])

0      0.048108
1      0.035000
2      0.020000
3      0.000000
4      0.080000
         ...
297    0.063750
298    0.010000
299    0.111765
300    0.086809
301    0.041538
Name: Duration, Length: 302, dtype: float64

Residence time and koff¶

Residence time provides useful insights into the dynamical behavior of the bound lipids, which due to their interaction with the protein, is no longer diffusive. Indeed, we often see both prolonged interactions and quick diffusive contacts of lipids at the protein surface. PyLipID therefore uses bi-exponentials for residence time calculation to account for the long and short decays of lipid relaxation. Residence time is calculated as 1 over koff, which is calculated from a normalized survival function using the durations. plot_data parameter allows for plotting of the koff data, which provides a more visually straight-forward way to check lipid interactions and understand the data.

[11]:

koffs, res_times = li.compute_residue_koff(plot_data=True, fig_close=True)
# fig_close=True will close all the figures. As it can generate substantial amount of figures
# (one for each residue), leaving all the figures open can consume considerable amount of memory.

# li.compute_residue_koff(residue_id=[10,15,25,35])
# li.compute_residue_koff(residue_id=10)

The koff figure looks like this:

0cf2ce9fe8cc4c1fb1a9770ee23c1ded

The left panel plots the interactions durations in the sorted order and the right panel plots the normalised survival rates of these interactions (the purple doted line) and the bi-exponential curve fitted to the survival rates (red broken line). To check the sampling quality, the interaction durations are bootstrapped and the fitted curve of the bootstraped interaction durations are plotted in gray lines. Insufficient sampling of interactions would give large deviation to the bootstrapped koffs.

Again, the dataset is updated to include information calculated from koff calculation. Added columns include the average value of each residue of “Residence Time”, “Koff”, “R Squared”, “Residence Time Bootstrap avg”, and “R Sqaured Bootstrap”. “R Sqaured” is the r_sqaured values of the curve fitting to the normarlised survival function. It indicates the quality of the curve fitting, hence indicates the reliability of the calculated koff and residence time. “Residence Time Bootstrap avg” and “R Squared Bootstrap avg” are the averaged residence time and r_squared from bootstrapping respectively.

[12]:

li.dataset.head(10)

[12]:

	Residue	Residue ID	Duration	Duration std	Occupancy	Occupancy std	Lipid Count	Lipid Count std	Koff	Residence Time	R Squared	Koff Bootstrap avg	R Squared Bootstrap avg
0	1ILE	0	0.048108	0.037333	8.366534	1.195219	1.000000	0.000000	11.804224	0.084715	0.998837	17.188663	0.998392
1	2MET	1	0.035000	0.016583	1.792829	0.996016	1.000000	0.000000	30.345855	0.032953	0.997823	32.201659	0.996950
2	3GLY	2	0.020000	0.020000	0.199203	0.199203	0.500000	0.500000	29.872911	0.033475	0.975950	26.885620	NaN
3	4SER	3	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000
4	5SER	4	0.080000	0.178232	7.569721	3.187251	1.000000	0.000000	1.790196	0.558598	0.960367	7.793917	0.975108
5	6VAL	5	0.072340	0.068486	12.749004	0.000000	1.015625	0.015625	13.032816	0.076729	0.999307	10.955595	0.998862
6	7TYR	6	0.099259	0.149664	31.274900	5.776892	1.021505	0.021505	4.313793	0.231815	0.996524	4.922636	0.994342
7	8ILE	7	0.187368	0.236128	6.374502	1.992032	1.000000	0.000000	2.335581	0.428159	0.992678	2.604414	0.993542
8	9THR	8	0.074085	0.104188	21.314741	3.784861	1.038600	0.006854	11.245471	0.088925	0.924258	6.244023	0.992139
9	10VAL	9	0.105246	0.146742	19.521912	0.398406	1.051250	0.011250	4.454339	0.224500	0.997426	3.898776	0.995095

Interaction with binding sites¶

Before calculation of interactions with binding sites, we need to calculate where the binding sites are. PyLipID calculates the lipid interaction correlations between the protein residues and finds clusters of residues that bind to the same lipid molecule at the same time. This is carried out by computing the community structures of the interaction network of the protein residues.

[13]:

node_list, modularity = li.compute_binding_nodes(threshold=4, print_data=False)

# threshold=4 decides that binding sites should contain at least 4 residues. This is particularly
# helpful when itneraction samplings are not sufficient, in which case false positively correlation
# is likely to happen among 2 or 3 residues.
#
# print_data=True will print the residues for each binding site. It can be quite verbose.

This function returned two data, first is the node_list which is a list of binding site lists. Each binding site list contains the IDs of residues in that binding site. Modularity is the network modularity, which is taken from the Louvain’s method of calculating network community structures. Modularity is an indication of the quality of community structures, i.e. binding sites in our cases. The value of modularity is between 1 and -1. The closer to 1, the more distinctive the community structures, hence the more trust-worthy the calculated binding sites. Modularity is dependent on the quality of interaction network, which is ultimately dependent on such factors as the sampling of lipid interactions and choice of interaction cutoffs.

[14]:

print(node_list)
print()
print(modularity)

[[0, 1, 2, 4, 5, 8, 64], [6, 9, 10, 12, 13, 265, 268, 272, 279], [7, 11, 14, 15, 18, 53, 54, 57, 58, 61, 298], [16, 20, 282, 286, 293, 296, 297, 300], [19, 22, 23, 26, 30, 299, 301], [25, 28, 29, 40, 43, 44, 47, 48, 50, 51, 55, 80, 119, 122, 123, 126], [41, 90, 93, 94, 97, 98, 101, 112, 113, 121, 125, 128, 181, 182, 186], [59, 62, 67, 73, 77], [72, 76, 127, 130, 131, 132, 134, 135, 137, 138, 140, 173, 177], [100, 104, 105, 185, 189, 192, 193], [116, 117, 120, 124], [176, 179, 180, 183, 184, 187, 188, 239, 240, 243, 244, 251, 255, 256], [191, 194, 198, 230, 233, 236, 237], [195, 196, 199, 202, 203], [227, 231, 234, 235, 238, 242, 273, 276, 280, 283, 284, 287, 288], [241, 245, 248, 249, 252, 259, 262, 263, 266, 267, 269, 270]]

0.8311593806596966

Afer this calculation, the dataset had a new column Binding Site ID, which gives the residues belonging to the same binding site a binding site ID, whereas labels residues that do not belong to any binding site as ‘-1’. This binding site ID will be an invariable and be accessed through all the binding site calculation.

[15]:

li.dataset.head(10)

[15]:

	Residue	Residue ID	Duration	Duration std	Occupancy	Occupancy std	Lipid Count	Lipid Count std	Koff	Residence Time	R Squared	Koff Bootstrap avg	R Squared Bootstrap avg	Binding Site ID
0	1ILE	0	0.048108	0.037333	8.366534	1.195219	1.000000	0.000000	11.804224	0.084715	0.998837	17.188663	0.998392	0
1	2MET	1	0.035000	0.016583	1.792829	0.996016	1.000000	0.000000	30.345855	0.032953	0.997823	32.201659	0.996950	0
2	3GLY	2	0.020000	0.020000	0.199203	0.199203	0.500000	0.500000	29.872911	0.033475	0.975950	26.885620	NaN	0
3	4SER	3	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	-1
4	5SER	4	0.080000	0.178232	7.569721	3.187251	1.000000	0.000000	1.790196	0.558598	0.960367	7.793917	0.975108	0
5	6VAL	5	0.072340	0.068486	12.749004	0.000000	1.015625	0.015625	13.032816	0.076729	0.999307	10.955595	0.998862	0
6	7TYR	6	0.099259	0.149664	31.274900	5.776892	1.021505	0.021505	4.313793	0.231815	0.996524	4.922636	0.994342	1
7	8ILE	7	0.187368	0.236128	6.374502	1.992032	1.000000	0.000000	2.335581	0.428159	0.992678	2.604414	0.993542	2
8	9THR	8	0.074085	0.104188	21.314741	3.784861	1.038600	0.006854	11.245471	0.088925	0.924258	6.244023	0.992139	0
9	10VAL	9	0.105246	0.146742	19.521912	0.398406	1.051250	0.011250	4.454339	0.224500	0.997426	3.898776	0.995095	1

Therefore, to check residues of a binding site and look at their interaction properties, we can use this Binding Site ID:

[16]:

mask = (li.dataset["Binding Site ID"] == 0) # check binding site NO. 0
li.dataset[mask]

[16]:

	Residue	Residue ID	Duration	Duration std	Occupancy	Occupancy std	Lipid Count	Lipid Count std	Koff	Residence Time	R Squared	Koff Bootstrap avg	R Squared Bootstrap avg
0	1ILE	0	0.048108	0.037333	8.366534	1.195219	1.000000	0.000000	11.804224	0.084715	0.998837	17.188663	0.998392
1	2MET	1	0.035000	0.016583	1.792829	0.996016	1.000000	0.000000	30.345855	0.032953	0.997823	32.201659	0.996950
2	3GLY	2	0.020000	0.020000	0.199203	0.199203	0.500000	0.500000	29.872911	0.033475	0.975950	26.885620	NaN
4	5SER	4	0.080000	0.178232	7.569721	3.187251	1.000000	0.000000	1.790196	0.558598	0.960367	7.793917	0.975108
5	6VAL	5	0.072340	0.068486	12.749004	0.000000	1.015625	0.015625	13.032816	0.076729	0.999307	10.955595	0.998862
8	9THR	8	0.074085	0.104188	21.314741	3.784861	1.038600	0.006854	11.245471	0.088925	0.924258	6.244023	0.992139
64	65SER	64	0.180000	0.180000	0.199203	0.199203	0.500000	0.500000	3.454593	0.289470	0.973915	2.418215	NaN

Durations, occupancy and lipid count¶

Similar to interaction with residues, interactions with binding sites can also be checked in a couple of metrics. The same syntax is used.

[17]:

durations = li.compute_site_duration()
occupancies = li.compute_site_occupancy()
lipidcounts = li.compute_site_lipidcount()
print(len(node_list), len(li.trajfile_list))
print(len(durations), len(durations[0]))
print(len(occupancies), len(occupancies[0]))
print(len(lipidcounts), len(lipidcounts[0]))

CALCULATE DURATION PER BINDING SITE: 100%|██████████| 16/16 [00:00<00:00, 49.09it/s]
CALCULATE OCCUPANCY PER BINDING SITE: 100%|██████████| 16/16 [00:00<00:00, 134.74it/s]
CALCULATE LIPIDCOUNT PER BINDING SITE: 100%|██████████| 16/16 [00:00<00:00, 132.91it/s]

Similar to the calculation for residues, the returned list contains calculated values for each binding site from the two trajectories.

Residence Time and koff¶

Similar to that for residue, residence time and koff can be calculated for binding sites, via the function compute_site_koff

[18]:

koff_BS, res_time_BS = li.compute_site_koff(plot_data=True)

The returned values are koff and residence time for each binding site. plot_data=True will generate the koff figure shown above for each binding site. Now the dataset is expanded to include information for binding site residence time and koff. Residues from the same binding site will record the same binding site interaction data:

[19]:

li.dataset[li.dataset["Binding Site ID"] == 2]

[19]:

	Residue	Residue ID	Duration	Duration std	Occupancy	Occupancy std	Lipid Count	Lipid Count std	Koff	Residence Time	...	R Squared Bootstrap avg	Binding Site ID	Binding Site Duration	Binding Site Occupancy	Binding Site Lipid Count	Binding Site Koff	Binding Site Koff Bootstrap avg	Binding Site Residence Time	Binding Site R Squared	Binding Site R Squared Bootstrap avg
7	8ILE	7	0.187368	0.236128	6.374502	1.992032	1.000000	0.000000	2.335581	0.428159	...	0.993542	2	0.160196	73.904382	1.26079	2.358624	1.492173	0.423976	0.998354	0.995259
11	12LEU	11	0.257778	0.351371	23.306773	1.792829	1.059524	0.003968	2.295437	0.435647	...	0.995768	2	0.160196	73.904382	1.26079	2.358624	1.492173	0.423976	0.998354	0.995259
14	15ALA	14	0.223333	0.182361	1.195219	0.000000	1.000000	0.000000	3.246453	0.308028	...	0.985841	2	0.160196	73.904382	1.26079	2.358624	1.492173	0.423976	0.998354	0.995259
15	16VAL	15	0.128621	0.182931	19.920319	1.593625	1.040258	0.003221	1.211126	0.825678	...	0.997457	2	0.160196	73.904382	1.26079	2.358624	1.492173	0.423976	0.998354	0.995259
18	19ILE	18	0.172105	0.172141	12.350598	1.992032	1.038462	0.038462	3.809314	0.262514	...	0.996998	2	0.160196	73.904382	1.26079	2.358624	1.492173	0.423976	0.998354	0.995259
53	54GLY	53	0.200000	0.161864	0.796813	0.796813	0.500000	0.500000	3.723346	0.268576	...	0.984969	2	0.160196	73.904382	1.26079	2.358624	1.492173	0.423976	0.998354	0.995259
54	55VAL	54	0.184444	0.235132	22.310757	1.992032	1.026197	0.006590	2.649627	0.377412	...	0.997034	2	0.160196	73.904382	1.26079	2.358624	1.492173	0.423976	0.998354	0.995259
57	58ILE	57	0.383158	0.329674	24.302789	1.992032	1.008929	0.008929	1.316120	0.759809	...	0.995668	2	0.160196	73.904382	1.26079	2.358624	1.492173	0.423976	0.998354	0.995259
58	59PRO	58	0.296429	0.301996	16.932271	2.589641	1.037982	0.017574	2.093627	0.477640	...	0.994831	2	0.160196	73.904382	1.26079	2.358624	1.492173	0.423976	0.998354	0.995259
61	62ILE	61	0.212941	0.269635	6.772908	1.992032	1.000000	0.000000	2.067640	0.483643	...	0.991763	2	0.160196	73.904382	1.26079	2.358624	1.492173	0.423976	0.998354	0.995259
298	299LYS	298	0.010000	0.010000	0.199203	0.199203	0.500000	0.500000	0.000000	0.000000	...	NaN	2	0.160196	73.904382	1.26079	2.358624	1.492173	0.423976	0.998354	0.995259

11 rows × 22 columns

LipidInteraction also provides a assisting function binding_site to show binding site information and show some stats and returns the pandas.DataFrame containing residues of the specified binding site.

[20]:

df = li.binding_site(binding_site_id=3) # check the binding site No 3
df.head(10)

# Binding site 3
 Binding Site Residence Time:       0.402 us
 Binding Site Koff:                 2.490  R squared:  0.9984
 Binding Site Duration:             0.084 us
 Binding Site Occupancy:           36.056 %
 Binding Site Lipid Count:          1.087
 Pos. Charge:        2/8
 Neg. Charge:        0/8
 Polar:              0/8
 Hydrophobic:        4/8
 Special:            2/8
 Residue Res ID  Res. Time (us)  Duration (us)   Occupancy (%) Lipid Count  Koff  R Squared
 297PHE    296       0.322           0.098          17.131        1.008     3.108   0.998
 301ILE    300       0.108           0.087          13.745        1.000     9.263   0.999
  21GLY    20        0.103           0.077           3.984        1.000     9.719   0.990
 298ARG    297       0.095           0.064           3.386        1.000    10.535   0.997
  17LEU    16        0.052           0.057          13.147        1.014    19.296   0.999
 287ALA    286       0.047           0.030           0.797        0.500    21.240   0.993
 294ARG    293       0.033           0.025           0.598        0.500    30.652   0.996
 283PRO    282       0.030           0.032           0.996        1.000    32.883   0.994

[20]:

	Residue	Residue ID	Duration	Duration std	Occupancy	Occupancy std	Lipid Count	Lipid Count std	Koff	Residence Time	...	R Squared Bootstrap avg	Binding Site ID	Binding Site Duration	Binding Site Occupancy	Binding Site Lipid Count	Binding Site Koff	Binding Site Koff Bootstrap avg	Binding Site Residence Time	Binding Site R Squared	Binding Site R Squared Bootstrap avg
282	283PRO	282	0.032000	0.024000	0.996016	0.597610	1.000000	0.000000	32.882928	0.030411	...	0.693587	3	0.08427	36.055777	1.086764	2.489836	6.14596	0.401633	0.998424	0.998174
293	294ARG	293	0.025000	0.021794	0.597610	0.597610	0.500000	0.500000	30.652487	0.032624	...	0.794795	3	0.08427	36.055777	1.086764	2.489836	6.14596	0.401633	0.998424	0.998174
286	287ALA	286	0.030000	0.030000	0.796813	0.796813	0.500000	0.500000	21.240404	0.047080	...	0.492717	3	0.08427	36.055777	1.086764	2.489836	6.14596	0.401633	0.998424	0.998174
16	17LEU	16	0.056923	0.047940	13.147410	0.796813	1.014286	0.014286	19.295543	0.051825	...	0.999266	3	0.08427	36.055777	1.086764	2.489836	6.14596	0.401633	0.998424	0.998174
297	298ARG	297	0.063750	0.060917	3.386454	0.199203	1.000000	0.000000	10.535137	0.094920	...	0.994440	3	0.08427	36.055777	1.086764	2.489836	6.14596	0.401633	0.998424	0.998174
20	21GLY	20	0.077143	0.090351	3.984064	0.796813	1.000000	0.000000	9.719108	0.102890	...	0.991340	3	0.08427	36.055777	1.086764	2.489836	6.14596	0.401633	0.998424	0.998174
300	301ILE	300	0.086809	0.071014	13.745020	2.589641	1.000000	0.000000	9.262659	0.107960	...	0.998140	3	0.08427	36.055777	1.086764	2.489836	6.14596	0.401633	0.998424	0.998174
296	297PHE	296	0.098095	0.099696	17.131474	6.772908	1.008333	0.008333	3.107937	0.321757	...	0.997092	3	0.08427	36.055777	1.086764	2.489836	6.14596	0.401633	0.998424	0.998174

8 rows × 22 columns

Analyze bound poses¶

Once the binding sites are defined, we can proceed to analyze bound poses, e.g. check the most representative bound poses for binding sites and check bound pose clusters for binding sites etc.

[21]:

pose_pool, pose_rmsd = li.analyze_bound_poses(binding_site_id=None, pose_format="pdb",
                                             n_top_poses=3, n_clusters='auto') # None means analysis for all residues

../_images/tutorials_0-application-walk-through_43_1.png

analyze_bound_poses() will rate the bound poses for each binding site based on probability density calculated from the simulation trajectories and write out a couple of top rated poses (controled by n_poses). It will also cluster the bound poses for each binding site, for which the number of clusters can be determined by a density based cluster DBSCAN or provided by users via the parameter n_clusters. So for each binding site, it will generate two directories (if n_clusters is not 0), one for top-rated poses (in the directory of ‘BSid{num}_rank’) and one for all the clustered poses (in the directory of ‘BSid{num}_clusters’). All these directories are then put under the “Bound_Poses_{lipid}” directory if the parameter save_dir is None, i.e. not given. An overview of the directories organisation is shown below:

f48443b8c303410c9f79966b4f28a324

The written poses store the lipid pose and the recetpor conformation that this pose was bound to. By default, the pose files are saved in the Gromacs ‘gro’ format but can be changed to, e.g. ‘pdb’ format by the parameter pose_format. For the top-ranked poses, the pose named ‘top1’ is the one with highest scores, i.e. the most representative bound pose from the simulations; and ‘top2’ is the second, and so on. For coarse-grained simulations, these top-ranked poses can be quite similar to each other due to the smoothened potentials. These poses can be checked by such molecular visualization softwares as VMD and PyMol etc. Below shows the top rated bound cholesterol pose at binding site No 3 of A2a. This cholesterol binding site have been seen in many of the Class A GPCR x-ray structures:

11fe9c1ee8fb44abaf865b4ae4dc77fc

The lipid bound poses for each binding site will be returned. The RMSD of bound poses to the binding site average will be returned and plotted in a violin plot.

42c2f80c1f95473899c4e1151bc4f516

Similar to other analysis functions, analyze_bound_poses can also select a couple of binding sites for analysis via binding_site_id.

The returned pose_pool is a dict object that stores bound poses for binding sites, with the binding site id as the dictionary keys and the bound pose coordinates as the corresponding values. pose_rmsd is a pandas.DataFrame that stores RMSD for the bound poses (in unit of nm).

[22]:

pose_rmsd.head()

[22]:

	Binding Site 0	Binding Site 1	Binding Site 2	Binding Site 3	Binding Site 4	Binding Site 5	Binding Site 6	Binding Site 7	Binding Site 8	Binding Site 9	Binding Site 10	Binding Site 11	Binding Site 12	Binding Site 13	Binding Site 14	Binding Site 15
0	0.223008	0.192302	0.288550	0.487145	0.325898	0.216980	0.417243	0.598373	0.303224	0.260776	0.174348	0.681689	0.269166	0.122740	0.362086	0.569712
1	0.214176	0.374907	0.317757	1.028707	0.492244	0.234929	0.465488	0.313493	0.312007	0.245380	0.189006	0.454326	0.267336	0.193313	0.836323	0.692334
2	0.279867	0.690856	0.540312	0.383474	0.372813	0.216115	0.463774	0.152496	0.780332	0.550248	0.127422	0.375738	0.265146	0.129976	0.524896	0.978642
3	0.199653	0.377980	0.362978	0.379433	0.406318	0.261471	0.273237	0.332318	0.325981	0.206775	0.196669	0.272414	0.179198	0.272044	0.273617	0.529334
4	0.288850	0.348632	0.273372	0.251446	0.244760	0.240792	0.359739	0.161057	0.385767	0.587389	0.326057	0.516921	0.297746	0.280991	0.701244	0.343846

Calculate surface area¶

We can also calculate the surface area for binding site, obtaining the values as a function of time. print_data=True generates a timeseries plot and a violinplot.

[23]:

surface_area = li.compute_surface_area(plot_data=True)

../_images/tutorials_0-application-walk-through_48_1.png

../_images/tutorials_0-application-walk-through_48_2.png

The returned surface_area is pandas.DataFrame object that stores the time evolution of the surface area for each binding site. The reported values are in the unit of nm^2.

[24]:

surface_area.head()

[24]:

			Binding Site 0	Binding Site 1	Binding Site 2	Binding Site 3	Binding Site 4	Binding Site 5	Binding Site 6	Binding Site 7	Binding Site 8	Binding Site 9	Binding Site 10	Binding Site 11	Binding Site 12	Binding Site 13	Binding Site 14	Binding Site 15	Time
0	0	0	5.180549	4.713027	6.588768	5.615480	6.622356	7.665939	7.261710	3.359309	9.046471	4.861176	3.672477	8.362637	3.414401	6.126667	8.513744	8.631628	3.00
		1	5.207631	4.389207	6.678460	5.471372	6.412016	8.460323	7.108385	3.279337	8.903983	4.817450	3.687601	8.675516	3.511277	5.917765	9.195548	8.689010	3.02
		2	5.548610	4.465664	5.872028	5.773362	6.685437	7.553239	6.973439	3.484172	9.458707	4.827971	3.402123	9.300999	3.251616	5.497607	9.364325	7.785925	3.04
		3	5.607398	4.589885	6.560425	5.169632	6.339952	8.063607	7.462493	3.413392	9.532347	4.830526	3.462173	8.392797	3.873588	5.904237	8.515190	7.945544	3.06
		4	5.024073	4.214376	6.399018	6.042726	6.304193	9.341050	7.435935	3.429047	9.438010	4.907884	3.322324	8.536524	3.349148	5.717446	9.065802	7.539258	3.08

Similar to other binding site analysis functions, compute_surface_area allows for selection of binding site for calculation via binding_site_id.

The calculation of surface area needs definition of the radii of atoms in the system. The function uses mdtraj’s atom radii and defines the radii of MARTINI 2 beads. New radii definition can be provided via radii.

Plots and save data¶

The class object LipidInteraction provides a couple of assisting functions to plot and save data. These figures plot various interaction metrics as a function of the protein residue index. These figures help to point out the contact hotspots.

[25]:

for item in ["Residence Time", "Duration", "Occupancy", "Lipid Count"]:
    li.plot(item=item) # plot values as a function of time

Creating new director: /Users/wlsong/PyLipID/PyLipID-dev/docs/tutorials/Interaction_CHOL/Figure_CHOL

../_images/tutorials_0-application-walk-through_54_1.png

../_images/tutorials_0-application-walk-through_54_2.png

../_images/tutorials_0-application-walk-through_54_3.png

../_images/tutorials_0-application-walk-through_54_4.png

PyLipID also plots interactions in logo figures generated by logomakek. This kind of figure can directly point out the amino acids that make the contacts. It could be helpful to understand the chemical nature of lipid interactions.

[26]:

li.plot_logo(item="Residence Time")

../_images/tutorials_0-application-walk-through_56_0.png

As mentioned above, the calculated binding site information can be saved in a text file. The order of the data can be controled by the sort_residue parameter:

[27]:

li.write_site_info(sort_residue="Residence Time") # write binding site information in a txt file

To assist understanding of the interaction from a structural perspective, PyLipID can also save the interaction information in the B factor column of the protein coordinate in the pdb format. The following example will save the ‘Residence Time’ of each residue in a pdb file “Coordinate_CHOL_Residence_Time.pdf” in the directory of “Coordinate_{lipid}”:

[28]:

li.save_coordinate(item="Residence Time") # write protein coordinate in pdb format with the b factor column
                                          # showing 'Residence Time' values

Creating new director: /Users/wlsong/PyLipID/PyLipID-dev/docs/tutorials/Interaction_CHOL/Coordinate_CHOL

The coordinate file can be viewed in such visualization softwares as VMD and PyMol, and the structure can be coloured based on B factors to obtain the following images:

f81f96183489458e89c49fbd186fb909

Shown above are the A2aR structure in the coarse-grained model. To make the secondary and teriary structure clearer, only the backbone beads were shown whereas the side beads were hidden. A ‘white to red’ color map was used to show the residence time for each residue in VMD. The interaction hotspots and their locations can be easily analyzed through such a presentation.