pylipid.api.LipidInteraction.analyze_bound_poses¶

LipidInteraction.analyze_bound_poses(binding_site_id=None, n_top_poses=3, n_clusters='auto', pose_format='gro', score_weights=None, kde_bw=0.15, pca_component=0.9, plot_rmsd=True, save_dir=None, eps=None, min_samples=None, metric='euclidean', fig_close=False, fig_format='pdf', num_cpus=None)[source]¶

Analyze bound poses for binding sites.

This function can find representative bound poses, cluster the bound poses and calculate pose RMSD for binding sites.

If n_top_poses is an integer larger than 0, this method will find the representative bound poses for the specified binding sites. To do so, it evaluates all the bound poses in a binding site using a density-based scoring function and ranks the poses using based on the scores. The scoring function is defined as:

\[\text { score }=\sum_{i} W_{i} \cdot \hat{f}_{i, H}(D)\]

where \(W_{i}\) is the weight given to atom i of the lipid molecule, H is the bandwidth and \(\hat{f}_{i, H}(D)\) is a multivariate kernel density etimation of the position of atom i in the specified binding site. \(\hat{f}_{i, H}(D)\) is calculated from all the bound lipid poses in that binding site. The position of atom i is a p-variant vector, \(\left[D_{i 1}, D_{i 2}, \ldots, D_{i p}\right]\) where \(D_{i p}\) is the minimum distance to the residue p of the binding site. The multivariant kernel density is estimated by KDEMultivariate provided by Statsmodels. Higher weights can be given to e.g. the headgroup atoms of phospholipids, to generate better defined binding poses, but all lipid atoms are weighted equally by default. The use of relative positions of lipid atoms in their binding site makes the analysis independent of the conformational changes in the rest part of the protein.

If n_clusters is given an integer larger than 0, this method will cluster the lipid bound poses in the specified binding site using KMeans provided by scikit. The KMeans cluster separates the samples into n clusters of equal variances, via minimizing the inertia, which is defined as:

\[\sum_{i=0}^{n} \min _{u_{i} \in C}\left(\left\|x_{i}-u_{i}\right\|^{2}\right)\]

where \(u_{i}\) is the centroid of cluster i. KMeans scales well with large dataset but performs poorly with clusters of varying sizes and density, which are often the case for lipid poses in a binding site.

If n_clusters is set to auto, this method will cluster the bound poses using a density-based cluster DBSCAN provided by scikit. DBSCAN finds clusters of core samples of high density. A sample point is a core sample if at least min_samples points are within distance \(\varepsilon\) of it. A cluster is defined as a set of sample points that are mutually density-connected and density-reachable, i.e. there is a path \(\left\langle p_{1}, p_{2}, \ldots, p_{n}\right\rangle\) where each \(p_{i+1}\) is within distance \(\varepsilon\) of \(p_{i}\) for any two p in the two. The values of min_samples and \(\varepsilon\) determine the performance of this cluster. If None, min_samples takes the value of 2 * ndim. If \(\varepsilon\) is None, it is set as the value at the knee of the k-distance plot.

For writing out the cluster poses, this method will randomly select one pose from each cluster in the case of using KMeans or one from the core samples of each cluster when DBSCAN is used, and writes the selected lipid pose with the protein conformation to which it binds using MDTraj, in the provided pose format.

The root mean square deviation (RMSD) of a lipid bound pose in a binding site is calculated from the relative position of the pose in the binding site compared to the average position of the bound poses. Thus, the pose RMSD is defined as:

\[RMSD=\sqrt{\frac{\sum_{i=0}^{N} \sum_{j=0}^{M}\left(D_{i j}-\bar{D}_{j}\right)^{2}}{N}}\]

where \(D_{i j}\) is the distance of atom i to the residue j in the binding site; \(\bar{D}_{j}\) is the average distance of atom i from all bound poses in the binding site to residue j; N is the number of atoms in the lipid molecule and M is the number of residues in the binding site.

Parameters

binding_site_id (int or list of int, default=None) – The binding site ID used in PyLipID. This ID is the index in the binding site node list that is calculated by the method compute_binding_nodes. The ID of the N-th binding site is (N-1). If None, the contact duration of all binding sites are calculated.
n_top_poses (int, default=3) – Number of representative bound poses written out for the selected binding site.
n_clusters (int or 'auto') – Number of clusters to form for bound poses of the selected binding site. If n_clusters is set to ‘auto’, the density-based clusterer DBSCAN will be used. If ``n_clusters` is given a non-zero integer, KMeans is used.
pose_format (str, default="gro") – The coordinate format the representative poses and clsutered poses are saved with. Support the formats that are included in MDtraj.save().
score_weights (dict or None, default=None) – The weights given to atoms in the scoring function for finding the representative bound poses. It should in the format of a Python dictionary {atom name: weight}. The atom name should be consisten with the topology. By default, all atoms in the lipid molecule are weighted equally.
kde_bw (scalar, default=0.15) –
The bandwidth for the Gaussian kernel. Used in the density estimation of the lipid atom coordinates in the binding site. Used by the function KDEMultivariate .
pca_component (int, float or ‘mle’, default=0.90) – The PCA used to decrease the dimensions of lipid atom coordinates. The coordinate of a lipid atom in the binding site is expressed as a distance vector of the minimum distances to the residues in that binding site, i.e. \([D_{i 1}, D_{i 2}, .., D_{i p}]\). This can be in high dimensions. Hence, PCA is used on this distance vector prior to calculation of the density. This PCA is carried out by PCA in sci-kit.
plot_rmsd (bool, default=True) – Plot the binding site RMSDs in a violinplot.
eps (float or None, default=None) –
The minimum neighbour distance used by DBSCAN if None, the value will determined from as the elbow point of the sorted minimum neighbour distance of all the data points.
min_samples (int or None, default=None) –
The minimum number of samples to be considered as core samples used by DBSCAN . If None, the value will be automatically determined based on the size of data.
metric (str, default='euclidean') –
The metric used to calculated neighbour distance used by DBSCAN . Default is the Euclidean distance.
fig_close (bool, default=False) – This parameter control whether to close the plotted figures using plt.close(). It can save memory if many figures are generated.
fig_format (str, default="pdf") – Figure format. Support formats included in matplotlib.pyplot.
num_cpus (int or None default=None) – The number of CPUs used for the tasks of ranking the poses and clustering poses. Python multiprocessing deployed by p_tqdm is used to speed up these calculations.
save_dir (str or None, default=None) – The root directory for saving the pose analysis results. If None, the root directory set at the initiation of LipidInteraction will be used. The representative poses and clustered poses will be stored in the directory of Bound_Poses_{lipid} under the root directory.

Returns

pose_pool (dict) – Coordinates of the bound poses for the selected binding sites stored in a python dictionary {binding_site_id: pose coordinates}. The poses coordinates include lipid coordinates and those of the receptor at the time the pose was bound. The pose coordinates are stored in a mdtraj.Trajectory object.
rmsd_data (pandas.DataFrame) – Bound poses RMSDs are stored by columns with column name of binding site id.