Projects

The main focus of research in my lab is on the development of machine learning methods for problems in bioinformatics. Our specialty is in the creative design of kernel methods for problems ranging from prediction of protein function, and interactions to prediction of alternative splicing.

Prediction of protein interfaces with graph convolutional networks

Proteins perform their function by interacting with other proteins. Therefore understanding the complex network of interactions between proteins, and at a finer level, knowing the interfaces through which those interactions occur are highly important. To address the problem of predicting protein interfaces, we are developing deep neural networks that use protein 3-d structure for this task.

_static/prot_interface_graph.png

In this work we represent a protein structure as a graph whose nodes are the amino acids, and edges are determined by proximity in the structure, and our goal is to apply convolutional neural networks to this data.

Convolutional neural networks have become one of the primary tools in computer vision. Their power comes from their ability to learn features with increasing levels of abstraction that are invariant to various transformations of the image (e.g. translation). We envisioned a similar approach for protein 3-d structures, which required us to replace the convolution over a regular grid, to convolution of a graph structure. We are currently designing custom convolutional operators specifically tailored for this task. Our first results on this problem were published at last year’s NIPS conference:

_images/nsflogo.jpg

We are currently extending this work for the related tasks of prioritizing protein docking solutions and protein structure predictions.

This project is funded by a grant from the NSF ABI program (award # 1564840).

 

Our proof-of-concept for the feasibility of partner specific prediction of interfaces from 3-d structure used SVMs and resulted in a method called PAIRpred:

My earlier work in the area of interface and interaction prediction includes prediction of Calmodulin binding sites, and genome-wide prediction of interaction networks in yeast and human.

Alternative splicing in plants

Splicing is the process whereby parts of a gene called introns are removed, and the RNA is spliced back to form the mature mRNA. A given gene can be spliced in multiple ways, a phenomenon called alternative splicing. Whereas it is well-studied in animals, alternative splicing in plants is not as well understood, and the differences in genome architecture between plants and animals lead to differences in alternative splicing. We are working on this in collaboration with A.S.N. Reddy of the Biology Department, and our approach is to computationally search for genomic features that are predictive of alternative splicing—elements that serve as splicing enhancers and suppressors, and test their biological relevance to the process.

A second avenue we are pursuing is to leverage next generation sequencing data for prediction of alternative splicing events and improve genome annotation. The noisy nature of this data makes this a challenging task.

_images/AT1G02205.png

The above figure shows a model created by our SpliceGrapher tool.

_images/nsflogo.jpg

This project is funded by NSF and DOE.

 

Protein function prediction

_images/go.png

Despite having been studied for over twenty years, the standard method for protein function prediction remains annotation transfer. The difficulty in applying state-of-the-art machine learning methods is that proteins can have multiple functions, and that the system of keywords used to describe protein function, the Gene Ontology (GO), has a complex hierarchical structure. This provides genome annotators with a rich vocabulary with which to describe protein function, but makes it sub-optimal to use standard approaches. Therefore, there are significant opportunities to develop new classification methods that treat function prediction as a hierarchical classification problem.

Our approach uses the so-called structured SVM, which is able to fully model the complexity of this learning problem. The method we developed, GOstruct, has shown state-of-the-art performance in several benchmarks.

_images/nsflogo.jpg

This project was funded by an NSF grant from the ABI program (award #0965768).