The main focus of research in my lab is on the development of machine learning methods for problems in bioinformatics. Our lab pioneered the use of graph convolutional networks for the analysis of protein structures. We are also applying deep learning techniques for extracting signals from genomic sequences.

Prediction of protein interfaces with graph convolutional networks

Proteins perform their function by interacting with other proteins. Therefore understanding the complex network of interactions between proteins, and at a finer level, knowing the interfaces through which those interactions occur are highly important. To address the problem of predicting protein interfaces, we are developing deep neural networks that use protein 3d structure for this task.


Convolutional neural networks have become one of the primary tools in computer vision. Their power comes from their ability to learn features with increasing levels of abstraction that are invariant to various transformations of the image (e.g. translation and rotation). We envisioned a similar approach for protein 3d structures. However, instead of representing proteins as a 3d image, we chose to represent the structure as a graph whose nodes represent atoms or amino acids, with connections that are determined by proximity in the protein structure. This required us to replace the standard form of convolution, which operates over a regular grid, to convolution over a graph structure. Our first results using this technique were published at the NIPS conference:


We are currently extending this work for the related tasks of prioritizing protein docking solutions and protein structure predictions.

This project was funded by a grant from the NSF ABI program (award # 1564840).


Our proof-of-concept for the feasibility of partner specific prediction of interfaces from 3-d structure used SVMs and resulted in a method called PAIRpred:

Earlier work in the area of interface and interaction prediction includes prediction of Calmodulin binding sites, and genome-wide prediction of interaction networks in yeast and human:

Deep learning in genomics

Genes, that contain the information that codes for mRNA and proteins, are constantly turned on and off in response to the needs of the cell and the organism that they are a part of. The molecular switches that control genes are proteins that bind DNA. These DNA binding proteins (transcription factors) recognize signals in the DNA. The complexity of the genome and the molecular processes that controls its expression are one of the major challenges facing biologists. We are using deep learning techniques to help discover these signals and help generate models that can help unravel various biological processes (see the discussion of alternative splicing below). Recent work in this area includes a comparative study of deep learning architectures for discovering signals in DNA and RNA:

We also designed a deep learning algorithm that allows for the discovery of interactions between regulatory features:

Alternative splicing

Splicing is the process whereby parts of a gene called introns are removed, and the RNA is spliced back to form the mature mRNA. A given gene can be spliced in multiple ways, a phenomenon called alternative splicing. Whereas it is well-studied in animals, alternative splicing in plants is not as well understood, and the differences in genome architecture between plants and animals lead to differences in alternative splicing. We are working on this in collaboration with A.S.N. Reddy of the Biology Department. Our approach is to computationally search for genomic features that are predictive of alternative splicing—elements that serve as splicing enhancers and suppressors, and test their biological relevance to the process.


The above figure shows a model created by our SpliceGrapher tool.


This project was funded by NSF and DOE.


Protein function prediction


Protein function prediction is an ongoing area of research in the lab. The difficulty in applying state-of-the-art machine learning methods to this problem is that proteins can have multiple functions, and that the system of keywords used to describe protein function, the Gene Ontology (GO), has a complex hierarchical structure. This provides genome annotators with a rich vocabulary with which to describe protein function, but makes it sub-optimal to use standard machine learning approaches. We addressed protein function prediction as a hierarchical multi-label classification problem and designed custom algorithms based on the so-called structured SVM, which is able to fully model the complexity of this learning problem. The method we developed, GOstruct, has shown state-of-the-art performance in several benchmarks.


This project was funded by an NSF grant from the ABI program (award #0965768).