The main focus of research in my lab is on the development of machine learning methods for problems in bioinformatics. Our lab pioneered the use of graph convolutional networks for the analysis of protein structures. We are also applying deep learning techniques for extracting signals from genomic sequences.

Deep learning in genomics

Genes, that contain the information that codes for mRNA and proteins, are constantly turned on and off in response to the needs of the cell and the organism that they are a part of. The molecular switches that control genes are proteins that bind DNA. These DNA binding proteins (transcription factors) recognize signals in the DNA. The complexity of the genome and the molecular processes that controls its expression are one of the major challenges facing biologists. We are using deep learning techniques to help discover these signals and help generate models that can help unravel various biological processes (see the discussion of alternative splicing below). Here are some examples of recent work in this area:

We designed a deep learning algorithm that allows for the discovery of interactions between regulatory features by leveraging self-attention:

We used deep learning to understand the process of alternative splicing:

A follow up paper posits that large scale chromatin models are the foundation models of genomics:

Our earlier work is a comparative study of deep learning architectures for discovering signals in DNA and RNA:

Analyzing protein structures with graph neural networks

My lab has been studying protein-protein interactions and other aspects of protein function. Proteins perform their function by interacting with other proteins. Therefore understanding the complex network of interactions between proteins, and at a finer level, determining the interfaces through which those interactions occur are highly important. To address the problem of predicting interfaces, we have introduced the concept of graph convolution to the analysis of protein 3d structures.


This was inspired by the success of convolutional networks in computer vision. Their power comes from their ability to learn features with increasing levels of abstraction that are invariant to various transformations of the image (e.g. translation and rotation). We envisioned a similar approach for protein 3d structures. However, instead of representing proteins as a 3d image, we chose to represent the structure as a graph whose nodes represent atoms or amino acids, with connections that are determined by proximity in the protein structure. This required us to replace the standard form of convolution, which operates over a regular grid, to convolution over a graph structure. Our first results using this technique were published at the NIPS conference:


We have recently applied this approach to the problem of assessing the quality of predicted protein structures, introducing a novel loss function inspired by the SVM regression epsilon-insensitive loss.

This project was funded by a grant from the NSF ABI program (award # 1564840).


Our proof-of-concept for the feasibility of partner specific prediction of interfaces from 3d structure used SVMs and resulted in a method called PAIRpred:

Earlier work in the area of interface and interaction prediction includes prediction of Calmodulin binding sites, and genome-wide prediction of interaction networks in yeast and human:

Deep learning tools for basecalling nanopore RNA sequencing data

Oxford Nanopore Technology sequencing devices are capable of directly sequencing long read RNA as is, with the potential of being able to detect modified bases without needing special sample preparation. However, todate there is no such tool available, that would enable easy access to the epitranscriptome.

In a preliminary study towards this goal we concentrated on improving RNA basecalling accuracy. We designed a novel basecalling architecture achieving state-of-the-art performance, improving on the accuracy of the commercial basecaller from Oxford Nanopore Technologies. This basecaller, called RODAN is freely available on github.

We are currently completing work on the detection of post-transcriptional RNA modifications in nanopore sequencing data with a focus on the detection of methylated adenosines, known as m6a . Our neural network, whose codename is Mothra, is the first of its kind, as it is capable of simultaneously basecalling and discerning modifications with read level resolution. Its architecture is based on the architecture developed in RODAN, and uses attention layers to pinpoint modified bases within a sequencing read. Our work will facilitate research into the detection of m6a while also furthering progress in the detection of other post-transcriptional modifications.

This project was funded by an NSF EAGER grant (award # 1949036).

Alternative splicing

Splicing is the process whereby parts of a gene called introns are removed, and the RNA is spliced back to form the mature mRNA. A given gene can be spliced in multiple ways, a phenomenon called alternative splicing. Whereas it is well-studied in animals, alternative splicing in plants is not as well understood, and the differences in genome architecture between plants and animals lead to differences in alternative splicing. We are working on this in collaboration with A.S.N. Reddy of the Biology Department. Our approach is to computationally search for genomic features that are predictive of alternative splicing—elements that serve as splicing enhancers and suppressors, and test their biological relevance to the process.


The above figure shows a model created by our SpliceGrapher tool.


This project was funded by NSF and DOE.


Protein function prediction


Protein function prediction is an ongoing area of research in the lab. The difficulty in applying state-of-the-art machine learning methods to this problem is that proteins can have multiple functions, and that the system of keywords used to describe protein function, the Gene Ontology (GO), has a complex hierarchical structure. This provides genome annotators with a rich vocabulary with which to describe protein function, but makes it sub-optimal to use standard machine learning approaches. We addressed protein function prediction as a hierarchical multi-label classification problem and designed custom algorithms based on the so-called structured SVM, which is able to fully model the complexity of this learning problem. The method we developed, GOstruct, has shown state-of-the-art performance in several benchmarks.


This project was funded by an NSF grant from the ABI program (award #0965768).