The main focus of research in my lab is on the development of machine learning methods for problems in bioinformatics. Our specialty is in the creative design of kernel methods for problems ranging from prediction of protein function, and interactions to prediction of alternative splicing.

Alternative splicing in plants

Splicing is the process whereby parts of a gene called introns are removed, and the RNA is spliced back to form the mature mRNA. A given gene can be spliced in multiple ways, a phenomenon called alternative splicing. Whereas it is well-studied in animals, alternative splicing in plants is not as well understood, and the differences in genome architecture between plants and animals lead to differences in alternative splicing. We are working on this in collaboration with A.S.N. Reddy of the Biology Department, and our approach is to computationally search for genomic features that are predictive of alternative splicing—elements that serve as splicing enhancers and suppressors, and test their biological relevance to the process.

A second avenue we are pursuing is to leverage next generation sequencing data for prediction of alternative splicing events and improve genome annotation. The noisy nature of this data makes this a challenging task.


The above figure shows a model created by our SpliceGrapher tool.


This project is funded by NSF and DOE.

Protein protein interactions and their binding sites

Proteins perform their function by interacting with other proteins. Therefore understanding the complex network of interactions between an organism’s proteins is important for understanding their role. My work in this area includes genome-wide prediction of interaction networks in yeast and human, and more recently prediction at a finer scale in order to determine the interfaces at which proteins interact.


Some of my older work in the area:

Protein function prediction


Despite having been studied for over twenty years, the standard method for protein function prediction remains annotation transfer. The difficulty in applying state-of-the-art machine learning methods is that proteins can have multiple functions, and that the system of keywords used to describe protein function, the Gene Ontology (GO), has a complex hierarchical structure. This provides genome annotators with a rich vocabulary with which to describe protein function, but makes it sub-optimal to use standard approaches. Therefore, there are significant opportunities to develop new classification methods that treat function prediction as a hierarchical classification problem.

Our approach uses the so-called structured SVM, which is able to fully model the complexity of this learning problem. The method we developed, GOstruct, has shown state-of-the-art performance in several benchmarks.


This project was funded by NSF grant ABI 0965768/0965616.