The main focus of research in my lab is on the development of machine learning methods for problems in bioinformatics. Our specialty is in the creative design of kernel methods for problems ranging from prediction of protein function, and interactions to prediction of alternative splicing.
Despite having been studied for over twenty years, the standard method for protein function prediction remains annotation transfer. The difficulty in applying state-of-the-art machine learning methods is that proteins can have multiple functions, and that the system of keywords used to describe protein function, the Gene Ontology (GO), has a complex hierarchical structure. This provides genome annotators with a rich vocabulary with which to describe protein function, but makes it sub-optimal to use standard approaches. Therefore, there are significant opportunities to develop new classification methods that treat function prediction as a hierarchical classification problem.
We are using a recent development in machine learning - kernel methods for structured output spaces to address this problem. Our recent work in this area—the GOstruct method is showing great promise.
This project is funded by NSF grant ABI 0965768/0965616.
Splicing is the process whereby parts of a gene called introns are removed, and the RNA is glued back to form the mature mRNA. A given gene can be spliced in multiple ways, a phenomenon called alternative splicing. Whereas it has been well-studied in animals, alternative splicing in plants is not as well understood, and the differences in genome architecture between plants and animals lead to differences in alternative splicing. We are working on this in collaboration with A.S.N. Reddy of the Biology Department, and our approach is to computationally search for genomic features that are predictive of alternative splicing—elements that serve as splicing enhancers and suppressors, and test their biological relevance to the process.
A second avenue we are pursuing is to leverage next generation sequencing data for prediction of alternative splicing events and improve genome annotation. The noisy nature of this data makes this a challenging task.
The above figure shows a model created by our SpliceGrapher tool.
This project is funded by NSF grant DBI 0743097.
Proteins perform their function by interacting with other proteins. Therefore understanding the complex network of interactions between an organism’s proteins is important for understanding their role. Even with the advent of high-throughput experimental methods for elucidating interactions, the interaction networks of even well-studied model organisms are only sparsely known. My work in this area includes genome-wide prediction of interaction networks in yeast and human; more recently, my lab is focusing on interactions of specific proteins such as Calmodulin which is highly conserved in all Eukaryotes, and interacts with a large number of proteins in each organism. This targeted approach allows us to tailor our predictors to the known properties of the protein in question.
This research is carried out by Fayyaz Afsar and Michael Hamilton, in collaboration with A.S.N. Reddy’s lab in the Biology Department at CSU.
Some of my older work in the area: