My research interests are broadly in Big Data for the sciences with an emphasis on issues related to predictive analytics, storage, retrievals, scientific data management and metadata.



Galileo is a distributed geospatial data storage system for efficient access to time-series geospatial datasets. Visit Galilieo project pagelink
[Effort with the Distributed Systems Group]


Glean focuses on performing analytics at scale over Big Data.
Visit Glean project page
[Effort with the Distributed Systems Group]


Mendel is a genomic data storage system that supports fast and efficient search for genomic sequence alignment.

gl-smlogo GeoLens
GeoLens is an engine for real-time, interactive visual analytics over large datasets. Visit GeoLens project page li



The Atmospheric Data Discovery Network Service (ADDS) enables discovery of observational, binary datasets managed at multiple data hosting services. Datasets are packaged and published by organizations at regular time intervals. ADDS parses binary datasets to generate metadata, which is used to allow random accesses to specific portions of the published data. ADDS provides programmable query interfaces to automate discovery mechanisms, and supports the binary BUFR format that is the World Meteorological Organization’s standard for observational data, and also netCDF a format often used to encode outputs of simulation models. This research is based on a collaborative effort with CIRA at Colorado State University and UCAR (University Corporation for Atmospheric Research) at Boulder.

Swarm is a meta scheduling framework that targeted alleviating inefficiencies in batch queues normally used in high throughput computing systems that are part of Grid environments. Swarm interoperated with both Condor and Globus frameworks commonly found in these environments, and could manage executions of millions of jobs with uneven workloads. Swarm was successfully used for managing large-scale genome sequencing tasks, some of which had very long running times. Based on the expected workloads, jobs were scheduled either in smaller, local clusters or highly parallelized Grid clusters.

LEAD (Linked Environments for Atmospheric Discovery)
The NFS-funded Lead project for tornado predictions makes research resources such as atmospheric data from observational devices, forecasting models, and analyses available to researchers and students. As part of this effort, I designed the MyLead data cataloging system that provides programmable data cataloging features for input, output, and intermediate data required during execution of large scientific workflows for tornado predictions. Since these workflows span multiple organizations, data accesses often cross administrative boundaries and trust issues need to be resolved. I devised the TrustCell model that established end-to-end trust relationships prior to data accesses. The system relied on hierarchical trust relationships constructed from local and global trust associations to provide a measure of trustworthiness associated with data accesses.

The Carousel project focused on developing an environment for supporting ubiquitous accesses to real-time collaborative applications in Grid settings. Devices that were supported include portable devices, such as 3G SmartPhones and 801.11b equipped PDAs and conventional desktop PCs. As part of this project, I designed a data pipelining architecture that was formally verified using Perti Nets. I also developed a protocol for reliable communications between these pervasive devices in wireless settings.