Shrideep Pallickara [Research Themes]

CV
Research
  Publications
  Projects
  Students
  Software
Teaching
Outreach
Personal
Home

  [Mining] [Sketching] [Orchestration] [Decision Support Tools] [Visualization] [Clouds] [File Systems]


 
[Back to Top]

Data riches getting you to frown?
    Mine them right, to turn that upside down


Managing Voluminous Spatiotemporal Data: Reap What You Mine

Our Galileo system can store and perform data mining over a trillion files. Each file we consider has thousands of observations, and each observation encapsulates vector of 50-100 features. We have incorporated support for over 20 data formats netCDF, HDF, XML, CSV, GRIB, BUFR, DMSP, NEXRAD, and SIGMET. Our benchmarks have been performed over Petascale datasets.

We support an expressive query syntax that supplements traditional SQL queries with a richer syntax that is aligned with the characteristics of spatiotemporal data. This includes support for:

  • Radial, proximity, and geometry constrained queries with proximity-based relevance ranking
  • Fuzzy queries that dynamically and incrementally relax query constraints
  • Approximate queries that relax accuracy requirements in favor of timeliness, and probablistic queries
  • Analytic queries for hypothesis testing, significance evaluations, and kernel density estimations
  • Queries can either be point queries or continuous i.e. evaluated periodically as data is ingested
  • A scale-out architecture that enables the incremental assimilation of nodes in the system
  • Support for a tunable replication framework
  • Support for journaling 


 

[Back to Top]

A sketch in time saves space
     Allowing discovery to cut to the chase


Sketching: Draw on Understanding the Data
My recent research has focused on three of Big Data’s qualitative metrics: volume, velocity, and variety. A key challenge here is that subsequent analysis (be it for explorations, visualization, or analytics) is I/O bound meaning that you spend most of the time reading/writing to disk or over the network. Why is this bad? I/O is like having to walk around with a 500-pound bag pack - imagine walking around, swimming, or hiking with that!  

A collection of I/O-bound processes is subject to the I/O dilemma: the more that processes try to do, the less that gets done. The I/O, both network and disk, that we perform occurs in shared settings. Network links over which data is transferred and the disks on which the data are stored are shared across users and applications. Sharing results in contention, i.e., concurrent access to the I/O subsystem interfere with each other. This contention in the I/O subsystem further increases completion times for I/O-bound tasks, but worse, causes the overall throughput (the amount of I/O performed per unit of time) to plummet as well. Delays inhibit exploration, which, in turn, precludes insights.

Can't I sample or compress? Two popular strategies to cope with data volumes, binary compression and sampling, are not suited for voluminous data settings. Binary compression, which tends to iterative and computationally expensive, cannot keep pace with data arrivals.  Sampling reduces data volumes by discarding observations - the greater the volumes, the greater the discard. Sampling also adversely impacts fidelity and representativeness in the multifeatured observational space. Missing subtle, critical variations (not known a priori) in the feature space or occurrence frequencies is inevitable - the baby is likely to be thrown out with the bathwater.

The Long and Short of Synopsis: We have designed an innovative sketching algorithm, Synopsis, designed specifically for spatiotemporal data. We are able to achieve 1000-fold compaction rates while preserving statistical representativeness at diverse spatiotemporal scopes.

The Synopsis algorithm is the current state-of-the-art for sketching spatiotemporal data and both ESRI (creators of ArcGIS) and Google Earth Engine are interested in incorporating support for these sketches as first-class data types in their software. The sketching algorithm makes a single pass on all records within a dataset and incrementally updates the sketch. Once constructed, it is the sketch, not the dataset, that is consulted for queries and analytic operations.

The Synopsis sketch provides a 1000-fold compaction rate and supports several operations that operate directly on the sketch. The sketch can be queried using general-purpose (e.g., relational) queries. The sketch can be combined with other sketches, or portions thereof, to build custom information spaces. A multiplicity of sketches from diverse domains (e.g., ecology, hydrology) and varying spatiotemporal scopes can be combined to create a federated information space. The sketch can also be used construct statistically representative datasets.

What's new? The Gutenberg suite of refinements to Synopsis supports two key features: versioning and preferential memory residency. Support for versioning in Synopsis allows concurrent ingestion and generation of tree paths, while ensuring that each set of updates has a monotonically increasing version number associated with it. A key advantage of versioning is reproducability. Query evaluations will retrieve only portions of the sketch that satisfy the versioning criteria. Preferential memory residency of the sketch allows selectively memory-residency of the sketch -- in extreme cases, almost the entire sketch may be disk resident. Given that, disk capacities are 3 orders of magnitude greater than memory capacities, on-disk storage and preferential memory-residency of sketches substantially reduces infrastructure requirements.

 

[ Back to Top]

If processing lags the rate of arrival?
     Imperil, you will, your process’ survival


Orchestration : Swinging a Stick in the Air Does Not a Symphony Make

I have explored issues in the distributed, autonomous orchestration of computational workloads in shared clusters. Our target workloads include stream processing and stochastic discrete event simulations. The nature of these workloads pose unique challenges given that computations are irregular, workloads highly variable, and the intractability of optimal scheduling. The clusters are shared across applications and users.

Data streams occur naturally in several observational settings and often need to be processed with a low latency. Streams (1) may have no preset lifetimes or data production rates; (2) can be produced periodically, intermittently, continuously, or some combination of these; and (3) can be voluminous. As we increasingly rely on sensors for decision making, challenges arise in ensuring that streams are processed in a timely fashion despite the data volumes and heterogeneity of the resources involved in stream processing.

Processing these data streams at scale over a limited number of resources is challenging. Processing a single stream is easy; so is processing multiple streams if there are an unlimited number of available resources. Each stream packet generated by the sensors does not involve a lot of processing (100s of microseconds), but there could be a very large number of these packets with the streams being generated stochastically. In such settings, the number of available resources is two orders of magnitude less than the number of computations. Orchestrating stream processing computations is an instance of the resource-constrained scheduling problem – given a set of tasks, a set of resources, and a performance measure, the objective is to assign tasks to resources such that performance is maximized. Depending on the precise formulation of the problem, some have characterized this as NP-Complete and some as NP-Hard.

At the core of our online scheduling algorithm is a data structure called prediction rings that encapsulates a set of footprint vectors [1]. Each element in the vector represents the expected resource utilization of a stream computation for a particular time granularity. Resource utilizations in the near future are captured at a fine-grained level while the resource utilizations further into the future are captured at a coarse-grained level. Prediction rings are populated based on expected stream arrival rates that are forecast using time-series analysis. We use prediction rings to compute normalized interference scores that quantify the interference (performance penalty) that each computation will be subject to.

Our distributed algorithm for steam scheduling focuses on reducing interference between stream processing computations despite variability in processing workloads and system conditions. This is achieved through a series of proactive, continuous, and incremental scheduling decisions where computations are migrated to machines with less interference. These targeted migrations reduce resource utilization imbalances within the cluster while alleviating performance hotspots, both of which improve performance. Our migration algorithm handles both stateful and stateless computations while ensuring the correctness of stream processing. We include mechanisms to counteract oscillations and frequent, inefficient migrations. Our distributed algorithm achieves high throughput at near real time, and is currently the state-of-the-art for single-stage stream processing.



 

[Back to Top]

A surfeit of alternatives
    Conspire to muddle decisions
Let consequences inform,
     and rank, choices
To leave this labyrinth unscathed

Decision Support Tools: Inform What You Do With All That You Know
We have developed real-time decision support tools (Cadence, Sonata, Symphony) for spatiotemporal phenomena that build richer, layered insights into phenomena. Supporting this capability involves several innovations:

  • Leveraging a mix of fine-grained and coarse-grained spatially explicit models that are temporally coupled to understand complex spatiotemporal phenomena
  • Effective orchestration of model fitting operations based on ensemble methods, deep learning, Bayesian classifiers, logistic regression, network models, etc.
  • Sensitivity analyses of large parameter spaces involving ~2500 dimensions - individual parameters we consider can be numeric, booleans, and also one of 28 different types of probability density functions
  • Adaptability to concept drifts as phenomena evolve over geographical extents and chronological time ranges, and
  • High-throughput, real-time anomaly detection over continually arriving spatiotemporal data streams with anomalies detected over a multidimensional observational space while accounting for concept drifts and feature space evolutions.

These tools are used to inform planning for disease incursions. Cadence was demoed to congressional staffers during DHS Day @ Captiol Hill. We also had the opportunity to demonstrate the Decision Support Tool to the Under Secretary of the DHS, Dr. Reginald Brothers.



 

[Back to Top]

If you anticipate,
    you can precompute
Interactivity becomes viable
    and explorations beckon
If you render just in time,
       and only what you must


Visualization of Petascale Datasets
: A Picture's Worth a Lot More if You Render it Faster

We facilitate (GeoLens, Sonata, and Aperture) concurrent rendering of millions of observations each of which may encapsulate a vector of features.To ensure interactivity, a novel feature that we support is anticipatory server-side creation of visual artifacts based on a user’s navigational patterns to ensure responsiveness and chain-of-thought. We also make innovative use of metadata, server-side processing, cache management, exploit spatial and temporal continuity of data access patterns, and account for perceptual limits when rendering content.






 

[Back to Top]

Be it a Pi at the edges
    Or a node in the clouds
Apportion your workloads right
    So that they may take flight


Tools of The Trade: From the Center Out to the Edges Like Ripples in a Pond

The proliferation of clouds and resource-constrained devices, such as the Raspberry Pi, offer unique opportunities. Clouds offer the promise of elastic scaling for workloads while edge devices are at the forefront for processing data harvested in continuous sensing environments.

Into the Clouds: To Warmer Climes, Faster
Scaling applications to Infrastructure-as-a-Service (IaaS) clouds to achieve efficient system throughput is complex and involves much more than simply increasing the number of allotted virtual machines (VMs). This involves determining the impact of provisioning variation and the composition/dispersion of components comprising an application over a distributed collection of machines. Using brute force performance testing to determine optimal placements is not feasible. The total number of possible component compositions for an application is the Bell’s number that grows exponentially. My research has involved development of performance models that allow a designer to efficiently arrive at a set of component placements that are likely to have good performance. We used multivariate adaptive regression splines to build performance models using application profiling data to predict performance of component deployments.

More Isn’t Necessarily Better: Doing More with Less
Public clouds offer an increasing array of virtual machine types with qualitatively defined CPU, disk, and network I/O capabilities. Determining cost effective application deployments requires selecting both the quantity and type of VM resources for hosting Service Oriented Architecture (SOA) workloads of interest. Hosting decisions must utilize sufficient infrastructure to meet service level objectives and cope with service demand. To address these challenges our methodology identifies infrastructure configuration alternatives that provide equivalent performance allowing the most economical infrastructure to be chosen. Our methodology supports characterization of workload requirements, predicting the required number of VMs of a given type necessary to host workloads, while ensuring that equivalent performance is achieved. Given alternatives the most economical can be chosen for SOA hosting. Infrastructure costs can be calculated based on fixed or spot market prices.

Life at the Edges: Observation Rich, but Resource Constrained
IoT settings and data harvesting have been fueled by the proliferation of miniaturized sensors and cheap edge devices. Edge nodes are the first to “see” the data, and often a lot of it. The data must be stored, processed, or transferred to private or public clouds. The constrained cpu, memory, and battery capacities of edge nodes curtails what they can process, store, or transfer – all prerequisite to analytics.

We have investigated bridging fog and cloud domains to support analysis through federated query evaluations . Hermes is a two-tiered storage and analysis framework consisting of cloud nodes and fog nodes. While cloud nodes are responsible for heavyweight analytics and long-term storage, fog nodes are tasked with processing incoming sensor data. As a result, fog nodes provide low-latency access to recent observations, with cloud nodes managing historical data and coordinating activities across the system. This separation of concerns reduces network traffic. Additionally, certain types of near-term analysis can be conducted entirely on the fog nodes, which reduces the amount of necessary cloud resources. Cloud and fog components support the same set of query operations but diverge in how they handle storage. Fog nodes sample from the incoming data streams and hold high-resolution data for shorter periods of time, whereas cloud nodes coordinate long-term storage.



 

[Back to Top]

But storage is more than squirreling data away
     There are patterns that await discovery
 I/O amplification pitfalls that need circumvention
     And analytic jobs that require scaling


File Systems: Save the Day

My research targets the micro/macroscopic aspects of distributed file systems design. At the individual machine level this has targeted efficiency of disk scheduling algorithms and contention. At the distributed scales this has involved metadata management, query support, preservation of timeliness and throughput, and overlay design. Ongoing research in this area has focussed on designing file systems that faciliate high-performance training of ensemble-based data fitting algorithms such as random forests and gradient boosting. File systems that our efforts interoperate with include ext3/4, btrfs, NTFS, F2FS, ZFS, Google BigQuery, HDFS, and HBase.
 
  Galileo: Fast storage and retrievals of voluminous multidimensional time-series data. Galileo supports approximate, analytical, fuzzy, t-tests, and significance evaluation queries over Petascale datasets encompassing ~ trillion files and quadrillion observations.  
  Minerva: This effort targets efficiency of file systems in virtualized cloud environments. Minerva proactively alleviates disk contention, amortizes I/O costs, and selectively prioritizes VMs based on access patterns and durations.  
  Gossamer: This effort explores foundational issues in file systems design for data generated in continuous sensing environments. In particular, we are seeking to explore how distributed file systems can cope with situations where data arrival rates outpace write throughputs and disk capacities.  
  Concerto: This a distributed file system designed specifically to simplify construction of analytical models using ensemble methods such as random forests and gradient boosting. In particular, the objective of this effort is to facilitate dispersion and data accesses in multidimensional data spaces to preserve accuracy while significantly reducing training times.
 
 

 
   














 
dhs-logo
nsf-logo
epa
omii-logo
usda-logo
doe-logo