Pradyumna Narayana
Colorado State University


2012 -
Graduate Research Assistant
Working with Prof. Bruce A. Draper
Colorado State University Vision Lab, Fort Collins, CO
2014 - 2018 (Expected July 2018)
Colorado State University: PhD
Thesis: Gesture Recognition
Data Scientist Intern
Salesforce, Seattle, WA
2011 - 2014
Colorado State University: MS
Thesis: Consistent Hidden Markov Models
Software Engineering Intern
Seagate Technologies, Boulder, CO
2007 - 2011
Jawaharlal Nehru Technological Univeristy: BTech


Continuous Gesture Recognition through Selective Temporal Fusion
Pradyumna Narayana, J. Ross Beveridge, Bruce A. Draper
European Conference on Computer Vision (ECCV) 2018 (Under Review)
Gestures are a common form of human communication and are important for human computer interfaces (HCI). However, HCI systems need to recognize gestures in continuous streams of data. This paper presents a new architecture called S-FOANet that recognizes gestures in continuous data streams without first pre-segmenting the videos into single gesture clips. When applied to the 2017 ChaLearn ConGD dataset, S-FOANet achieves a mean Jaccard Index of 0.7740 compared to the previous best result of 0.6103. This paper also presents the first continuous data stream results for the NVIDIA dataset. Perhaps more importantly, using results from both datasets this paper shows that the best temporal fusion strategies in multi-channel networks depends on the modality (RGB vs depth vs flow field) and target (global vs left hand vs right hand) of the channel. S-FOANet achieves optimum performance using Gaussian Pooling for global channels, LSTMs for focused (left hand or right hand) flow field channels, and late Pooling for focused RGB and depth channels.

Gesture Recognition: Focus on the Hands
Pradyumna Narayana, J. Ross Beveridge, Bruce A. Draper
IEEE Conference on Computer Vision and Pattern Recognition (CVPR) 2018
Gestures are a common form of human communication and important for human computer interfaces (HCI). Recent approaches to gesture recognition use deep learning methods, including multi-channel methods. We show that when spatial channels are focused on the hands, gesture recognition improves significantly, particularly when the channels are fused using a sparse network. Using this technique, we improve performance on the ChaLearn IsoGD dataset from a previous best of 67.71% to 82.07%, and on the NVIDIA dynamic hand gesture dataset from 83.8% to 91.28%.

Interacting Hidden Markov Models for Video Understanding
Pradyumna Narayana, J. Ross Beveridge, Bruce A. Draper
International Journal of Pattern Recognition and Artificial Intelligence (IJPRAI) 2018 (Minor Revision)
People, cars and other moving objects in videos generate time series data that can be labeled in many ways. For example, classifiers can label motion tracks according to the object type, the action being performed, or the trajectory of the motion. These labels can be generated for every frame as long as the object stays in view, so object tracks can be modeled as Markov processes with multiple noisy observation streams. A challenge in video recognition is to recover the true state of the track (i.e. its class, action and trajectory) using Markov models without (a) counter-factually assuming that the streams are independent or (b) creating a fully-coupled Hidden Markov Model with an infeasibly large state space. This paper introduces a new method for labeling sequences of hidden states. The method exploits external consistency constraints among streams without modeling complex joint distributions between them. For example, common sense semantics suggest that {trees can't walk}. This is an example of an external constraint between an object label ("tree") and an action label ("walk"). The key to exploiting external constraints is a new variation of the Viterbi algorithm we call the Viterbi-Segre (VS) algorithm. VS restricts the solution spaces of factorized HMMs to marginal distributions that are compatible with joint distributions satisfying sets of external constraints. Experiments on synthetic data show that Viterbi-Segre does a better job of estimating true states given observations than the traditional Viterbi algorithm applied to (a) factorized HMMs, (b) fully-coupled HMMs, or (c) partially-coupled HMMs that model pair-wise dependencies. We then show that VS outperforms factorized and pair-wise HMMs on real video data sets for which fully-coupled HMMs can not feasibly be trained.

Cooperating with Avatars Through Gesture, Language and Action
Pradyumna Narayana, Nikhil Krishnaswamy, Isaac Wang, Rahul Bangar, Dhruva Patil, Gururaj Mulay, Kyeongmin Rim, Ross Beveridge, Jaime Ruiz, James Pustejovsky, Bruce Draper
Intelligent Systems Conference (IntelliSys) 2018
Advances in artificial intelligence are fundamentally changing how we relate to machines. We used to treat computers as tools, but now we expect them to be agents, and increasingly our instinct is to treat them like peers. This paper is an exploration of peer-to-peer communication between people and machines. Two ideas are central to the approach explored here: shared perception, in which people work together in a shared environment, and much of the information that passes between them is contextual and derived from perception; and visually grounded reasoning, in which actions are considered feasible if they can be visualized and/or simulated in 3D. We explore shared perception and visually grounded reasoning in the context of blocks world, which serves as a surrogate for cooperative tasks where the partners share a workspace. We begin with elicitation studies observing pairs of people working together in blocks world and noting the gestures they use. These gestures are grouped into three categories: social, deictic, and iconic gestures. We then build a prototype system in which people are paired with avatars in a simulated blocks world. We find that when participants can see but not hear each other, all three gesture types are necessary, but that when the participants can speak to each other the social and deictic gestures remain important while the iconic gestures become less so. We also find that ambiguities flip the conversational lead, in that the partner previously receiving information takes the lead in order to disambiguate the ambiguity.

Motion Segmentation via Generalized Curvatures
Robert T. Arn, Pradyumna Narayana, Tegan Emerson, Bruce A. Draper, Michael Kirby, Chris Peterson
IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI) 2018
New depth sensors, like the Microsoft Kinect, produce streams of human pose data. These discrete pose streams can be viewed as noisy samples of an underlying continuous ideal curve that describes a trajectory through high-dimensional pose space. This paper introduces a technique for generalized curvature analysis (GCA) that determines features along the trajectory which can be used to characterize change and segment motion. Tools are developed for approximating generalized curvatures at mean points along a curve in terms of the singular values of local mean-centered data balls. The features of the GCA algorithm are illustrated on both synthetic and real examples, including data collected from a Kinect II sensor. We also applied GCA to the Carnegie Mellon University Motion Capture (MoCaP) database. Given that GCA scales linearly with the length of the time series we are able to analyze large data sets without down sampling. It is demonstrated that the generalized curvature approximations can be used to segment pose streams into motions and transitions between motions. The GCA algorithm can identify 94.2% of the transitions between motions without knowing the set of possible motions in advance, even though the subjects do not stop or pause between motions.
EASEL: Easy Automatic Segmentation Event Labeler
Isaac Wang, Pradyumna Narayana, Jesse Smith, Bruce Draper, Ross Beveridge, Jaime Ruiz
International Conference on Intelligent User Interfaces (IUI) 2018
Video annotation is a vital part of research examining gestural and multimodal interaction as well as computer vision, machine learning, and interface design. However, annotation is a difficult, time-consuming task that requires high cognitive effort. Existing tools for labeling and annotation still require users to manually label most of the data, limiting the tools’ helpfulness. In this paper, we present the Easy Automatic Segmentation Event Labeler (EASEL), a tool supporting gesture analysis. EASEL streamlines the annotation process by introducing assisted annotation, using automatic gesture segmentation and recognition to automatically annotate gestures. To evaluate the efficacy of assisted annotation, we conducted a user study with 24 participants and found that assisted annotation decreased the time needed to annotate videos with no difference in accuracy compared with manual annotation. The results of our study demonstrate the benefit of adding computational intelligence to video and audio annotation tasks.

Exploring the Use of Gesture in Collaborative Tasks
Isaac Wang, Pradyumna Narayana, Dhruva Patil, Gururaj Mulay, Rahul Bangar, Bruce Draper, Ross Beveridge, and Jaime Ruiz
Conference on Human Factors in Computing Systems (CHI) 2017
Personal assistants such as Siri have changed the way people interact with computers by introducing virtual assistants that collaborate with humans through natural speech-based interfaces. However, relying on speech alone as the medium of communication can be a limitation; non-verbal aspects of communication also play a vital role in natural human discourse. Thus, it is necessary to identify the use of gesture and other non-verbal aspects in order to apply them towards the development of computer systems. We conducted an exploratory study to identify how humans use gesture and speech to communicate when solving collaborative tasks. We highlight differences in gesturing strategies in the presence/absence of speech and also show that the inclusion of gesture with speech resulted in faster task completion times than with speech alone. Based on these results, we present implications for the design of gestural and multimodal interactions.

Communicating and Acting: Understanding Gestures in Simulation Semantics
Nikhil Krishnaswamy, Pradyumna Narayana, Isaac Wang, Kyeongmin Rim, Rahul Bangar, Dhruva Patil, Gururaj Mulay, Ross Beveridge, Jaime Ruiz, Bruce Draper, James Pustejovsky
International Conference on Computational Semantics (IWCS) 2017
In this paper, we introduce an architecture for multimodal communication between humans and computers engaged in a shared task. We describe a representative dialogue between an artificial agent and a human that will be demonstrated live during the presentation. This assumes a multimodal environment and semantics for facilitating communication and interaction with a computational agent. To this end, we have created an embodied 3D simulation environment enabling both the generation and interpretation of multiple modalities, including: language, gesture, and the visualization of objects moving and agents performing actions. Objects are encoded with rich semantic typing and action affordances, while actions themselves are encoded as multimodal expressions (programs), allowing for contextually salient inferences and decisions in the environment.

Creating Common Ground through Multimodal Simulations
James Pustejovsky, Nikhil Krishnaswamy, Bruce Draper, Pradyumna Narayana, Rahul Bangar
International Conference on Computational Semantics (IWCS) workshop on Foundations of Situated and Multimodal Communication 2017
The demand for more sophisticated human-computer interactions is rapidly increasing, as users become more accustomed to conversation-like interactions with their devices. In this paper, we examine this changing landscape in the context of human-machine interaction in a shared workspace to achieve a common goal. In our prototype system, people and avatars cooperate to build blocks world structures through the interaction of language, gesture, vision, and action. This provides a platform to study computational issues involved in multimodal communication. In order to establish elements of the common ground in discourse between speakers, we have created an embodied 3D simulation, enabling both the generation and interpretation of multiple modalities, including: language, gesture, and the visualization of objects moving and agents acting in their environment. The simulation is built on the modeling language VoxML, that encodes objects with rich semantic typing and action affordances, and actions themselves as multimodal programs, enabling contextually salient inferences and decisions in the environment. We illustrate this with a walk-through of multimodal communication in a shared task.
EGGNOG: A Continuous, Multi-modal Data Set of Naturally Occurring Gestures with Ground Truth Labels
Isaac Wang, Mohtadi Ben Fraj, Pradyumna Narayana, Dhruva Patil, Gururaj Mulay, Rahul Bangar, J. Ross Beveridge, Bruce A. Draper, Jaime Ruiz
International Conference on Automatic Face Gesture Recognition (FG) 2017
People communicate through words and gestures, but current voice-based computer interfaces such as Siri exploit only words. This is a shame: human-computer interfaces would be natural if they incorporated gestures as well as words. To support this goal, we present a new dataset of naturally occurring gestures made by people working collaboratively on blocks world tasks. The dataset, called EGGNOG, contains over 8 hours of RGB video, depth video, and Kinect v2 body position data of 40 subjects. The data has been semi-automatically segmented into 24,503 movements, each of which has been labeled according to (1) its physical motion and (2) the intent of the participant. We believe this dataset will stimulate research into natural and gestural human-computer interfaces.