Mind's Eye

First, An Example

This video shows an example of our system detecting and recognizing a 'pickup' event in the Mind's Eye Year two evaluation data. In particular, the man in the red jacket picks up a brief case. It also finds a stop event (which is correct: the man in the black jacket had been moving), and an arrive event (which is debatable: did the man in the black jacket arrive at his destination, or simply stop because he was meeting the man in red?). If the video overlaps the menu items on the right-hand side of the page, you may want to make your page wider. More videos of other events appear below.

Program Context

From Sept. 2010 through Spring 2013, we were a part of DARPA's Mind's Eye program. The goal of this project was to work toward an intelligent camera for persistent surveillance. The performance tasks are (1) to detect and recognize 48 English language verbs whenever they occur in unstructured videos provided by DARPA, and (2) to generate brief English language descriptions of those events. (There are also tasks for gap filling and anomaly detection, but we have focused on verb recognition and event description). The Year 1 and Year 2 data sets can be accessed and downloaded from www.visint.org.

Our emphasis was on learning to recognize and describe events through unsupervised learning and selective guidance. Other groups in the Mind's Eye project make extensive use of supervised training to learn object appearances and/or actions. This is effective but inordinately expensive. Raw video is a limitless and free resource, but the process of adding ground truth labels is time-consuming and expensive (not to mention really, really tedious). One solution is to use crowd sourcing to label videos, but this still has costs. Our approach is to learn to recognize events by analyzing raw (unlabeled) videos.

We made use of unsupervised learning to generate models of common views of objects (a.k.a. appearances) as well as common actions, where an action is a motion by a single actor over a brief period of time (typically 1 or 2 seconds). We can then match these learned models to novel images. Unfortunately, object and action models learned without supervision carry no semantic labels ("object view #17 did action #8" is not a very interesting description). We therefore use selective guidance to match terms learned without supervision to known semantic terms with a minimum of human effort. Once this mapping is achieved, abstract events are constructed from combinations of actors and actions.

More Examples

The following short videos were created as part of the Year 2 Evaluation for the Mind's Eye program. We trained (without supervision) on the Mind's Eye Year 2 training set, and then applied the trained systemm to the sequestered Year 2 evaluation set. The videos below show selected results.

This video shows an example of a detected mutual 'approach' event

This video shows an example of a detected 'dig' event

This video shows an example of a detected 'turn' event (but we missed that she is carrying a bowl of fruit)

This video shows an example of a detected 'carry' event. (The carried object is a gun, and he is also walking, as you will see.)

Finally, another detected 'approach' event in a different setting. This time one person approaches another (rather than a mutual approach).