Overview

As has become the pattern for this class, this fifth assignment builds upon the previous assignment. In the previous assignment your team constructed a real-time tracking algorithm that works in conjunction with a live camera. So now that you have a system that will pay attention to an object presented to it, it's time to extend your system so that it can be taught the proper names for a handful of objects and then label those objects when they are again presented to the system.

Also, as is consistent with the goals of a graduate course, as we are now approaching the end of the semester you're being presented with a more open-ended assignment. Exactly how you carry out the teaching and then labeling of known objects is very much up to you. That said, a baseline approach in broad outline is presented below. But first, let's review a few aspects of your tracker in order to think through what may make it a more fluid and reliable component of a larger recognition system.

Tracking

At this point, we've all benefited greatly from seeing eight different approaches to initialization of a David Bolme style tracker. Perhaps the single most important thing for you to consider as you move forward with assignment five that you are not being required to maintain the correlation filter tracking capability. If you so choose, you may use a built-in tracker that is part of OpenCV. Whatever you choose to do, for your own sanity as you start working with the system and multiple objects you're going to want to streamline the initialization phase. One simple approach is to register a single mouse click on the center (approximate center) of the object of interest and then initiate tracking of that unknown object. I emphasize unknown object, because before you can even start working on the code to acquire an internal stored model of an acquired object, your system will need to be able to reliably track not yet named objects.

Baseline Approach

In essence, there are two key and related capabilities you must develop. First, for a single frame of video based upon the trackers estimated position of an object, you must extract a feature vector. The word feature vectors used very broadly here. For example, it may be an entire set of feature vectors. By default, you could certainly use the feature vectors that are provided in OpenCV for use in conjunction with SURF attention windows. To say just a little bit more, you could basically take the SURF features within the ROI being created by the tracker.

Second, you need a mechanism for taking a new feature vector and retrieving a best guess label, i.e. a single English word or phrase ("angry bird"). The default assumption going into this assignment is that you will be using the nearest neighbor functionality built into OpenCV to carry out this task. As you begin playing with this code, you will discover a lot of details, and that is by design. Here's just one detail. Is it better to display labels independently for each new frame of video or is it better to "smooth" the labels by looking over a set of perhaps 10 frames and only displaying the label that is most consistently being selected.

Here's another detail upon which there is not much flexibility. So that you and others can see your system working in action you will write the English label associated with the object being tracked on the live video feed being displayed back to us the viewers.

What has yet to be described in the baseline approach is the process of teaching the system a new object. This can be kept simple from a user-interface standpoint. Click on a new object and track it for some small number of seconds and then pop up a box asking the user to type in the name. Of course, in terms of implementation, there's a lot more will have to be done. You will have to decide how to store the features that represent that newly introduced named object.

For this assignment, you will not be required to maintain knowledge of named objects between runs of your system. To put this in very practical terms, you not being required to write disk the feature representation of your named objects. Having said this is not a requirement, you may simply want to implement this additional functionality because you may find it makes working with your system more productive and enjoyable.

Teams and Grading

You are expected to work in pairs - teams - for this project. Unless worked out between yourselves and approved by me in advance, I strongly suggest you keep your partner from Assignment 3. The grading will be done in the same fashion as for Assignment 3, in face-to-face interviews with the instructor.

To provide a little more detail about what you should expect in presenting your system for grading, you should come with four objects that you can use to show off the capabilities of your system. What you do with these four objects will depend a bit on how you build your system. At a minimum, you'll be asked to demonstrate your system acquiring the name for a newly presented object; and in this context by newly presented what is meant is it will be one of your four objects but you will not have set up your system in advance of the meeting to recognize that object. You will then be asked to intern present all four objects to the system and show the automatic labeling of the four known objects. Finally, there will be fifth object that your system will have to be taught to name and that object will be provided for you at the time of the meeting. In other words, it will not in all likelihood be an object that you have ever shown your system before.