May 4, 2015

Example image from the PASCAL VOC challenge.

by Benjamin Recchie

Greg Shakhnarovich, assistant professor at Toyota Technological Institute—Chicago and an assistant professor (part time) of computer science at the University of Chicago, is searching for the holy grail. Not the holy grail of Arthurian legend, but the holy grail of computer vision: getting a computer to “know what is where by looking,” an enormously complex task. It’s not enough to tell a computer what a table looks like, if it has to be able to recognize that table from all angles and in different lighting conditions—to say nothing of identifying a different table. But it’s not an insoluble problem either: in a talk given as part of the Research Computing Center’s Visualization Speaker Series, he described how far he (and the rest of the computer vision community) had gotten.

Shakhnarovich has been working with trained artificial neural networks—algorithms that can learn over time. He works on image segmentation, a technique in which a program attempts to break down an image into sections. To demonstrate, he showed a picture of people on bicycles racing: an algorithm had to classify everything in the image as “bicycle,” “person,” or background.

The most important benchmark on which image segmentation algorithms are measured is the PASCAL  Visual Object Classes (VOC) challenges, a data set of thousands of real-world images containing one or more of twenty common objects: a person, bird, cat, cow, dog, horse, sheep, airplane, bicycle, boat, bus, car, motorbike, train, bottle, chair, dining table, potted plant, sofa, or TV. The algorithms must pick out these objects and distinguish them from each other and irrelevant background.

Shakhnarovich’s algorithms start with a “superpixel,” an aggregation of pixels that is smaller than actual regions on the image. This is done for convenience; labeling 1 million pixels is much more difficult than labeling 500 superpixels. The program then tries to classify just the superpixel using what it has learned about the shape, color, texture, and other patterns that identify each standard category. (Buses, for example, tend to have wheels—big, round, dark things—beneath them; horses might be brown or black but not orange or blue.) It then takes its best guess to what the superpixel is on.

Next, he combines the superpixels with a “zoom-out,” which depends on the fact that most superpixels are likely to be next to another superpixel of the same thing, unless there’s a clear difference. The results for each superpixel are compared to the results for the surrounding superpixels and reevaluated accordingly. Thus, even if the first pass suggests an individual superpixel is a TV, if the surrounding superpixels are more likely to be a sheep, then the whole area is reclassified as a sheep. This process is repeated several times, until the whole image has been evaluated.

Thus trained, the neural networks are capable of high accuracy—seemingly higher every day. There have been rapid advances in the field of image recognition in just the last few months, Shakhnarovich said. In November of 2014, his neural network achieved a then-record 58.2% accuracy rate on the PASCAL VOC data. But competing teams of researchers, using even more sophisticated neural networks, quickly improved upon that. As of March 2015, the bar had been raised to 67.7%--almost ten percentage points in mere months.

Clearly, researchers are on the right path. The success of neural networks similar to Shakhnarovich’s on the PASCAL VOC challenge points the way towards solving more complex problems, such as recognizing individual humans, or at least more than just the 20 categories of objects in the VOC catalog. Someday very soon, a computer might finally be able to look around, and not just see, but understand.