Our paper “Pooling in image representation: the visual codeword point of view” has been published at the May issue of the Computer Vision and Image Understanding journal (CVIU). The paper is available at the publisher’s site (DOI: 10.1016/j.cviu.2012.09.007). The last preprint is also available in my publications page.
In this paper, we explore and extend the bags-of-visual-words formalism. We propose a new pooling function, based upon preserving information about the distances between the image low-level descriptors in the image and the codewords in the visual dictionary. That density-based approach allows the creation of more compact representations than the parametric approaches (based upon moments of multidimensional Gaussians) commonly found in literature. Here’s the abstract:
In this work, we propose BossaNova, a novel representation for content-based concept detection in images and videos, which enriches the Bag-of-Words model. Relying on the quantization of highly discriminant local descriptors by a codebook, and the aggregation of those quantized descriptors into a single pooled feature vector, the Bag-of-Words model has emerged as the most promising approach for concept detection on visual documents. BossaNova enhances that representation by keeping a histogram of distances between the descriptors found in the image and those in the codebook, preserving thus important information about the distribution of the local descriptors around each codeword. Contrarily to other approaches found in the literature, the non-parametric histogram representation is compact and simple to compute. BossaNova compares well with the state-of-the-art in several standard datasets: MIRFLICKR, ImageCLEF 2011, PASCAL VOC 2007 and 15-Scenes, even without using complex combinations of different local descriptors. It also complements well the cutting-edge Fisher Vector descriptors, showing even better results when employed in combination with them. BossaNova also shows good results in the challenging real-world application of pornography detection.
I’d like to shoot a small video, exploring the bags-of-words model in general, and this work in particular, later this month — let’s hope I can find the time !