Paper accepted in VLDB

My colleague George Teodoro and I, together with a handful of collaborators, had a paper accepted at the prestigious VLDB Journal, “Approximate Similarity Search for Online Multimedia Services on Distributed CPU–GPU Platforms”. The paper considerably extends our previous work on parallel similarity search engines, incorporating the use of accelerators (GPUs) and needed scheduler algorithms to make the use of those accelerators efficient for online applications.

Here’s the abstract:

Similarity search in high-dimensional spaces is a pivotal operation for several database applications, including online content-based multimedia services. With the increasing popularity of multimedia applications, these services are facing new challenges regarding (i) the very large and growing volumes of data to be indexed/searched and (ii) the necessity of reducing the response times as observed by end-users. In addition, the nature of the interactions between users and online services creates fluctuating query request rates throughout execution, which requires a similarity search engine to adapt to better use the computation platform and minimize response times. In this work, we address these challenges with Hypercurves, a flexible framework for answering approximate k-nearest neighbor (kNN) queries for very large multimedia databases. Hypercurves executes in hybrid CPU–GPU environments and isable to attain massive query processing rates through the cooperative use of these devices. Hypercurves also changes its CPU–GPU task partitioning dynamically according to the observed load, aiming for optimal response times. In our empirical evaluation, dynamic task partitioning reduced query response times by approximately 50% compared to the best static task partition. Due to a probabilistic proof of equivalence to the sequential kNN algorithm, the CPU–GPU execution of Hypercurves in distributed (multi-node) environments can be aggressively optimized, attaining superlinear scalability while still guaranteeing, with high probability, results at least as good as those from the sequential algorithm.

The paper is yet to appear, but the last preprint before publisher corrections is available at my publications page.

Update 31/07 : The publisher’s version is already online, with DOI 10.1007/s00778-013-0329-7.

Talk at LIP6 Lab, UPMC, Paris

I’m giving my “Scalability Issues in Multimedia Information Retrieval” talk at the LIP6, UPMC, this Tuesday, February 19th, at 13h. I thank my colleague and former advisor Prof. Matthieu Cord for this opportunity.

Talk at the I3S Lab, Université de Nice, Sophia-Antipolis

As part of my visit to the I3S Lab, I’m giving a talk on February 11th :

Title: Scalability Issues in Multimedia Information Retrieval
Where: I3S conference room (level 0)
When: Monday, February 11th, at 14h00

The Millennium marked a turning point for textual Information Retrieval, a moment when Search Engines and Social Networks changed our relationship to World Wide Web: gigantic corpora of knowledge suddenly felt friendly, accessible and manageable. Ten years later, the same phenomenon is happening for complex non-textual data, including multimedia. The challenge is how to provide intuitive, convenient, fast services for those data, in collections whose size and growing rate is so big, that our intuitions fail to grasp.

Two issues have dominated the scientific discourse when we aim at that goal: our ability to represent multimedia information in a way that allows answering the high-level queries posed by the users, and our ability to process those queries fast.

In this talk, I will focus on the latter issue, examining similarity search in high-dimensional spaces, a pivotal operation found a variety of database applications — including Multimedia Information Retrieval. Similarity search is conceptually very simple: find the objects in the dataset that are similar to the query, i.e., those that are close to the query according to some notion of distance. However, due to the infamous “curse of the dimensionality”, performing it fast is challenging from both the theoretical and the practical point-of-view.

I have selected for this talk Hypercurves, my latest research endeavor, which is a distributed technique aimed at hybrid CPU–GPU environments. Hypercurves’ goal is to employ throughput-oriented GPUs to keep answer times optimal, under several load regimens. The parallelization also poses interesting theoretical questions of how much can we optimize the parallelization of approximate k-nearest neighbors, if we relax the equivalence to the sequential algorithm from exact to probabilistic.

The talk will be in English. I thank my colleague and friend Prof. Frédéric Precioso, for this opportunity.

Upcoming Talk: kNN Search and CBIR

I’ve been invited by Prof. Francisco Pelaez and Prof. Camila Barione of the Centre of Mathematics Computing and Cognition of the Federal University of ABC to give my talk on kNN Search and CBIR (Content Based Image Retrieval).

I will discuss my past work, showing the three methods I’ve proposed during my thesis on high-dimensional multimedia indexing on large databases. But I also discuss some of my new research pursuits, related to the use of very discriminant local descriptors, like SIFT, on complex semantic queries, which require generalisation.

The talk, in Portuguese, will be on Tuesday May 26, at the Block B, room A801 of the Federal University of the ABC, which is located at the Rua Santa Adélia, 166, Santo André — SP, Brazil, CEP 09210-170. Their phone number is +55 11 4996-3166.

If you have a Google Calendar, you can save the date by clicking below:

Upcoming Talk: Three New Methods for kNN Search

Prof. Ricardo Torres has invited me to the Institute of Computing of the State University of Campinas, where I am giving a talk on the work I’ve done on my thesis. I will explore the challenges of kNN search (also known as k nearest neighbours search, or simply similarity search) and discuss the three original methods I’ve proposed: the 3-way trees, which are based on the traditional KD-Tree with the addition of redundant overlapping nodes; the projection KD-Forests, my first attempt of using an index composed of multiple moderate-dimensional sub-indexes; and finally the Multicurves, an index based on the use of multiple moderate-dimensional space-filling curves, which has several nice properties like ease of implementation, dynamicity (tolerance to insertions and deletions without performance degradation) and avoidance of random accesses (thus making secondary-memory implementation easier).

The talk will be in Portuguese.

Upcoming Talk: High-Dimensional Indexing and CBIR

I’m giving a talk on the workshop organised by the Digital Image Processing Centre — NPDI for the French-Brazilian project CAPES-COFECUB on Interactive and Content-Based Multimedia Information Analysis for Digital Video Applications.

My talk, entitled “Indexing High-Dimensional Data – Application to CBIR” explores my recent work on multimedia indexing, k nearest neighbours search (kNN search) and image identification.  Here’s the abstract:

“The troubles of multimedia information retrieval start at its most elementary operation: matching the high-dimensional feature vectors used to describe the data. In this talk, we will discuss how recent innovative methods are taming the infamous ‘curse of dimensionality’ and how they can be used in CBIR. The author will discuss his recent contributions to the advance of the state-of-art and his current research endeavours.”

My talk will be on Wednesday, April 8, at 14h30. It will take place at the UFMG Pampulha Campus, on the ICEx building, in room 2077. Registration to the workshop is free.

EDIT 14/4: The presentation, with narrative is available on my (brand new) talks and courses page.

End of Winter Break

After the break, back-to-work under the snow in one of the harsher European winters seen of recent.

I am glad that I’ve got my seasonal flu before the break, since I have so much work to finish before coming back to Brazil in January 30. I intend to submit soon a journal article about multicurves (my high-dimensional indexing method based on space-filling curves), and I have some other results which I would like to submit to one of the conferences in the “first wave of deadlines” (ICDAR, January 22; ICIP, January 30; ACM KDD, February 2). And, of course, I have to accomplish my final role on the EROS 3D project, preparing the end-of-project report to be submitted later this semester.

Interestingly, an unusual lot of Brazilian colleagues have chosen to spend their Summer break in Europe (thus exchanging heat waves and tropical storms for snow and glacial winds), including two former teachers and two former classmates.  This unusual density of temporary “exilés” has been fortunate, in the sense that it gives the opportunity of many encounters, discussions, and… future cooperation!

Paper on CIKM ’08

I’ve presented the paper “High-Dimensional Descriptor Indexing for Large Multimedia Databases” in the Session 5C : Information Retrieval — Multilingual and Multimedia of the 17th ACM Conference on Information and Knowledge Management — CIKM 2008. This paper presents the third (and most interesting, in my opinion) method for multidimentional indexing proposed in my Ph.D. thesis.

Here’s the abstract:

“In this paper we address the subject of large multimedia database indexing for content-based retrieval. We introduce multicurves, a new scheme for indexing high-dimensional descriptors. This technique, based on the simultaneous use of moderate-dimensional space-filling curves, has as main advantages the ability to handle high-dimensional data (100 dimensions and over), to allow the easy maintenance of the indexes (inclusion and deletion of data), and to adapt well to secondary storage, thus providing scalability to huge databases (millions, or even thousands of millions of descriptors). We use multicurves to perform the approximate k nearest neighbors search with a very good compromise between precision and speed. The evaluation of multicurves, carried out on large databases, demonstrates that the strategy compares well to other up-to-date k nearest neighbor search strategies. We also test multicurves on the real-world application of image identification for cultural institutions. In this application, which requires the fast search of a large amount of local descriptors, multicurves allows a dramatic speed-up in comparison to the brute-force strategy of sequential search, without any noticeable precision loss.”

The paper and some related material can be found in my publications page.