Paper at SISAP’2014 on large-scale LSH for general metric data

We’ve got a paper — Large-Scale Distributed Locality-Sensitive Hashing for General Metric Data — accepted at the upcoming International Conference on Similarity Search and Applications (SISAP’2014).

I’ll be presenting the paper next week : if you’re planning to be there, please, come to say hi ! My session will be on Wednesday afternoon (October 30th at 14h).

The paper is part of an ongoing cooperation with my colleague Prof. George Teodoro (University of Brasilia), with my former M.Sc. student Eliezer Silva, and Petrobras researcher Thiago Teixeira. Here’s the abstract :

Locality-Sensitive Hashing (LSH) is extremely competitive for similarity search, but works under the assumption of uniform access cost to the data, and for just a handful of dissimilarities for which locality-sensitive families are available. In this work we propose Parallel Voronoi LSH, an approach that addresses those two limitations of LSH: it makes LSH efficient for distributed- memory architectures, and it works for very general dissimilarities (in particular, it works for all metric dissimilarities). Each hash table of Voronoi LSH works by selecting a sample of the dataset to be used as seeds of a Voronoi diagram. The Voronoi cells are then used to hash the data. Because Voronoi diagrams depend only on the distance, the technique is very general. Implementing LSH in distributed-memory systems is very challenging because it lacks referential locality in its access to the data: if care is not taken, excessive message-passing ruins the index performance. Therefore, another important contribution of this work is the parallel design needed to allow the scalability of the index, which we evaluate in a dataset of a thousand million multimedia features.

The fullpaper is available at the conference proceedings (LNCS 8821) and will be on open access from October 20 to November 21, 2014. The last preprint is also available on my publications page.

Program Committee Member at SISAP 2014

Similarity search is my favorite research subject. But it was only two years ago that I became aware of SISAP. I remember, at that point, that taking a look at the former editions, I got the impression SISAP was exactly the kind of event you would want to attend : small, cozy, and all the cool people were there.

So I didn’t hesitate to accept when my dear colleague Prof. Agma Traina, who’ll be Chair at the Program Committee this year, invited me as a member of the Committee. I hope I’ll be able to attend, but even if that doesn’t happen, I am glad to take some part at this interesting community.

The call for papers has already launched, with deadline for abstract submissions on May 12th. If you work on similarity search, related topics or applications, please consider sending a contribution. The event will take place in Los Cabos, Mexico, on October, 29–31.

Round-table at Orand

I offered a conversational version of  my “Scalability Issues in Multimedia Information Retrieval” talk at Orand this afternoon, May 28th. At the occasion, Felipe Ramírez, researcher of Orand, also presented his M.Sc. dissertation, “An ensemble of one-class domain descriptors”. It was a very interesting scientific exchange.

Prof. Eduardo Valle round-table at Orand, Santiago, Chile

Presentation of my “Scalability Issues…” talk at Orand. This was an informal, conversational version, interspersed with discussion. (Photo by Paola Cornejo, edited by EV)

I thank my colleague Juan Manuel Barrios, director of research of Orand, for the invitation and organization of this round-table.

Orand's Research Team and Prof. Eduardo Valle : left to right Juan Manuel Barrios (Director of Research), Felipe Ramírez, Eduardo Valle, and José Manuel Saavedra

Orand’s Research Team and I : left to right, Juan Manuel Barrios (Director of Research), Felipe Ramírez, yours truly, and José Manuel Saavedra. (Photo by Paola Cornejo, edited by EV)

Talk at the DCC Universidad de Chile

I’m giving my “Scalability Issues in Multimedia Information Retrieval” talk at the DCC, Universidad de Chile, this Wednesday, May 29th, at noon. I thank my colleague  Prof. Benjamin Bustos for this opportunity.

The talk will be in English, but the follow-up questions may be asked either in English or Spanish, so don’t be shy.

Talk at the I3S Lab, Université de Nice, Sophia-Antipolis

As part of my visit to the I3S Lab, I’m giving a talk on February 11th :

Title: Scalability Issues in Multimedia Information Retrieval
Where: I3S conference room (level 0)
When: Monday, February 11th, at 14h00

The Millennium marked a turning point for textual Information Retrieval, a moment when Search Engines and Social Networks changed our relationship to World Wide Web: gigantic corpora of knowledge suddenly felt friendly, accessible and manageable. Ten years later, the same phenomenon is happening for complex non-textual data, including multimedia. The challenge is how to provide intuitive, convenient, fast services for those data, in collections whose size and growing rate is so big, that our intuitions fail to grasp.

Two issues have dominated the scientific discourse when we aim at that goal: our ability to represent multimedia information in a way that allows answering the high-level queries posed by the users, and our ability to process those queries fast.

In this talk, I will focus on the latter issue, examining similarity search in high-dimensional spaces, a pivotal operation found a variety of database applications — including Multimedia Information Retrieval. Similarity search is conceptually very simple: find the objects in the dataset that are similar to the query, i.e., those that are close to the query according to some notion of distance. However, due to the infamous “curse of the dimensionality”, performing it fast is challenging from both the theoretical and the practical point-of-view.

I have selected for this talk Hypercurves, my latest research endeavor, which is a distributed technique aimed at hybrid CPU–GPU environments. Hypercurves’ goal is to employ throughput-oriented GPUs to keep answer times optimal, under several load regimens. The parallelization also poses interesting theoretical questions of how much can we optimize the parallelization of approximate k-nearest neighbors, if we relax the equivalence to the sequential algorithm from exact to probabilistic.

The talk will be in English. I thank my colleague and friend Prof. Frédéric Precioso, for this opportunity.

Paper Accepted at ICPR 2010

Our paper, “MONORAIL: A Disk-Friendly Index for Huge Descriptor Databases” was accepted at the upcoming IAPR Internation Conference on Pattern Recognition — ICPR 2010. Here is the abstract:

We propose MONORAIL, an indexing scheme for very large multimedia descriptor databases. Our index is based on the Hilbert curve, which is able to map the high-dimensional space of those descriptors to a single dimension. Instead of using several curves to mitigate boundary effects, we use a single curve with several surrogate points for each descriptor. Thus, we are able to reduce the random accesses to the bare minimum. In a rigorous empirical comparison with another method based on multiple surrogates, ours shows a significant improvement, due to our careful choice of the surrogate points.

I am particularly proud of this paper, not only because of the method itself, but also because of the experimental design we propose for the validation. I have been studying for more than a year the topics of Design of Experiments, statistical tests and validation. This is the first of a crop of publications that are employing those rigorous evaluation tools, which, though commonplace in other fields, are still seldom used in Computer Sciences.


Tutorial Accepted on SBBD 2009

My tutorial Similarity Search and Indexing for High-Dimensional Data has been accepted on SBBD 2009 (The Brazilian Symposium on Databases).  Here’s the abstract:

Searching by similarity is a critical operation on many systems, and thus has attracted the attention of many disciplines in Computer Sciences, including Computational Geometry, Machine Learning, Multimedia and, of course, Databases. To perform efficiently, similarity search requires the support of indexing, which suffers from the infamous “curse of the dimensionality”. In this tutorial we will introduce the challenges of indexing and searching high-dimensional data, and present the most recent tools available to “tame the curse”. At the end, the audience will have a good grasp of the current state of the art, the most promising research trends and the challenges still faced by the technology.

The tutorials, as I understand, are open to all participants on the conference. Mine will be held on Wednesday, October 7th from 14h40 to 18h20, with a 20′ coffee-break. If you use Google calendar, you can save the date by clicking on the button below.

* * *

I’ve unintentionally let an awful lot of of time pass since my last post — the move to Campinas (and to UNICAMP) has been wonderful, but also laborious. I thought that after moving across countries three times, moving across states would be a piece of cake, but it seems that, no matter the distance, moving is always a lot of hassle!

EDIT 11/11/09: The tutorial presentation, for the moment without narrative, is available on my talks and courses page.

Upcoming Talk: kNN Search and CBIR

I’ve been invited by Prof. Francisco Pelaez and Prof. Camila Barione of the Centre of Mathematics Computing and Cognition of the Federal University of ABC to give my talk on kNN Search and CBIR (Content Based Image Retrieval).

I will discuss my past work, showing the three methods I’ve proposed during my thesis on high-dimensional multimedia indexing on large databases. But I also discuss some of my new research pursuits, related to the use of very discriminant local descriptors, like SIFT, on complex semantic queries, which require generalisation.

The talk, in Portuguese, will be on Tuesday May 26, at the Block B, room A801 of the Federal University of the ABC, which is located at the Rua Santa Adélia, 166, Santo André — SP, Brazil, CEP 09210-170. Their phone number is +55 11 4996-3166.

If you have a Google Calendar, you can save the date by clicking below:

Upcoming Talk: Three New Methods for kNN Search

Prof. Ricardo Torres has invited me to the Institute of Computing of the State University of Campinas, where I am giving a talk on the work I’ve done on my thesis. I will explore the challenges of kNN search (also known as k nearest neighbours search, or simply similarity search) and discuss the three original methods I’ve proposed: the 3-way trees, which are based on the traditional KD-Tree with the addition of redundant overlapping nodes; the projection KD-Forests, my first attempt of using an index composed of multiple moderate-dimensional sub-indexes; and finally the Multicurves, an index based on the use of multiple moderate-dimensional space-filling curves, which has several nice properties like ease of implementation, dynamicity (tolerance to insertions and deletions without performance degradation) and avoidance of random accesses (thus making secondary-memory implementation easier).

The talk will be in Portuguese.