Paper Accepted at WebSci 2014 on Image Popularity in Social Networks

We had a paper accepted at the ACM WebScience Conference 2014 (WebSci 2014), a study on how much visual attributes can predict the popularity of images on Pinterest (measured as the number of repins). We found that social attributes are more predictive to popularity than automatically extracted visual attributes (not very surprisingly). However, for the heavily followed users, visual attributes respond for a considerable fraction of the deviation from the expected behavior, after we factor out the most predictive social attribute (number of followers). This is shown in the featured image of this blog entry.

Here’s the full abstract of the paper:

Little is known on how visual content affects the popularity on social networks, despite images being now ubiquitous on the Web, and currently accounting for a considerable fraction of all content shared. Existing art on image sharing focuses mainly on non-visual attributes. In this work we take a complementary approach, and investigate resharing from a mainly visual perspective. Two sets of visual features are proposed, encoding both aesthetical properties (brightness, contrast, sharpness, etc.), and semantical content (concepts represented by the images). We  collected data from a large image-sharing service (Pinterest) and evaluated the predictive power of different features on popularity (number of reshares). We found that visual properties have low predictive power compared that of social cues. However, after factoring-out social influence, visual features show considerable predictive power, especially for images with higher exposure, with over 3:1 accuracy odds when classifying highly exposed images between very popular and unpopular.

Accuracy of binary classification between very popular and very unpopular (middle case excluded) measured by number of repins after subtracting the influence of number of board followers. Error bars are standard deviations.

Accuracy of binary classification between very popular and very unpopular (middle case excluded) measured by number of repins after subtracting the influence of number of board followers. Error bars are standard deviations.

The paper was a cooperation between my post-doc Dr. Sandra Avila and I, from the RECOD Lab here at the State University of Campinas, and master student, and students Luam Totti and Felipe Costa, and Profs. Wagner Meira Jr. and Virgílio Almeida, from the InWeb — National Institute of Science and Technology for the Web at the Federal University of Minas Gerais.

The fulltext is available at my publications page.

In accordance to our policy of improving the reproducibility of our published results, both the data, and the code of the paper are available. Due to the restrictions of FigShare, the dataset is on a fragmented zipped SQL dump. I thank Luam Totti very much for agreeing in putting the effort to make that possible.


Paper published at JASIST

Prof. Jacques Wainer and I had our paper “What happens to computer science research after it is published ? Tracking CS research lines” issued for early view on the Journal of the American Society for Information Science and Technology (JASIST) (DOI: doi/10.1002/asi.22818). The last preprint, before the publishers’ corrections, is also available in my publications page. Here’s the abstract :

Are computer science papers extended after they are published ? We have surveyed 200 computer science publications, 100 journal articles, and 100 conference papers, using self-citations to identify potential and actual continuations. We are interested in determining the proportion of papers that do indeed continue, how and when the continuation takes place, and whether any distinctions are found between the journal and conference populations. Despite the implicit assumption of a research line behind each paper, manifest in the ubiquitous “future research” notes that close many of them, we find that more than 70% of the papers are never continued.

In this paper we try to shed light on that “early stopping” phenomenon. Why so many CS papers stay on the “first idea” phase ? Does this interact with the atypical value that CS attributes to conferences ? Is there a correlation (positive or negative) between any  “quality” metric of the work and the probability of  a continuation popping up ?

My colleague and friend Jacques is a specialist on Scientometrics, and social networks of cooperating scientists, whom he analyses through webs of publications, co-publications, citations and co-citations. We have spent about an year discussing the best statistical tools to tackle such complex phenomena, and then trying to translate the results back into “social” meaningful conclusions. I’ll let you judge how much we have succeeded on the latter effort.

Goodbye, Google+

When Google+ started, I had the feeling it could become the social network for “the rest of us”, who don’t have a Facebook account for a reason. Those of us who want to share ideas, interesting stuff, without flinging open our personal lives online, without bothering our friends with endless invites to play some silly minigame.

It seems though, that just as the service starts to get critical mass, Google has decided to make it “Orkut 2.0”, i.e., to turn it into yet another classical social network. Thanks but, no, thanks.

While the service was becoming slowly but progressively intrusive the latest months, I was under the impression that it was just bothering me, and that if I was careful enough with all the “new terms of service”, and “discover new features”, and “new settings” screens, I could keep enjoying it. But then my sister came to me and said:

“— Could you just stop sending me those Google+ invites? It’s annoying.”

“— Say what ?!”

I found out that for weeks Google+ has been sending her e-mails every other day, telling her how much I want to keep in touch with her, and how much she is missing out on my updates by not joining the service. In my name. This, of course, is a huge breach of trust, and I don’t think I’ve overreacted by deleting my profile immediately. Call it ragequit, if you want, but it’s over.

If you have a Google+ account, ask around : you might be, unknowingly, annoying friends, family — or worse — colleagues, and clients. If you had the same experience as me, I’d love to hear from you. I’d be reassured to know I wasn’t victim of a freaky bug or something.

What is worthy and what is puffy ?

The only other scientist in the family is my cousin Laila, and as I try to navigate my way around this “web two dot oh of science”, I can hear in my mind her thoughtful advice: “before you dive into something, you should check how deep the water is”. Wise words: after all, people experience many times the metaphorical broken neck after investing energy into “the next big thing”, which later reveals to be a shallow pond.

Of course, it’s often impossible to know precisely the depth of the waters, especially in the cases that much will depend on the contributions of the user community. What I ask myself is when can we start to foresee the tendency of success or failure of a new technology, for a given purpose?

Take Google’s Orkut, for example. When I’ve first heard about it, my colleagues were talking about a professional network, something like what LinkedIn is today, but quickly it became clear that it would not work for this purpose, and at least from this point of view, it failed. But as a social “for fun” network (like MySpace) it became a major success in Brazil.

For the last few months I’ve been trying a lot of stuff (web 2.0, web 1.0 or not web at all) which I think could be useful for my research team:

Google Tools

I’ve recently consolidated all my mailboxes into a single Gmail account and I am finding the service incredibly convenient. The Calendar is also fantastic, though I can’t believe they’ve left out such an essential item as the Task List. As for Google Docs, I find that they are much too rudimentary, even for lowbrow daily use.

Microsoft Groove

I liked the concept of Groove very much (and the little video demo is really really seductive), but I quickly bumped into the harsh reality that in the academic world, few people like to use Microsoft Office. What’s the use of using a communication platform and then having no one to communicate to?

I’ve considered starting a major evangelising campaign (helped by the fact that MS Office is very cheap for students nowadays) but when I realised that Groove wasn’t available for Mac (trying to evangelise someone into MS Office is hard but possible, but trying to evangelising someone out of a Mac is just a waste of time) and after two or three “sorry, the service is experiencing some problems”, I’ve given up. When evangelising people into a new product, it’s crucial that it works perfectly.


I’ve been really captivated by ThinkFree. Feature-wise, it is much behind Office 2007 (or LaTeX, for that matter). But it’s way beyond Google Docs, and, since it offers a decent (but by no means perfect) conversion to and from MS Office, and very good web integration, it really deserves a second look. The fact that it is available for three major platforms (Windows, Mac and Linux), helping to avoid religious wars in the lab, is also a major selling point. I’ll be trying it for the next weeks and keep you updated.


I had great expectations for CiteULike, and I still find it is a neat idea: putting your citation database online, sharing it with other researchers, and even creating groups of interest to share and discuss about the papers. However, it seems that most users just use it as a online reference manager, and it there is not much “two dot oh” synergy (discussion, active sharing, blogging, feedback…) happening on the groups I’ve visited so far. But I’ll keep my eye on it.


My (silly) prejudice against applications implemented in the form of Firefox extensions has been greatly dismissed by Zotero, a fantastic reference manager. Migrating the medium sized database (about 160 entries) I’ve created in Endnote for my thesis was somewhat tricky, because of my extensive use of custom fields, but after half an hour of adapting Endnote’s Export Output Styles, everything went well. Now, I want to see how well Zotero will work in cooperation with word processors like MS Word, LyX and ThinkFree Write. I also want to see if Zotero + CiteULike can work along well.


This one seems to be all the rage in the States. After creating my profile on LinkedIn, I was surprised to see that all my past classmates at UFMG, Brazil were already there. But I am still a little sceptical about the usefulness of the concept. I recognise the importance of “networking” but I am more doubtful about putting it online. I suppose that networking is a zero-sum game. If everyone is connected to everyone, the global effect is levelling the game.

The “get introduced to” feature seems interesting, but do people use it?

Like its own advertisement says this one is “a Facebook for academics”. The idea is not bad, but the current implementation is dreadful: the Flash interface is slow, everything is organised in a rigid hierarchical way (what happens to labs shared by two Universities?) and an information model which favours painstaking “tree-like” browsing instead of direct search. After some hassle I was able to put my profile in, and the site seems to be gaining momentum. Maybe we can hope for a major UI improvement?