I was wondering when people actually uploaded all their stuff, so I started to grab the uploads per minute on a regular basis some time ago (see here). It seems that people upload 4,682 images per minute in average. This remained more or less stable from the last experiment, which gave an average number of 4,602. Now I have enough data, 1,938 samples, for a first shot on the question: when do people upload their stuff?
It seams that people concentrate their uploads in between 3 pm and 11 pm (CET) and that there is not much going on around 8-10 am (CET). So that looks reasonable to me if the typical Flickr user is American or European 🙂
I grabbed the number of photos uploaded in the last minute every 5 minutes from February 25th 2010, 1 pm to March 5th 2010, 2.30 pm (CET). This resulted in a number of 1,703 samples. This does not sum up to the actual time because timeouts and network problems occurred in between. Nevertheless I found:
- Average uploads per minute: 4,602
- Sample minimum and maximum: 1,852 and 8,849
This is interesting as many research papers (including some of my own) talk about upload rates of 6,000-8,000 photos per minute, which definitely does not reflect the average value. This might be a good reminder that systems actually working with people change and that we have to take a look at the number periodically.
The contribution of Christoph Kofler and me with the title “An exploratory study on the explicitness of user intentions in digital photo retrieval” has been accepted for publication and presentation at the I-Know ’09. Here is the abstract (the full paper will follow as soon as we have prepared the camera ready version):
Search queries are typically interpreted as specification of information need of a user. Typically the search query is either interpreted as is or based on the context of a user, being for instance a user profile, his/her previously undertaken searches or any other background information. The actual intent of the user – the goal s/he wants to achieve with information retrieval – is an important part of a user’s context. In this paper we present the results of an exploratory study on the interplay between the goals of users and their search behavior in multimedia retrieval.
This work has been supported by the SOMA project.
I’m currently testing a new implementation of an approximate search index for content based image retrieval. Especially the performance tests have become interesting as I didn’t have access to a real big data size. So what to do?
Actually I programmed a lot of spiders and grabbers before, so I knew that there is a lot of data available on Flickr 🙂 But I was still searching for an easy way. Now here is my approach (using of course bash):
wget -q -O - http://api.flickr.com/services/feeds/photos_public.gne?format=atom | grep -o .............static.*m.jpg | wget -i -
Why should this work?
- The first wget command gets a list of recent photos as atom feed.
- The grep command gets out all the medium sized (suffix “m.jpeg”) pictures
- The lot of dots and the static are just a nice trick to get the right ones, the real image content.
- Finally the second wget downloads the images from the server.
Issuing this command one should get ~ 25 photos in one go. Using a bash loop or a cronjob you can get of course a lot more in an unattended way 🙂
While writing a scientific paper on tag recommendation I checked – just out of curiosity – the share of images tagged by their uploaders on Flickr. I found out that 4 out of five images are untagged and that less than 15% of images have 2 or more tags.
My method and detailed results: In general one would need a random sample for such an investigation, but a truly random sample is hard to obtain without access to the data base. Therefore I just grabbed 20,004 images from the RSS feed for recent uploads and counted the number of tagged images. Easy enough I also computed the confidence interval:
- In my sample 3,650 images were tagged with at least one tag, that makes p1=18.25%
- With alpha=0.99 p1 is in [16.84, 19.66].
- That leaves more than 4 out of 5 images untagged.
- Also in my sample 2,628 images were tagged with at least two tags, that makes p2=13,14%
- With alpha=0.99 p2 is in [11.9, 14.37].
- That means that less than 15% of the images images have more than one tag.