Large image data sets with LIRE – some new numbers

People lately asked whether LIRE can do more than linear search and I always answered: Yes, it should … but you know I never tried. But: Finally I came around to index the MIR-FLICKR data set and some of my Flickr-crawled photos and ended up with an index of 1,443,613 images. I used CEDD as main feature and a hashing algorithm to put multiple hashes per images into Lucene — to be interpreted as words. By tuning similarity, employing a Boolean query, and adding a re-rank step I ended up with a pretty decent approximate retrieval scheme, which is much faster and does not loose too many images on the way, which means the method has an acceptable recall. The image below shows the numbers along with a sample query. Linear search took more than a minute, while the hashing based approach did (nearly) the same thing in less than a second. Note that this is just a sequential, straight forward approach, so no optimization has been done to the performance. Also the hashing approach has not yet been investigated in detail, i.e. there are some parameters that still need some tuning … but let’s say it’s a step into the right direction.

Results-CEDD-Hashing

3 thoughts on “Large image data sets with LIRE – some new numbers

  1. Mathias Lux Post author

    Currently it’s rather complicated. There is a test class at http://goo.gl/D7Vfr which shows the overall approach. However, I’ll try to integrate DocumentBuilders and alike in the next few days and will post as soon as it is done.

  2. jackyma

    I hava millions of images to index. It took me days to index them using Lire. Is their any method to accelerate it(such as multi-thread indexing in Lire)?

    Thanks

Leave a Reply

Your email address will not be published. Required fields are marked *

*

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>