Monthly Archives: March 2013

Large image data sets with LIRE - some new numbers

People lately asked whether LIRE can do more than linear search and I always answered: Yes, it should … but you know I never tried. But: Finally I came around to index the MIR-FLICKR data set and some of my Flickr-crawled photos and ended up with an index of 1,443,613 images. I used CEDD as main feature and a hashing algorithm to put multiple hashes per images into Lucene — to be interpreted as words. By tuning similarity, employing a Boolean query, and adding a re-rank step I ended up with a pretty decent approximate retrieval scheme, which is much faster and does not loose too many images on the way, which means the method has an acceptable recall. The image below shows the numbers along with a sample query. Linear search took more than a minute, while the hashing based approach did (nearly) the same thing in less than a second. Note that this is just a sequential, straight forward approach, so no optimization has been done to the performance. Also the hashing approach has not yet been investigated in detail, i.e. there are some parameters that still need some tuning … but let’s say it’s a step into the right direction.

Results-CEDD-Hashing

Updates on LIRE (SVN rev 39)

LIRE is not a sleeping beauty, so there’s something going on in the SVN. I recently checked in updates on Lucene (now 4.2) and Commons Math (now 3.1.1). Also I removed some deprecation things still left from Lucene 3.x.

Most notable addition however is the Extractor / Indexor class pair. They are command line applications that allow to extract global image features from images, put them into an intermediate data file and then — with the help of Indexor — write them to an index. All images are referenced relatively to the intermediate data file, so this approach can be used to preprocess a whole lot of images from different computers on a network file system. Extractor also uses a file list of images as input (one image per line) and can be therefore easily run in parallel. Just split your global file list to n smaller, non overlapping ones and run n Extractor instances. As the extraction part is the slow one, this should allow for a significant speed-up if used in parallel.

Extractor is run with

$> Extractor -i <infile> -o <outfile> -c <configfile>
  • <infile> gives the images, one per line. Use “dir /s /b *.jpg > list.txt” to create a compatible list on Windows.
  • <outfile> gives the location and name of the intermediate data file. Note: It has to be in a folder parent to all images!
  • <configfile> gives the list of features as a Java Properties file. The supported features are listed below the post. The properties file looks like:
    feature.1=net.semanticmetadata.lire.imageanalysis.CEDD
    feature.2=net.semanticmetadata.lire.imageanalysis.FCTH

Indexor is run with

Indexor -i <input-file> -l <index-directory>
  • <input-file> is the output file of Extractor, the intermediate data file.
  • <index-directory> is the directory of the index the images will be added (appended, not overwritten)

Features supported by Extractor:

  • net.semanticmetadata.lire.imageanalysis.CEDD
  • net.semanticmetadata.lire.imageanalysis.FCTH
  • net.semanticmetadata.lire.imageanalysis.OpponentHistogram
  • net.semanticmetadata.lire.imageanalysis.JointHistogram
  • net.semanticmetadata.lire.imageanalysis.AutoColorCorrelogram
  • net.semanticmetadata.lire.imageanalysis.ColorLayout
  • net.semanticmetadata.lire.imageanalysis.EdgeHistogram
  • net.semanticmetadata.lire.imageanalysis.Gabor
  • net.semanticmetadata.lire.imageanalysis.JCD
  • net.semanticmetadata.lire.imageanalysis.JpegCoefficientHistogram
  • net.semanticmetadata.lire.imageanalysis.ScalableColor
  • net.semanticmetadata.lire.imageanalysis.SimpleColorHistogram
  • net.semanticmetadata.lire.imageanalysis.Tamura