Tag Archives: tricks

Dealing with images Java can’t handle out of the box

Frequently asked question in the mailing list is: Lire cannot handle my images, what can I do? In most cases it turns out that Java can’t read those images and therefore the indexing routine can’t create a pixel array from the file. Java is unfortunately limited in it’s ability to handle images. But there are two basic workarounds.

(1) You can convert all images to a format that Java can handle. Just use ImageMagick or some other great tool to batch process yout images and convert them all to RGB color JPEGs. This is a non Java approach and definitely the faster one.

(2) You can circumvent the ImageIO.read(..) method by using ImageJ. In ImageJ you’ve got the ImagePlus class, which supports loading and decoding of various formats and is much more error resilient than the pure Java SDK method. Speed, however, is not increased by this approach. It’s more the other way round.

Find some code example on how to do this in the wiki.

Searching with Lire in big datasets

Having received several complaints about the slowness of Lire when searching in 100k+ documents I took my time to write a small how to to explain approaches for search in big (relatively) data sets.

Lire has the ability to create indexes with lots of different features (descriptors, like RGB color histograms or CEDD). While this opens the opportunity to flexibility at search time as we can select the feature at the time we create a query, the index tends to get bigger and bigger and searcher take longer and longer.

With a data set of 121,379 images the index created with the features selected for default in Lire Demo has a size of 14,3 GB on the disk. In contrast to that an index just storing the CEDD feature along with the image identifier has a size of 29 MB.

Due to the size of the index also linear search tends to get slower. While for the index stripped down to the CEDD feature and the identifier searching takes (on a AMD Quad-Core computer with 4GB RAM and Java 1.7) roughly 0.33 seconds, searching the big index takes 7 minutes and 3 seconds.

So if you want to index and search big data sets (> 100.000 images for instance) I recommend to

  • select which features you need,
  • create the index with a minimum set of features, and
  • eventually split the index per feature and select the index on the fly instead of the feature
  • also you can load the index into RAM

For more on loading the index to RAM and the option to use local features read on in the developer wiki.