The Solr plugin itself is fully functional for Solr 4.4 and the source is available at https://bitbucket.org/dermotte/liresolr. There is a markdown document README.md explaining what can be done with plugin and how to actually install it. Basically it can do content based search, content based re-ranking of text searches and brings along a custom field implementation & sub linear search based on hashing.
The new LIRE web demo is based on Apache Solr and features and index of the MIRFLICKR data set. The new architecture allows for extremely fast retrieval. Moreover, there’s a new walk through video with some short peeks behind the screen. The source of the plugin will be released in the near future.
The beta update features (i) improvements on local feature handling. i.e. stronger quantization of local feature histograms and several bug fixes, (ii) critical bug fixes for CEDD and JCD, which were not thread safe, and (iii) improvements on the ParallelExtractor and Indexor classes as well as the intermediate binary format.
I’ve just uploaded LIRE 0.9.4 beta to the Google Code downloads page. This is an intermediate release that reflects several changes within the SVN trunk. Basically I put it online as there are many, many bugs solved in this one and it’s performing much, much faster than the 0.9.3 release. If you want to get the latest version I’d recommend to stick to the SVN. However, currently I’m changing a lot of feature serialization methods, so there’s no guarantee that an index created with 0.9.4 beta will work out with any newer version. Note also that the release does not work with older indexes
Major changes include, but are not limited to:
New features: PHOG, local binary patterns and binary patterns pyramid
Parallel indexing: a producer-consumer based indexing application that makes heavy use of available CPU cores. On a current Intel Core i7 or considerably large Intel Xeon system it is able to reduce extraction to a marginal overhead to disk I/O.
Intermediate byte based feature data files: a new way to extract features in a distributed way
In-memory cached ImageSearcher: as long as there is enough memory all linear searching is done in memory without much disk I/O (cp. class GenericFastImageSearcher and set caching to true)
Approximate indexing based on hashing: tests with 1.5 million led to search time < 300ms (cp. GenericDocumentBuilder with hashing set to true and BitSamplingImageSearcher)
Footprint of many global descriptors has been significantly reduced. Examples: EdgeHistogram 40 bytes, ColorLayout 504 bytes, FCTH 96 bytes, …
New unit test for benchmarking features on the UCID data set.
The ACM Multimedia Systems conference (http://www.mmsys.org) provides a forum for researchers, engineers, and scientists to present and share their latest research findings in multimedia systems. While research about specific aspects of multimedia systems is regularly published in the various proceedings and transactions of the networking, operating system, real-time system, and database communities, MMSys aims to cut across these domains in the context of multimedia data types. This provides a unique opportunity to view the intersections and interplay of the various approaches and solutions developed across these domains to deal with multimedia data types. Furthermore, MMSys provides an avenue for communicating research that addresses multimedia systems holistically.
As an integral part of the conference since 2011 2012, the Dataset Track provides an opportunity for researchers and practitioners to make their work available (and citable) to the multimedia community. MMSys encourages and recognizes dataset sharing, and seeks contributions in all areas of multimedia (not limited to MM systems). Authors publishing datasets will benefit by increasing the public awareness of their effort in collecting the datasets.
In particular, authors of datasets accepted for publication will receive:
Dataset hosting from MMSys for at least 5 years
Citable publication of the dataset description in the proceedings published by ACM
15 minutes oral presentation time at the MMSys 2014 Dataset Track
All submissions will be peer-reviewed by at least two members of the technical program committee of the MMSys 2014. Datasets will be evaluated by the committee on the basis of the collection methodology and the value of the dataset as a resource for the research community.
Authors interested in submitting a dataset should
(A) Make their data available by providing a public URL for download
(B) Write a short paper describing:
motivation for data collection and intended use of the data set,
the format of the data collected,
the methodology used to collect the dataset, and
basic characterizing statistics from the dataset.
Papers should be at most 6 pages long (in PDF format) prepared in the ACM style and written in English.
Data set paper submission deadline: November 11, 2013
While I know that the performance did not skyrocket with Lucene 4.0 I finally came around to find out why. Unfortunately the field compression technique applied in Lucene 4.x compresses each and every stored field … and decompresses it upon access. This makes up for a nice overhead when reading the index in a linear way, which is excactly one of the main methods of LIRE.
The image shows a screen shot of the CPU sampler in VisualVM. 58.7% of the CPU time go to the LZ4 decompression routine. That’s quite a lot and makes a huge difference for search. If anyone has a workaround of sort, I’d be happy
Update (2013-07-03): With the great help of the people from the lucene-user list I found at least a speed-up. In the current SVN version, there is a nove LireCustomCodec for stored fields, which speeds up decompression a lot. Moreover there is now an in-memory caching approach implemented in the GeneriecFastImageSearcher class, which is turned off by default, but speeds up search time (as a trade off for memory and init time) by holding image features in-memory. It has been tested with up to 1.5M images.
The current LireDemo 0.9.4 beta release features a new indexing routine, which is much faster than the old one. It’s based on the producer-consumer principle and makes — hopefully — optimal use of I/O and up to 8 cores of a system. Moreover, the new PHOG feature implementation is included and you can give it a try. Furthermore JCD, FCTH and CEDD got a more compact representation of their descriptors and use much less storage space now. Several small changes include parameter tuning on several descriptors and so on. All the changes have been documented in the CHANGES.txt file in the SVN.
The ACM Multimedia Open-Source Software Competition celebrates the invaluable contribution of researchers and software developers who advance the field by providing the community with implementations of codecs, middleware, frameworks, toolkits, libraries, applications, and other multimedia software. This year will be the sixth year in running the competition as part of the ACM Multimedia program.
To qualify, software must be provided with source code and licensed in such a manner that it can be used free of charge in academic and research settings. For the competition, the software will be built from the sources. All source code, license, installation instructions and other documentation must be available on a public web page. Dependencies on non-open source third-party software are discouraged (with the exception of operating systems and commonly found commercial packages available free of charge). To encourage more diverse participation, previous years’ non-winning entries are welcome to re-submit for the 2013 competition. Student-led efforts are particularly encouraged.
Authors are highly encouraged to prepare as much documentation as possible, including examples of how the provided software might be used, download statistics or other public usage information, etc. Entries will be peer-reviewed to select entries for inclusion in the conference program as well as an overall winning entry, to be recognized formally at ACM Multimedia 2013. The criteria for judging all submissions include broad applicability and potential impact, novelty, technical depth, demo suitability, and other miscellaneous factors (e.g., maturity, popularity, student-led, no dependence on closed source, etc.).
Authors of the winning entry, and possibly additional selected entries, will be invited to demonstrate their software as part of the conference program. In addition, accepted overview papers will be included in the conference proceedings.
In the current SVN version three global features have been re-visited in terms of serialization. This was necessary as the index of the web demo with 300k images already exceed 1.5 GB.
This significant reduction in space leads to (i) smaller indexes, (ii) reduced I/O time, and (iii) therefore, to faster search.
How was this done? Basically it’s clever organization of bytes. In the case of JCD the histogram has 168 entries, each in [0,127], so basically half a byte.Therefore, you can stuff 2 of these values into one byte, but you have to take care of the fact, that Java only supports bit-wise operations on ints and bytes are signed. So the trick is to create an integer in [0, 2^8-1] and then subtract 128 to get it into byte range. The inverse is done for reading. The rest is common bit shifting.
The code can be seen either in the JCD.java file in the SVN, or in the snippet at pastebin.com for your convenience.