Open Source Search with Lucene & Solr

If you have ever had the need to add full-text indexing or search capability to one of your projects, chances are you will be familiar with Apache Lucene or one of its many derivatives such as PyLucene, Lucene.NET, Ferret (Ruby), or Lucy (C port). With history dating back to the early 2000's, Lucene has become one of the most feature complete information retrieval (IR) libraries with extensive support for dozens of tokenizers, analyzers, query parsers and scoring algorithms.

Having recently attended Lucene Revolution in Boston, it was clear from the presentations, as well as the hallway track, that the project has found its way into applications that span the whole gamut: embedded use, desktop and enterprise search, and even large-scale distributed deployments. Best of all, the number of projects around it, as well as the planned core improvements continues to impress - if you are looking for an open source search solution, Lucene is definitely worth a close look.

Lucene in the wild: Salesforce, LinkedIn, Twitter, et al.

Salesforce, LinkedIn and just recently Twitter are all examples of large-scale Lucene deployments that many of us use on a day-to-day basis. Salesforce started with Lucene back in 2002 and today manages an 8TB+ index (~20 billion documents). Their cluster consists of roughly 16 machines, which in turn contain many small (sharded) Lucene indexes. Currently, they handle ~4000 queries per second (qps), and provide an incremental indexing model where the new user data is searchable within ~3 minutes.

LinkedIn is another Lucene power user, servicing 350+ million queries/week. Their "people search" product currently peaks at ~200 qps on a single node while providing consistent sub 100ms latency. To deliver the real-time experience the LinkedIn team wrote their own sorting framework, and implemented their own faceted search (Bobo) and a real-time indexing system (Zoie).

Twitter is a recent Lucene convert - prior to the switch their search API was powered by (gasp!) MySQL. Michael Busch gave a great talk ("A Billion Queries Per Day") where he described the challenges and the changes they had to make to the internals of Lucene to support their query load, which currently peaks at 1 billion queries/day (~12,000 qps). Not surprisingly, their entire index is held in memory and currently holds up to a billion of the most recent tweets - about a weeks worth of data. Good news is, many of their optimizations should find their way upstream into Lucene core within the next year.

Of course, the above is anything but a complete list. iTunes is another notable user, said to be handling up to 800 queries/sec, and at PostRank we are currently managing a 1TB+ index (growing at ~40GB a week) with over 1.2 billion documents.

Lucene + HTTP: Solr Server

If Lucene is a low-level IR toolkit, then Solr is the fully-featured HTTP search server which wraps the Lucene library and adds a number of additional features: additional query parsers, HTTP caching, search faceting, highlighting, and many others. Best of all, once you bring up the Solr server, you can speak to it directly via REST XML/JSON API's. No need to write any Java code or use Java clients to access your Lucene indexes. Solr and Lucene began as independent projects, but just this past year both teams have decided to merge their efforts - all around, great news for both communities. If you haven't already, definitely take Solr for a spin.

Real-time Search with Lucene

Real-time search was a big theme at Lucene Revolution. Unlike many other IR toolkits, Lucene has always supported incremental index updates, but unfortunately it also required an fsync (flush new documents from memory to disk) and a reopen of the "index reader" to make those documents visible to the incoming requests. Needless to say, fsyncs are expensive, hence the recommended patterns have been: commit after N seconds, or commit after adding N documents. These strategies work in most cases, but become major bottlenecks when you are trying to index a high-velocity stream of updates.

To address this issue, the core team has been working on the "Near Realtime" (NRT) branch, where the in-memory index will also become searchable as soon as it is written to - this alleviates the need to wait for an fsync. However, even with the NRT branch in place, the periodic fsync and an index reader reopen can remain a major bottleneck, as both of those operations can be CPU and memory intensive.

An alternative to NRT is Zoie, a project developed and maintained by the LinkedIn team. Similar to the NRT branch, it makes all the updates immediately visible to the incoming queries, but it also implements an incremental indexing and flush model which alleviates the need to perform large, explicit commits. Zoie is currently deployed in production at LinkedIn as a standalone service (comes with HTTP server), but there is also some work for integrating it with Solr.

Last but not least, realtime search is obviously a big deal for Twitter. At the moment, it takes approximately 10 seconds for a tweet to appear in the search results from the time you publish it to the service - that's pretty fast. To achieve this, all Lucene indexes are maintained in memory in many small segments (up to 16 million tweets per segment) and are heavily optimized for Twitter's small document structure. Once again, take a look at Michael's presentation for more details.

Distributed Search

Out of the box, Lucene does not provide any support for distributed indexes - your application can open multiple index readers, but all of that has to be coordinated manually. Solr on the other hand allows hosting and querying multiple "cores" (indexes), as well as adds support for distributed queries, where results are aggregated from multiple servers, are appropriately scored, and then returned to the user. However, even with Solr, managing cores, shuffling data (rebalancing), and failover logic is left to the application.

SolrCloud is attempting to address some of these issues by embedding a Zookeeper instance to simplify the creation and management of distributed Solr clusters. Specifically, the aim is to provide central configuration, discovery, load balancing, and failover. The project is still in incubation, but a few people at Lucene Revolution have reported to already have it running in production! Promising.

As competition to SolrCloud there are also the Katta and ElasticSearch projects. Out of the two, ElasticSearch appears to have the higher momentum and implements a distributed, RESTful search engine built on top of Lucene. ElasticSearch speaks native JSON, supports automated master election and failover, index replication, atomic operations (no need to commit) and looks like a solid overall competitor to SolrCloud.

Solr, Lucene and NoSQL

Instead of running Lucene or Solr in standalone mode, both are also easily integrated within other applications. For example, Lucandra is aiming to implement a distributed Lucene index directly on top of Cassandra. Jake Luciani, the lead developer of the project, has recently joined the Riptano team as a full-time developer, so do not be surprised if Cassandra will soon support a Lucene powered IR toolkit as one of its features!

At the same time, Lily is aiming to transparently integrate Solr with HBase to allow for a much more flexible query and indexing model of your HBase datasets. Unlike Lucandra, Lily is not leveraging HBase as an index store (see HBasene for that), but runs standalone, albeit tightly integrated Solr servers for flexible indexing and query support.

In summary ...

Lucene may not be on top of your high profile open-source radar, but it definitely has the momentum and the developer community behind it to be considered as such. The projects listed above are only but a small sample of all the great work being done by the community. To get started, check out some of the great books on Lucene and Solr, or dive into one of the online tutorials. Finally, kudos to Lucid Imagination for organizing Lucene Revolution this year, and to all the speakers for great presentations!

French translation by Frédéric Faure (architect at Ysance)

Ilya GrigorikIlya Grigorik is a web performance engineer at Google, co-chair of the W3C Web Performance working group, and author of High Performance Browser Networking (O'Reilly) book — follow on Twitter, Google+.