Avoiding my Lucene TooManyClauses exceptions

Before I start, I should point out that I’m not a Lucene expert. This post isn’t a definitive “you should do things this way” commandment from a Lucene mage. Think of it more as “I had this problem, and this seemed to work for me. I’m sharing it in case it helps you, too”.

I’m using Lucene to implement searches. Recently, as my Lucene index has grown (a lot), I was getting a lot of these errors when I tried to do a search:

org.apache.lucene.search.BooleanQuery$TooManyClauses: maxClauseCount is set to 1024
    at org.apache.lucene.search.BooleanQuery.add(BooleanQuery.java:163)
    at org.apache.lucene.search.BooleanQuery.add(BooleanQuery.java:154)
    at org.apache.lucene.search.MultiTermQuery.rewrite(MultiTermQuery.java:63)
    at org.apache.lucene.search.WildcardQuery.rewrite(WildcardQuery.java:54)
    at org.apache.lucene.search.BooleanQuery.rewrite(BooleanQuery.java:383)
    at org.apache.lucene.search.BooleanQuery.rewrite(BooleanQuery.java:383)
    at org.apache.lucene.search.IndexSearcher.rewrite(IndexSearcher.java:162)
    at org.apache.lucene.search.Query.weight(Query.java:94)
    at org.apache.lucene.search.Searcher.createWeight(Searcher.java:185)
    at org.apache.lucene.search.Searcher.search(Searcher.java:86)

I’m guessing that TooManyClauses is a common problem for people getting going with Lucene.

It’s mentioned in the FAQ, and there are a few StackOverflow threads around about it.

But I couldn’t find a straightforward “you need to follow these steps to fix it” post anywhere, so I’ll add my experience here.

Why does this exception happen?

To borrow the example from the Lucene FAQ, in some situations Lucene expands your search query before running it.

If you have an index with:

     car
     cars
     cat
     catalogue
     caterpillar
     delta
     doctor
     dogs
     dominos

And you search for: ca*
It will essentially be expanded to “car OR cars OR cat OR catalogue OR caterpillar” (removing the wildcard).

And if you search for: *
It will be expanded to: “car OR cars OR cat OR catalogue OR caterpillar OR delta OR doctor OR dogs OR dominos“.
A query with nine clauses.

As the index grows, the number of clauses in these wildcard queries increases. If it increases to the point where search queries end up with more than the default 1024 clauses in them, then searches start throwing TooManyClauses exceptions.

Why was I getting this exception?

I had some Java code that was taking user input, and using it to build a Lucene query. The queries where I was ending up with TooManyClauses exceptions were where I was trying get all items except a few named entities.

For example:
-id:12345 -id:23456 -id:34567 -id:45678 *

In other words, “get me everything except the items with IDs 12345, 23456, 34567 and 45678”

I was building this using:

String queryString = "-id:12345 -id:23456 -id:34567 -id:45678 *";
Query luceneQuery = queryParser.parse(queryString);
TopDocs hits = indexSearcher.search(luceneQuery, null, start, sort);

This worked fine with reasonably small indices. However, because of the wildcard behaviour described above, this is a terribly inefficient way to do this.

The wildcard * was being expanded to every term in the index, making for a search query that started with “-id:12345 -id:23456 -id:34567 -id:45678” and essentially continued with the thousands and thousands of terms in the rest of the index.

Avoiding the TooManyClauses exception

The Lucene FAQ outlines the two main approaches to avoid the TooManyClauses exception.

  • Increase the maximum-clause-count limit beyond the default limit
    I didn’t think this was a fix, as it would only postpone the problem until the number of documents covered by my index increased a bit more. And it comes with a memory requirement overhead anyway.
  • Write a filter to replace the part of the query that causes the exception

A better way to do my query

Using a wildcard query to get all documents results in a very long query being generated. I found a better way to get all the documents: the MatchAllDocsQuery.

// notice that I don't include the wildcard
String queryString = "-id:12345 -id:23456 -id:34567 -id:45678";

BooleanQuery luceneQuery = new BooleanQuery();

// create a default match-all-questions query and add it 
MatchAllDocsQuery madQuery = new MatchAllDocsQuery();
luceneQuery.add(madQuery, Occur.SHOULD);

// prepare a parser to parse the exceptions clauses
Query filterQuery = queryParser.parse(queryString);

if (filterQuery instanceof BooleanQuery){
    // if the parsed query contains multiple clauses, we add each 
    //  one of these to our overall query individually
    //  (without specifying the Occur property - therefore 
    //  keeping whatever was specified in the query string)
    BooleanQuery parsedBooleanQuery = (BooleanQuery)filterQuery;
    for (BooleanClause clause : parsedBooleanQuery.getClauses()) {
        luceneQuery.add(clause);
    }
}
else {
    // if the parsed query contains just a single clause, we add
    //  it to our overall query                    
    luceneQuery.add(filterQuery, Occur.MUST);
}

TopDocs hits = indexSearcher.search(luceneQuery, null, start, sort);

In other words, we build a query that starts with a clause specifying match-all-documents, and then add the clauses that identify the exceptions.

It seems to be much more efficient than relying on a WildcardQuery to do this.

Other ways to get the error

This was one particular way to end up with a TooManyClauses exception. There are many others, and other approaches will be appropriate for them. This was a way to avoid a problem with a particularly badly written query in the first place. I didn’t need to write a filter in this instance, but it’d be interesting to see an example of what that would look like.

Tags: , ,

Comments are closed.