{"id":2081,"date":"2012-03-20T21:57:17","date_gmt":"2012-03-20T21:57:17","guid":{"rendered":"http:\/\/dalelane.co.uk\/blog\/?p=2081"},"modified":"2012-03-20T21:59:37","modified_gmt":"2012-03-20T21:59:37","slug":"avoiding-my-lucene-toomanyclauses-exceptions","status":"publish","type":"post","link":"https:\/\/dalelane.co.uk\/blog\/?p=2081","title":{"rendered":"Avoiding my Lucene TooManyClauses exceptions"},"content":{"rendered":"<p><em>Before I start, I should point out that I&#8217;m not a Lucene expert. This post isn&#8217;t a definitive &#8220;you should do things this way&#8221; commandment from a Lucene mage. Think of it more as &#8220;I had this problem, and this seemed to work for me. I&#8217;m sharing it in case it helps you, too&#8221;.<\/em><\/p>\n<p>I&#8217;m using <a href=\"http:\/\/lucene.apache.org\/\">Lucene<\/a> to implement searches. Recently, as my Lucene index has grown (a lot), I was getting a lot of these errors when I tried to do a search:<\/p>\n<pre style=\"border: thin solid silver; background-color: #eeeeee; padding: 0.7em; font-size: 1em; overflow: auto;\">org.apache.lucene.search.BooleanQuery$TooManyClauses: maxClauseCount is set to 1024\r\n    at org.apache.lucene.search.BooleanQuery.add(BooleanQuery.java:163)\r\n    at org.apache.lucene.search.BooleanQuery.add(BooleanQuery.java:154)\r\n    at org.apache.lucene.search.MultiTermQuery.rewrite(MultiTermQuery.java:63)\r\n    at org.apache.lucene.search.WildcardQuery.rewrite(WildcardQuery.java:54)\r\n    at org.apache.lucene.search.BooleanQuery.rewrite(BooleanQuery.java:383)\r\n    at org.apache.lucene.search.BooleanQuery.rewrite(BooleanQuery.java:383)\r\n    at org.apache.lucene.search.IndexSearcher.rewrite(IndexSearcher.java:162)\r\n    at org.apache.lucene.search.Query.weight(Query.java:94)\r\n    at org.apache.lucene.search.Searcher.createWeight(Searcher.java:185)\r\n    at org.apache.lucene.search.Searcher.search(Searcher.java:86)<\/pre>\n<p>I&#8217;m guessing that <code>TooManyClauses<\/code> is a common problem for people getting going with Lucene. <\/p>\n<p>It&#8217;s <a href=\"http:\/\/wiki.apache.org\/lucene-java\/LuceneFAQ#Why_am_I_getting_a_TooManyClauses_exception.3F\">mentioned in the FAQ<\/a>, and there are a <a href=\"http:\/\/stackoverflow.com\/search?q=Lucene+TooManyClauses\">few StackOverflow threads<\/a> around about it. <\/p>\n<p>But I couldn&#8217;t find a straightforward &#8220;you need to follow these steps to fix it&#8221; post anywhere, so I&#8217;ll add my experience here. <\/p>\n<p><!--more--><strong>Why does this exception happen?<\/strong><\/p>\n<p>To borrow the example from the Lucene FAQ, in some situations Lucene expands your search query before running it.<\/p>\n<p>If you have an index with:<\/p>\n<pre style=\"border: thin solid silver; background-color: #eeeeee; padding: 0.7em; font-size: 1.1em; overflow: auto;\">     car\r\n     cars\r\n     cat\r\n     catalogue\r\n     caterpillar\r\n     delta\r\n     doctor\r\n     dogs\r\n     dominos<\/pre>\n<p>And you search for: <code>ca*<\/code><br \/>\nIt will essentially be expanded to &#8220;<code>car OR cars OR cat OR catalogue OR caterpillar<\/code>&#8221; (removing the wildcard). <\/p>\n<p>And if you search for: <code>*<\/code><br \/>\nIt will be expanded to: &#8220;<code>car OR cars OR cat OR catalogue OR caterpillar OR delta OR doctor OR dogs OR dominos<\/code>&#8220;.<br \/>\nA query with nine clauses. <\/p>\n<p>As the index grows, the number of clauses in these wildcard queries increases. If it increases to the point where search queries end up with more than the default 1024 clauses in them, then searches start throwing <code>TooManyClauses<\/code> exceptions. <\/p>\n<p><strong>Why was I getting this exception?<\/strong><\/p>\n<p>I had some Java code that was taking user input, and using it to build a Lucene query. The queries where I was ending up with <code>TooManyClauses<\/code> exceptions were where I was trying get all items except a few named entities. <\/p>\n<p>For example:<br \/>\n<code>-id:12345 -id:23456 -id:34567 -id:45678 *<\/code><\/p>\n<p>In other words, &#8220;get me everything except the items with IDs 12345, 23456, 34567 and 45678&#8221;<\/p>\n<p>I was building this using:<\/p>\n<pre style=\"border: thin solid silver; background-color: #eeeeee; padding: 0.7em; font-size: 1.1em; overflow: auto;\">String queryString = \"-id:12345 -id:23456 -id:34567 -id:45678 *\";\r\nQuery luceneQuery = queryParser.parse(queryString);\r\nTopDocs hits = indexSearcher.search(luceneQuery, null, start, sort);<\/pre>\n<p>This worked fine with reasonably small indices. However, because of the wildcard behaviour described above, this is a terribly inefficient way to do this. <\/p>\n<p>The wildcard <code>*<\/code> was being expanded to every term in the index, making for a search query that started with &#8220;<code>-id:12345 -id:23456 -id:34567 -id:45678<\/code>&#8221; and essentially continued with the thousands and thousands of terms in the rest of the index. <\/p>\n<p><strong>Avoiding the TooManyClauses exception<\/strong><\/p>\n<p>The <a href=\"http:\/\/wiki.apache.org\/lucene-java\/LuceneFAQ#Why_am_I_getting_a_TooManyClauses_exception.3F\">Lucene FAQ<\/a> outlines the two main approaches to avoid the TooManyClauses exception. <\/p>\n<ul>\n<li>Increase the maximum-clause-count limit beyond the default limit <br \/>I didn&#8217;t think this was a fix, as it would only postpone the problem until the number of documents covered by my index increased a bit more. And it comes with a memory requirement overhead anyway.\n<\/li>\n<li>Write a filter to replace the part of the query that causes the exception<\/li>\n<\/ul>\n<p><strong>A better way to do my query<\/strong><\/p>\n<p>Using a wildcard query to get all documents results in a very long query being generated. I found a better way to get all the documents: the <a href=\"http:\/\/lucene.apache.org\/core\/old_versioned_docs\/versions\/2_9_0\/api\/all\/org\/apache\/lucene\/search\/MatchAllDocsQuery.html\">MatchAllDocsQuery<\/a>. <\/p>\n<pre style=\"border: thin solid silver; background-color: #eeeeee; padding: 0.7em; font-size: 1.1em; overflow: auto;\">\/\/ notice that I don't include the wildcard\r\nString queryString = \"-id:12345 -id:23456 -id:34567 -id:45678\";\r\n\r\nBooleanQuery luceneQuery = new BooleanQuery();\r\n\r\n\/\/ create a default match-all-questions query and add it \r\nMatchAllDocsQuery madQuery = new MatchAllDocsQuery();\r\nluceneQuery.add(madQuery, Occur.SHOULD);\r\n\r\n\/\/ prepare a parser to parse the exceptions clauses\r\nQuery filterQuery = queryParser.parse(queryString);\r\n\r\nif (filterQuery instanceof BooleanQuery){\r\n    \/\/ if the parsed query contains multiple clauses, we add each \r\n    \/\/  one of these to our overall query individually\r\n    \/\/  (without specifying the Occur property - therefore \r\n    \/\/  keeping whatever was specified in the query string)\r\n    BooleanQuery parsedBooleanQuery = (BooleanQuery)filterQuery;\r\n    for (BooleanClause clause : parsedBooleanQuery.getClauses()) {\r\n        luceneQuery.add(clause);\r\n    }\r\n}\r\nelse {\r\n    \/\/ if the parsed query contains just a single clause, we add\r\n    \/\/  it to our overall query                    \r\n    luceneQuery.add(filterQuery, Occur.MUST);\r\n}\r\n\r\nTopDocs hits = indexSearcher.search(luceneQuery, null, start, sort);<\/pre>\n<p>In other words, we build a query that starts with a clause specifying match-all-documents, and then add the clauses that identify the exceptions. <\/p>\n<p>It seems to be much more efficient than relying on a <code>WildcardQuery<\/code> to do this. <\/p>\n<p><strong>Other ways to get the error<\/strong><\/p>\n<p>This was one particular way to end up with a <code>TooManyClauses<\/code> exception. There are many others, and other approaches will be appropriate for them. This was a way to avoid a problem with a particularly badly written query in the first place. I didn&#8217;t need to write a filter in this instance, but it&#8217;d be interesting to see an example of what that would look like. <\/p>\n","protected":false},"excerpt":{"rendered":"<p>Before I start, I should point out that I&#8217;m not a Lucene expert. This post isn&#8217;t a definitive &#8220;you should do things this way&#8221; commandment from a Lucene mage. Think of it more as &#8220;I had this problem, and this seemed to work for me. I&#8217;m sharing it in case it helps you, too&#8221;. I&#8217;m [&hellip;]<\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[7],"tags":[519,521,520],"class_list":["post-2081","post","type-post","status-publish","format-standard","hentry","category-code","tag-lucene","tag-maxclausecount","tag-toomanyclauses"],"_links":{"self":[{"href":"https:\/\/dalelane.co.uk\/blog\/index.php?rest_route=\/wp\/v2\/posts\/2081","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/dalelane.co.uk\/blog\/index.php?rest_route=\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/dalelane.co.uk\/blog\/index.php?rest_route=\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/dalelane.co.uk\/blog\/index.php?rest_route=\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/dalelane.co.uk\/blog\/index.php?rest_route=%2Fwp%2Fv2%2Fcomments&post=2081"}],"version-history":[{"count":0,"href":"https:\/\/dalelane.co.uk\/blog\/index.php?rest_route=\/wp\/v2\/posts\/2081\/revisions"}],"wp:attachment":[{"href":"https:\/\/dalelane.co.uk\/blog\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=2081"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/dalelane.co.uk\/blog\/index.php?rest_route=%2Fwp%2Fv2%2Fcategories&post=2081"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/dalelane.co.uk\/blog\/index.php?rest_route=%2Fwp%2Fv2%2Ftags&post=2081"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}