Wednesday, April 29, 2009

Couchdb-lucene Sorting

‹prev | My Chain | next›

To sort on a field in couchdb-lucene (or Lucene proper for that matter), the field cannot be analyzed / tokenized. If you have a recipe with a title of "Chocolate Chip Pancakes", the title field will be indexed with three tokens: "chocolate", "chip" and "pancakes". That way each term can be found in the index and readily associated back to the original recipe / document.

Inverted indexes work well for searching, but not so well for sorting. Which token would be used for sorting? "Chocolate" because it was the first term? "Chip" because it comes first alphabetically? Lucene handles this by simply refusing to sort on such fields. It may not sort this way, but Lucene does support sorting.

It does so by storing fields not-analyzed (without tokenizing). Couchdb-lucene supports this feature via a "not_analyzed" argument to the Document's field method. To get this working with the title and date fields, I need to add this to the lucene design document:
  ret.field('sort_title', doc['title'], 'yes', 'not_analyzed');
ret.field('sort_date', doc['date'], 'yes', 'not_analyzed');
Re-indexing (which I do by removing the lucene directory from the couchdb directory), and then searching my development database, I get results back with a "sort_order" attribute:
cstrom@jaynestown:~/repos/couchdb-lucene$ curl http://localhost:5984/eee/_fti?q=ingredient:salt\&sort=sort_date
{"q":"+_db:eee +ingredient:salt",
"etag":"120f4ad1fc3",
"skip":0,
"limit":25,
"total_rows":7,
"search_duration":0,
"fetch_duration":14,
"sort_order":[{"field":"sort_date","reverse":false,"type":"string"},
{"reverse":false,"type":"doc"}],
"rows":[{"_id":"2002-01-13-hollandaise_sauce",
...}
To reverse the sort order in couchdb-lucene, you need to prepend a back-slash to the field being sorted on (double back-slashes to prevent the shell from interpreting it):
cstrom@jaynestown:~/repos/couchdb-lucene$ curl http://localhost:5984/eee/_fti?q=ingredient:salt\&sort=\\sort_date
{"q":"+_db:eee +ingredient:salt",
"etag":"120f4ad1fc3",
"skip":0,
"limit":25,
"total_rows":7,
"search_duration":0,
"fetch_duration":14,
"sort_order":[{"field":"sort_date","reverse":true,"type":"string"},
{"reverse":false,"type":"doc"}],
"rows":[{"_id":"2008-07-21-spinach",
...}
What this will mean for my Sinatra app is that I need to pass sort parameters to couchdb-lucene, but do not need to store them as instance variables. The view and helper can pull the sort information directly from the result set.

I will work on that tomorrow. For today, I update the lucene design document (as described above) and implement this scenario step:
When /^I click the "([^\"]*)" column header$/ do |link|
click_link("sort-by-#{link.downcase()}")
end
by adding an ID attribute to the search results column heading in the Haml template:
    %th
%a{:href => "/recipes/search?q=foo&sort=name", :id => "sort-by-name"} Name

(commit)

1 comment: