Kevin Nelson Marshall
Other entries:
« Beta really just means 'Breakable'

I'm still putting together the finishing touches on a simple little widget that I've been hinting around at for a few days now. Actually the widget part of the work is done, but there's a back end web site that needs to go along with it before I can really start to use the widget around here (and share it with the rest of the world).

So I was trying to tie up the loose ends today on that when I hit a small snag. One of the things I'm doing is making the data the widget collects searchable on the back end web site. I'm doing this with Ruby Ferret which is really just a Ruby port of Lucene.

For those of you that are familiar with Ferret or Lucene, you know that in order to do an update you actually need to delete the existing document from the index and then add a new one back into the index. That is, there is no direct 'update' process or command.

Generally this is not a problem, because if you are going to make changes to documents in your index you might as well just re-index the whole document to make sure you've got the latest and greatest.

However, in my case, I had a slightly different issue than is normal.

Basically my documents have two important fields: content and score. As you can probably guess, the content field holds the majority of the data and is what the actual 'searches' are going to be going against. Meanwhile, the score field is a calculated value that, in theory, will change quite often.

The basic idea of the search will be to show the most relevant documents based on your search criteria with the highest 'score' (descending order).

NOTE: Terminology is a little confusing here if you have any experience with search applications because the score I'm referring to is my own calculation outside of the document score that the search program itself generates -- what I really want to do is apply an additional calculation on the fly to the document score and my calculated score to produce the true 'sorting' score.

Since the score of any given content can change often, I needed a way to have the index reflect the changes as often as possible. The easiest way to do that is, of course, to run updates on the index as needed.

So you see the problem?

In my system I need to do frequent updates to a field within the document, but I don't really want to do a complete re-index because the larger field (content) is not really going to change often (at least not for our interests). And from everything I knew about Lucene and Ferret, this was exactly the type of thing it couldn't do nicely...

So I spent awhile looking into my other full text search options like Sphinx Search and ht://Digg (among others).

The problem with each of these was two fold:

1. I didn't want to spend a lot of time installing these and playing around with them to get the performance/situation I required for such a little project.

2. None of them gave me confidence that they would be able to actually do anything different than what Ferret was making me do (it appears updating via delete/insert is the de facto standard!).

This eventually I went back to Ferret and dug around some more in the API and online comments and such. From all of that research, I started to head down the path of having two indexes. One would just contain my content and be updated very rarely, and the other would contain my scores and be updated all the time.

This would solve my update issue no problem, but it introduced larger headaches.

First, I needed to connect a specific link between a given 'content' and 'score'. I could do that with some type of special key no problem. It means an extra field and some key management, but really not a big deal.

The bigger deal was the second issue - which reveals itself during search. How the heck do I actually search the 'content' field of one index but sort by the 'score' field of the other (tying them all together by this new key field we magically created moments ago)?

Normally when I hit problems like this, my brain kicks into over drive and I start mapping out all these complex work flows, lookup tables, and calculations I can use to accomplish the task. A task I remind you, that in the end is usually something as simple as displaying a list of results (something a general user has no appreciation for the actual work involved in generating -- but that's another post for another day!).

Today I was lucky though.

You see I'm not just a compulsive programmer who has trouble sleeping when a problem is nagging me, I'm also a LAZY programmer who NEVER wants to do more work than he really has to.

NOTE: I've come to believe that this combination of compulsive and lazy is key to being a good developer - you'll work non-stop to solve a problem that is bothering you, but you'll do it in the most efficient way possible so as not to waste too much energy or time!

Anyway, the idea of all the work it was going to take just to IMAGINE the solution to my problem forced me to rethink the problem itself. I knew I was forced to do updates via delete/insert. And I knew that I wanted to be able to update one field a lot. What I really wanted to avoid was having to pull in the data for the 'content' field every time I just wanted to update the 'score'.

The solution I came up with turned out to be so simple it's crazy.

You see, the way you delete documents from an index is to call a delete statement using the documents id (internal). And the way to get that document's id is to do a search for the document using some unique key you know about the document. So you search for a document, you get it's ID, and then you call the delete command passing it that ID.

What I realized was that at the same time that I was grabbing the ID, I could just grab the content as well, throw it into a variable, delete the document, and then pass the variable along with the updated score to the new document I was adding in it's place! So I was deleting and inserting, but the only thing that was really changing was my 'score' field. Best of all I wouldn't need to do any additional external processing for the 'content' fields just to update the 'score'!

Most of you are probably saying 'DUH' to yourselves right now because this solution seemed obvious to you from the start - but hey I never said I was smarter than you now did I?

I freely admit that I often take the long road to the simple solution thanks to my less than stellar intelligence ... but hey I'm just happy if, in the end, I can get to that simple solution one way or another.

So happy in fact, that I thought I would share that long journey to today's simple solution with you!

posted by Kevin Marshall on 2008-01-25 00:00:00+00

Subscribe »

BotFu feed with RSS reader

BotFu feed by Email


Search All Posts »

Blog Details »

This blog now includes 286 wonderfully exciting posts from 1 unique and very special writer!


Kevin Marshall - Who's That?

I'm just your basic programmer. I can't spell to save my life, I'm not the greatest story teller, and I often ramble on about nothing. This blog showcases all of that!

If you're bored drop me an email at info at falicon.com or view my outdated resume.


Stalk me on »

Twitter (@falicon) »
Delicious »
Digg »
Disqus »
Facebook »
Flickr »
FriendFeed »
Last.fm »
LinkedIn »
StumbleUpon »

Archives by Category »

(24) Code »
(5) ColdFusion »
(11) Database »
(7) Factor »
(286) General »
(9) JavaScript »
(15) Perl »
(13) PHP »
(17) Ruby »

Archives by Month »

(1) February 2010 »
(5) January 2010 »
(2) October 2009 »
(6) August 2009 »
(11) July 2009 »
(2) May 2009 »
(3) April 2009 »
(2) March 2009 »
(7) February 2009 »
(9) January 2009 »
(14) December 2008 »
(5) November 2008 »
(12) October 2008 »
(13) September 2008 »
(16) August 2008 »
(23) July 2008 »
(20) June 2008 »
(24) May 2008 »
(23) April 2008 »
(27) March 2008 »
(28) February 2008 »
(26) January 2008 »
(7) December 2007 »

Published Works »

Beginning Amazon's SimpleDB (Apress in dev.)
Pro Active Record (Apress 2007)
Web Services with Rails (O'Reilly 2006).

Contributed To »

Ruby Cookbook (O'Reilly 2006)
SQL Cookbook (O'Reilly 2005)
Various Reviews published in Computing Reviews

Free Code I've Created »

SimpleDB library in Python 3.0

Fantasy focused domains »

draftwizard.com
fantasy-football-draft.com
fantasyfootballkit.com
fantasyfootballquiz.com
hockeynotes.com
pegg.it
rosterhelp.com
sportsxml.com
statsfeed.com
supermug.com

Tech. focused domains »

factorcode.com
perlquiz.com
simpledb.info

Social Tool focused domains »

conversationlist.com
friendstat.us
fuzzypop.com
gawk.it
grou.pe
halfbite.com
jivegas.com
pu.ly
tagli.st
timelylinks.com
tym.ly
wow.ly

Utility focused domains »

fubnub.com

Other domains »

betaread.com
botfu.com
falicon.com
storyrank.com

Not yet live domains »

bar.ackoba.ma
basketballnotes.com
buddydirt.com
budrank.com
cakntoba.com
coachwizard.com
cointhief.com
ezbcs.com
falconsrule.com
fantasydeke.com
fantasyfootballrank.com
ffkit.com
footballnotes.com
footballpublishing.com
giggletweet.com
greentile.com
herobrawl.com
kacode.com
kickasscode.com
knowabout.it
leaguewizard.com
nfldraftnews.com
pa.ly
rorbe.com
slidepitch.com
startfail.com
survivorhub.com
tagli.st
thedfl.com
thescoutsreport.com
toptenify.com
tripacation.com
tweetwiki.com
umock.com

* Yes I realize I have a bit of an addiction to domain names, but I really do have specific ideas for each of the above.



This blog is powered by KickAssCode.