Monday, June 7, 2010

500kbs of Magic!

We all are helplessly in love with our dearest pal, the internet, but what is internet without a search engine?
I always wondered, about the mysterious ingenious behind the simple search engine's that magically pulled out, any thing I wanted, from a plethora of online data, in fraction of a second.
Lately I have had the pleasure of working in a search team, where I discovered this beautiful piece of software called Lucene, and I have been in awe of it ever since. It is like a priceless gem, given away for nothing, for the benefit of all (yes, it is open source).

What is Lucene?
by definition it is "a free/open source information retrieval library", or in english, it is the heart of a search engine. It has a beautiful way of indexing and searching data to return highly relevant documents in blink of an eye. All it takes is a few line of code, and you are good to go. Most of the search you see in services is actually built on Lucene...... Yahoo, Facebook, Fed-Ex, Amazon, LinkedIn...list is endless (not google ofcourse!).
It doesn't take much to build a search with it, but that is just the start, diving deeper, opens up a whole new world of possibilities and the power to do amazing things!. It is beautiful to see how such complex problems are solved so elegantly.

Let me discuss one seemingly mammoth task, that was solved effortlessly with it. My friend, who knew I had some knowledge of lucene, presented me with a problem. He wanted me to take a piece of news article online and collect other article's on net, that displayed the same news.
Now I cant even begin to think how I would compare millions of articles, and find out the ones that talked about exactly the same news as this one. No other article would use the same words to describe the same news as mine, and most articles would partially match my article because, common English words exist in all of them, it is very complicated to establish relationship between articles that convey the same news, but may use entirely different set of words to do so. But guess what, a little bit of exploration into lucene, solved this in a blink!.
To get a over view, Lucene turns every article indexed, into, kind of vector in 3-D space, and uses complex scoring algorithms to establish relationship between them. The mechanics is beautiful, but I wont go into it, because anyone interested can google it out (and, I'm not entirely certain it yet :P). The end result was simply beautiful, it was very exciting to see, our code go through hundreds of articles and return articles that talked about same news with amazing accuracy.

Now in case you are wondering, why I am trying to scare away, the few readers I do have, with this technical mumbo jumbo....then, don't worry I am not!.
This post was more to express my elation, on finding this glorious new toy, and express gratitude to good old folks in open source community, than to talk about search engine's. My post's shall not get anymore technical :)

Tributes to Doug Cutting, creator of Lucene. Long live the open source community!