Thursday, June 19, 2008

Lucene in PHP?

Oh, man... I'm as happy as a clam. I discovered that the Zend Framework includes an 100% PHP implementation of Lucene. While it's not function-for-method identical, the majority of the public interface is the same.

To "test it out", I'm porting the old Caryatid/Pilaster library off of DBA and onto a pure Lucene backend. In essence, I'm turning Lucene into a DocumentDB (albeit, one far simpler than Rhizome).

Originally, I was going to replicate the Rhizome structure of separating the document repository from the index. But then somebody inadvertently threw down the gauntlet.

While working on this, I found the following quote in the Zend Framework documentation:

It should be clear that Zend_Search_Lucene as well as any other Lucene implementation does not comprise a "database".

Indexes should not be used for data storage. They do not provide partial backup/restore functionality, journaling, logging, transactions and many other feautures associated with database management systems.

This got me thinking... why not store the entire document in the index instead of just the term fields? Of the list above, most of the "missing features" are either missing in a standard file system or are provided to the index by a standard file system. And with some tools (like, say, backup and restore functionalty), the index could function in a way not at all dissimilar from the DBA database. Keep in mind, DBA is just a dumbed-down API for using BDB, GDBM, and their ilk.

Adding full backup-and-restore was a five minute job. I haven't tried incrementals, yet. But as of 10pm last night, all of my tests were passing, and the library appeared to be working. I'll probably release it at as PilasterPHP 1.0... but I want to do some real-world testing first.

I'm sure this is not the most scalable solution. With index sizes capped at 2G, it'd be wise to store nothing other than small text documents (articles, blogs, API docs, and so on). But that's about the size of the needs of the Caryatid system (as well as most blog managers and low-end Web CMS systems).

So I'm giving it a try.