Log
Notes on Open Source Search Engines
I’ve been investigating various open source search engines/ spiders to implement here since I’ve got a mix of static and database-driven pages. Obviously the simplest thing to do would be to rely on the search engine built into Movable Type, but it wouldn’t get the content on the rest of the site. Here’s a summary of options that I’ve looked into—I’ll be updating this as I research more.
Swish-e
Swish-e is a spider and search engine written in C, so indexing is really fast . It writes the indexes in flat files, rather than in a database.
Pros
- Fast
- Apparently still in active development. Code has been checked in within the last week.
- Can index formats other than HTML/XML/plain text such as PDF, gzip, etc.
- Can exclude parts of pages from index
Cons
- Perl API—no real PHP support (not the biggest deal, but it’s nice to use as few languages on one website as possible…)
- Indexing couldn’t be completed on the first run against my development site due to, I think, non-ASCII characters
- Converts UTF-8 charcter sets to Latin 1
- Can’t remove parts of index—must completely re-index
PhpDig
A PHP version of ht://Dig that uses MySQL to store the index.
Pros
- Written in PHP so it will play nicely with a php site
- Fairly flexible in terms of controlling what is indexed on a given site—can isolate certain directories, control how many levels down it goes, etc.
- Pretty good reporting on search terms, common words, etc.
Cons
- Indexing is painfully slow and uses tons of CPU & memory
- Last release is from November 2005
- Admin interface is confusing at first and has a fairly steep learning curve
The Search Engine Project
Another PHP/MySQL based search engine and spider
Pros
- Very configurable
- Fast
- Written in PHP
Cons
- Very complicated admin—so many options it takes a lot of effort to figure out what they all do
- Latest version needs code modifications right out of the box for indexing to work
- Getting it to actually spider your site can be tricky and more time consuming.
- Last release was September 2005, no commits for more than 10 months
04/12/07 07:59PM Geekiness
Recently Played on iTunes
-
“12:51”
Room on Fire
The Strokes
10/02/08 20:57 -
“Hospital Food”
Electro-Shock Blues
Eels
10/02/08 20:53 -
“Captain Easychord”
Sound-Dust
Stereolab
10/02/08 20:48