A better search facility for Drupal

Drupal has a search facility and many people use Google to search Drupal based Web sites because Google is better at finding the right Web page. What are the alternatives, their cost, problems, and the result from using the alternatives?

Facet/keyword/tag

Facet, keyword, and tag are common names for identifiers you can select during a search to speed up the search. They are all exactly the same. Tag is currently the common name for keywords added by users to comments, articles, and other classification systems. Tag used to be a name for what is now called a facet. Keyword searches have been around since the invention of electrons. In the very early days of computing, a three book encyclopaedia was indexed electronically and the printed version of the index was twenty two books. It was one of the early versions of combined keyword and phrase search indexing. Facet is the current fashion term for keyword.

Facet also implies cumulative addition of facets to refine searches although not all facet searches provide effective cumulative additions or any sort of additions. Search within search and other terms are a more accurate description of an effective cumulative search where the results are refined from previous results.

Alternatives

Drupal search

The Drupal built in search has received no development for many years and is not designed for modern Web sites. the search works for some small web sites and fails on sites the size of drupal.org. In fact it started failing on Drupal. org several years ago when drupal.org was far smaller.

Google

The Google search is effective for medium complexity searches and is better than the Drupal search for a site the size of drupal.org. The main disadvantages of the Google search are the lack of knowledge about your content and Google removing useful search options.

Apache Solr

Apache Solr is an indexing and search program written in Java and requires a separate everything to work. luckily there are other versions under development including a PHP version.

The big advantage of Solr is called faceted searching. Faceted searching is a fancy version of the keyword selection you provide in shopping systems. Faceted search requires setup time for every facet and may return better search results from some facets. The biggest successes have a lot of time invested in creating the right facets for the content.

PHP Solr

Solr is a front end for Apache Lucene. Lucene is available in several languages. There is a PHP version for the Zend Framework. The PHP version has the advantage that you do not need a separate server.

Search words

Search words is a work in progress. Originally built before the Web, Search words moved to the Web when the Web was invented then to PHP was invented then Drupal when..., well it arrived in Drupal 4. I never published that version because I was not working on any sites that needed the advantages of Search words.

Search works invests lot of processing up front to maximise search speed when your visitors search your site. Content words and phrases are connected to an id that connects to content. In the Drupal version the content is in nodes and the node id is used as the content identifier.

Pet Search

Pet search is an alternative to Search words and is not currently under development. Pet search was a predecessor to search words for specific projects and might be revived one day as a Search words Extra Light.

Requirements for a good search

Most search facilities throw away small words, a, if, or, my, and, the, and miss important results. You may be offered a way to include the words in your search but the inclusion is useless if the search database does not contain the original information. Search words is one search that both includes and automatically excludes small words with the search returning the more accurate search first. Google recently removed their advanced search which was the only way of using Google accurately.

Exact phrase searches should be ahead of regular searches but are missing from most search engines. Search words has exact phrase searching built in automatically and lets you specify a maximum degree of accuracy which will alter the amount of disk space used. The current Drupal version does not have that setting implemented. You can choose the storage type and that may have some limitations.

The search should perform the exact phrase first then search for approximations if there are no exact results. The approximations should be based on the way people type instead of quick cuts based on easy programming.

There is no accurate way to remove short words. Dictionary words should be removed before random words because the random words are often identifiers or terminology.

Search for singular and plural. Search for the exact value entered first then for other variations. Search words includes definable variations. Google makes a guess at them based on common searches but knows nothing about your content. HTML lets you define some alternatives but does not let you specify what they are an alternative to.

Common database limitations

Facets, keywords, and other approaches require some form of storage. There are some super tricky storage techniques out there but the big all time favourites for practical storage have accumulated, collectively, hundreds of years of experience. What can we tell from the experience?

String identifiers make up most searches. Common computer code uses strings measured by a length implemented as one byte, two bytes, four bytes, or eight bytes. The one byte length limits strings to 255 characters, sufficient for 99.8% of searches. The rest require something longer and the two byte length of 65535 is generally far larger than what is required for anything other than a copyright search. Even a copyright search is better performed as a series of smaller qualifying searches. Searches are generally limited by one of the byte lengths.

A single byte character cannot store all the characters for all the languages. UTF8 and UTF16 are charactersets aimed at storing every character for every language. UTF uses more than one byte per character. A 255 byte string might contain 255 characters or it might contain less than a 100 multibyte characters. you can understand the difficulty of predicting exactly what you can do with a 255 byte string.

Many databases and storage systems either do not use strings with two byte lengths or treat them as special items requiring extra storage. Yes, it is automatic but there are extra overheads for every action. yes, some storage software brags about not being a database but the special software often end up using exactly the same storage techniques for exactly the same reasons. Your search performance ends up being the same.

The extra server

The Solr approach is supposed to be faster. Note that you have to have a second server or second virtual private server. If you allocated twice the money to your current Web site, you would get more than twice the resources and everything on your web site would be faster, not just the search. The effect of using a special separate server hides the true comparison between the different approaches.

The separate extra server might be a big advantage when several of your sites share the search server. The same multiple site usage might be a severe disadvantage if you use a hosted service and the host company decides to share the one search engine server across 800 sites.

If you are stuck with external search software and you have your own server, you can run them side by side and move resources wherever they are needed. When you use two virtual private servers, you can adjust them to fit.

The indexing process might be the biggest resource user and the indexing uses resources on your Web server to generate the content for indexing. be careful to monitor the resources used for the indexing and push it out to the quietest time for your Web site.

Future developments

I looked at developing Search words for Drupal 7. There is no interest. The most vocal people on drupal.org, and at conferences, recommend Solr without hinting at the problems Solr creates for smaller sites. Many of those vocal people have a financial interest in pushing people toward hosting services that include Solr. It is hard to promote something against a strong wave of self interest. There are other markets for the underlying software.

Google is reducing its search service and moving toward a sales portal. Google search will be less effective for your content and more effective for the content of people who advertise through Google.

The Lucene implementations in languages other than Java appear to be dead in the water. They exist and are compatible but only with very old versions of Lucene, which makes then useless against the current versions of Solr. if you have more than five people in your information technology team, it is probably not a big overhead to have one person on board who knows enough to implement Apache Solr, the tomacat server for Java, and Java. Put them in a case in the corner and away from sharp implements. For smaller projects, there are currently few alternatives with the common shopping cart/product sales system as the one real area of innovation.

For many sites an old fashion resource hungry full text content search is a practical reality because modern servers can hold your whole Web content in memory and search the content using just one of the eight processors, of which perhaps only two are used for the Web site.

Is there a need for a new search module?

One big question is the need for a new search module versus the work required to build it. The search part is easy. The user interface is the difficult bit. with all the Drupal focus on Solr, anyone proposing an alternative is likely to be shot down without discussion of the need for an alternative to Solr.

Solr fits a big project where you can have a separate server. An alternative would be database based and is unlikely to benefit the very small sites with no control over the database. The sites most likely to benefit are the ones big enough to have a VPS and some control over their database but not the people resources to run Solr. This is the market attacked by a number of commercial organisations influential in the Drupal world and they general offer Solr, giving them the financial incentive to discredit alternatives.

I am happy to work on my own module for my own use with no user interface. I would be happy to work on an alternative as part of a project to train people in Drupal module development. There is no incentive for me to work on a public module outside of helping people learn Drupal. I cannot see any incentive for anyone else to devote time to this type of project when there is so much media misinformation devoted to Solr.

Conclusion

Search software is not magic. Modern technology makes only one real difference, you can afford to chew through huge resources during the indexing process in order to make the search faster. the software is cheaper than the cost of the setup time. The biggest setup time may be configuring and testing your facets/keywords/tags for focused searches. Look at the configuration time and the ongoing testing time when estimating cost.