- PeterMoulding.com
- Author
- Trainer
- Speaker
- Business Coach
- How to write a How To book
- PHP Courses
- Speaking
- Web Architect
- Australia
- Books
- Authors
- Akkana Peck
- Alex Berenson
- Andrew Nugent
- Ben Sanders
- Brock Clarke
- Chris Simms
- David Mercer
- Dianna Mullet
- Don Winslow
- Dori Smith
- Harlan Coben
- Jack McDevitt
- James Wines
- Jerry Yudelson
- John Grisham
- Kevin Mullet
- L. E. Modesitt Jr.
- Laurell K. Hamilton
- Marshall Karp
- Martina Cole
- Michael Marshall Smith
- Michel Roux Jr
- Nadia Sawalha
- Philip Pullman
- Raymond Khoury
- Richard North Patterson
- Robert Masello
- Sally Roth
- Sarah Langan
- Stella Rimington
- Stephen Booth
- Stephen King
- Stephen Leather
- T.C. Boyle
- Tom Negrino
- Tony Hillerman
- Urban Waite
- Val McDermid
- Valerio Massimo Manfredi
- Beginning GIMP
- Beginning Visual C++
- Culturalism
- Fiction
- A Drink Before The War
- A Talent for War
- Bag of Bones
- Blood and Ice
- Burn
- Dark Lady
- Dead Line
- Eclipse
- Empress of Eternity
- Exley
- Flipping Out
- Just One Look
- Nightfall
- Pet Sematary
- Savage Moon
- Skinwalkers
- Starvation Lake
- The Fallen
- The Gardens of the Dead
- The Jump
- The Last Templar
- The Mermaids Singing
- The Midnight Mayor
- The Secret Soldier
- The Summons
- The Terror of Living
- The Testament
- The Tower
- Under the Dome
- Virus
- AJAX and PHP
- Aging with Grace
- Food books
- Green Architecture
- Life Is So Good
- SQL: The Complete Reference
- The Backyard Bird Lover's Ultimate How-to Guide
- The Garden Gurus
- Authors
- Sustainability
- -18 hours left to decide the future of Australia
- Campbells vegetable stock or Massel vegetable stock?
- Carbon Sequestration
- Carbon tax for Australia is a fraud
- Copenhagen will fail
- Cost of living in Australia
- Dick Smith jumps on the population bandwagon
- Dry Run: Preventing the Next Urban Water Crisis
- Energy Saving Lights
- Garlic
- How many people can live in Australia?
- Its obsolete, throw it out!
- Julia Gillard offers 9.9 billion dollars bribe to Rob Oakeshott
- Laundry detergent
- Petrol or Diesel?
- Reflective foil batts kill
- RoHS
- Sea level to rise 3mm due to climate change
- Solar power
- Spring again in Sydney
- Sustainable fuels
- The CRUD Tax is back
- The people who make building regulations do not own houses
- Water efficiency
- Which insulation is safer, foil or wool?
- Will Australia reduce greenhouse gas emissions?
- Technology
- Android or Blackberry or iPhone or a flip phone?
- Apple versus Google 2011
- Cameras
- Cars
- Colour
- Burgundy
- Colour Blindness
- Colour Names
- Dulux colours
- Pantone colours
- Safe Colours
- Seculine ProDisk Mini colour balance card
- What Causes Colour Blindness?
- Hardware
- Batteries for the Digital Age
- Cables
- Cases
- Computer reliability
- Computrace
- Disks
- Astone ISO Gear 481E
- Best SSD for your notebook computer
- Disk block size
- Hitachi disk HDS722020ALA330
- LaCie USB 2.0 250 GB mobile hard drive design by F.A. Porsche
- SMART disk
- Samsung 2 TB HD204UI quiet low power disk for mass storage
- Seagate and Samsung merge disk business
- Select the right disk for your RAID array
- USB disk speed
- Western Digital WD20EARX 2 GB SATA 3 disk
- How long should computer hardware last?
- Keyboards
- Mainframe
- Memory cards
- Monitors
- Netbooks, notebooks, tablets, and xPads
- Network Attached Storage
- OLED Displays
- PC's are a thing of the past
- Printers
- Quiet
- Samsung Galaxy S
- Speed
- Television
- Tools
- USB
- Worst computer movies
- Xserve is dead. What next?
- Your backup will not work
- Z68 motherboards
- iPad or Acer Aspire One?
- IQ
- LG Intello Washing Machine
- Lack of a challenge
- Networks
- 802.11n wireless networking
- D-Link DIR-655 wireless router
- D-Link DWA-160 Xtreme N dual band USB adapter
- D-Link DWA-556 Xtreme N PCI Express desktop adapter
- MIMO
- NBN spends another $12 billion of our tax money on nothing
- National Broadband Network
- Netgear wireless modem router DGND3300 with 300 Mbps 802.11n
- Refrigerator kills wireless broadband
- Small Wireless Network
- TP-LINK TL-SG10005D 5 port gigabit switch
- TP-Link TL-WR1043N wireless N gigabit router
- Telstra Pre-paid Mobile Wi-Fi
- Where are the router plus proxy server combinations?
- Open Source documentation
- Software
- 7-zip
- Accounting
- Asterisk
- Audacity
- Backup software
- Bloat only in Windows
- CAD
- CDex
- Disk imaging software for copying and backup
- Exact Audio Copy
- Filezilla
- Firefox
- Java
- LibreOffice or OpenOffice?
- Linux
- 1 in 5 servers will ship with Linux
- Android phones outsell iPhone
- Another Move to Linux
- CentOS 5.5 installation on SSD and RAID 5
- Debian
- Debian 5.0.5 AMD64 installation
- Debian 5.06 installation
- Fedora
- Fedora or Ubuntu?
- Gnome or KDE?
- K9copy
- Linux 2.6.38
- Linux Gnome login settings lost
- Linux Mint
- Linux RAID, a rant
- Linux Speed
- Linux Time
- Linux reliability as demonstrated by Ubuntu 10.10
- Linux reliability as demonstrated by Ubuntu 11.4
- Linux still a struggle in 2011
- Linux workstation disk RAID 1
- Linux, NT, Windows, and SETI
- Linux, three years of progress
- London Stock Exchange switches to Linux
- Mandrake Linux 9.2
- The partition is misaligned by 48128 bytes - warning from Linux RAID
- Ubuntu
- How to fix the scroll bars in Ubuntu 11.4 Gnome
- Kubuntu 10.10 alternate installation on desktop with RAID 1
- POWbuntu
- Ubuntu 10.10 after 6 months use
- Ubuntu 10.10 alternate installation
- Ubuntu 10.10 desktop RAID 1
- Ubuntu 10.10 desktop RAID 5
- Ubuntu 10.10 desktop install on a netbook
- Ubuntu 10.10 desktop installation
- Ubuntu 10.10 netbook install on a netbook
- Ubuntu 10.10 server AMD64
- Ubuntu 10.10 upgrade to version 11.4 beta 2
- Ubuntu 10.4
- Ubuntu 11.10
- Ubuntu 11.10 first upgrade
- Ubuntu 11.4 after one month use
- Ubuntu 12.04 beta1 desktop amd64
- Ubuntu One
- Ubuntu by Microsoft?
- Ubuntu desktop upgrade 10.4 to 10.10 failed because I did not check the media
- Ubuntu strikes again
- Upgrade Ubuntu to Linux Mint 12 LDXE for extra speed
- Yes, use Linux but not that distribution!
- Nero
- OpenOffice
- OpenOffice is now Apache Office
- Project management
- Scribus
- Software for Windows and Linux
- Text editors
- Time
- Todo applications
- Tomboy notes
- Top text editors
- Version control
- VideoLAN VLC media player
- Visio
- Webmin
- Webmin installation on CentOS for Web development
- Webmin installation on Ubuntu
- What is the most popular open source software today?
- Windows
- Another Windows person goes Linux
- BAD_POOL_CALLER
- Cygwin
- Microsoft Malicious Software Removal Tool cannot find a common virus
- One of the developers of Windows XP is criminally insane
- There are unused icons on your desktop
- W32time
- Which Windows version?
- Windows 7 Home Premium
- Windows XP Stop 0x0000007B during installation
- Windows XP is a disaster
- Windows processes
- XML
- Zip, bzip, gzip, or 7zip?
- configFree
- Technology Succession Planning
- VoIP
- Web Sites
- Drupal
- Do Drupal themes have to use the GPL?
- Drupal 7
- A better search facility for Drupal
- Drupal - performance or flexibility
- Drupal 7 Fields are hard to fix
- Drupal 7 new features
- Drupal 7 ships on January 5
- Drupal 7.14
- Drupal 7.4 hits PeterMoulding.com
- Drupal function sequence
- The evolution of a module
- Undefined index: headers in DefaultMailSystem->mail() (line 54 of /modules/system/system.mail.inc).
- Undefined index: to in DefaultMailSystem->mail() (line 83 of /modules/system/system.mail.inc).
- implode(): Invalid arguments passed in DefaultMailSystem->format() (line 23 of /modules/system/system.mail.inc).
- Drupal 8
- Drupal Code Load Cut
- Drupal How To
- Drupal Modules
- Backup and Migrate
- Browscap
- CKEditor with Drupal WYSIWYG
- Captcha
- Cel
- Colorbox
- Content Construction Kit
- Content type
- Devel module for Drupal
- Drupal Rules as an automation language
- Drupal Spam add-on module
- Form alter to node
- IMCE
- IMCE Wysiwyg bridge
- ImageAPI
- Jdog
- Lightbox2
- Module variable
- Node Gallery Access
- Node_Gallery
- Path
- Path redirect
- Pathauto
- Pet
- Search
- Service links
- Session Variable
- Statistics
- Taxonomy
- Token
- Token ex
- Transliteration
- Trigger
- Watch
- Other modules
- Drupal Training
- Drupal access controls need a major rewrite
- Drupal coding tricks
- Drupal performance
- Drupal themes for the future
- Drupal.org colours
- Import existing data into Drupal
- Multiple Web sites made easy using Drupal multisite and the right start
- drupal_lookup_path()
- Adobe PDF
- Apache
- Apache Mahout
- Audi.com
- Bleet
- CSS Strikes Again
- CSS or xCSS
- Can you believe Facebook or email?
- Content Management Systems
- Databases
- Facebook scam
- Font
- Fonts
- HTML
- Install Apache, MySQL, and PHP 5 in Ubuntu 11.4 using the Ubuntu Software Centre
- Language Codes
- Marketing
- Memcache
- Nginx
- Open source development hits another roadblock
- Oscars
- PHP
- SPDY
- Search software
- Techoni.com.au
- Theme themes
- Things to hate on Web sites
- U.S. Patent No. 6,985,875
- Virtual Private Server
- Visible Improvement
- Web 4.0
- Web browser usage
- Web browsers
- Web site development
- Bluefish
- Crying over spilt code
- Eclipse and PHP
- Getting a Git client, a story of ancient technology and pain
- HTTrack
- MVC
- Netbeans
- PHP or ..., CakePHP/Symfony/ZF versus ...
- Programming
- Superfish
- Web browser emulators for testing your Web site
- Web development frameworks
- Web site books
- Web site development on your own computer
- Webmin or phpMyAdmin or cPanel for creating databases?
- aiki framework
- jQuery
- Views development - Learn Fields first
- Views development - Learn Actions and Rules
- jQuery .each()
- jQuery .has()
- jQuery .is()
- jQuery and Firefox Firebug
- jQuery children
- jQuery for people not using Drupal - Installation and getting started
- jQuery hover
- jQuery hover de-duplication example
- jQuery or CSS?
- jQuery performance
- jQuery tests
- Web site hosting
- Westpac Web site still broken after two years and ten months
- Wordpress wins another CMS survey
- Drupal
A better search facility for Drupal
Submitted by Peter on Fri, 2011-12-09 19:23
Drupal has a search facility and many people use Google to search Drupal based Web sites because Google is better at finding the right Web page. What are the alternatives, their cost, problems, and the result from using the alternatives?
Facet/keyword/tag
Facet, keyword, and tag are common names for identifiers you can select during a search to speed up the search. They are all exactly the same. Tag is currently the common name for keywords added by users to comments, articles, and other classification systems. Tag used to be a name for what is now called a facet. Keyword searches have been around since the invention of electrons. In the very early days of computing, a three book encyclopaedia was indexed electronically and the printed version of the index was twenty two books. It was one of the early versions of combined keyword and phrase search indexing. Facet is the current fashion term for keyword.
Facet also implies cumulative addition of facets to refine searches although not all facet searches provide effective cumulative additions or any sort of additions. Search within search
and other terms are a more accurate description of an effective cumulative search where the results are refined from previous results.
Alternatives
Drupal search
The Drupal built in search has received no development for many years and is not designed for modern Web sites. the search works for some small web sites and fails on sites the size of drupal.org. In fact it started failing on Drupal. org several years ago when drupal.org was far smaller.
The Google search is effective for medium complexity searches and is better than the Drupal search for a site the size of drupal.org. The main disadvantages of the Google search are the lack of knowledge about your content and Google removing useful search options.
Apache Solr
Apache Solr is an indexing and search program written in Java and requires a separate everything to work. luckily there are other versions under development including a PHP version.
The big advantage of Solr is called faceted searching. Faceted searching is a fancy version of the keyword selection you provide in shopping systems. Faceted search requires setup time for every facet and may return better search results from some facets. The biggest successes have a lot of time invested in creating the right facets for the content.
PHP Solr
Solr is a front end for Apache Lucene. Lucene is available in several languages. There is a PHP version for the Zend Framework. The PHP version has the advantage that you do not need a separate server.
Search words
Search words is a work in progress. Originally built before the Web, Search words moved to the Web when the Web was invented then to PHP was invented then Drupal when..., well it arrived in Drupal 4. I never published that version because I was not working on any sites that needed the advantages of Search words.
Search works invests lot of processing up front to maximise search speed when your visitors search your site. Content words and phrases are connected to an id that connects to content. In the Drupal version the content is in nodes and the node id is used as the content identifier.
Pet Search
Pet search is an alternative to Search words and is not currently under development. Pet search was a predecessor to search words for specific projects and might be revived one day as a Search words Extra Light.
Requirements for a good search
Most search facilities throw away small words, a, if, or, my, and, the, and miss important results. You may be offered a way to include the words in your search but the inclusion is useless if the search database does not contain the original information. Search words is one search that both includes and automatically excludes small words with the search returning the more accurate search first. Google recently removed their advanced search which was the only way of using Google accurately.
Exact phrase searches should be ahead of regular searches but are missing from most search engines. Search words has exact phrase searching built in automatically and lets you specify a maximum degree of accuracy which will alter the amount of disk space used. The current Drupal version does not have that setting implemented. You can choose the storage type and that may have some limitations.
The search should perform the exact phrase first then search for approximations if there are no exact results. The approximations should be based on the way people type instead of quick cuts based on easy programming.
There is no accurate way to remove short words. Dictionary words should be removed before random words because the random words are often identifiers or terminology.
Search for singular and plural. Search for the exact value entered first then for other variations. Search words includes definable variations. Google makes a guess at them based on common searches but knows nothing about your content. HTML lets you define some alternatives but does not let you specify what they are an alternative to.
Common database limitations
Facets, keywords, and other approaches require some form of storage. There are some super tricky storage techniques out there but the big all time favourites for practical storage have accumulated, collectively, hundreds of years of experience. What can we tell from the experience?
String identifiers make up most searches. Common computer code uses strings measured by a length implemented as one byte, two bytes, four bytes, or eight bytes. The one byte length limits strings to 255 characters, sufficient for 99.8% of searches. The rest require something longer and the two byte length of 65535 is generally far larger than what is required for anything other than a copyright search. Even a copyright search is better performed as a series of smaller qualifying searches. Searches are generally limited by one of the byte lengths.
A single byte character cannot store all the characters for all the languages. UTF8 and UTF16 are charactersets aimed at storing every character for every language. UTF uses more than one byte per character. A 255 byte string might contain 255 characters or it might contain less than a 100 multibyte characters. you can understand the difficulty of predicting exactly what you can do with a 255 byte string.
Many databases and storage systems either do not use strings with two byte lengths or treat them as special items requiring extra storage. Yes, it is automatic but there are extra overheads for every action. yes, some storage software brags about not being a database but the special software often end up using exactly the same storage techniques for exactly the same reasons. Your search performance ends up being the same.
The extra server
The Solr approach is supposed to be faster. Note that you have to have a second server or second virtual private server. If you allocated twice the money to your current Web site, you would get more than twice the resources and everything on your web site would be faster, not just the search. The effect of using a special separate server hides the true comparison between the different approaches.
The separate extra server might be a big advantage when several of your sites share the search server. The same multiple site usage might be a severe disadvantage if you use a hosted service and the host company decides to share the one search engine server across 800 sites.
If you are stuck with external search software and you have your own server, you can run them side by side and move resources wherever they are needed. When you use two virtual private servers, you can adjust them to fit.
The indexing process might be the biggest resource user and the indexing uses resources on your Web server to generate the content for indexing. be careful to monitor the resources used for the indexing and push it out to the quietest time for your Web site.
Future developments
I looked at developing Search words for Drupal 7. There is no interest. The most vocal people on drupal.org, and at conferences, recommend Solr without hinting at the problems Solr creates for smaller sites. Many of those vocal people have a financial interest in pushing people toward hosting services that include Solr. It is hard to promote something against a strong wave of self interest. There are other markets for the underlying software.
Google is reducing its search service and moving toward a sales portal. Google search will be less effective for your content and more effective for the content of people who advertise through Google.
The Lucene implementations in languages other than Java appear to be dead in the water. They exist and are compatible but only with very old versions of Lucene, which makes then useless against the current versions of Solr. if you have more than five people in your information technology team, it is probably not a big overhead to have one person on board who knows enough to implement Apache Solr, the tomacat server for Java, and Java. Put them in a case in the corner and away from sharp implements. For smaller projects, there are currently few alternatives with the common shopping cart/product sales system as the one real area of innovation.
For many sites an old fashion resource hungry full text content search is a practical reality because modern servers can hold your whole Web content in memory and search the content using just one of the eight processors, of which perhaps only two are used for the Web site.
Is there a need for a new search module?
One big question is the need for a new search module versus the work required to build it. The search part is easy. The user interface is the difficult bit. with all the Drupal focus on Solr, anyone proposing an alternative is likely to be shot down without discussion of the need for an alternative to Solr.
Solr fits a big project where you can have a separate server. An alternative would be database based and is unlikely to benefit the very small sites with no control over the database. The sites most likely to benefit are the ones big enough to have a VPS and some control over their database but not the people resources to run Solr. This is the market attacked by a number of commercial organisations influential in the Drupal world and they general offer Solr, giving them the financial incentive to discredit alternatives.
I am happy to work on my own module for my own use with no user interface. I would be happy to work on an alternative as part of a project to train people in Drupal module development. There is no incentive for me to work on a public module outside of helping people learn Drupal. I cannot see any incentive for anyone else to devote time to this type of project when there is so much media misinformation devoted to Solr.
Conclusion
Search software is not magic. Modern technology makes only one real difference, you can afford to chew through huge resources during the indexing process in order to make the search faster. the software is cheaper than the cost of the setup time. The biggest setup time may be configuring and testing your facets/keywords/tags for focused searches. Look at the configuration time and the ongoing testing time when estimating cost.








