Searchspy Home Page

Searchspy is a simple Perl5 script that processes a standard web server log file (referrer information required), that tells you what search queries people are using to find your web pages from search engines.

For instance, these lines from my access_log

  sea-ts1-p27.wolfenet.com - - [01/Feb/1998:05:10:11 -0500] "GET /people/nelson/movies/ HTTP/1.1" 200 114132 "http://www.metacrawler.com/crawler?general=%22rumble+in+the+bronx%22+stereotypes&target=&method=0®ion=0&rpp=20&timeout=10&hpe=10"
  port54.soho.prodigy.net - - [01/Feb/1998:07:13:19 -0500] "GET /people/nelson/movies/ HTTP/1.0" 200 114132 "http://www.mckinley.com/magellan/look-magellan.dcg?dcgvar=true&search=%22sexual+anxiety%22&c=web&look=magellan"

give this output when passed through searchspy:

  /people/nelson/movies/ "rumble in the bronx" stereotypes
  /people/nelson/movies/ "sexual anxiety"

telling me in a very simple way that people found my movie page when they were looking for "sexual anxiety" or information about stereotypes in "Rumble in the Bronx". (Not a very good movie, by the way. See "Snake in Eagle's Shadow" instead.)

Download and Docs

Feel free to download searchspy, run it, and play with it. The simplest way to run it is

  searchspy < access_log

although I often prefer to do

  searchspy < access_log | sort | uniq -c | sort -nr

to process the data somewhat. You may want to change the $searchURL variable in the script to customize it for your particular web pages. By default it processes all pages.

I wrote this because I was curious how well search engines worked in practice. The results are quite amusing as well as informative. I'm suprised at how on-target most search terms are. People find my movie pages with queries that are about movies, not about random other topics. I get few hits from random sex searches (maybe that says something about my pages!). Full text indexing works better than you might think.

I'm interested in any comments you may have. But this software is really a one-off hack, I do not promise to maintain this software in any meaningful way. If someone is really excited by this program, I may be interested in handing development over.

If you are looking for fancier logfile analysis, analog now also does a similar search engine query thing.

News

Feb 7 2000, noted that analog does something similar.
Jul 21 1999, revision 1.8. Fixed a problem - searchspy no longer assumes that the referrer URL is surrounded by ""s. Also added another entry to the engine map for savvysearch.com.
Jul 20 1999, revision 1.7. Added some more entries to the engine map, thanks to Keith Dawson.
Jan 5 1999, revision 1.6. Added some entries to the engine map: altavista.com is the main one.

How it works

Search engines usually encode queries in the standard CGI format. That means that when a user clicks through to your page, the referrer information with the search term is sent to your web server. If you've got referrer logging turned on and the client isn't using a filter like Intermute, you can find out the search terms that led users to your pages.

This Perl script is a quick hack I wrote on my decrepit old Linux laptop while on an airplane. The basic idea is that any referrer with a ? in the URL is quite likely a search engine. So referrers with a ? in them are parsed to pull out the particular query term. Each engine has a different field they set to mark the search term: hotbot uses MT, AltaVista uses q, etc. This field is extracted, processed into human readable form, and then reported.

Want to Hack the Script?

Feel free to play with the script - if you do something neat, email me and let me know. Most of the code is straightforward. Lines are read in and parsed (in a ugly, probably broken way). If the referring URL has a ? in it, then it is unpacked into a hash named %a and engine-specific parsing is done on it. Finally, the query term is passed through a filter to turn +s into spaces and the %xx hex encoding back into characters.

The engine-specific parsing is all encapsulated in the table %engineMap. The basic idea there is if the search engine name matches the left side of an entry then the result of the subroutine on the right side is the actual query term. The function applytable actually executes the table - this could be useful code for other applications.

Some suggested hacks:

Improve the logfile parsing.
Update and/or maintain the %engineMap table for new search engines. Turn $debug on to see the errors.
Fix it so the looksmart fixup isn't needed.
Rewrite applytable so that the order is significant.

Nelson Minar	Created: February 24, 1998
`<nelson@media.mit.edu>`	Updated: February 7, 2000