somewhat daily mutterings

/Programming/Projects/Gemcast Implementing the "Recent Googles" Sidebar Box with a Shell Script

The other day I turned on referer logging on my Apache instance. Almost immediately, I noticed Google searches that had led to my site. I though to myself, "wouldn't it be cool if I could display a sidebar box on my site containing the last n Googles that had led to my site?" "Yes", myself replied.

gemcast, the weblogging software I wrote for this site, has a feature that looks for '.box' files in its content directories. When it finds a '.box' file, it creates a sidebar box for the page it's building. Simple. So, about a half-hour later I had a shell script that generates a list of the last ten Google searches, along with the content that makes the output good for a .box entry. Then I created a cron job run this script every 15 minutes and send its output to my root gemcast content directory, so it would display on the "top" page.

Here is the script:

1  echo "10 Recent Googles Leading to samoht.com"
2  grep '\.google.*search' /var/apache/logs/referer_log |
3    awk {'print $2'} |
4      sed 's/http.*q=/<li>/;s/%22/"/g;s/\&.*$//' |
5        uniq |
6          tail -10
7  echo "Generated on `date`"

For those of you who aren't Unix script hackers, a line-by-line explanation is in order:

  1. Print the title of the box.
  2. Search the apache referer log for lines containing Google search URLs.
  3. Use 'awk' to extract the 2nd field, which is the URL. Given that I'm just using awk to extract a field, I could have used 'cut' as well, but I'm much more familiar with awk's syntax.
  4. Use 'sed' to replace the first bit of the URL, up to the 'q=' query string, with HTML list item markup, replace '%22' with a '"' character, and eat off everything after the query parameter. The result is the encoded query parameter, prefixed with an '<li>' HTML element, like so: "<li>this+is+the+query".
  5. Strip out duplicates.
  6. Only show the 10 most recent queries.
  7. Print the time that the script was run.

Note that lines 2-6 constitute a pipeline -- each command's result is fed into the next command as input. In reality I could have used awk to the entire script. However, that would have required me to write a much more sophisticated awk script. I'd rather string together Unix commands that do a single job (or few jobs) well. To me, it makes the script more obvious, and since I know the Unix commands pretty well, I was definitely done more quickly than if I'd have to hack out and debug an awk script. Also, if I was going to do an awk script that complicated, I might as well use Ruby.

Posted: Fri Apr 25 05:29:01 -0700 2003

Thanks for visiting! Send comments to Mike Thomas.

Site 
Meter