Regular Expressions for Analytics

Web analytics tools use regular expressions in filters, goals, searches, and more. This article is a basic refresher.

Please use our free regex tester to test your own regular expressions.

What are Regular Expressions?

Regular expressions (also known as regex) are used to find specific patterns in a list. Regex can be used to find anything that matches a certain pattern. For example, you can find all keywords that start with the phrase "replace", all pages within a subdirectory, or all pages with a query string more than ten characters long.

Regular expressions provide a powerful and flexible way to describe what the pattern should look like, using a combination of letters, numbers, and special characters.

For example, typing html into the search box in the content reports will return all URLs that contain "html" anywhere in path. For example, the following pages would be returned:

  • /index.html
  • /html-definitions.php
  • /search.php?q=html+vs+php

The Escape Character: Backslash

Regular expressions use a series of special characters that carry specific meanings. This is a thorough, but not complete, list of the special characters in regex that carry a non-literal meaning.

^ $ . ? [] () + \

As an example, the question mark means "make the previous character optional" in regex. We'll show an example of this in action later in this article.

But if you want to search a question mark, you need to "escape" the regex interpretation of the question mark. You accomplish this by putting a backslash just before the quesetion mark, like this:

\?

If you want to match the period character, escape it by adding a backslash before it. For example, \.html would match a dot followed by the string "html".

If you want to match a series of special characters in a row, just escape each one individually. To match "$?", you would type \$\?.

You can escape any special character with a backslash - even the backslash! \\

If you're unsure whether a character is a special character or not, you can escape it without any negative consequences.

Anchors: Caret and Dollar

Regular expressions match the pattern you specify if they occur anywhere in the string--beginning, middle or end. There are anchors you can use in regex to specify that a pattern should only occur at the beginning or end. The anchor characters are:

^ $

Use the caret symbol (^) to anchor a pattern to the beginning. Use a dollar sign ($) to anchor a pattern to the end. You can use either or both in a

^/page will match "/pages.html", "/page/site.php" and "/page". It won't match "/site/page" or "/pag/es.html".

html$ will match "/index.html", "/content/site.html" and "/html", but not "/html/page.php", "/index.htm" or "/index.html?q=html+vs+php".

^car$ will only match "car" and ^$ will match only empty strings.

$/google.php^ won't match anything because it's bad regex - the caret should always be to the left of the dollar: ^/google.php$

Ranges of Characters

Regex can also be used to match ranges or combinations of characters. Square brackets allow you to specify a variety of characters that can appear in a certain position in the string.

For example, [eio] would match either "e", "i" or "o".

You can include a long list of characters in square brackets, but it's easier to match a range of characters with a hyphen. For example:

[a-z] will match any lowercase letter from a to z.

[a-zA-Z0-9] will match any lowercase letter, uppercase letter, or number.

[a-dX-Z] will match a, b, c, d, X, Y, or Z.

Square brackets look at each individual character, not whole words.

[word] matches a single occurrence of "w", "o", "r" or "d".

To match a string of characters, enclose them in parentheses and use a pipe (|) as an "or" character. For example, to match an instance of "cat" or "dog", you would type:

(cat)|(dog) OR (cat|dog).

Finally, use a period to match any character. It's like a wildcard for a single character:

car.s will match "carrs", "car?s", "car5s", etc.

Repeating Patterns

With regex, you can even specify the number of times a pattern should occur.

A question mark after a character will match zero or one occurrence of the character. This makes the character optional:

aa?pple will match "aapple" or "apple".

A plus sign matches one or more occurrences.

a+ will match "a", "aa", "aaaaaaaaaa", etc.

An asterisk matches zero or more of the previous character. Combined with a period, ".*" is commonly used as a wildcard because it matches anything.

.* will match any non-empty strings.

Curly brackets allow you to match a specific range of occurrences. You specify the minimum and maximum number of occurrences.

ca{3,5}t will match "caaat", "caaaat", "caaaaat", but not "cat" or "caaaaaaaaat".

Next Steps

Use our free regex tester to test your own regular expressions.

Trouble Activating License

Urchin was canceled in 2012 and it's become extremely difficult to activate new Urchin installations. Plus, the newest version of Urchin (v7.2) is unaware of newer browsers & OS platforms, and the geolocation db is woefully outdated.

If you are having licensing issues with an active Urchin installation, you should consider migrating to Angelfish Software


Here's how to

No Internet Connection


If you need to activate an Urchin License, but your server does not have an internet connection, you will need to contact Google to have a license manually generated for you.

  1. Log into your MySQL or Postgres Urchin database.
  2. Execute the following command: update uglobals set ucgl_serial='<YOUR SERIAL>';
  3. Execute the inspector from urchin/util and save its output to a file.

With Internet Connection


If your Urchin server is able to contact the Internet but you are having trouble activating an Urchin serial code, it's likely that the serial code is locked. Serial codes are single use only.

  • The Urchin server crashed and Urchin is being reinstalled.
  • The Urchin server was deprecated and Urchin is being installed on a new server.

Post-2012, Google was the only company that could reset Urchin serial codes and they said they'd reset them for 3 years. So if your Urchin installation doesn't work, the only way we know of that you can revive your data is to migrate to Angelfish Software.

Reset Urchin Admin Password

Occasionally, the password for the Administrator account (user name: admin) gets lost or forgotten. To reset the Urchin 6 Administrator password to 'urchinadmin' (without quotes), use the SQL statement below.

update uusers set ucus_password="USCX|f7a3ffae66ce965865eb4568e9a9271f" where ucus_name="admin";

After logging in with the new password, it may be changed in the admin settings section.

Remove a Stuck or Pending Task

Symptoms

  • A task is "stuck" - it doesn't appear to be doing anything more and fails to complete or error, OR
  • A task never changes from a "Pending" status to "Running", OR
  • A task is running fine, but needs to be stopped for one reason or another.

Clearing the task

  1. Stop the Urchin web services
    • /path/to/urchin/bin/urchinctl stop, OR
    • Start >> Program Files >> Urchin 6 >> Disable Urchin Services
  2. Log into the Urchin mysql database
    From Command Line:
    mysql -u <user> -p<Password> <urchin database>

  3. Remove the appropriate task from the uprofiles_queue table. In most cases, you'll want to clear out all tasks. Use the following command do to so:
    delete from uprofiles_queue;

  4. Navigate to path/to/urchin/data/reports/<account>/<profile> and delete the lock.udb file if it exists. Be sure to replace <account> and <profile> with their respective values.
  5. Restart Urchin services
    • /path/to/urchin/bin/urchinctl start, OR
    • Start >> Program Files >> Urchin 6 >> Enable Urchin Services

Urchin Web Hosting

If you do not wish to host Urchin 6 yourself, there are other web hosting options available. You may wish to have Urchin hosted by a hosting provider and manage it remotely, use an Urchin service provided by a host, or use the Managed Urchin solution provided by Actual Metrics. Whatever you decide on, it's important to be certain your Urchin installation is allotted the proper resources and given the appropriate maintenance attention to ensure your data is always available when you need it.

Web Hosting Solutions

Many different managed hosting solutions are available from a number of hosting providers. In most cases, you may be given space to host the application, but will need to manage it yourself. Such solutions include:

Dedicated Hosting - A hosting solution where the client leases an entire server that is not shared with anyone else. Typically, the hosting provider then gives full control of the dedicated server to the client.
Shared Hosting - A solution where your site is installed on a separate partition on a server shared with other clients. Other clients do not have access to your partition, but all sites to share the server's resources.

Some hosting companies offer Urchin as part of a hosting package - feel free to click some of our sponsors' links (above) and evaluate their offerings. Once Urchin is set up in some type of managed hosting environment, it is then your responsibility to ensure it runs properly and provides reports on time.

How to Move Urchin Software to a New Server

In the event that Urchin needs to be moved from one server to another, all the configuration and report data needs to be migrated. Follow the steps below to ensure that everything gets moved so the new server works smoothly:

  1. Move Urchin Configuration Database
  2. Install Urchin on the new server
  3. Move report data
  4. Move custom files and templates
  5. Modify custom settings
  6. Verify new setup
  7. Request Urchin Serial Reset

1. Move Urchin Configuration Database

Backup your existing Urchin configuration database using one of the following methods:

MySQL
mysqldump –u {username} –p {database} > urchin_bak.sql


PostgreSQL
pg_dump -p {database port} -U {username} {database} > urchin_bak.sql

Next, create the Urchin database on the new server. Finally, import the configuration database to the new server.

MySQL
mysql –u {username} –p {database} < urchin_bak.sql


PostgreSQL
psql –h –p {port} -U {username} {database} < urchin_bak.sql

2. Install Urchin on the new server

Make sure to uncheck "Delete and recreate Urchin specific database tables if they exist" during a Windows installations or answer NO to "Would you like to initialize the configuration database during install?" for Linux installations.

3. Move Report Data

By default, the Urchin report data files are stored under urchin/data. If this is the case with your installation, copy all the contents from old/server/urchin/data to new/server/urchin/data. If your data files are stored in a custom location, you may copy them wherever you like, but be sure to follow step 5.

4. Move custom files and templates

Any customizations made in Urchin are generally kept in the urchin/lib/custom directory. Make a backup of this entire 'custom' directory and restore it in urchin/lib/custom on the new Urchin server.

5. Modify custom settings

If any changes were made to the following files, be sure to replicate the customizations on the new Urchin server:

  • /etc/session.conf
  • /etc/urchin.conf
  • /var/urchinwebd.conf

6. Verify new setup

Run urchin/util/inspector to check the new Urchin 6 setup. If any permission errors were encountered, run inspector -r

7. Request Urchin Serial Reset

Once the serial has been reset, start the Urchin services (urchinctl start), navigate to the Urchin interface, and follow the steps to reactivate your license.

Urchin Software Discontinued

In January of 2012, Google announced development of Urchin Software would be discontinued. We were disappointed to hear the news, although we can't say we were overly surprised. Google's focus on Urchin dwindled in 2011, coinciding with the launch of a paid version of Google Analytics.

We've been contacted by a variety of customers since the announcement - the most popular questions and answers are below. Feel free to contact us if you'd like us to clarify anything.


Will my Urchin installation stop working?

The good news is that Urchin isn't going to turn into a pumpkin at midnight. If you're currently using Urchin, it will continue to work indefinitely. Google will also maintain the licensing server for the indefinite future. This means that even though the product won't be updated, it will keep running. And if you want to move to another product, you have ample time to make your selection.

Can I still get support for Urchin?

Unfortunately...it's been more than 6 years since Google canceled Urchin, so you're pretty much on your own if anything breaks.

Why did Google decide to retire Urchin?

Google began making efforts to put more wood behind fewer arrows in 2011, and Urchin seems to be a casualty of this initiative. We understand the decision but there's still a market for on-premises web analytics software.


Is there an alternative to Urchin?

Yes! Angelfish Software allows you to migrate config and report data from Urchin to Angelfish, and provides a bunch of new features like nested segments, broken link reports, bulk update utilities, date ranges as small as one second, and a long list of other improvements. Plus, Angelfish has an up-to-date list of browsers, platforms, and geolocation info.

Learn more about Angelfish Web Analytics Software

Differences between Urchin and Google Analytics

A few times a month, we're asked "why are my stats different between Urchin and Google Analytics?" If Urchin and GA are looking at the exact same data and there's a numbers discrepancy, 95% of the time it's caused by the visitor tracking method in use.

Urchin 6 uses two different visitor tracking methods: UTM and IP+UA. The UTM method is by far the most accurate -- you tag each page of your site with a javascript file, which then assigns a unique ID to every visitor to your site and creates a special gif request for each single pageview. But if you can't use cookies or tracking code, or have months/years of historical log files to process, IP+UA is your solution.

In order to understand IP+UA, have a look at a single hit in a web server log file (below):

68.166.201.241 [01/Jan/2005:12:43:20 -0800] "GET /milw.html HTTP/1.1" 200 678 "http://home.earthlink.net/~milwaukeeroadcoastdivisiondvd/id7.html" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.0)"

For the above hit, the fields (L to R) are:
client-ip, datetime, cs-request, sc-status, sc-bytes, cs-referral, cs-useragent

With the IP+UA tracking method, Urchin picks out the client IP address and user agent (client-ip and cs-useragent) from each hit in the log file and uses the info to calculate a visitor session. IP+UA is easily foiled by proxy servers, firewalls, NAT devices, and the like -- if you have 300 visitors from behind a corporate firewall who appear to use the same IP address and user agent, your numbers usually won't match up with the true stats. This means the IP+UA tracking method tends to overstate traffic by a wide margin. A ~30% increase (compared to UTM) is average, although we've seen it as high as 400%! That said, you can use an exclude filter to "trim the fat" and get rid traffic from robots and spiders -- doing this usually helps make the numbers more in-line with UTM metrics.

The UTM tracking method is used by both Urchin and Google Analytics, but only Urchin can use the IP+UA method.

Why is visit count in Top Content different?

Urchin users often notice that the number of visits reported under
Marketing Optimization >> Unique Visitor Tracking >> Visits & Pageview Tracking
is different than the total number of visits in
Content Optimization >> Content Performance >> Top Content

This is because the two reports are actually reporting two separate metrics. The Visits & Pageview Tracking report is showing the total number of visits to your site, whereas the Top Content report is showing the total number of "visits" to each page on your site, which is equivalent to total unique pageviews.

Unique Pageview - the number of sessions which included the given page one or more times.

The description for the Top Content report states:

...This report shows how many visits and pageviews each page on your site received...

This is, in fact, accurate since unique pageviews are equivalent to the number of visits which included a page.

For Example:
Let's say a user visits your site and views the following pages, in succession, during their visit:

  1. /index
  2. /about
  3. /index
  4. /products
  5. /products/widget
  6. /products
  7. /index

The Visits & Pageview Tracking report would show 1 visit and 7 pageviews. Under Top Content, however, you'd see the following:

Page Visits Pageviews
/index 1 3
/products 1 2
/about 1 1
/products/widget 1 1
Totals 4 7

This Top Content report is showing that each page was viewed at least once in 1 visit. If one were to look at the totals row and think that they were 4 total visits to their site, they would be extremely off. For all intents and purposes, the visits column should always be thought of as Unique Pageviews. If you know how, I'd even recommend changing the column title.


Let's take this example one step further
A visitor comes to your site a second time and views the following pages:

  1. /index
  2. /products
  3. /products/widget
  4. /products/cool-widget
  5. /products/widget

Now, the Top Content report would look like this:

Page Visits Pageviews
/index 2 4
/products 2 3
/products/widget 2 3
/about 1 1
/products/cool-widget 1 1
Totals 8 12

Now, there are only 2 total visits to your site, but if one were to use the Top Content report as reference, they would see 8.

About Urchin Software: アーチン

Prior to being acquired by Google, Urchin Software Corporation was a scrappy startup in San Diego that progressed from web hosting into software development.

Fast forward to today: the most popular web analytics tool in the world (Google Analytics) is Urchin, rebranded.


The Early Days

In 1996, Scott Crosby and Paul Muret founded a small web hosting company: Quantified Systems. Clients were billed by the amount of bandwidth a website used, so Paul cobbled together a program that would parse web server logs and figure out how much bandwidth was used per customer.

Over time, the log parser was updated and began to show things like pageviews, hits, and sessions. The log parser was also extremely fast, and the combination of speed and reporting features proved to be a hit with customers. Urchin Software was born.

After a few years and some early wins, Quantified Systems rebranded to Urchin Software Corporation. In the early 2000s the company hit some major bumps in the road, like a failed VC funding round in Sept 2001 and an expensive, unsuccessful attempt at international expansion (to Japan!). In spite of these setbacks, Urchin persevered and was acquired by Google in 2005.

Scott details the early days of Urchin in an article posted on medium.com - it's an engaging read:

Urchin Software Corp. - The unlikely origin story of Google Analytics.


Google Acquisition

In early 2005, Urchin Software Corp. offered 2 products:

  • Urchin Software - v5, downloadable, self-hosted software
  • Urchin On Demand - v6, SaaS offering, hosted by Urchin

The Urchin folks met a Google team at the Search Engine Strategies conference in 2004 - Google was interested in acquiring a web analytics company and Urchin was at the top of the list...after Omniture turned down Google's offer first. The deal went through in spring 2005, Urchin on Demand was released as "Urchin from Google" in November 2005, and was rebranded as Google Analytics in 2006.

For a few years thereafter, Google Analytics received all the attention from development and marketing, while Urchin Software languished on the sidelines.

Finally in 2008, Google released Urchin 6.4. - it was glorious. The new user interface looked great, the reports were useful, and legions of furious Urchin customers were satisfied.

In 2010, Google released Urchin 7 with even more features and an updated interface. The future was looking bright for Urchin. Unfortunately...


End of Days

In the summer of 2011, Google announced a new initiative: More Wood Behind Fewer Arrows

In other words, Google was going to cull its herd of products so it could focus on the best ones. Looking back, Urchin Software, Google Labs, Google Buzz, Google Reader, Google Health, and a handful of other products were a casualty of this announcement.

Urchin 7 received a few more updates but Paul Muret himself wrote the final blog post:
The End of an Era for Urchin Software


Urchin Replacement

One of the fun things about Urchin is if you don't make any changes to the underlying OS or hardware, it will run forever. It's almost 2020 and we know of a handful of companies that still use Urchin 5!

But with each passing year, Urchin becomes more and more outdated.

If you still use Urchin, here are some reasons why you should replace it.