No Tagging, No ROI (Free Bulk Link Tagger Included)
Possibly the most important performance metrics we all know is ROI, return on investment. Every online marketer will probably agree on this, understanding what your marketing spend is earning you is critical to effectively managing marketing campaigns and budgets.
What is often overlooked however is making sure campaigns are being properly tagged and tracked. Without proper campaign tracking, it becomes very difficult, and sometimes impossible to effectively measure ROI (even on a high level).
A classic example would be email campaigns: Did you know that if links are not properly tagged in your email communications all clicks from email clients such as Outlook or Thunderbird (the most common method of viewing mails) will be attributed to Direct traffic, while webmail clicks will be attributed to referral? The implications of this are huge especially if you are investing a lot of time and money into your email channel.
Fortunately to get around this Google has made it extremely simple to tag your campaign links. How it works is that you simply have to append your links with three (minimum) tags called utm tags. In fact Google even provides a tool call URL Builder to create these links.
Here is an example of a tagged link with the tag parameters in different colours:
UTM Tags, What They Mean:
Utm_source (required)
This tag tells Google Analytics the site/source which directed the traffic to your site. For example this can be “april_newsletter_list22”, “twitter”, “our_blog”, “myspace” etc. In fact you can call it whatever, but the key is that you use the same name for the same source and also that the name is relevant to the source. Best practice though is to just name the unique site where the link is placed. If you have the same email going to multiple lists remember to indicate this on your source.
Utm_medium (required)
This tag defines what type of channel sent the traffic. You can potentially name this medium whatever you want (banner, postcard, listing) but there are some naming conventions which are recommended to be used for certain channels. Yet again consistency is key:
Email: email
Social media sites: social
Pay per click search: cpc
News/RSS feeds: feed
Website banners: banner
Offline ads: offline
Utm_campaign (required)
This is to name the campaign the link is in aid of. Yet again you can name this tag whatever you want, but be sure to be consistent in your naming conventions and which links you tag.
Utm_term (optional)
This tag is optional and is generally used for paid search campaigns to track the keyword which generated a click. It can be used for a manner of other uses like a blog author name or a page category on a site.
Utm_content (optional)
This optional tag is generally used to distinguish between different variations of an ad or link. For example you could have two ad banners running on a website, one with an image of a cat and another with the dog. Using the utm_content tag you can see which ad variation works best for you.
UTM Tags The Bigger Picture
As an example, let’s look at a Christmas Promotional campaign targeting bicycle sales. The campaign utilises 3 channels over 5 different sites and mailers to increase sales. The diagram below shows how the structure of this campaign’s tags fit together. For all links the campaign tag will be the same, the medium tag however will be one of the three channels used. The only unique tag will be the source which indicates where the link is located. With a structure like this you can easily view the return of your campaign, as well as drill down to see which channels and sites/mailers gave you the most return within the campaign.

REMEMBER all campaign tags are CasE SensiTive eg: tag ≠TAG
Automatic UTM Tagging
Unfortunately tagging often has to be done manually, there are two big exceptions though: Google Adwords and Email. Since Adwords can be integrated with Google Analytics it’s no surprise that Adwords has the function for Auto Tagging to track your campaigns. A little less known fact is that most Email Service Providers offer the functionality to also auto tag all links in your emails. That said you should always check the tools you are currently using if they do offer auto tagging. Also make sure the automatic tagging is working in the most optimal format for you.
Our Gift To You: Bulk UTM Tagging Tool

*This workbook has been extensively tested but the responsibility to make sure the built links work rests on the user. If you do however find any fault with the workbook, please let us know.
Google’s URL builder is definitely useful, but only doing one tag at a time can be a bit of a drag. So to help you tag and manage your campaigns more efficiently we have created a special Bulk URL Tagging Excel sheet for free download. All you have to do is paste in the required parameters into the sheet and your tagged link will be generated. You will find two sheets in the workbook, for simplicity the first sheet only covers the 3 required tags, but if you want to include the optional utm_term and utm_content tags you can find these options on the second sheet.
*This workbook has been extensively tested but the responsibility to make sure the built links work rests on the user. If you do however find any fault with the workbook, please let us know.
SMART Tip: UTM Tags and Link Shorteners
With all these tags, you find that your links are getting a bit too long for your liking. This is particularly relevant for social media sharing especially with services like Twitter. Good news though, if you do use a link shortener your tags will be still be preserved when the visitor is directed to your site.
That’s all for this post, stay tuned for the next post where we look at some more advanced methods for tracking offline campaigns and overriding conversion attribution settings.
Web Analysis: The Forest For The Trees

Data represents peoples actions. *Image sourced from The Matrix.
All too often web analysis is confused with online analytics tools and data they measure. For example, it will surprise many that Google Analytics is not in fact Web Analysis! Nor are website hits and time on site the bedrock of Web Analysis.
How has this happened? Well, one just has to look at the dashboard reports of any online analytics tool to get an idea. There you’ll find loads of data and ‘interesting’ indicators and graphs all available at the click of a button. In fact there is so much data available presented in such an appealing way that it must mean something and must be important. And not knowing what it means or why it’s important, makes it easy to put the data and the tools in the same box as analysis itself.
The fact of the matter though, is that the whole of “web analysis” is not that new, just the “web” part is. As an example let’s take an ecommerce site. The purpose or goal of this site is fundamentally to sell goods, and increase revenue. In fact its goal is essentially the same as an offline store.
Now let’s look a little closer at the task of product positioning in an offline store, in particular the task of arranging goods on a shelf. Over the years, much time and effort has been invested into optimising this seemingly simple task. Now imagine if instead of running expensive real world tests marketers could measure exactly what goods consumers are looking at, picking up and not/purchasing as they are shopping (and re-arrange the shelves in real time!).
Now let’s look at an online store. The end goal is exactly the same (sell goods). The shelf is now a screen. And guess what, all those key points of interaction can be recorded: where the consumer is looking at (heatmaps), what they’re picking up (clicks) and finally what they purchase (ecommerce tracking). Just like the offline store data still has to be tracked interpreted, which is often more tricky than it seems. But the point is at the end of day it’s not the “analysis” part which is new, just the “web” part and the data sets and methods of data collection it brings with it. This is a great simplification of course, but the principle is there. A web presence exists to accomplish goals and web analysis is a just a methodology for optimising online activities to achieve those goals.
So next time you look at web metrics do try remember, ultimately the data is just a (rough) etching of what real people did when their worlds interacted with yours, just like before the internet. And the task at hand has remained the same (albeit new data is now more readily available): to make sense of the data available to improve your world and subsequently your visitors’ worlds as well.
My e-commerce data is under reporting
In this post, I would like to highlight three fairly common implementation errors that may cause your e-commerce data to be inaccurate or incomplete.

1. Errors passing e-commerce values into your tracking script.
When passing product names and sales values into the e-commerce script, always ensure you closely follow the strict technical requirements. Some examples: You should never include any currency symbols in the data you pass into the script and always “escape” special characters that will break the JavaScript code. For example: If your product names contain an apostrophe, ensure that they are “escaped” so that the apostrophe within the product name does not “break” the script and avoid the transaction from being successfully captured.
2. You are using different versions of Google Analytics code.
Although this is not a common error, I have experienced this. Most webmasters have migrated to the asynchronous version of Google Analytics code which was released in December 2009. You will also need to update your e-commerce tracking to the asynchronous tracking. Herewith some more information on the e-commerce asynchronous tracking: http://code.google.com/apis/analytics/docs/tracking/gaTrackingEcommerce.html
3. You are using a different domain name or sub-domain to record the transactions.
Although you may be successfully tracking sales, if you do not correctly implement cross domain tracking, you will lose the original traffic source that provided the sale. Ensuring end to end tracking is critical for measuring and optimising marketing campaigns. Herewith some more information on Google Analytics multiple domain tracking: http://code.google.com/apis/analytics/docs/tracking/gaTrackingSite.html
There may also be other reasons for not successfully recording all transactions. It is always good to check what percent of transactions is being tracked in Google Analytics against your actual number of transactions.
If there is a significant difference between the recorded conversions in Google Analytics and your actual number of conversions, we highly recommend doing some end to end testing to ensure your Google Analytics implementation is correctly setup. For some simple debugging, we recommend the Google Chrome extension, debug.js which can be found here: https://chrome.google.com/webstore/detail/jnkmfdileelhofjcijamephohjechhna
Google Analytics Update to Sessions – The Impact on Metrics
There has been a recent change in the way Google Analytics calculates sessions:http://analytics.blogspot.com/2011/08/update-to-sessions-in-google-analytics.html .
Some of our clients who monitor their sites with GA have seen large changes to their metrics, and inevitably questions regarding the perceived change in the performance of our marketing campaigns, land on our desks.
The size of the effect depends on the typical user behaviour on the site, and the type of campaign strategy. We see the greatest change in sites where there is short interaction time and low page-depth interaction. The effects are also greater where the campaign is deep-linked to the site. This arises because the person searching is sent immediately to the most relevant page so his interaction in terms of navigation and time is lower.
We have observed some secondary effects of the change which affect the actual value of parameters, not just the average value per visit, etc. As an example, consider the time-on-site metric for sites where a significant number of interactions are one page deep
A typical scenario could be the following: a person comes to page A on the site after searching for the term “Dog”; he then goes back to the search engine and searches for the term “Cat” and lands on page B sixty seconds later and then leaves the site. Previously this would be considered one session (as the browser was not closed) and time-on-page A would be sixty seconds (page B would be zero seconds). With the new method of counting sessions, a new session starts as the person comes back on the search term “Cat” and you end up with two visits,both with zero time-on-site. When you add all the time-on-site metrics and divide by visit numbers, not only do you get a smaller number (because of the larger number of sessions), but the actual total will be smaller as well. To illustrate, in one site we observed a 25% increase in session count, but a 50% drop in time-per-session. A similar effect can be seen if you use goal values to monitor some events. Only one goal is recorded per session, so if you have goals which can be achieved more than once in a session you might find that the actual number of reported goal completions has increased – as you now have more sessions.
Google Visualization API and African Governance
“And now for something completely different” like they say in Monty Python. In this post we diverge a bit from search to look at the nice graphics on offer through the Google Visualization API. It provides a nice tool for visualizing any multi-dimensional data set Hans Rosling style. It will be familiar to the Google Analytics users amongst you who have used bubble charts in Analytics. Below is an example of such a graph created using the googleVis package in R. I am passionate about African development, so when I was scouting for an example data set the Mo Ibrahim Foundation was an obvious place to start. The Mo Ibrahim Index measures the delivery of public goods and services to citizens by government and non-state entities. It uses indicators across four main categories: Safety and Rule of Law; Participation and Human Rights; Sustainable Economic Opportunity; and Human Development as proxies for the quality of the processes and outcomes of governance. It is the most comprehensive collection of qualitative and quantitative data that assess governance in Africa and is funded and led by an African institution. This data is also very much in the spirit of the data you can view through the Google Public Data Explorer, where there is currently mostly European and US data. I thought that adding a bit of an African flavour would be nice.
In the initial chart setup I label the five African countries that obtained the five highest rankings according to the index. They are Mauritius, Seychelles, Botswana, Cape Verde and South Africa in decreasing order. The overall ranking is plotted against the Infrastructure Index, while the Public Management and Accountability & Corruption Indexes are represented by the colour and size of the pots, respectively. When the chart is dynamically played of the period 2002 to 2010 the following trends become clear for these 5 countries:
- Seychelles and Mauritius have made great strides in improving infrastructure over the last decade
- Botswana has lost considerable ground in terms of public management
- South Africa has made good gains in infrastructure in the initial part of the period, but lost some momentum thereafter; the effect of the Fifa world cup on infrastructure development in the 2011 index will be interesting to monitor
- It is also interesting to note that Gambia made good strides in the overall index rising to a position as high as 9th; a deterioration in accountability and corruption seem to reverse that progress in the later part of the period.
Enjoy spotting some of your own trends on African governance. Hans Rosling eat your heart out! Note that the chart may not display correctly on all mobile devices.
Below is the R code I used to generate the above chart.
library(ggplot2) library(googleVis) library(Hmisc) # Data can be downloaded from www.moibrahimfoundation.org MoIbrahim <- read.csv("MO Ibrahim Trends Summary.csv") # read data # Generate Motion Chart # Specify initial chart set-up via options parameter # Setting initial state via state string obtained from Settings panel of the initialised chart MoIbrahimMotion <- gvisMotionChart(MoIbrahim, idvar="COUNTRY", timevar="YEAR",options= list(state='{"orderedByX":false,"sizeOption":"6","dimensions":{"iconDimensions":["dim0"]}, "orderedByY":false,"xAxisOption":"13","time":"2002","yZoomedDataMax":53,"iconType":"BUBBLE", "xLambda":1,"yZoomedDataMin":1,"xZoomedDataMin":0.26,"yZoomedIn":false,"xZoomedIn":false, "iconKeySettings":[{"LabelY":-173,"LabelX":22,"key":{"dim0":"Mauritius"}},{"LabelY":-169, "LabelX":-129,"key":{"dim0":"South Africa"}},{"key":{"dim0":"Seychelles"}},{"LabelY":-84, "LabelX":-165,"key":{"dim0":"Cape Verde"}},{"LabelY":-241,"LabelX":-18,"key":{"dim0":"Botswana"}}], "showTrails":false,"nonSelectedAlpha":0.4,"duration":{"multiplier":1,"timeUnit":"Y"},"yLambda":1, "yAxisOption":"3","xZoomedDataMax":78.01,"uniColorForNonSelected":false,"colorOption":"11", "playDuration":15000}')) plot(MoIbrahimMotion) # Create chart to imbed in WordPress # Copy the Source Code in this file to your web page/WordPress etc. cat(MoIbrahimMotion$html$chart, file="temp.html")
Google places and SERPS – a first look
On the 27th of October 28, 2010 Google modified its search results pages (SERPs) for search queries that show an element of local intent. So far I’ve seen two different layouts, one that includes the address and logo of the local business (screenshot 1) and one that doesn’t (screenshot 2). I’m guessing the layout shown is dependent on the quality of the business listing that Google is displaying as the screenshot Google use in the official announcement (see here ) looks even more fancy.
So what’s the big deal? Well depending on your angle, there are different ways of looking at this:
If you’re into SEO you will probably not be welcoming this change since it will become even harder to be seen by potential customers, no matter how high you rank. The opportunity to rank in the local listings is limited and I doubt users will browse local listings as extensively as they do traditional search results (which isn’t a lot in any case)
As a local business owner you should get your business on the map ASAP, literally. Google is placing a lot of effort into local search and with Marissa Mayer at the helm we are bound to see many more changes. The prominence of these local results will wake many small businesses from their slumber and encourage them to create a listing in Google Places, vastly improving the information Google has on them.
But what’s the effect on search advertisers? Firstly I’m surprised Google is displaying a map above the sponsored links in the right column. The immediate effect is that ads in that position become a lot less prominent which will surely affect click-through rates. So I expect click volumes to drop quite a bit for those ads. I’m not too concerned about the impact on Quality Score (QS) as I’m sure Google will adjust for that when assessing ad performance. But I am slightly surprised that when you scroll down the page, the ads disappear underneath the map, making ads invisible (unclickable) to users (Screenshot 3). Since this does seems to reduce the opportunity for Google to monetize search ads, I wonder if it is an oversight or an intentional feature. Then again, 3 days ago Google announced Boost (Beta) which basically enables advertiser to have paid ads on Google maps (and local results I imagine) (read more about Google Boost here: )
Until now, business listing have always been free but it seems those days will end very shortly. The big difference with Adwords is that with boost, Google will determine which search queries are relevant to your business. i.e. it’s keywordless advertising. From the Boost announcement:
“our system automatically sets up your ad campaign – figuring out the relevant keywords that will trigger your ad to appear on Google and Google Maps, and how to get the most out of the budget you allotted”
It’s a clever way to get many more advertisers on board without troubling them with the complexity of PPC campaigns. But it does beg the question of control over one’s advertising budget.
Finally, as a Google user, I’m not sure if I really needed the additional focus on local results. Local results have been appearing as part of the universal search for years, and they’ve always been more than sufficient for me. My main comment for now is that these local result aren’t necessarily that relevant. In the first screenshot my search query ‘car hire cape town’ mainly showed small local car rental companies that I don’t know neither necessarily trust. I do trust the big brands who have a global presence but they don’t appear in the local results. The same is true for my other query ‘flights sao paulo’: it returns a map and addresses of some airline related businesses but what I’m actually looking for is a website where I can book a flight. To be fair on Google, the local results did not appear when I modified my query slightly to say ‘flights to sao paulo’ but I guess the lesson here is that just because I enter a location, I don’t necessarily want a local result.
As per usual, we’ll measure the quantitative impact of these changes over the coming weeks and report on the effects on paid search campaigns once we have more data.
The Paid Search Impact of Google Instant – Some Initial Data
The arrival of Google Instant created great excitement in the search community. There has been a lot of speculation about the impact that Google Instant will have on paid and natural search. Will it help or hurt the long tail? What will its impact be on conversion rates? How will click-through dynamics change on the page? Will big brands benefit from an increased traffic share? Some initial data started surfacing at SMX East and some blogs. The consensus seems to be that there are no dramatic changes as yet. Here we will present our own data on the paid search impact we have observed so far.
IMPRESSIONS AND AVERAGE CPC
We looked at data for a set of our US retail clients and compared metrics after the launch of Google Instant on September 8th, up to October 6th, against a similar length period prior to the launch. In figure 1 we show the distribution of percentage changes in overall impressions and average CPCs before and after the launch of Google Instant at a campaign level. We focus on a reasonably large set of campaigns where the overall spend has stayed fairly stable over the period. In each boxplot the horizontal black line represents the median change for each metric. An earlier post explain how to interpret the boxplots in figure 2. For impressions there was a median 2.9% decrease before and after the launch of Google Instant, and for CPCs a median 3.2% increase. Given the variability in the data, it would be very premature to get too excited about the changes we see below. It really suggests that nothing has changed significantly in terms of impression volumes and CPCs. A change may come when people will change their search behavior over a longer period of time and get familiar with Google Instant.

Figure 1: Changes in impressions and average CPCs before and after the launch of Google Instant.
THE EFFECT ON THE KEYWORD TAIL
Many pundits argued that the arrival of Google Instant would spell the end for the long tail. A searcher may start a search with the intention of using a long-tail search term but may end up getting good enough results when they have only typed a part of the original intended query. Others argue that most people would finish typing their original query anyway. We analyzed a large set of over 20,000 paid search keywords across several US advertisers to see if we see any drop in the revenue contribution from tail terms. We compare the 3 weeks before Google Instant to the 3 weeks thereafter. There were 22,031 keywords with an impression in the 3 weeks before Google Instant. This dropped to 20,911 keywords after Google Instant. This is not significant, as this level of variation was not unexpected for such a large keyword set even before Google Instant. What we are more interested in, is to see if there has been a shift in the contribution to total revenue between head and tail terms. In figure 2 we rank all keywords into percentiles according to their traffic contribution, starting with the highest traffic keywords on the left. As an example, we see that the top 10% of keywords in terms of traffic account for about 80% of total revenue. We can consider these as head terms. Total revenue refers to click date revenue only, as there will be very little data with full cookie revenue. If there has been a substantial shift in traffic towards the head keywords after Google Instant, we would have expected the red line in figure 2 to have shifted above the black curve on the left of the plot; however, it is clear that there has been very little change in the relative contribution of head terms and tail terms. This suggests that we are not seeing a dramatic decrease in the contribution of tail terms to total revenue yet.

Figure 2: Revenue contribution by keyword click percentiles.
CHANGING CLICK-THROUGH DYNAMICS
An interesting potential impact of Google Instant is how it will affect the click-through dynamics on the first and subsequent pages of search results. There are some theories suggesting that relatively more clicks may occur on higher positions than before. The data for a group of our US clients in figure 3 suggest that there may have been a subtle shift in click-through rates before and after Google Instant. We plot the square root of the click-through rate in figure 3, as it results in better visualization due to the heavily skewed click-through rate distributions. We focus on a subset of keywords with at least 1000 impressions in a 4 week periods before and after the launch of Google Instant. The median click-through rate (represented by the black dots) seem to have shifted slightly higher at the top positions and slightly lower at the lower positions.
Subsequently, we fit a logistic regression model that enable us to model the click-through rate as a function of position and match type before and after Google Instant. The regression fits are shown in figure 4 for 2 match types. They also suggest that there has been a subtle increase in click-through rates after Google Instant at the higher positions and a subtle decrease at lower positions. In order to evaluate the statistical significance of these changes in the light of the inherent variability in the data, we fitted another regression model for three position ranges: top (positions 1-2), middle (positions 3-6) and bottom (positions 7-12) instead of individual positions. We then compute click-through odds ratios (and their 95% confidence intervals) for comparing the odds of a click before and after Google Instant for a specific position range. For a refresher on click-through odds refer to this earlier post. The results are summarized in table 1 below. If there has not been a significant change in the odds of a click we would expect the confidence interval to include 1. Table 1 suggest that the odds for a click on the top positions has increased by about 6.6% after Google Instant, while the odds for a click has decreased by about 13.1% after Google Instant at the lower positions. The change for the middle positions is much smaller and barely significant, reflected by the fact that the corresponding confidence interval almost includes 1. The lower positions here also include some page 2 positions, which may suggest that Google Instant is reducing the importance of page 2 results as well. The relatively thin data on second page positions prevent confident inference about the effect on page 2 at this stage.
Figure 3: A comparison of click-through rates for top and bottom positions before and after Google Instant.

Figure 4: Modeled click-through rates by position pre- and post-Google Instant

Table 1: Click-through odds ratios pre- and post-Google Instant from logistic regression fits
SUMMARY
Below we summarize our findings:
- There does not seem to be a significant shift in overall impressions and CPCs at this stage.
- Our data does not provide any evidence at this stage that we should start preparing for the funeral of the long tail of paid search just yet.
- A subtle shift seems to be happening in terms of click-through dynamics on the first page (and possibly the second page). There seems to be a slight increase in click-through for higher positions and slight decrease for lower positions, resulting in relatively more traffic from higher positions than before. If this results in higher competition amongst advertisers for higher positions in order to maintain volume, it is certainly not something Google will be too unhappy about. The best strategy remains to bid traffic to its estimated value rather than focusing too much on position.
- More time is needed to see how Google Instant affects longer terms search behaviour, as people become more familiar with it.
- Google Instant could also affect conversion rates, which was not investigated here. We will investigate this once we have gathered some more conversion data.
A request to Google: enable us to pay a fair price search partner traffic
We would like to add our voice to those out there that have been pleading with Google for some time to give us the ability to differentiate bids by search partner on Google’s syndicate network. Industry leaders such George Mitchie have in the past presented data illustrating that the quality of traffic coming from the Google search engine itself is significantly higher than the traffic coming from search partners. Here we will present some of our own data to reinforce the fact that the quality of traffic we get via search partners is (with some exceptions) of a lower quality than the traffic we get directly from Google’s search engine. Additionally, we will show that there are geographical differences in the value of traffic and that the value of brand and non-brand traffic can potentially be quite different for search partners. We do not dispute the additional value of traffic from the syndication partners, but we just want the ability to pay the right price for that traffic based on its inherent value.
The proportion of Google search traffic that comes from syndicate partners varies quite a bit from client to client. On average, we see about 20-25% of traffic coming from search partners for US clients, while it tends to be closer to 10% for Australasian clients. Data for collections of US and Australasian clients show that search partner traffic is generally of a considerably lower quality. In figures 1 and 2 we sort referring domains by click volume across a set of US and Australasian clients, respectively. We then calculate the ratio of each domains average conversion rate to the overall average conversion rate for the client. This comparative conversion rate data reveal the differences between Google and other referring domains. The color of the bar labels for the different domains distinguish between 3 groups of domains: red (conversion rate at least 10% below overall conversion rate), green (conversion rate at least 10% above overall conversion rate) and black (conversion rate within 10% of overall conversion rate).

Figure 1: Relative traffic value by search partner for a selection of US clients

Figure 2: Relative traffic value by search partner for a selection of AUS clients
In the US and Australasia most search partners bring in traffic that is well below that from Google.com in quality. There are some differences in relative traffic value between the two regions, notably for Amazon traffic. Out of interest we repeated the above analysis by excluding all brand traffic first. The results show that the relative traffic value between search partners change considerably when we do this. In figure 3 we show data for the set of Australasian clients above after excluding all brand traffic. EBay is an example where the exclusion of brand traffic has increased the relative value of its traffic considerably. Volumes do become quite thin so the inherent statistical variability should be kept in mind in our inference. It is clear that the traffic from most search partners is quite a bit lower than direct search traffic from Google. The ability to bid differently by search partner would therefore ad quite a bid of value. Adwords advertisers have the ability to bid differently for the Google domain and the rest of the network by separating all campaigns into Google only versions and exact copies of them to Google + Search partners versions. The bids for the Google-only campaigns are higher because the conversion rates are higher, hence on Google.com only the Google-only campaigns are in play. That means that the Google + Syndication partner campaigns actually only serve ads on the syndication network and the bids can be depressed for that traffic. This is a rough workaround and not ideal. The ability to bid differently by domain would be much better.

Figure 2: Figure 3: Relative non-brand traffic value by search partner for a selection of AUS clients
Can Google queries help predict economic activity?
In Bill Tancer’s book Click, he gives some examples of how near real-time Internet data provides a time advantage over traditional leading economic indicators. These indicators are typically only available with a time lag. The data for a particular month is generally released about halfway through the next month. I found this concept quite interesting when I read the book a year ago. I never really pursued it analytically myself, until I recently discovered a nice interface to query Google Trends data from within the leading freely available open-source statistical software package R. The R package RGoogleTrends (developed under the Omegahat Project) provides a very useful tool to extract and analyze Google query data in an efficient manner. In the documentation for this package it is stated that its development was inspired by a blog post by Google’s chief economist, Hal Varian, which was published on the Google’s Research Blog. They illustrate some simple forecasting methods, and encourage readers to undertake their own analyses. By their own admission it is possible to build more sophisticated forecasting methods. We decided to take up the challenge, because at Clicks2Customers we are always keen for an analytical challenge, especially if it comes from the mighty Google.
R has a wide range of sophisticated time series packages, which we decided to put to the test to see if the incorporation of query data can indeed improve the estimation and forecasting of leading economic indicators. In this post we will focus on the monthly home sales data released by the US Census Bureau and the US Department of Housing and Urban Development at the end of each month and which was used in Google’s study. In order to make our results comparable to that of Google, we use the same January 2004 to July 2008 time window.
Our aims are two-fold:
- Verify that a more sophisticated time series modeling approach improves accuracy compared to Google’s relatively simple models
- Verify that the inclusion of query data in models improves the accuracy of estimates
In figures 2 and 3 we show the raw and seasonally adjusted home sales data downloaded from the US Census Bureau. Similar to the Google study we will start our modeling process on the seasonally adjusted sales figures. This is to aid a comparison of our results with those of Google, although the seasonal component can easily be modeled directly. Google Trends provides an index of the volume of Google queries by geographic location and category. The query index of a search term reflects the query volume for that term in a given geographical region divided by the total number of queries in that region at a point in time. This index is then normalized relative to January 1, 2004. The index at a later date therefore reflects a percentage deviation from January 1, 2004. Google Trends data is also reflected on a category and sub-category level. Figure 3 reflects the search index data for the ‘Real Estate’ category and 5 of its sub-categories: Real Estate Agencies, Home Financing, Home Inspections & Appraisal, Property Management, and Rental Listings & Referrals.

Figure 1: Raw Home Sales Data

Figure 2: Seasonally adjusted home sales data

Figure 3: Google query volumes for the Real Estate category and 5 of its sub-categories
The Google study fits simple auto-regressive models using standard linear model fitting functions. A closer investigation of these models shows that they do not adequately model the correlation structure in the data. We will follow a more classical time series approach based on the classic autoregressive integrated moving average (or ARIMA) time series models. In our study we will first model the house sales data on its own, in order to establish a performance benchmark. Thereafter, we will incorporate query data in the models to test if its inclusion can improve the prediction of house sales data. We will evaluate the prediction of the different models by making a series of one-month ahead predictions and compute the prediction error, known as the mean absolute error (MAE), as defined in the Google study. Each forecast uses only the information available up to the time the forecast is made, which is one week into the month in question.
The simplest time series model that is closest to the null model (Model 0), presented by Google is an ARIMA(1,1,0) model. The difference being that our model takes a lag-1 difference of the log-transformed data to reduce it to a stationary data series, which is a necessary prerequisite. This model provides a reasonable fit to the data and gives a prediction error of 6.03%, which is lower than the 6.91% of Google’s null model. There is some suggestion in the data that a higher order auto-regressive model may provide a better fit. We found that an ARIMA(7,2,0) model does result in an improved fit and a significantly reduced prediction error of 4.04%. The previous model already outperforms Google’s more advance model (Model 1) with a prediction error of 6.08%, which already incorporates query data and house prices. Next we take it up a notch by incorporating the above Google query data and fitting a multivariate time-series model. We use the query data in the first week of each month. We experiment with different combinations of the above query indices and found that the Property Management query index gives the lowest prediction error of 3.7%. The model we fit is a vector auto-regressive model with a lag of 3 using the R package dse. The monthly 1-step ahead prediction errors for the above models above are plotted in figure 3.

Figure 4: US home sales data 1 step ahead prediction errors
Let us return to our stated aims. It seems like we have verified both aims, namely that a more formal time series approach improves considerably on the models presented by Google and that the inclusion of query data has the potential to further improve the 1-step ahead prediction in the case of the house sales data. Our best performing model improves about 39% on the best model presented in the Google study in terms of 1 step ahead prediction accuracy (without incorporating the house price data used by Google yet). There seems to be potential in using Google query data in forecasting economic data.
This is a single example and a proper study will have to apply a more sophisticated modeling approach to a much wider range of data sets. The Google study also illustrates the use of Google Trends data in predicting travel visits. In their example they use data from the Hong Kong Tourism Board. We intent to perform a similar study using monthly tourism data released by Statistics South Africa in conjunction with Google Trends data for the period building up to the 2010 FIFA World Cup. This should make for an interesting case study for the use of Google Trends data. Keep an eye on our blog for the results sometime in the future!
getstats – promoting the understanding of statistics
Data is becoming more and more important in every sphere of society. This is underlined by companies like Google that have it as their mission to organize the world’s information and make it universally accessible and useful. Major consulting firms are acknowledging the emergence of data-driven decision making as an emerging global trend. This is a trend that is not only limited to business world. We are increasingly being exposed to statistics and data in our everyday lives.
The Royal Statistical Society is launching its 10 year campaign for statistical literacy on World Statistics Day: 20/10/2010. The vision for the campaign, known to its friends as getstats, is “a society in which our lives and choices are enriched by an understanding of statistics”. Please visit http://www.getstats.org.uk for more information and to show your support As a company operating in a data-driven industry, we are proud support this global initiative.



