Sampling can drastically reduce the accuracy of data queried from Google Analytics, but by using the API in many cases it can be eliminated.
Google Analytics is clearly an enterprise capable analytics platform. As higher and higher traffic websites take advantage of its features, Google has put in place some limits on how much data can be included in a query- if a query requires more than this, it implements sampling (called fast access mode in the Google Analytics user interface) this ensures that the user interface and API respond quickly.
Sampling provides an estimate based on a sample of visits in cases where Google has not pre-aggregated datasets. The most common cause of sampling is the use of advanced segments in a query.
In many cases, sampling only slightly affects the accuracy of the result- but in many others, it makes it impossible to get valid data.
The good news is, that Google has provided a fantastic API, and that lets tools like Analytics Canvas provide a solution.
Generally sampling kicks in at 500,000 visits in the query period. The way we can eliminate sampling is by breaking a query up into multiple queries by time period so that each individual query does not trigger sampling.
For example, imagine a site that gets 20,000 visits a day. Any query of more than a month in duration will trigger sampling. However, a query including 7 days will only involve 140,000 visits, and therefore the API will return exact data.
Analytics Canvas lets you automate this- we call it query partitioning.
Lets look at an example from one of our clients. We have done a query on a site for a period where there are almost 9 million visits. We’re looking for a specific source, so we have defined an advanced segment. This results in very heavy sampling- in fact less than 6% of the visits are actually included in the sample.
Lets do a comparison between the sampled data and unsampled data that we can get using Analytics Canvas. In this case, to make the query all we need to do is specify that we want to break the query up into 15 day increments to ensure that the exact data is retrieved. Analytics Canvas then automatically aggregates the results of all the different queries:
Here is how sampled vs unsampled data sets compare:
We can see that doing analysis on the sampled data alone could lead to very different conclusions and a significant amount of detail is lost due to the sampling process. While the number of pageviews is off by 8%, those pageviews are allocated by the sampling among about 400 pages when in fact, over two thousand different pages were viewed.
One thing that partitioned queries cannot do is report unique visitors, or other unique type metrics. The reason is that the sampling is avoided by breaking up the time period and doing multiple queries- this means that if metrics like visitor are included, the unique visitors for each day will be added up- which is not the same as the visitors for a month, for example.
The only way to get that level of exactness is by getting at the raw data- and that means Google Premium.
If you have a high volume site I’d strongly suggest you consider Google Analytics Premium. While a number of our customers subscribe to Premium, at $150,000 a year it’s not for everyone. But for companies that do serious business online it offers huge value.
So while there are limits on how much data can be included in a query, for both Premium and regular Google Analytics customers, Analytics Canvas can help you uncover those hidden insights.
This article was written by James Standen
James is the founder of nModal Solutions, the creator of the Analytics Canvas tools. nModal's vision is to bring an entire new class of visual, flexible tools to web analytics and social media analysis. You can find him on Google+.