The shady world of Google Analytics proxying

If you’re tracking me, at least be honest about it

27th September 2020 1,721 words

Tags:

I was looking at Google’s new Eleventy starter, eleventy-high-performance-blog, the other day, when one specific feature jumped out at me:

Supports locally serving Google Analytics's JS and proxying it's hit requests to a Netlify proxy (other proxies could be easily added).

As someone who is all too familiar with the flaws in Google Analytics — it’s something I’ve written about before — this struck me as exceedingly clever, but also rather sneaky. To recap, I had three main issues with Google Analytics:

  1. It’s not accurate
  2. It’s bad for performance
  3. It’s bad for privacy

This technique of locally serving Google Analytics's JavaScript and proxying its hit requests has the potential to almost entirely solve the first two issues:

In this post, I’ll start with a brief explanation of CNAME cloaking, another method commonly used to conceal the nature of tracking JavaScript. I’ll then cover how Google Analytics proxying differs, and see how it improves upon other ways of loading Google Analytics. I’m also going to cover its privacy implications and explain why, despite its obvious benefits for site owners, I’d strongly advise against using it.

CNAME cloaking

While exploring methods to hide tracking JavaScript from ad-blockers, I found this post by Roger Comply, on the topic of Plausible Analytics (a service which I also covered in my previous post), which mentions CNAME cloaking — a technique which exploits CNAME records to make third-party resources seem like first-party resources.

A CNAME record is a type of DNS record which maps an 'alias' name to a true or 'canonical' domain name. Say your website lives on my-site.com and you load some tracking JavaScript from tracking-site.com. Like google-analytics.com, tracking-site.com is known to the developers of ad blockers as a tracking service, so any request to it is blocked. If you add a CNAME record for nothing-suspicious-here.my-site.com (a subdomain of your site, the alias), and point it to tracking-site.com (the canonical domain) you can load the tracking code from your alias domain, making it much harder for ad blockers to identify its true purpose.

The Scooby Doo 'Let’s See Who This Really Is' meme where Fred removes the ghost mask, overlaid with the text 'nothing-suspicious-here.my-site.com', to reveal the text 'tracking-site.com'

But this technique is not guaranteed to bypass ad blockers — since version 1.25.0, uBlock Origin can now 'CNAME-uncloak' network requests to determine the canonical domain and block it if appropriate.

What is Google Analytics Proxying?

The basic idea behind Google Analytics proxying is to divert any traffic that would normally go directly between the browser and Google through an intermediary. This intermediary, known as a reverse proxy, sends data to and from Google, hiding the data's true destination/source from the browser — if you open up the Network tab in your browser's developer tools, instead of google-analytics.com, you'll see the URL of your proxy.

A diagram demonstrating reverse proxying, showing from left to right: an illustration of a desktop monitor, labelled 'web browser'; a server labelled 'proxy' and another server labelled 'Google'. Between, each item there are arrows pointing back and forth, representing the flow of data.

When setting up a Google Analytics proxy, there are two distinct types of request which would usually go directly to Google, which we want our server to handle:

  1. Loading the analytics.js library
  2. Sending pageviews and other tracking events

Self-hosting analytics.js

The first requirement of this technique is hosting analytics.js somewhere other than Google. You could use the method from this article by Stefano Chiodino whereby your proxy loads the most up-to-date analytics.js library directly from Google, rewriting any hardcoded URLs on the fly. However, there is a simpler approach: copy analytics.js from google-analytics.com to your own server, editing any hardcoded URLs to point to your proxy instead of google-analytics.com. You won’t receive any updates to the analytics.js library, but unless you want to use the newest features, it’s unlikely to cause any noticeable problems.

A screenshot from the Network tab in Firefox developer tools, showing the contents of a file named cached.js which is a copy of analytics.js

eleventy-high-performance-blog uses this second approach: this file is named 'cached.js' but when you open it up, you’ll see a modified version of analytics.js with any references to Google URLs replaced.

Proxying tracking requests

The second task, tracking events, is a little more complex. Here’s an article on the technique from 2017 which suggests using a Node.js server to act as your proxy, but for sites built with Static Site Generators (SSGs), this is an increased layer of complexity and potentially an extra cost. eleventy-high-performance-blog uses Netlify functions instead — these are a way of running backend logic without needing to maintain a server, which is why they’re commonly called 'serverless' functions. The code for Netlify functions lives in your site’s git repository and gets automatically deployed along with your site. There's the added benefit that you get a quota of 125k free function calls per month on Netlify’s free tier. The function, which you can view here, is triggered by any request to yoursite.com/.netlify/functions/ga.

This function does more than just forwarding the request directly to google-analytics.com — Google Analytics determines a lot of information from the origin of the request, not just its content, so when forwarding these requests you need to add a few extra parameters to the URL:

The end result is that, in your Google Analytics dashboard, you should see the exact same data you would have if you weren't using a proxy, but with one important difference — you’re likely to see more visitors and more pageviews; which brings me onto the first key difference of this technique, when compared with non-proxied analytics: accuracy.

How does it differ from regular Google Analytics?

Accuracy

One of, if not the biggest causes of inaccurate analytics data is the tracking protection features in browsers and browser add-ons. When using my usual ad blocker, uBlock Origin, analytics.js served from Google and any requests to google-analytics.com are blocked automatically. What does this mean for visitor and pageview numbers? These users are effectively invisible to client-side trackers, resulting in a big gap in analytics data sets; but when Google Analytics is served through a proxy, all of these requests get through.

To find a figure for the size of this gap, you can start by looking at ad blocker installs as a proportion of web users. This 2020 survey by AudienceProject, shows that 36% of respondents in the UK answered that they use an ad blocker (down from 47% in 2020). Unfortunately, there are ad blockers which don’t even block all ads, let alone tracking JavaScript. Things are further complicated by the fact that many ad blockers which don’t block analytics by default, have an option to enable it. e.g. to disable analytics trackers in the most widely-used ad blocker, AdBlock Plus, you have to navigate to the extension options and tick the 'Block additional tracking' box, something I’d expect most users to ignore.

A screenshot of the settings page, for AdblockPlus, showing an unchecked checkbox labelled 'Block additional tracking'

The website of rrreGAIN, a company whose entire business model seems to be based on CNAME cloaking and proxying Google Analytics, says "30% of your web traffic is invisible", with the caveat that this could range "from 5% to as high as 60%", depending on the type of traffic your site attracts. But I’d be more inclined to believe this article from December 2017, which compared client-side hits (which can be blocked) with server-side hits (which can’t be blocked). It gives a percentage of only 8% of users blocking Google Analytics.

Performance

If you’re loading analytics.js from Google, there’s only so much you can do to improve performance, but hosting it somewhere you control can help to mitigate these issues:

Privacy

First of all, as with any collection of personal data, you’ve got the law to think about. “I’m not collecting any personal data!” you might say, but under privacy laws like the European Union’s General Data Protection Regulation (GDPR), IP addresses can be personal data, and you need either a good reason for collecting them[1], or explicit consent from your website visitors. Google Analytics gives you the option to anonymise IP addresses, only storing part of the IP address, which may be enough to be GDPR-compliant[2] (I’m not a lawyer in case you hadn’t guessed).

That said, your Google Analytics proxy doesn’t collect any more data than regular Google Analytics; in fact, because you’re using your own domain, there’s no opportunity for Google to set third-party cookies. Although unless you’re running ads, Google Analytics itself only uses first-party cookies.

Above all, the best argument not to disguise your tracking code as a first-party resource is that it’s fundamentally dishonest; and this kind of deception is far more widespread than you might think. The fact that so-called 'privacy-friendly' analytics services like Plausible offer CNAME cloaking shows just how normalised this type of behaviour has become. Isn’t privacy about respecting people’s right to choose how much they share? Perhaps it’s time site owners stopped inventing new tracking methods for the sake of a single-digit improvement to the accuracy of analytics data, and instead thought about why people choose to block trackers in the first place.


  1. What is personal data?, the Information Commissioner’s Office ↩︎

  2. Google Analytics IP Anonymization and GDPR Compliance, Data Driven ↩︎