3 Crucial Tips for optimizing Google Analytics with Microsoft Power BI

Background

It’s estimated that 55% of all the worlds websites have Google Analytics [GA] installed on them. This number dwarfs the next largest competitor, Adobe. Due to its sheer size, scale, ease of use, and robust API’s – pretty much everyone supports it, including Microsoft PowerBI. PowerBI’s integration with GA is longstanding, pretty solid, and largely dependable. However, there are still potential issues with slow user experience [UX] and data accuracy arising from the nature of the data (high cardinality, long text fields, GA sampling limits). But all is not lost; this article will showcase 3 tips on restoring performance and dramatically increasing accuracy via some easy steps. It is worthy to note that although this article is geared toward the free version of GA, it can be applied to the Premium [360] version and other data sources as well.

So, without further ado – here are three tips on speeding up PowerBI and making your data more accurate:

Tip #1 – Speed up Refresh time and improve your data quality by fracturing your GA query

When loading GA data into PowerBI, the temptation is just to select the entire date range for their property. This is a huge mistake on several points: 1) It reloads all [unchanging] historical data, every time you refresh, 2) If you have a lot of traffic, your data quality will suffer hugely due to the sampling, and 3) The sampling will decrease resolution and prevent you from going granular with the data.

To resolve this, I recommend breaking your GA data load down in to bite-sized chunks of data. Essentially, creating an array of identical query-loads, but each covers a given date range. With PowerQuery in PowerBI you can then stitch together [Append] unlimited numbers of queries into a larger dataset. The overarching process is fairly quick, provided you don’t abuse calculated columns in the output dataset (more on that later). The process of converting from a single GA Query to an array of queries is pretty straightforward, although you will need to move your measures and calculated columns to the new [roll-up] table after implementing.

  1.  Take your existing baseline query and make copies of it, for each period you want to capture. You will need to adjust the date filter for each query. You will need to make the determination on frequency here based on your overall traffic and number of dimensions/metrics in your query. I don’t want to get ‘into the weeds’ here with representative sample percentages and target frequencies here because it varies depending on your data. What I will say is that – select a frequency you feel gives you adequate resolution. Generally, that’s once a month for a site with strong traffic and maybe every 3-6 months for a small site. Lastly, If you have user level data in GA that you need to expose – you will need to remove all sampling and may have to go to daily pulls.load_split
  2. Create a new query using the “append queries as new..” option and load the data once. After that you will need to select the ‘Three or more tables..” option and include all the new fractured tables. This will run (briefly) and you will notice a new query listed. It will have one step with “= Table.Combine({..” syntax in it. Btw: don’t use merge as that will slow down the process as it scrubs every record in insert. Just make sure you’re selecting date ranges (in #1 above) that DON’T OVERLAP. append
  3. After you have performed your initial data load – you wont need to re-fetch the historical data as Its just a waste of time and CPU. So, PowerQuery has included an option to exclude the query in future report refreshes. Go to all your historical queries (e.g. last month, the month before, and so on..), right click, and UNCHECK the “Include in Report Refresh” option. PowerBI will remember the original data, but wont bother asking GA for updates.refresh

NOTE: Some follow-up notes on fragmenting your data: Fragmenting will increase the physical data size. Although PowerBI’s compressed, columnar data store is very quick – the sheer cardinality of the text data (especially dimensions like Page and Source/Medium) can choke the PowerBI/SSAS engine. But don’t fret just yet, we have the solution coming up next…

 

Tip #2 – Simplify your data before its in the Database..

loadingDo your reports take many seconds to refresh upon interacting with them? Are you intimately familiar with the spinning dots? Most users will experience a considerable degradation of performance in PowerBI after laying in all seven dimensions and metrics, calculated columns, and measures. You must first understand that the columnar database in PowerBI likes its data tall and skinny (e.g. lots of rows, but being minimalist on columns) to be as productive as possible. This is in contrast to a traditional SQL DB which prefers the opposite.  It also benefits PowerBI to not have large text fields with unique data. So, how do we address this?

  1. Its inevitable we will need to pull in [Page] or [Page Title] dimensions when querying GA. But part of the problem is all the ‘crap’ in the querystring. We look at this as superfluous junk but PowerBI sees it as something unique. The best way to address this is to scrub these fields down to just their relevant data DURING THE IMPORT and NOT to add additional calculated columns later. If you have campaign or tracking codes in the querystring you need to interrogate: see if there is a way to simplify that at data load time. As you can see by this example below – we have the same page with a boatload of unique querystrings relating to some Chinese bot farm we don’t care about. Bringing this data into PowerBI is a crime and you should try to simplify as much during load. From here, I would split the first column at the ‘?’ and then delete the 2nd half of the split. Since we also have a lot of cardinality in the 6th column, I would develop a filter (find/replace) of the primary offenders (again, Chinese Bot Farms)bot_traffic
  2. Since most PowerBI reports are constantly evolving and changing, routinely audit your tables for unused, calculated columns and remove them. They make the data fat, slow and hamper your load times. I keep a document with all the functions I’ve ever created in case I need to re-add it at some point.
  3. Determine if calculated columns would be better suited as a Measure. I will admit when I first was learning DAX – it was less challenging to use calculated columns to process data. but it slows the system down.

 

Tip #3 – Keep it simple [and fast] with Calculated Tables

Another way to approach this is to use a calculated table, based on your loaded data. You can limit the size (columns, etc) at this point much more effectively. The Calculated table is an independent table based on the original data from one or more tables. The major benefit is that you can SUMMARIZE data here (e.g. get rid of all the useless data you don’t care about for the query). Sure, it will use additional memory – but nothing good in life is free.

calc_table

 

Final Notes

I think you will be very pleased with the outcomes here on your reporting. The important thing is that we want to get all the cutting and prep work off the table before PowerBI has to crunch data. Failure to do this will lead to a reporting experience that will get slower and slower with each passing month.

If you have any questions – feel free to email me or post a comment. Best of luck and I hope this was useful.

Article by David

Your host and an experienced analyst dedicated to evangelizing the quantifiable, data-driven measurement of business with an emphasis on the customer relationship.

Leave a Reply

Your email address will not be published. Required fields are marked *