RegEx Formulas for Cleaning Up Messy UTMs in Data Studio

Key takeaways:

Regular expression formulas in Google Data Studio (GDS) can save marketers from messy campaign reporting by combining splintered data and moving values between dimensions.

RegEx in GDS can help with reporting disparities that happen due to inconsistency in naming, which usually happens because of conflicting letter cases, dashes, or just by being unintentional about the naming.

The main formulas we recommend are those for forcing lowercase, replacing strings, removing strings, or setting a new value when a condition is met.

UTM.io includes a ‘Rules’ feature that uses RegEx to prevent users from creating bad UTMs in the first place. This enforces UTM conventions across large teams making it easy for anyone in the organization to build UTM parameters.

Why RegEx Formulas Are Important When Reporting on Links with UTMs
3 of the Ways You Can Use RegEx Formulas to Clean Up Data from UTMs
Over to you

Why RegEx Formulas Are Important When Reporting on Links with UTMs

Regular expressions, or RegEx, are used to search through a text string. They work by grouping values based on common patterns, then correcting or extracting information from those patterns.

When you use RegEx for data passed by UTMs, your reporting can be clean and reliable even if the UTMs are messy.

UTMs get messy. It’s common for marketers to build UTMs without following the best practices for tagging, for instance when you have a UTM tool to simplify the workflow and make it more consistent. Sometimes, UTMs get so messy that the reports become splintered, hard to read, or inaccurate.

Instead of tossing all your hard work and the reports you worked on, you can clean it all up with RegEx in Data Studio. We’ll show you how.

3 of the Ways You Can Use RegEx Formulas to Clean Up Data from UTMs

To start illustrating how to use RegEx for UTMs, let’s look at this raw list of UTMs used on our landing page:

The UTM tags in the table are inconsistent and will split data into multiple data sets. That’s because the naming convention and formats are lacking or vary — different letter cases, a mix of dashes and other symbols, different naming for social ads, different naming for the newsletter channel. As a result, it’ll be difficult to gauge how the landing page is performing.
Using the following RegEx for UTMs, this — and all other Data Studio reports — can be cleaned up to get more accurate data.

1. Group Splintered Values

In the table below, you can see an example of how messy the report got because of the UTMs above. For example, LinkedIn showed up twice as: ‘LinkedIn’ and ‘linkedin.’ Paid social popped up as four separate dimensions: ‘paid-social‘, ‘social-cpc’, ‘Sponsored20%content’, and ‘ppc’. “Klaviyo” was always in utm_source but the rest of the source parameter was written in a few different ways. As a result, the data was splintered into incoherent reporting shown in this table:

Reporting table with data splintered into 5 sources and 5 medium tags. Full data

Paid LinkedIn is in four different source / medium rows when it should only be in one. Klaviyo is in three when it should only be in one. This data needs to be cleaned up, or your analysis will not lead to actionable insight. This RegEx formula will help you do that:

Lower(CASE WHEN (REGEXP_CONTAINS( Lower(Source ), ‘linkedin|facebook|pinterest|instagram’) and REGEXP_CONTAINS(Lower( Medium), ‘ppc|sponsored|paid|cpc’)) THEN Concat(Source, ‘ / ‘,’paid-social’) WHEN (REGEXP_CONTAINS(Lower( Source) , ‘weekly_analytics|weekly-analytics’) ) THEN Concat(‘klaviyo’, ‘ / ‘,Medium) ELSE Concat(Source,’ / ‘, Medium) END)

Let’s break it down.

WHEN (REGEXP_CONTAINS( Lower(Source ), ‘linkedin|facebook|pinterest|instagram’) and REGEXP_CONTAINS(Lower( Medium), ‘ppc|sponsored|paid|cpc’)) THEN Concat(Source, ‘ / ‘,’paid-social’)

This statement takes various values and consolidates them into one — ‘paid-social’ in lowercase. The formula picks the source tag with values such as ‘linkedIn’, ‘facebook’, ‘pinterest’ and ‘instagram’ in source and medium tags such as ‘ppc’, ‘sponsored’, ‘paid’ and ‘cpc’, and groups them together as ‘paid-social’. Now you’ll truly know which traffic came from paid social.

WHEN (REGEXP_CONTAINS(Lower( Source) , ‘weekly_analytics|weekly-analytics’) ) THEN Concat(‘klaviyo’, ‘ / ‘,Medium)

This statement takes anything with ‘weekly_analytics’ or ‘weekly-analytics’ in the source and ensures that it gets attributed to lowercase ‘klaviyo’. We want “klaviyo” in source because it’s the platform the traffic is coming from. “weekly-analytics” can either be in source after klaviyo, or it can be in the campaign parameter instead. We choose to keep it cleaner and simpler and move “weekly-analytics” out of the source parameter. Having it there led to splintered data. The inconsistency made the information difficult to parse out. You’d have to manually do math to add it all up, and that’s the opposite of good reporting.

Get nuanced and actionable insights

Learn more about how to use UTMs and Google Data Studio to get clean data

Watch the GDS God Mode webinar

Also, we do want the words “weekly analytics” in the campaign parameter because then we could use a campaign with that name when we promote the newsletter on other channels, too. More on that below.
As a result, we get rewarded by a Source / Medium table with clean data. All the 256 sessions that came from paid LinkedIn are now in one row instead of four. All the 209 sessions that came from Klaviyo emails are now in one row instead of three:

Splintered data turned into a coherent set. Full data

2. Make Sure the Formatting Is Consistent, Move Values between Parameters

Consistency in UTM formatting is challenging without a premium UTM builder because of the volume of people building your UTM tags, and the people’s variety in experience. This formatting issue frequently shows in the campaign parameter:

Messy campaign table as a result of messy UTMs. Full data

The marketing team building the UTM tags for these campaigns didn’t have a consistent UTM naming convention. The resulting messy campaign rows in the report would be a big hurdle to attributing campaign ROI to the right places. So, you know it, RegEx formulas come to the rescue:

Lower (REGEXP_REPLACE(Campaign,’%20|_’,’-‘))

This formula looks for “%20” and underscores in the campaign name, and changes them to dashes. ‘Lower’ forces everything into lowercase, as UTMs are case-sensitive. So that’s one important change we made to follow UTM best practices.
The “retirement-investments” campaign will look clean now that everything will be lowercase. You’ll get just one row and a total of 96 sessions. If you didn’t clean this up and went with two different campaigns, you might incorrectly think that there are two campaigns, one with 63 and one with 33 sessions. You could also use RegEx to group “Retirement” together with “retirement-investments”, but we’re not 100% sure they belong together, so it’s OK to leave it alone.
In a similar way, “weekly-analytics” is now in fewer rows:
After making “%20” and underscores into dashes, we also moved the issue number around. It’s useful to have it in the campaign name instead of the source parameter. You might be promoting the newsletter issue 154 on a variety of channels and it’ll be great to use the issue number across those channels. Also, the source parameter should have much less variety than the campaign parameter, so having the issue number in source would prevent you from building optimal UTMs.
For example, in the raw list of links with UTMs at the top of this article, you could see a source tag “wEEkly-analytics-48-klaviyo”.
RegEx saves the day again. You can use a formula to identify a number in the source tag and move it to the end of the campaign tag instead:

CASE WHEN REGEXP_CONTAINS(Source, ‘\\d’) Then CONCAT(Campaign a,’-‘,REGEXP_EXTRACT(Source,’\\d+’)) else Campaign a End

Thanks to the two formulas that circumvented the messy UTMs, you get a real report that you can use to make real decisions:

Data in the right column has been cleaned up and explains itself better. Full data

3. Enforce consistent UTM naming

UTM consistency will determine the accuracy of your data. But this isn’t always easy, especially in large organizations.
UTM.io solves this dilemma using the ‘Rules’ feature. This feature makes it easy to fix any problem before it happens. It uses RegEx to ensure consistent UTM naming across teams.
With this UTM.io feature, people don’t have to expend too much energy building and naming UTM’s because everything they need to know about correct UTM naming is already built into the system. They’ll know what they’re allowed to do, which parameters they can use, and what names to use for each parameter.

Here’s how easy it is to create rules in UTM.io

Over to you
Here is a list of additional resources you can use to learn more about RegEx formulas and how you can use them to clean up your data and have more accurate reports.

RegEx UTM Stack Overflow Thread — how to extract a UTM from a link

Data Studio Google Support Thread — how to extract UTM tags for use as dimensions

Guide & Tool for Removing UTMs and Accurate Attribution — how to make sure your website visitors don’t copy & paste UTMs that are not valid for them anymore

Build UTMs in a premium builder that allows RegEx

Try RegEx in the Rules feature of UTM.io, become amazing at preventing tagging inconsistencices

Try UTM.io for free