5 Steps to Clean Data

How to clean your data as a part of data mining process

I am flush with data. I’ve got data from all over the galaxy and concerning about everything I could ever need in order to do my job. Problem was, I was suffering from a systemic problem and didn’t even know it. All it took was an update and an outdated lookup table, and my entire database was off, spawning impossible records that threatened to rip a hole in the space-time continuum (sorta). I had country contacts fantastically coded with the incorrect country and continent. Then, on top of this, my lookup table didn’t have a complete list of identified countries. Though there are 195 countries (as of this writing) we only had 185 in the table, and the United States had fallen off the map entirely.

The database was suffering from a classic data mistake, put garbage in and you get garbage out (GIGO). What was happening is exactly why Customer Data Platforms (CDPs) are on the rise. Companies that receive their data from multiple sources, and currently code it in to make it fit with a Data Management Platform (DMP) are finding that this just doesn’t work anymore. There is simply too much data. In this day and age, not using a CDP creates the sort of quantum  loophole when a contact who lives in Canada suddenly gets classified as being from China (yeah, that happened).

A CDP creates a unified customer database that is perfect for companies that pull data from multiple sources. This software is designed to match customers across different sources to create an optimized “Golden Record”. DMPs fall short because they are designed to create audiences based on the data compiled, not create master records for each customer. This means, in a DMP, customer data can be overwritten multiple times in a single day if data is coming from multiple sources. The detrital “tossed” data is lost forever.

After the harrowing experiences of venturing into the unknown void of a time-sucking black hole to identify the problem, fix it and implement a new process for over 2 million records, I decided I needed some steps to make sure this other-worldy nightmare never recurred on my watch. Following are the steps I developed:

Preventing Catastrophic GIGO Scenarios:

  1. Understand all your data sources (Quality control)
  2. Develop a cleansing, normalization and standardization process (Create and know the rules)
  3. Plan how data will be integrated into your DMP (Implement the rules)
  4. Regularly maintain and review your normalization rules (Quality control)
  5. Invest in a CDP (More resources to help get the job done)

Even though we live in a world of automation, big data tells us we can’t just set it and forget it.

Lindsey Griffith is a current student in the Helzberg School of Management’s Master of Science in Business Intelligence and Analytics program. Lindsey currently works as Database Marketing Manager at Aviation Week Network, an Informa PLC Company.

Lindsey Griffith, MS-BIA '20

Lindsey Griffith headshot