Checks, Checks, Checks

and more data quality checks…

In episode 13 of The Data Coach Podcast, I went over the importance of data quality. I went over a few of the things you should be checking to validate that your data is of good quality.

In this post, I’ll be going over some systematic ways to validate data. Let’s start off with different ways to implement data quality checks.

To check before or check after?

This is an interesting dilemma that you run into: “Should I validate the data after it’s entered into the system or do I stop it at the gate?”

This depends on whether you have control of the data in the first place. My rule of thumb is that if your source system can validate the data as it is being entered, then THAT is the best quality data. Data that is stopped at the gate prevents garbage data from being entered into the system in the first place. One of the caveats is that you may be leaving some data out that you really want. In that case, you would do the “data scrubbing” or clean up after you have the data in your system.

Ask yourself these questions in order to decide which route to take:

  • Do I want to see the data that is not being entered correctly? If yes, then you want to do the quality checks after you have the data in your system.

  • What is the impact of my system receiving data that doesn’t pass a data quality check? If bad data impacts your business or process, then you want to do quality checks before that data is entered into your system.

Who does the checking?

This depends on your answer to the previous question. If your data needs to be stopped at the “gate”, then your application developers need to create validations in at the point the data is being collected but before it is stored in your systems. This could be things like validating forms or application event messages.

If your data needs to validated post-ingestion into your system, then a data engineering team would be responsible for cleaning or scrubbing that data using automated methods.

Basic Checks

There are a few basic checks that I begin with when it comes to data quality:

  • Uniqueness

    • First, define what a unique record means for the data set.

    • Then, remove all duplicates or store the duplicates in another table so you have an audit of what could be a potential issue with how you are capturing data.

  • Completeness

    • Establish the range of records that you expect for the data set. You can do this programmatically, by storing row counts on a daily basis and using the average within a given timeframe to determine whether you have the complete data set.

  • Consistency

    • Ensure that there are standards within your naming conventions as to not mistaken an ID for a number or have those be interchangeable. If I call this data point a SKU, it should be called a SKU in all places and not confused with Product ID.

  • Accuracy

    • Compare the data that you have with data from either a source system or another related data set. For example, if you are loading data from a file, validate that the rows loaded into your system match the line count in the file.

  • Timeliness

    • Run a simple check to determine when the data was last updated. If you expect the data to be refreshed every hour, then validate that is what updated within the last hour.

These are just building blocks. Data quality goes beyond these basic checks. Layering in your business domain and logic is critical to ensuring that your data is of value and use to your business. You could be validating whether or not certain values for inventory numbers are acceptable or whether you need to only allow data from certain IP addresses. The important takeaway is that if you are not doing data quality checks today, start small and work your way up to having a robust data quality framework. After all, we don’t want to deal with garbage in/garbage out scenarios.

Previous
Previous

Getting It Right

Next
Next

Facts vs Fiction vs Bias