When data goes bad

Data is not like food that has an expiration date. But we've all encountered bad data in the past. Whether it is an incorrect bill that you're getting from the doctor or cable company to a bad report at work, we've all experienced bad data before one way or another. Now, what happens in organizations when the state of their data is in an unreliable state? Or when maybe the worst case scenario is you don't know where it is. This post is not about the issues people encounter with data discovery. What I'm going to be focusing on are the tactics you can start taking to implement some level of data governance. 

Now, every time I've been in an organization that has  “bad data”, I proceed with caution. The reason I say that is because it's easier to just say, “oh, data is bad” as a blanket statement. When posed with that statement I like to ask a follow up question: 

Is the data bad  or is it just inaccessible?

 Those are two different things. Is it a matter that you don't know where data lives? Or is it that there's multiple sources of that data? Maybe with regards to your sales data, you don't know how many sales you have done this month. That could be because you have data that is coming in from one system and data that's coming from another system and both numbers are different. Therefore, you don't know which one to trust.

There's another form that I can think of with regards to bad data. I’ll classify that data as incorrect data. It’s not bad because there's multiple sources of data. It’s bad because the actual source of that data ( for example your production platform) is sending incorrect data.

After it has been established that the data is bad and this is not an issue of inaccessibility, I then ask:

How do you know It's bad?

Phase 1: Discovery

There are few things that can be taken into account here when trying to figure out “bad data”. For starters, what is your baseline? How would you know that your data is bad? Is there some other data set that is telling you that these numbers actually shouldn't be those numbers? Is it your financial data? Is it some invoice that you're receiving from someone else? Is it contracts that you have in place? Is it any of that? You need to start off with a baseline. You need to understand what is  your true source of data. Ask yourself what is your system of records (your audit trail) and how are you confirming that any of this data that is coming in through a system is correct. This discovery phase includes doing some research on possibly application logs, it could be some metrics being captured in real time by another application that stores this information in a way that can’t be modified.

In essence, you need to understand what is the state of your data. Do you know what that is? Once you know it, then you can determine if it's good or bad. You can't classify your information if you don't know what it is, how it gets there, and what happens after. What you usually see happen in organizations is that you hear through the grapevine from multiple people across organization that data is bad. But is it really bad? If you hear people saying that data is bad, encourage them to provide examples.  This is not because we don’t trust them, but let's be honest and not reinvent the wheel. If you know that the information that's being captured is incorrect then share those details because it helps in the discovery phase. A starting point is better than none. This starting point helps the data engineering team, or the analytics team, or the software engineering team to point them in the right direction. 

This first step, which I call the discovery phase, may take some time in order to understand the areas where data is bad. There is no time box that you can definitively put in these cases because every case is different, every system is different, every organization is different and the sometimes priorities may be changing. But if your priority is to clean the data, then push for that discovery phase. 

Phase 2:  Backlog and priority

After discovery occurs, you may have a laundry list of issues and tasks that need to get done. Which one should you focus on first? Now, what I have seen to be very successful, is being able to lay everything out. Ensure that your findings are documented and detailed. Include in these findings the following:

  • what is the impact of this data being wrong? For example, is it that your revenue metrics are incorrect?

  • You need to understand the level of severity in order to understand the magnitude of the “bad data”.

Without knowing the severity or impact, we can spend months working on the wrong issue. By understanding the details behind the issues, then you can make decisions about which tasks to go after first. 

Phase 3:  Execute

After you have your list of all the issues and priority, execute. After you prioritize, let your stakeholders know what your plan of attack is. It provides transparency and also brings communication to the forefront that you are listening to the issues. It sends a message “Thank you for the information that you've given me. Thank you for providing all these insights. I am now able to pinpoint what are the actual issues that are going on. I am now able to call them out and as you can see this is  what that plan looks like”. Make sure that you provide updates along the way.

Preventing bad data

This is a light form of data governance  because in reality, data governance is one of those things that when done early on is easy. I understand that that's not always the case. What ends up happening most of the time is that you're dealing with data issues after the fact ( after the systems have been built). What you are dealing with is the fact that data was not top of mind when people were building their systems or when the company was being built and that's understandable. Every company has their own issues and priorities of what they need to do to to scale and grow. What is most often seen is that organizations deal with data governance issues after they've built out their data warehouses or after they build out their data systems. That’s why it's hard.  It's probably the hardest thing to do because when things are added after the fact, regardless of whether it's data or not, it's harder to justify and it takes a whole lot longer to do. You now have to go back, do an audit of what is going on in your system, and then say, okay, these are the things that we need to do. 

The hardest part about implementing data governance is getting buy-in. If you are planning to focus on data governance just know that it is going to take time. The person or team that will focus on that task will come in, do discovery work and document everything that you have with regards to data, recommend new policies, workflows and processes. For some, this may be bothersome, but in reality, what you get out of that is actually better communication and understanding of where data lives, who owns it, and how it’s used.

As I said earlier, data going bad occurs when organizations run so fast that data is not the first thing on their mind. This is not necessarily the fault of people not really thinking through their ideas. They just want to go really fast and then they’ll look at the numbers. But, if you want to have a data driven organization, data needs to be at the core of decisions that are being made. Whether these decisions are at the CEO level, CTO, whether it's an individual engineering team, business team level; making decisions based on data must be top of mind.

This sounds very utopian but the reality is that data goes bad or it is bad to begin with when an organization stops thinking about data, when data is not seen as an asset for organization but as a cost. When nobody cares about data,  nobody wants to do anything about it. They just sit there and ask why would even capitalize on that?

So, how do you prevent data from going bad? How do you prevent it from going bad to worse? Think of it this way, in your personal life, when something bad happens, what do you do? You learn from it! Every single time that you learn from your mistakes, you learn the root of the issue. The same thing applies with data going bad. Go through some root cause analysis, hold a retrospective. This will allow you to learn from the failures and understand what can be done so that this issue doesn’t occur again. You may be wondering, who really executes on this. If you have someone in your company, or a team in your company that's responsible for data integrity, maybe it's your data engineering team, then that's where these changes need to be placed. Systemically, the data needs to be validated to ensure it’s reliable. If you don’t have a team, raise that as a concern in your organization.

Be proactive

Now, think about data in your organization. What would you label it as? Is the data at your company bad? If so, how do you know that? If you know that, then have you told the people that are responsible for that data that it needs to be fixed? If you don’t know who is responsible for it, have you spoken with your manager about it?

The reason I raise these questions is because so often what we hear from people are complaints. It's very simple and easy to say, “Oh, it's not me. You know, I'm looking at the data. But it's, you know, it's not my data”. So easy for us to say, I'm not going to do anything about it. But is that the right way? Is that the thought process that we should be thinking about when we're striving to be a data driven company? No!

What we should be doing is proactively raising these concerns, providing this information and saying:

“ I'm seeing this discrepancy in this data. This is how I know what's wrong. This is where I'm sourcing it from. Can you validate? Can you can you say that this is the right approach that I'm doing”.

The more and more people that raise these concerns and raise these issues, the cleaner the data is going to be long term. 

I like to think about bad data like a big pile of dirty laundry. You don’t like it, you complain about it, everyone else complains about it, but unless you start going piece by piece, you’re just going to be paralyzed or overwhelmed by the task at hand. So let's not freeze and instead take a proactive approach and not look at everything and say “this is a big mess, I can't fix this, no one can ever fix these issues” and really just start breaking it apart, breaking into pieces. You know where you ultimately want to be as an organization with respect to data. If you don't, then again, go back to go back to this post about why it's important to have data driven cultures: link.

Now it is up to us, myself included, to be able to make these changes and drive change in organizations; to fix these issues and not just sit back and continue to spread the same complaints of “data is bad”. Take a proactive approach and be part of the solution. Be that person that helps your organization attack the small or large data issues.

Data Governance is hard, but it's only hard if we choose to sit back and do nothing about it.

Previous
Previous

The more you know

Next
Next

To centralize or not, that is the question