Data Catalog Todd Goldman April 4, 2017

How data management is like cleaning up after my teenage daughters

Do you (or your organization) waste too much time searching for data and not enough time actually using it? Me too. In fact, most Chief Data Officers I speak with complain of the same problem and I’d estimate that in general, 9 out of 10 CDOs complain about this particular issue. But when you consider the way most companies manage their data, this shouldn’t really come as much of a shock. 


In fact, the way companies manage data is quite similar to the way my teenage daughters deal with their clothing. Like many teenagers, they lead an active life and as a result, they often change clothes on the go and leave them scattered in various locations; a pair of shorts in the car on the way from school to track practice; jackets on the steps going upstairs when they get home; dirty clothes in the bathroom, etc. So in my brilliance as a father, I decided to take their distributed mess from the car, the stairs, the bathroom or wherever else they may have left it and consolidate this mess on their beds. 


The idea is that if it is all collocated on their beds, they will take the dirty clothes, put them in the laundry, fold the clean clothes and put them in their drawers. That way, when they need a specific article of clothing, they can just go to the appropriate drawer to get it; a shirt, a sweater, their underwear etc. Makes sense, right?


WRONG!  In reality, what happened was less idyllic as we went from having a distributed mess, to having a collocated mess. Either way, it was still a mess. This is similar to organizations that take data distributed across their entire business or institution, then either collocate it all in a data warehouse or in a Hadoop data lake. Regardless, all they succeed in doing is collocating their mess. This means that anytime there is a data-oriented project, the data scientists, business analysts and data analysts spend more time looking for the data they need and less time actually analyzing it. This is due to the fact that they have no organization for their data and no catalog they can easily search, so they end up wading through their pile of data or asking someone else if they know where to find something.  While this is somewhat manageable when you are just talking about a pile of clothes, it is not at all manageable when you are talking about thousands to tens of thousands (and sometimes hundreds of thousands) of data attributes.


How much time is actually wasted spent locating data in data projects? According to Boris Evelson of Forrester Research, “A good rule of thumb … is to assume that 80% of the effort is going to center around data integration activities… a similar 80% effort within data integration just to identify and profile data sources.” (Boost Your Business Insights By Converging Big Data And BI by Boris Evelson, March 25, 2015).  In other words, 64% of the effort for a data-oriented project is spent just identifying the data you need to do the project.


For me, the collocation of clothing approach was a failure with my daughters. What ultimately worked was a polite conversation with them about their responsibility as part of our household to chip in and help out so we had a comfortable place to live.  In my case, that actually worked! But also, it could just be that my daughters matured and grew out of it on their own habits so that I was just lucky in my timing.  Alas, hoping that your company will grow out of it will not work. 


What does work when it comes to data is putting in the effort to implement a data catalog that is a common repository with a taxonomy of your data assets as well as link to where those assets are located—whether they are in a data lake, data warehouse or distributed across lots of data source. The challenge is how you collect that catalog of information and keep it up to date.  But that will be a topic for another blog post.