Data Catalog Alex Gorelik August 23, 2017

Pulling Your Data Out of the Fog (Part 2 of 3)

In part one of this post, I discussed how most enterprise data is obscured. You can only see it when you are in close proximity or when it is so fresh that you remember exactly where it is. The problem: proximity and freshness works for a very small amount of data. Meanwhile, the variety, volume, and velocity of data coming in continues to grow. Organizations become overwhelmed trying to make sense of it all.

I talked about the competitive importance of being able to quickly discover, understand and utilize your data. Here, I will talk about how companies can lift the fog and keep it lifted so business users can more readily find critical data and convert it into actionable business intelligence on an ongoing basis.

The challenge: most organizations are relying on the street light method to find data. They’re searching for data only where they can see, where there is already a streetlight shining. That is not necessarily where the data is actually located. Instead, it’s often hidden in the shadows where there is no light. The result: they don’t know what data is even available. This is compounded by the data security catch-22: you can’t access data without justifying why you need it, but how can you know whether you need it if you don’t have access? This is where tribal knowledge traditionally comes in, but results are often spotty. People forget. People leave. People make mistakes.

Our solution to tribal knowledge problem is to make it simple. Make automated tagging a part of the regular project workflow to kick start the initial tagging of data. We establish credibility for the automation by curating automated results through SME review. Plus, SMEs can incorporate ratings and annotations, which along with automatically discovered lineage, provide an understanding of data quality.

SME review of the automated results provides a feedback loop that continually improves the accuracy of the automation through machine learning, but total accuracy is never assumed. Each automated tag is accompanied by confidence percentage. The closer that figure is to 100, the greater the likelihood that data is what it’s supposed to be. But the human element is never removed from the equation. Data stewards or analysts can officially accept or reject a tag at any time. With this combination of data set properties (data quality and lineage), governance (curation and stewardship) and social validation (ratings, notes and reputation), we can establish trust when it comes to the classification of your data, which in turn supports tighter control over accessing and provisioning.

Next week: the last installment of this three-part post where I will take a deeper look at how our approach can be used to support data governance.