Data Catalog Alex Gorelik August 16, 2017

Pulling Your Data Out of the Fog

Uncovering the obscured data

Everyone’s heard of cloud computing. But for those of us from San Francisco, where we live much of our lives in low clouds, it’s fitting when we discover that much of our data is also lost in a fog.

Imagine you are a business analyst. For you, most of your organization’s data is obscured. You can only see it when you are in close proximity or when it is so fresh that you remember exactly where it is:

Proximity: Analysts can identify valuable data that’s within their immediate area of expertise, but the more removed that data is from their basic sphere of understanding, the less they understand the data and whether it’s of value.

Freshness: The more time your analysts spend studying the data for a specific project, the easier it is for them to see the data’s value. The fog lifts but only for awhile. Once the project is completed, the analyst’s memory of the data begins to recede. The data itself begins to change. The fog has returned.

The problem is that proximity and freshness work for a very small amount of data. Meanwhile, the variety, volume, and velocity of data coming in continues to grow. Organizations become overwhelmed trying to make sense of it all. As one of our customers recently put it, “We have 100 million fields of data. How can anyone find anything?” We agree.

Discover, understand, and use your data faster

As our CEO Alex Gorelik often says, “It’s like looking for a specific book you want at a flea market. You can look all you want, but it’s going to take a long time to find it.” You can invest all you want in faster data processing, faster analytics and faster response times. But if your organization can’t discover, understand and utilize your data fast enough—if you can’t quickly convert all that unstructured data into actionable business intelligence—you will have a tough time serving your customers, let alone competing with those who can efficiently capitalize on their data in today’s knowledge economy.

Companies need to lift the fog and keep it lifted so business users can more readily find critical data on an ongoing basis. The question is, “How?” And the answer is simple: The fog lifts when the sun shines in. And that’s exactly what an intelligent data catalog does. It shines a light and lifts the fog off your data.

Check back for the second part of this post next week.

Get out of the fog, and stay out

Here, I will talk about how companies can lift the fog and keep it lifted so business users can more readily find critical data and convert it into actionable business intelligence on an ongoing basis.

The challenge: most organizations are relying on the street light method to find data. They’re searching for data only where they can see, where there is already a streetlight shining. That is not necessarily where the data is actually located. Instead, it’s often hidden in the shadows where there is no light. The result: they don’t know what data is even available. This is compounded by the data security catch-22:

You can’t access data without justifying why you need it, but how can you know whether you need it if you don’t have access?

This is where tribal knowledge traditionally comes in, but results are often spotty. People forget. People leave. People make mistakes.

Permanently retain your tribal knowledge

Our solution to tribal knowledge problem is to make it simple. Make automated tagging a part of the regular project workflow to kick start the initial tagging of data. We establish credibility for the automation by curating automated results through SME review. Plus, SMEs can incorporate ratings and annotations, which along with automatically discovered lineage, provide an understanding of data quality.

SME review of the automated results provides a feedback loop that continually improves the accuracy of the automation through machine learning, but total accuracy is never assumed. Each automated tag is accompanied by confidence percentage. The closer that figure is to 100, the greater the likelihood that data is what it’s supposed to be. But the human element is never removed from the equation. Data stewards or analysts can officially accept or reject a tag at any time. With this combination of data set properties (data quality and lineage), governance (curation and stewardship) and social validation (ratings, notes and reputation), we can establish trust when it comes to the classification of your data, which in turn supports tighter control over accessing and provisioning.

Complete data governance is made possible

The challenge: Currently, there are tens of thousands of analysts, researchers, and data scientists all requiring access to data in order to do their jobs. But finding, provisioning and governing that data is not only difficult, it can be very expensive. Part of the reason is all the time users waste trying to navigate all those convoluted paths between the data seekers and the data sought. And after that, there’s never a guarantee that users are getting access to the data they need. The organization, after all, can’t blindly grant access to all users. It must protect sensitive business information and, in many cases, personally identifiable information. So how does the organization protect the data while granting access to those who need it?

Some employ a top-down approach. The admin finds the data and strips it of any personally identifiable information before providing access to any user requiring information in that particular data set.

For others, there’s the agile/self-provisioning approach in which a catalog is built and populated only with the metadata. Then, the data can be “de-identified” and provisioned per each user’s request.

The problem with both approaches is this: it’s impractical to manually find, de-identify, and authorize access to data as it’s requested. You don’t want to wait six months to a year for someone to build a business glossary that maps back to the physical data. You want access to governed data now. Instead, so much data is trapped, sitting in quarantine because nobody has had the time to determine its level of sensitivity and access rights.

Here’s the catch: if you want to tap the value of all your data, it must be properly governed. And to be properly governed, you need to know what data you have. Only then can you know what data you need to secure. Otherwise, access to potentially valuable data is cut off until someone is able to look at it. Until someone looks at it.

Connect the right data to the right people

Our approach quickly connects the right data to the right people by replacing the manual tagging of metadata with an automated process that rapidly classifies all your data assets, including new data as it’s created, while determining data lineage. When you know what info is in your data set and where it came from, you can immediately and automatically make it available to authorized users. There’s no “wait and see.”

But that doesn’t mean we bypass the human element altogether. As I mentioned earlier, review of the automated results by subject matter experts provides a feedback loop that continually improves the accuracy of the automation through machine learning. Total accuracy of the automatic process is usually dead on, but it’s never assumed.

I hope you’ll continue to check back for more as we strive to provide all the information and technology you need to cut through the fog and put your data to work.