Data Catalog Alex Gorelik August 31, 2017

Pulling Your Data Out of the Fog (Part 3 of 3)

In my last post, I wrote how most organizations are relying on the street light method to find data. They’re searching for data only where they can see, not where the data might actually be located. The result: they don’t know what data is even available. Tribal knowledge can help, but results are often spotty. People forget. People leave. People make mistakes.

But I also presented our solution to the problem: combine automated tagging with SME review to provide a feedback loop that continually improves the accuracy of the automation through machine learning. The result: the organization achieves greater understanding and trust of their data.

Here I’d like to discuss how our approach can be used to support data governance.

Currently, there are tens of thousands of analysts, researchers, and data scientists all requiring access to data in order to do their jobs. But finding, provisioning and governing that data is not only difficult. It can be very expensive. Part of the reason is all the time users waste trying to navigate all those convoluted paths between the data seekers and the data sought. And after that, there’s never a guarantee that users are getting access to the data they need. The organization, after all, can’t blindly grant access to all users. It must protect sensitive business information and, in many cases, personally identifiable information. So how does the organization protect the data while granting access to those who need it?

Some employ a top-down approach. The admin finds the data and strips it of any personally identifiable information before providing access to any user requiring information in that particular data set. For others, there’s the agile/self-provisioning approach in which a catalog is built and populated only with the metadata. Then, the data can be “de-identified” and provisioned per each user’s request.

The problem with both approaches is this: it’s impractical to manually find, de-identify, and authorize access to data as it’s requested. You don’t want to wait six months to a year for someone to build a business glossary that maps back to the physical data. You want access to governed data now. Instead, so much data is trapped, sitting in quarantine because nobody has had the time to determine its level of sensitivity and access rights.

Here’s the catch: if you want to tap the value of all your data, it must be properly governed. And to be properly governed, you need to know what data you have. Only then can you know what data you need to secure. Otherwise, access to potentially valuable data is cut off until someone is able to look at it. Until someone looks at it.

Our approach quickly connects the right data to the right people by replacing the manual tagging of metadata with an automated process that rapidly classifies all your data assets, including new data as it’s created, while determining data lineage. When you know what info is in your data set and where it came from, you can immediately and automatically make it available to authorized users. There’s no “wait and see.”

But that doesn’t mean we bypass the human element altogether. As I mentioned in my last post, review of the automated results by subject matter experts provides a feedback loop that continually improves the accuracy of the automation through machine learning. Total accuracy of the automatic process is usually dead on, but it’s never assumed.

I hope you’ll continue to check back for more as we strive to provide all the information and technology you need to cut through the fog and put your data to work.