Data Catalog

Top 4 reasons you need a data catalog

Posted on March 16th, 2017 | Todd Goldman

I’ve been reading a lot lately on the thoughts of different industry analysts about the importance of data catalogs or information catalogs and why you should have one. The data governance market is expected to
double in size over the next five years, with data governance solutions such as data catalogs holding the largest market share, so from a market standpoint, this is no surprise. But most of the arguments I hear from analysts are built around a semi-technical argument that can be summed up as “good metadata is good, bad metadata is bad” or what I like to think of as the “four legs good, two legs bad” argument.

Given I spend my days steeped in data cataloging, I figure it is about time I share with you…
the top 4 reasons you need data catalog.

1. Spend less time searching for data and more time using it to gain insight. According to Boris Evelson of Forrester Research, “A good rule of thumb … is to assume that 80% of the effort is going to center around data integration activities… and a similar 80% effort within data integration just to identify and profile data sources.” (Boost Your Business Insights By Converging Big Data And BI by Boris Evelson, March 25, 2015). In other words, 64% of the effort for a data-oriented project is spent just identifying which data it is that you actually need to execute the project.This is because most organizations lack any real mechanism for organizing and tracking all of their data.


Over the years, they have lost track of where data is located, which results in a growing pile of data and makes it increasingly difficult to find the good stuff. This also means that business analysts spend an excessive amount of time (64% of their time in fact), simply hunting for the data they need to do their jobs. If they had a well-organized data catalog, then that time could be spent actually doing advanced analytics, as opposed to collecting the components necessary to do so.


2. Operationalize access control to allow for safe access to data through governance. Not only do organizations have a tough time finding data, but even when they do gather the desired pieces of data, they have a hard time properly controlling who has access to that data. The result is that a.) either everyone has access, which is a problem in regulated environments, or b.) very few highly trusted employees have access. A data catalog can help with this problem by automatically tagging data as sensitive and then passing those tags onto security infrastructure like Apache Sentry, Apache Ranger or other access control infrastructures that provide access control for a wider environment. The result is that rather than data sitting in quarantine (waiting for someone to review the data before making it available to others) the validation process can be automated, or at least partially automated, speeding data through the process and making it available for use almost immediately.


3. Reduce the cost of data redundancy and hoarding. Simply put, most organizations are data hoarders. They keep three to five times more data around than they need. So by properly cataloging their data, they can find the data redundancies, eliminate those redundancies and deliver cost savings in terms of saved storage costs, saved database license costs and saved management costs for all of that excess data. When all is said and done, it adds up to a lot of money.


4. Stay out of jail. With all of the data-related laws (and acronyms), like HIPAA, BASEL, and GDPR, there is a big need to be able to understand what data you have and where it comes from so you can properly manage the data lifecycle and control access to that data (per reason number 3). All of these laws are driven from different perspectives and use cases, but in the end, they all come down to better governance of your data with a focus on data lineage and access control. If we zoom out and take the macro view, all of the laws tend to have an underlying focus on those two issues. For any chance of complying with the mandates, you need a catalog of what data you have and to know where it is located as well as the provenance of that data.


The bottom line is that there is a now a much clearer linkage between the technical value of better metadata management and the business value. For those of you trying to explain to your business counterparts why metadata management is important, think about the four reasons supplied above because at least one or more of them is likely to have implications for your organization.