Why is a Data Catalog the Next Critical Big Data PLATFORM Component?

Most industry analysts talk about data catalogs as if they are just a simple database of metadata with a search user interface on them. The use of the UI is mainly associated with self-service analytics, helping data analysts and data scientists find the data they need to do their jobs. And while all good data catalogs present a user friendly search UI that lets you filter using built-in or your own custom facets (see image), that really is only just scratching the surface of how data catalogs can be used.

Data catalog dashboard screenshot showing how search used crowd-sourced data

Figure 1:  Search results with faceted filtering

Data Catalogs as a Platform

An even more interesting use case is how you can integrate a data catalog as part of an automated data flow, which in turn is part of a standard process that your organization runs.  For example, imagine you acquire a new data set from an external data supplier, and you want to get that data put into use quickly in your organization. The challenge is that while you know you have similar data already in use, the terminology you use within your organization to describe any given attribute of that data is likely to be different than the terminology your supplier uses.

In the old world, a data steward would be assigned to open the data file, look at each column, execute some SQL queries to get basic stats on each column of data, then manually map the columns of data to the business terms that the organization typically uses. In my experience, this takes an hour a column on average.

To speed this up, the data catalog can be used as part of an automated process that would look something like this:

  • Data is placed in a predefined location.
  • A script, which is running hourly or daily, checks to see if there is new data in that location. If there is, it automatically invokes the data catalog to profile and fingerprint the data.
  • The data catalog inspects the data in all the columns, generates profiling statistics (min, max, average, value frequencies, etc.) then automatically categorizes and tags the data.
  • Those attributes where the automated tagging is above an 80% confidence rate are automatically passed into the data lake for use. Those attributes where the automated tagging algorithm returns a result below 80% would be reviewed by a data steward and then passed into the data lake.
  • Data that passes automatically through the “quarantine” process above is made available in the data lake. If the attribute was considered to be sensitive data, then the tag is noted in the security system, which masks or de-identifies the sensitive data automatically.

The result of automating this data on-boarding process literally saves weeks to months of time in getting new data into an organization and put to use.

Why is This Automation and Rapid Time to Use for New Data Important?
It’s important because acting on new data more quickly than the competition can be the difference between being a market winner and a market loser. Look at companies like Amazon that have built their business around being faster to market with new products and services. They are able to do this because they were faster to market with their use of data and the subsequent analytics based on that data.

So as you build out your next generation big data architecture, think about how integrating your data catalog and data cataloging processes into your data workflows in an automated manner can change your organization’s ability to use data in a more agile manner. It may make the difference between being an also ran or number one.