Data Catalog

Top 10 Capabilities to Look for in a Data Catalog (Part 2)

Posted on July 20th, 2017 | Todd Goldman

As in our Part 1 post, we interviewed our customers—which rely on data catalogs to execute important business process and deliver offensive and defensive data intelligence—to find out which capabilities are most important when implementing a data catalog across your enterprise data fabric. Here’s the rest of our list:

 

  1. Open APIs for integration with a wide variety of tools

Ranging from existing business glossaries to data wrangling and business intelligence tools, data catalogs have to be able to integrate with a wide variety of applications. In addition, the APIs need to support the integration of the catalog with your own applications. Many of our customers integrate the data catalog profiling and tagging capabilities as part of an automated data pipeline, and APIs are critical for making this possible.

 

  1. Scalable search

As more users get access to the catalog, and the catalog gets larger with more tags and metadata, the ability to search the catalog will need to scale as well. That is why it is crucial to choose a catalog that uses a search engine like Solr that has been proven to scale.

 

  1. Data catalog as a platform

A data catalog isn’t just another piece of middleware. It is the underlying platform for a wave of new metadata-based applications. New kinds of data governance, data rationalization and consent management applications will be built on top of the data catalog in the near future. If your data catalog vendor doesn’t have a vision for these kinds of apps, then you are talking to a follower, not a leader in the space.

 

  1. Data lineage

Being able to see at a glance where a data set came from and how it is used to generate other data sets, combined with the ability to quickly review those data sets, is key to understanding the data set and trusting it to do the job. Unfortunately, the entire lineage of a piece of data cannot be imported from just the tools that generate new data sets. While some can be imported from ETL tools or Hadoop systems like Apache Atlas and Cloudera Navigator, there are always gaps in lineage chains that need to be filled. The data catalog should assist in filling these gaps by automatically discovering and suggesting missing lineage between data sets. By filling in the gaps, the catalog is able to provide search based on which sources a particular data set was derived from (e.g., ability to find all customer data sets that were derived from salesforce.com regardless of how long those lineage chains are and how many data sources they span).

 

As noted in the opening paragraph, data catalogs are a relatively new space and, as time progresses, we intend to update this list to keep it current. As of July 2017, we are confident that the information above represents state-of-the-art data catalog capabilities that any data-driven organization should require.