Data Catalog Overview
What is a data catalog?
The easiest way to understand a data catalog’s purpose is to look at Amazon.com. Amazon carries millions of different products and yet, as consumers, we can find almost anything very quickly. Beyond Amazon’s advanced search capabilities, they also give detailed information on each product, reviews, a list of companion products purchased most often, instructional videos, the seller’s information, shipping times, etc.
Data catalogs work in the same way for your data lakes of Big data, data warehouses, databases, and more. The purpose of data catalogs is to organize the thousands or millions of an organization’s data sets to help users perform searches for specific data and understand its meta data, such as data lineage, uses, and how others perceive the data’s value (or lack of).
The goal of data catalogs is the movement toward self-service, just like Amazon. In the past, technology limitations restricted access to certain information to specialists. As an example, 20 years ago, if you needed a part for a washing machine, you would have to call your local repairman that would look up the part in their restricted catalog. Today, everyone has the ability to access parts for a washing machine (even if it’s 20 years old).
Make Data More Accessible to More Users
Today, the pooling of company information into data lakes is creating enormous quantities of data. Meanwhile, more workers need access to specific data sets within their data lakes in order to make data-backed business decisions.
To make this data more accessible, there is a real need to make the data easily searchable, alongside its lineage, uses, and value to the organization. One of our customers recently said that before data catalogs, 80% of their employees’ time was spent finding data, 15% fixing the data, and 5% analyzing it. Their data catalog flipped these numbers, so that employees spent 80% of the time analyzing the data and only 5% finding it. Data catalogs are the core for data analysis.
Accelerate Time to Value and Reduce Operational Costs
The quantity of data can present a formidable challenge. Each data set needs to be properly discovered, tagged, and annotated with a description of its lineage, ratings from other users, and where it’s used. Here’s where technology helps. Waterline Data automates the cataloging of data sets, saving an immense amount of time not only in building the initial catalog but also maintaining the catalog as new data sets are created. A data steward or data curator can then review a data set’s schema and properties, reducing time to discover and tag the data from scratch. How that data is automatically curated can accelerate time to value of the data for business analytics and reduce operational costs significantly.
Improve Regulatory Compliance
Data catalogs provide business teams with the competitive edge to make business decisions based on data or to aid in a variety of use cases. One example is improving regulatory compliance. New government regulations, such as GDPR and a variety of privacy acts, are now being mandated on organizations to know where all of their prospects’, customers’, and employees’ data reside to properly conform to these regulations. Data catalogs can help to provide that information to improve regulatory compliance.
If each department within an organization were to manage the personal data for example, it would not only be inefficient, but very costly and prone to errors when any information needs to be deleted or modified in multiple locations. Data catalogs enable data rationalization as well, improving overall data accuracy for an organization.
Data catalogs are the core for any data management strategy.
Top 10 Capabilities to Look for in a Data Catalog
1. Automated population of the catalog
The hardest problem in making a data catalog valuable is getting the catalog populated with information. The specific information people look for are the tags that connect business terms to the actual attributes scattered around the organization.
But for the majority of businesses, there is simply too much data in their environment to be able to realistically tag the actual attributes with business terms by hand. Even crowdsourcing is not enough. Plus, manual tagging will miss the dark data that hasn’t been touched recently.
Simply put, data tagging needs to be done by AI and machine learning as data is profiled and objective metadata about the quality of the data is added to the catalog.
2. Crowdsourced curation of tags with machine learning feedback
Computerized tagging by itself is not enough. Even with the world’s best tagging algorithms, human review is still necessary to catch machine errors.
Additionally, when a data steward accepts or rejects a tag, the machine learning should incorporate that advice to improve future automated tagging of data.
This feedback loop is critical to improving the accuracy of the automation.
3. Crowdsourced ratings and reviews
Context counts. Users may like some data sets more or less than others depending on the context of their job and the data.
Allowing users to rate the data with a five-star system, as well as add written comments, provides subjective information about datasets to augment more objective profiling data added during the automated tagging process.
Think of it as the equivalent to Yelp for your data!
4. Ability to ensure tagging and metadata freshness
It is one thing to profile and tag data once, but when new data is coming into your environment all the time, you need to be able to incrementally evaluate and tag new data as it arrives to keep your metadata fresh.
This is especially important with new tag-based security paradigms where security policies are based on metadata tags that can be discovered automatically by the catalog.
For example, each new dataset should be automatically scanned for sensitive data and tagged appropriately, so tag-based security policies can be applied.
5. Enterprise scalability and scope
Only catalogs natively developed on big data technology like Spark and Solr can scale to process data in the entire enterprise. For example, one customer started using the smart data catalog by covering over four billion rows of data.
Because Waterline takes advantage of existing Hadoop, Spark and cloud infrastructure, it is able to scale out to deal with this large volume of data, profiling and completing initial automated tagging in about two hours on 10 nodes.
In addition to scaling to support large data volumes, a data catalog must support a wide variety of data sources in the enterprise, whether on premises, in the cloud or hybrid. A data catalog needs to document all datasets in any format, be it relational, semi-structured or unstructured.
6. Open APIs for integration with a wide variety of tools
Ranging from existing business glossaries to data wrangling and business intelligence tools, data catalogs have to be able to integrate with a wide variety of applications.
In addition, the APIs need to support the integration of the catalog with your own applications. Many of our customers integrate the data catalog profiling and tagging capabilities as part of an automated data pipeline, and APIs are critical for making this possible.
7. Scalable search
As more users get access to the catalog, and the catalog gets larger with more tags and metadata, the ability to search the catalog will need to scale as well. That is why it is crucial to choose a catalog that uses a search engine like Solr that has been proven to scale.
8. Data catalog as a platform
A data catalog isn’t just another piece of middleware. It is the underlying platform for a wave of new metadata-based applications. New kinds of data governance, data rationalization and consent management applications will be built on top of the data catalog in the near future. If your data catalog vendor doesn’t have a vision for these kinds of apps, then you are talking to a follower, not a leader in the space.
9. Data lineage
Being able to see at a glance where a data set came from and how it is used to generate other data sets, combined with the ability to quickly review those data sets, is key to understanding the data set and trusting it to do the job. Unfortunately, the entire lineage of a piece of data cannot be imported from just the tools that generate new data sets.
While some can be imported from ETL tools or Hadoop systems like Apache Atlas and Cloudera Navigator, there are always gaps in lineage chains that need to be filled. The data catalog should assist in filling these gaps by automatically discovering and suggesting missing lineage between datasets.
By filling in the gaps, the catalog is able to provide search based on which sources a particular data set was derived from (e.g., ability to find all customer data sets that were derived from salesforce.com regardless of how long those lineage chains are and how many data sources they span).
Data catalogs have to integrate with the native security infrastructure, rather than impose a new user management and authorization system across all the data assets. While the ability to provide a metadata-only view of the data assets is required, the ability to protect data access by respecting native authorization policies and authentication process is just as critical.
As noted in the opening paragraph, data catalogs are a relatively new space and, as time progresses, we intend to update this list to keep it current. As of July 2017, we are confident that the information above represents state-of-the-art data catalog capabilities that any data-driven organization should require.