What is a Data Catalog?
The easiest way to understand a data catalog’s purpose is to look at Amazon.com. Amazon carries millions of different products and yet, as consumers, we can find almost anything very quickly. Beyond Amazon’s advanced search capabilities, they also give detailed information on each product, reviews, a list of companion products purchased most often, instructional videos, the seller’s information, shipping times, etc.
Data catalogs work in the same way but in today’s company data lakes. Data catalogs organize thousands or millions of an organization’s data sets to help users perform searches for specific data and understand its lineage, uses, and how others perceive the data’s value (or lack of).
“Simplicity is the ultimate sophistication”, Leonardo da Vinci
Another correlation between Amazon and data catalogs is data democratization, or the movement toward self-service. In the past, technology limitations restricted access to certain information to specialists. As an example, 20 years ago, if you needed a part for a washer machine, you would have to call your local repairman or call an appliance parts store that would look up the part in their restricted catalog. Today, everyone has the ability to access parts for a washing machine (even if it’s 20 years old).
Make Data More Accessible to More Users
Today, the pooling of company information into data lakes is creating enormous quantities of data. Meanwhile, more workers need access to specific data sets within their immense data lakes in order to do their job and make data-backed decisions.
To make this data more accessible, there is a real need to make the data easily searchable, alongside its lineage, uses, and value to the organization. One of our customers recently said that before data catalogs, 80% of their employees’ time was spent finding data, 15% fixing the data, and 5% analyzing it. Their data catalog flipped these numbers, so that employees spent 80% of the time analyzing the data and only 5% finding it.
Because of the quantity of data that needs to be organized, data catalogs can present a formidable challenge. Each data set needs to be properly discovered, tagged, and annotated with a description of its lineage, ratings from other users, and where it’s used. Here’s where technology helps. Some data catalog software vendors (including Waterline Data) use machine learning and artificial intelligence to automate the process of tagging, displaying lineage and users’ annotations, schema, and finally cataloging of the data set, saving an immense amount of time not only building the initial catalog but maintaining the catalog as new data sets are created. Although a data steward or data curator is still needed to review new entries, reviewing a data set’s schema and properties takes much less time than creating the data set’s properties from scratch.
Improve Regulatory Compliance
Beyond data democracy, data catalogs have other uses. For many new government regulations such as GDPR or the California Consumer Privacy Act effective in 2020, organizations need to know where all of their prospects’, customers’, and employees’ data reside to properly conform to these regulations. Allowing “tribal” (each department owning their own data) storage is not only inefficient, it can be very costly when companies need to delete or change their regulated personal data because they may miss some of the hidden or tribal databases.
Data catalogs also virtually eliminate database duplicity. In the past, divisions and departments often unintentionally duplicated existing databases because they didn’t know the data already existed. With today’s smart data catalogs, duplicity less of a concern because people can easily search for existing content.
Implementing data catalogs has some unique challenges, including:
- Educating your organization on the value of a single source for data
- Removing tribalism
- Finding a marriage between empowering your organization’s power users with the functionality they need and democratizing your organization’s data so other, less technically literate users can leverage its power.
Most organizations understand the growth of data and breadth of users are “fait accompli” and are working on Proof of Concepts today.
If you would like a more detailed and technical understanding of data catalogs, read the Ultimate Guide to Data Catalogs.
Top 10 Capabilities to Look for in a Data Catalog
1. Automated population of the catalog
The hardest problem in making a data catalog valuable is getting the catalog populated with information. The specific information people look for are the tags that connect business terms to the actual attributes scattered around the organization.
But for the majority of businesses, there is simply too much data in their environment to be able to realistically tag the actual attributes with business terms by hand. Even crowdsourcing is not enough. Plus, manual tagging will miss the dark data that hasn’t been touched recently.
Simply put, data tagging needs to be done by AI and machine learning as data is profiled and objective metadata about the quality of the data is added to the catalog.
2. Crowdsourced curation of tags with machine learning feedback
Computerized tagging by itself is not enough. Even with the world’s best tagging algorithms, human review is still necessary to catch machine errors.
Additionally, when a data steward accepts or rejects a tag, the machine learning should incorporate that advice to improve future automated tagging of data.
This feedback loop is critical to improving the accuracy of the automation.
3. Crowdsourced ratings and reviews
Context counts. Users may like some data sets more or less than others depending on the context of their job and the data.
Allowing users to rate the data with a five-star system, as well as add written comments, provides subjective information about datasets to augment more objective profiling data added during the automated tagging process.
Think of it as the equivalent to Yelp for your data!
4. Ability to ensure tagging and metadata freshness
It is one thing to profile and tag data once, but when new data is coming into your environment all the time, you need to be able to incrementally evaluate and tag new data as it arrives to keep your metadata fresh.
This is especially important with new tag-based security paradigms where security policies are based on metadata tags that can be discovered automatically by the catalog.
For example, each new dataset should be automatically scanned for sensitive data and tagged appropriately, so tag-based security policies can be applied.
5. Enterprise scalability and scope
Only catalogs natively developed on big data technology like Spark and Solr can scale to process data in the entire enterprise. For example, one customer started using the smart data catalog by covering over four billion rows of data.
Because Waterline takes advantage of existing Hadoop, Spark and cloud infrastructure, it is able to scale out to deal with this large volume of data, profiling and completing initial automated tagging in about two hours on 10 nodes.
In addition to scaling to support large data volumes, a data catalog must support a wide variety of data sources in the enterprise, whether on premises, in the cloud or hybrid. A data catalog needs to document all datasets in any format, be it relational, semi-structured or unstructured.
6. Open APIs for integration with a wide variety of tools
Ranging from existing business glossaries to data wrangling and business intelligence tools, data catalogs have to be able to integrate with a wide variety of applications.
In addition, the APIs need to support the integration of the catalog with your own applications. Many of our customers integrate the data catalog profiling and tagging capabilities as part of an automated data pipeline, and APIs are critical for making this possible.
7. Scalable search
As more users get access to the catalog, and the catalog gets larger with more tags and metadata, the ability to search the catalog will need to scale as well. That is why it is crucial to choose a catalog that uses a search engine like Solr that has been proven to scale.
8. Data catalog as a platform
A data catalog isn’t just another piece of middleware. It is the underlying platform for a wave of new metadata-based applications. New kinds of data governance, data rationalization and consent management applications will be built on top of the data catalog in the near future. If your data catalog vendor doesn’t have a vision for these kinds of apps, then you are talking to a follower, not a leader in the space.
9. Data lineage
Being able to see at a glance where a data set came from and how it is used to generate other data sets, combined with the ability to quickly review those data sets, is key to understanding the data set and trusting it to do the job. Unfortunately, the entire lineage of a piece of data cannot be imported from just the tools that generate new data sets.
While some can be imported from ETL tools or Hadoop systems like Apache Atlas and Cloudera Navigator, there are always gaps in lineage chains that need to be filled. The data catalog should assist in filling these gaps by automatically discovering and suggesting missing lineage between datasets.
By filling in the gaps, the catalog is able to provide search based on which sources a particular data set was derived from (e.g., ability to find all customer data sets that were derived from salesforce.com regardless of how long those lineage chains are and how many data sources they span).
Data catalogs have to integrate with the native security infrastructure, rather than impose a new user management and authorization system across all the data assets. While the ability to provide a metadata-only view of the data assets is required, the ability to protect data access by respecting native authorization policies and authentication process is just as critical.
As noted in the opening paragraph, data catalogs are a relatively new space and, as time progresses, we intend to update this list to keep it current. As of July 2017, we are confident that the information above represents state-of-the-art data catalog capabilities that any data-driven organization should require.