Since the data cataloging space is relatively new, many organizations don’t know what to look for in a data catalog. To help shed some light, Waterline Data worked with its customers to develop this list of the critical capabilities to be aware of when implementing a data catalog across your enterprise data fabric:
- Automated population of the catalog
The hardest problem in making a data catalog valuable is getting the catalog populated with information. The specific information people look for are the tags that connect business terms to the actual attributes scattered around the organization. But for the majority of businesses, there is simply too much data in their environment to be able to realistically tag the actual attributes with business terms by hand. Even crowdsourcing is not enough. Plus, manual tagging will miss the dark data that hasn’t been touched recently. Simply put, data tagging needs to be done by AI and machine learning as data is profiled and objective metadata about the quality of the data is added to the catalog.
- Crowdsourced curation of tags with machine learning feedback.
Computerized tagging by itself is not enough. Even with the world’s best tagging algorithms, human review is still necessary to catch machine errors. Additionally, when a data steward accepts or rejects a tag, the machine learning should incorporate that advice to improve future automated tagging of data. This feedback loop is critical to improve the accuracy of the automation.
- Crowdsourced ratings and reviews
Context counts. Users may like some data sets more or less than others depending on the context of their job and the data. Allowing users to rate the data with a five star system, as well as add written comments, provides subjective information about data sets to augment more objective profiling data added during the automated tagging process. Think of it as the equivalent to Yelp for your data!
- Ability to ensure tagging and metadata freshness
It is one thing to profile and tag data once, but when new data is coming into your environment all the time, you need to be able to incrementally evaluate and tag new data as it arrives to keep your metadata fresh. This is especially important with new tag-based security paradigms where security policies are based on metadata tags that can be discovered automatically by the catalog. For example, each new data set should be automatically scanned for sensitive data and tagged appropriately, so tag-based security policies can be applied.
- Enterprise scalability and scope
Only catalogs natively developed on big data technology like Spark and Solr can scale to process data in the entire enterprise. For example, one Waterline customer started using Waterline data catalog by covering over four billion rows of data. Because Waterline takes advantage of existing Hadoop, Spark and cloud infrastructure, it is able to scale out to deal with this large volume of data, profiling and completing initial automated tagging in about two hours on 10 nodes. In addition to scaling to support large data volumes, a data catalog must support a wide variety of data sources in the enterprise, whether on premises, in the cloud or hybrid. A data catalog needs to document all data sets in any format, be it relational, semi-structured or unstructured.
These are only a few of the considerations we uncovered from our customers. Stay tuned for the second part of our post, in which will list the rest of the top capabilities to look for in a data catalog.