Data Catalog by WaterlineProduct Details
Waterline Data Fingerprinting™
Automated population of your data catalog
Waterline Data Fingerprinting™ works by analyzing the data values in each data set and profiling the data. Waterline then uses that information to create a “fingerprint” for each column of data—using machine learning to intelligently and automatically tag and match data fingerprints to glossary terms and populate the data catalog. Users can then refine matched terms, and remaining unmatched terms, through crowdsourcing.
Overcoming technical challenges
Waterline fingerprints data while overcoming two difficult technical challenges. The first is the problem of generating too many false positive matches. This has been addressed by tuning our proprietary matching algorithms using years of experience with real customer data. The second is dealing with the massive amount of data that modern enterprises need to inventory. Waterline is the only enterprise data catalog designed from the ground up to run directly on Hadoop and Spark. The result is that we are designed to scale directly with our customers’ infrastructure for today and tomorrow.
Automated data fingerprinting at scale provides a huge data governance leap over approaches that use only crowdsourcing and allows Waterline customers to get value from their datasets in a matter of hours or days.
Curation, governance, ratings and reviews.
We recognize that users don’t want to turn over the entire process to a machine — nor should they. Waterline uniquely combines at scale automation with a curation and governance process that allows data stewards to accept or reject the automated tags, or even add their own tags to the data catalog. As stewards curate the data, machine learning algorithms improve the automated discovery process.
Combining automation & tribal knowledge
Waterline allows end users to provide ratings and reviews of data sets. — users can comment on aspects of the data that can’t be easily determined algorithmically, preserving and leveraging tribal knowledge. Users might comment that a certain data set was good for HR, but not for finance. Or they might note that a specific data set was being used by data scientists as a sandbox and shouldn’t be used for other purposes. The combination of automation with data democratization provides the fastest and the best approach to build and maintain a data catalog.
Waterline catalog allows business analysts to spend less time trying to find the data they need and more time doing actual analytics. Waterline search provides a wide variety of facets for users to narrow down their searches. Customers can also define their own facets by creating custom properties to tag data and then use those properties to later filter searches. Search results also show crowd sourced ratings and reviews for data … yet another way users can judge the quality of the data they are viewing in search results.
Industry standard search technology
Waterline uses the Apache SOLR engine to power the search of the catalog. SOLR delivers scalability, high availability (HA) and disaster recover (DR). Waterline also exposes a REST API for search to make it easy to integrate search queries and results directly into third party applications like data wrangling and business intelligence tools.
A data catalog isn’t complete without data lineage — a critical requirement in establishing user trust in data as well as a requirement for compliance with regulatory laws. Lineage lets users know the sources from where data comes, and also where it is ultimately consumed. Better understanding of data provenance enhances governance and provides users with more confidence and trust in downstream data and aids in solving acceptable use questions.
Imported & derived lineage
Waterline uses both imported lineage as well as derived lineage. Imported lineage comes directly from the metadata available from Apache Atlas, Cloudera Navigator or ETL tools (via our REST API). To fill in the gaps, Waterline also employs derived lineage to discover lineage by analyzing the data values within a hive table, SQL table or file in HDFS identifying similar signatures to identify potential lineage candidates. Using smart proprietary algorithms, the data catalog identifies the best candidates and uses time stamps to determine the upstream and downstream tables in the data flow.
One consequence of the big data era is that there is just too much data coming into organizations to govern and keep track of manually. New data sources are introduced into organizations with increasing velocity. Often this new data lands in a quarantine zone to be reviewed and organized before it can be effectively used. Typically, to get out of quarantine, a data steward must manually review the data, classify it and then configure appropriate access controls based on user roles and domains. This process can take so long (if ever) that supposedly new data can get stale before it is ever put into use!
Tag based access control
Waterline significantly accelerates the governance process with tag-based access control. This approach intelligently and automatically tags new data sets as they arrive. Data stewards only have to review the suggestions. In addition, through integration via REST APIs, the tags are propagated directly into platform access control mechanisms (Apache Ranger or Cloudera Sentry,) where access can be immediately managed based on existing roles and domains. Data clears quarantine much more quickly and new data is immediately put to use. Instead of being paralyzed by the flood of data, Waterline allows enterprises to harness big data to be agile and gain critical business insights.
Easy Integration: Open architecture and ecosystem
To make a data catalog useful, it needs to be integrated into the rest of your ecosystem so your organization can take action more effectively and leverage the full value of your data assets.
Time to take a REST
Waterline Data Catalog has an extensive array of REST APIs including (but not limited to) REST interfaces for:
- Search – data discovery is made easy — search from your favorite data wrangling or business intelligence, or directly launch those tools within the Waterline Smart Data Catalog
- Lineage – import or export data lineage from ETL tools or from Hadoop.
- Business metadata – import or export business glossary terms and definitions
- Sensitive data tags – integrate directly with Cloudera Sentry or Apache Ranger to enable smart attribute-based access control
In addition, Waterline has a relational database plug-in architecture that makes it easy to quickly add new relational databases if the database you care about isn’t already on our supported list
Supported Data Sources
Waterline can inventory the following data sources and can add additional relational data sources via our RDBMs plug-in architecture:
- Amazon S3
- Azure Data Lake Storage
- Azure Blob Storage
If there is a data source you want us to fingerprint for your environment, let us know. It might already be on our roadmap or we can add it to the roadmap.