Waterline Data Fingerprinting™ Automates the population of your data catalog
Waterline Data Fingerprinting™ works by analyzing the data values in each data set and profiling the data. Waterline then uses that information to create a “fingerprint” for each column of data—using machine learning to intelligently and automatically tag and match data fingerprints to glossary terms and populate the data catalog. Users can then refine matched terms, and remaining unmatched terms, through crowdsourcing.
Minimize False Positives
Machine learning algorithms take advantage of our decades of historical experience with real customer data.
Waterline is the only data catalog designed from the ground up scale directly using our customer’s existing Hadoop and Spark infrastructure.
Human reviewers accept or reject data tags which provide a feedback loop that continually improves the accuracy of the machine learning algorithms.
Industry and use case specific tag libraries (insurance, GDPR, retail, etc) further improve the automated tag accuracy.
Curation, ratings and reviews.
We recognize that users don’t want to turn over the entire process to a machine — nor should they. Waterline uniquely combines at scale automation with a human driven curation process that allows data stewards to accept or reject the automated tags, or even add their own tags to the data catalog. As stewards curate the data, machine learning algorithms improve the automated discovery process.
Data stewards can quickly accept or reject machine discovered tags through a very easy to use interface.
End users can add their own subjective ratings on a 1-star to 5-star scale to provide more insight into available data sets.
Users can comment on aspects of the data that can’t be easily determined algorithmically, preserving and leveraging tribal knowledge
Organizations can add their own tag properties libraries which are immediately integrated into the automated tagging process
Waterline data catalog allows business analysts to spend less time trying to find the data they need and more time doing actual analytics. Waterline search provides a wide variety of facets for users to narrow down their searches. Search results also show crowd-sourced ratings and reviews for data … yet another way users can judge the quality of the data they are viewing in search results.
Industry Standard Search
Waterline uses the emerging industry standard Apache SOLR engine to power the search of the catalog.
Search results let you see which files or columns contain the term you are looking for and let you drill down to the profiling results at the attribute level.
Define your own facets by creating custom properties to tag data and then use those properties to later filter searches.
Integrate search queries and results directly into third party applications like data wrangling and business intelligence tools through our REST API.
A data catalog isn’t complete without data lineage — a critical requirement in establishing user trust in data as well as a requirement for compliance with regulatory laws. Lineage lets users know the sources from where data comes, and also where it is ultimately consumed. Better understanding of data provenance provides users with more confidence and trust in downstream data and aids in solving acceptable use questions.
Waterline Smart Data Catalog imports lineage directly from the metadata available from Apache Atlas, Cloudera Navigator or ETL tools (via our REST API).
Waterline derives lineage by analyzing the data values within a hive table, SQL table or file in HDFS identifying similar signatures to select potential lineage candidates.
Lineage is presented at the table level and users can drill down to the column/attribute level to see exactly which columns map through data flows.
Stewards can review derived lineage and accept or reject the lineage derived by the automated algorithms.
Tag based Access Control
One consequence of the big data era is that there is just too much data coming into organizations to be able to keep track of manually. New data sources are introduced into organizations with increasing velocity. Often this new data lands in a quarantine zone to be reviewed and organized before it can be effectively used. Typically, to get out of quarantine, a data steward must manually review the data, classify it and then configure appropriate access controls based on user roles and domains. This process can take so long (if ever) that supposedly new data can get stale before it is ever put into use!
Sensitive Data Discovery
Sensitive data tag libraries can be created that automatically label a field as sensitive data so it can be properly managed for access downstream
Access Control Integration
Sensitive data tags are propagated directly into platform access control mechanisms (Apache Ranger or Cloudera Sentry,) via REST APIs
Roles and Domains
User access to data can be immediately controlled throughout your hadoop data landscape based on existing roles and domains
Search Based Access Control
Our implementation of SOLR search uses sensitive data tags to prevent users without proper credentials from viewing sensitive data
The Data Catalog as a Platform
To make a data catalog useful, it needs to be a platform that can be integrated into the rest of your ecosystem so your organization can take action more effectively and leverage the full value of your data assets. For every user interface in the product, there is a corresponding REST API to integrate the data catlog into your existing data workflows:
Search the data catalog from your favorite data wrangling or business intelligence or directly launch those tools from Waterline Smart Data Catalog
Import or export data lineage from ETL tools or from Hadoop
Import or export business glossary terms and definitions from existing data glossaries or CSV files
Database Plugin Architecture
Our RDBMS plug-in architecture makes it easy to quickly add an RDBMS that isn’t already on the list of supported databases
Waterline can catalog all kinds of data (hadoop, relational, cloud) while running natively on big data and cloud environments so we scale naturally as your environment grows.
Waterline can inventory the following data sources and can add additional relational data sources via our RDBMs plug-in architecture. We also connect directly to cloud data storage making it easy to run natively in your cloud environment.
If there is a data source you want us to fingerprint for your environment, let us know. It might already be on our roadmap or we can add it to the roadmap.