Smart Data Catalog by Waterline

The Only Data Catalog with Automated Data Fingerprinting

Waterline Data Fingerprinting™ Automates the population of your data catalog

Waterline Data Fingerprinting™ works by analyzing the data values in each data set and profiling the data. Waterline then uses that information to create a “fingerprint” for each column of data—using machine learning to intelligently and automatically tag and match data fingerprints to glossary terms and populate the data catalog. Users can then refine matched terms, and remaining unmatched terms, through crowdsourcing.

Watch the Demo
Minimize False Positives

Machine learning algorithms take advantage of our decades of historical experience with real customer data.

Scale Up

Waterline is the only data catalog designed from the ground up scale directly using our customer’s existing Hadoop and Spark infrastructure.

Tag Accuracy

Human reviewers accept or reject data tags which provide a feedback loop that continually improves the accuracy of the machine learning algorithms.

Tag Packs

Industry and use case specific tag libraries (insurance, GDPR, retail, etc) further improve the automated tag accuracy.

Curation, ratings and reviews.

We recognize that users don’t want to turn over the entire process to a machine — nor should they. Waterline uniquely combines at scale automation with a human driven curation process that allows data stewards to accept or reject the automated tags, or even add their own tags to the data catalog. As stewards curate the data, machine learning algorithms improve the automated discovery process.

  • Curation

    Data stewards can quickly accept or reject machine discovered tags through a very easy to use interface.

  • Ratings

    End users can add their own subjective ratings on a 1-star to 5-star scale to provide more insight into available data sets.

  • Reviews

    Users can comment on aspects of the data that can’t be easily determined algorithmically, preserving and leveraging tribal knowledge

  • Custom Tags

    Organizations can add their own tag properties libraries which are immediately integrated into the automated tagging process

Learn More


Waterline data catalog allows business analysts to spend less time trying to find the data they need and more time doing actual analytics. Waterline search provides a wide variety of facets for users to narrow down their searches. Search results also show crowd-sourced ratings and reviews for data … yet another way users can judge the quality of the data they are viewing in search results.

  • Industry Standard Search

    Waterline uses the emerging industry standard Apache SOLR engine to power the search of the catalog.

  • Drill Down

    Search results let you see which files or columns contain the term you are looking for and let you drill down to the profiling results at the attribute level.

  • Custom Facets

    Define your own facets by creating custom properties to tag data and then use those properties to later filter searches.


    Integrate search queries and results directly into third party applications like data wrangling and business intelligence tools through our REST API.

Data Lineage

A data catalog isn’t complete without data lineage — a critical requirement in establishing user trust in data as well as a requirement for compliance with regulatory laws. Lineage lets users know the sources from where data comes, and also where it is ultimately consumed. Better understanding of data provenance provides users with more confidence and trust in downstream data and aids in solving acceptable use questions.

Learn More
Imported Lineage

Waterline Smart Data Catalog imports lineage directly from the metadata available from Apache Atlas, Cloudera Navigator or ETL tools (via our REST API).

Derived Lineage

Waterline derives lineage by analyzing the data values within a hive table, SQL table or file in HDFS identifying similar signatures to select potential lineage candidates.

Drill Down

Lineage is presented at the table level and users can drill down to the column/attribute level to see exactly which columns map through data flows.

Lineage Curation

Stewards can review derived lineage and accept or reject the lineage derived by the automated algorithms.

Tag based Access Control

One consequence of the big data era is that there is just too much data coming into organizations to be able to keep track of manually. New data sources are introduced into organizations with increasing velocity. Often this new data lands in a quarantine zone to be reviewed and organized before it can be effectively used. Typically, to get out of quarantine, a data steward must manually review the data, classify it and then configure appropriate access controls based on user roles and domains. This process can take so long (if ever) that supposedly new data can get stale before it is ever put into use!

Learn More
  • Sensitive Data Discovery

    Sensitive data tag libraries can be created that automatically label a field as sensitive data so it can be properly managed for access downstream

  • Access Control Integration

    Sensitive data tags are propagated directly into platform access control mechanisms (Apache Ranger or Cloudera Sentry,) via REST APIs

  • Roles and Domains

    User access to data can be immediately controlled throughout your hadoop data landscape based on existing roles and domains

  • Search Based Access Control

    Our implementation of SOLR search uses sensitive data tags to prevent users without proper credentials from viewing sensitive data

Easy Integration:
The Data Catalog as a Platform

To make a data catalog useful, it needs to be a platform that can be integrated into the rest of your ecosystem so your organization can take action more effectively and leverage the full value of your data assets. For every user interface in the product, there is a corresponding REST API to integrate the data catlog into your existing data workflows:

Learn More

Search the data catalog from your favorite data wrangling or business intelligence or directly launch those tools from Waterline Smart Data Catalog


Import or export data lineage from ETL tools or from Hadoop

Business Metadata

Import or export business glossary terms and definitions from existing data glossaries or CSV files

Database Plugin Architecture

Our RDBMS plug-in architecture makes it easy to quickly add an RDBMS that isn’t already on the list of supported databases

Waterline can catalog all kinds of data ( hadoop, relational, cloud) while running natively on big data and cloud environments so we scale naturally as your environment grows.

Waterline can inventory the following data sources and can add additional relational data sources via our RDBMs plug-in architecture.  We also connect directly to cloud data storage making it easy to run natively in your cloud environment.

If there is a data source you want us to fingerprint for your environment, let us know. It might already be on our roadmap or we can add it to the roadmap.

Want to learn more about Waterline Data Catalog?