Big Data Scott Whitney September 24, 2018

Waterline Data Launches on Microsoft Azure HDInsight at Ignite 2018

Today we have very exciting news from Microsoft Ignite! As Microsoft announced on their blog, the Waterline Data Catalog is now available on Microsoft Azure HDInsight. You can read more about it in today’s press release.

The Waterline Data Catalog delivers the fastest and most accurate big data discovery engine with the highest scalability in the industry. With Waterline Data on Azure HDInsight, petabyte enterprises gain a competitive edge through the fast and accurate discovery of governed business data that can be called up and analyzed by virtually anyone and quickly put to work delivering value to the business. Already, some of the world’s largest financial, food & beverage, and other companies in the Fortune 100 and Global Fortune 500 have been successfully leveraging Waterline Data on Azure HDInsight for faster conversion of data into groundbreaking, business-driving insights.

Waterline on HDInsight: Fast, Accurate, Highly Scalable

Microsoft Azure HDInsight is a fully-managed cloud service that allows organizations to process massive amounts of data for fast, easy and cost-effective analytics. It allows use of open-source frameworks such as Hadoop, Spark, Hive, LLAP, Kafka, Storm, and R while enabling a broad range of scenarios such as ETL, Data Warehousing, Machine Learning, and IoT. But HDInsight can’t be used to properly process data if the data hasn’t been accurately identified. Today’s large organizations often house thousands of datasets with millions of data fields increasing daily in both volume and complexity. Much of this data is unknown to users and thus functionally of no value.

But while manually documenting this data isn’t an option, automated data discovery solutions can also sacrifice accuracy for speed, which only means the organization will be first to act on the wrong information. The Waterline Data Catalog was specifically designed to resolve the challenges of high volume and high variety data by creating a real-time, virtual view of the enterprise’s entire data estate even as new data pours in. Waterline is by far the most scalable data catalog in the industry, able to profile and tag billions of rows of data while helping customers cut data processing time by up to 10X. We provide an easy to use “shop for data” interface that allows business analysts to use everyday business terms to find the data sets they need for full self-service analytics without ever having to rely on IT.

The Waterline Data Catalog complements HDInsight by feeding it with all the data in an organization’s data estate—and because this data has been discovered by Waterline, customers can quickly convert data into actionable intelligence with the confidence that the data is known, trusted and governed for both compliance and user access. In today’s data-driven economy, speed, accuracy and scalability are of the essence. With Waterline, high volumes of data are accurately identified and tagged faster.

Data is:

  • Governed and secured faster
  • Prepped for self-service faster
  • Recognized for value faster
  • Put to work for the business faster

Where the Magic Happens: Waterline’s Industry-Unique Data Fingerprinting™

The core value of the Waterline Data Catalog is its industry-unique Fingerprinting technique, an AI-based, machine-learning mechanism that automatically analyzes and profiles values in each data set. Waterline’s data Fingerprinting runs on the notion that every column of data emits certain qualities that give it a “signature” or “fingerprint” that we can 1.) detect, 2.) assign a business term or label that can be connected to this type of data, and 3.) extend that term or label to data that shares the same qualities.

Waterline essentially creates a fingerprint for each column of data and then automatically matches these fingerprints across data sets and assigns tags, which users can review and approve or correct. Machine learning algorithms improve the automated discovery process based on this user feedback. This provides the perfect balance of AI automation and human interaction in discovering, organizing and controlling data. Once tagged, the data is available via an intuitive search system. The entire data enterprise is now at the fingertips of virtually any user who has been assigned access, regardless of their technical expertise.

Overview of Features and Benefits:

Waterline is the most comprehensive data catalog system available with the breadth of required capabilities:

  • AI-Driven Automation: Waterline is the only system that uses machine learning to automatically discover, organize, tag and control.
  • Customization and Control: Users can add tags and associated properties that are immediately included in the automated data discovery process. This can include rules to automatically label fields containing sensitive data for controlled access. Sensitive data tags can be loaded into access control systems (e.g., Apache Ranger, Cloudera Sentry) via REST APIs.
  • Search: Intuitive web interface designed for business users to search a catalog of trusted, curated data. Search results also show data profiles and ratings.
  • Scale: Waterline is the only data catalog designed from the ground up to scale using existing Hadoop and Spark infrastructures.
  • Accuracy: Only Waterline incorporates an industry-unique data Fingerprinting technique combined with human review for the highest level of accuracy.
  • Speed: With Waterline, faster data discovery means faster data analysis for faster decision-making, transforming any organization into an intelligent, nimble, data-driven enterprise.

Getting Started is Easy!

To learn more about how Waterline Data on HDInsight brings tremendous value to anyone building on Microsoft’s public cloud, click here.

To receive a key for your Microsoft Azure/Waterline Data sandbox, click here, and within 24 hours we will contact you with everything you need to get started.