Cloud

Why more Data Catalog implementations are moving to the cloud

Posted on June 26th, 2017 | Todd Goldman

An increasing number of our customers are moving from on-premises to cloud-based solutions for their data analytics needs. When we started Waterline Data in 2013, we were focused on big data running in on-premises Hadoop clusters. Big data required big computer clusters and even in late 2015, most of the people who were looking at Hadoop and data catalogs were very focused on on-premises deployments.

As time marched on, something changed radically. In late 2016, two big requests started to commonly pop up in high volume from customers. The first was the request to expand beyond just Hadoop and handle relational data. That meant that Waterline needed to add support for traditional structured data sources like Oracle, Teradata and MySQL. This is a feature we added in late Q1 of 2017, and we continue to expand our list of supported relational data sources. The second request was around support for the cloud; customers no longer wanted to set up their own clusters, and very quickly there was a lot of interest in implementing data catalogs on top of Hadoop clusters based on Cloudera, Hortonworks and MapR running on AWS. Since then, interest in Amazon has increased even more and, based on specific customer requests, we have added support for Amazon EMR and Amazon Redshift.

Our customers still request support to discover and fingerprint data on premises, but the reality for most of them going forward is the need for a mixed environment. With that being said, the ability to quickly ramp up a Hadoop or Spark cluster in the cloud is extremely attractive. Why? Because anything you can do to redirect resources to performing actual analytics from other activities is of high value. In the same way that implementing a data catalog with automated discovery and fingerprinting means that data analysts spend less time searching for data and more time actually doing analytics, the same is true for the underlying data fabric itself. The result is that organizations looking to enable self-service data find a conceptual consistency in using cloud to “self-service” provision the data grid (Hadoop or Spark), implement “self-service” data analytics and data wrangling, and “self-service” for data discovery through the implementation of a data catalog.

One other big advantage of the cloud from a data catalog perspective is that the compute capacity available for fingerprinting data makes it possible to discover and tag data much faster.

However, on-premises data isn’t going away any time soon. The ease of use of the cloud and the simplicity of self-service is driving more companies to move to the cloud, which we here at Waterline are seeing not just in theory, but in reality.