Cloud Alex Gorelik December 11, 2017

The Cloud and a Post-Hadoop World

Nobody can deny the adoption of Cloud Computing. According to Forbes, Cloud Computing will become a $411B market by 2020. We all remember the debates in the late 2000s on whether or not cloud computing was just a fad or a “real thing”. A few years after, it was clear that cloud was here to stay and the debate was on whether a public cloud or a private cloud was the right way to go. Here we are, about to enter 2018, and companies have made significant progress in determining what’s right for them. The verdict is in…. We’re in a new world, folks. Some applications work great on the public cloud. Some work great in a private cloud environment. Some will have to remain on premises. It looks like an on-premises + public cloud hybrid is here to stay!

The question is why? Doesn’t having multiple environments create more complexity? Rather than debate this endlessly, we thought we’d talk to the companies making this decision. So, we talked to two of our customers. One is a large restaurant chain that feeds 1% of the world’s population daily and another is a large coffee company with over 26,000 stores world-wide. Like most enterprises, they have adopted a unified cloud architecture where analytics is done both in the cloud and on-premises. Given their scale and complexity, Hadoop was the obvious analytics platform, but Hadoop is complicated. It takes expensive admins to manage it. Customers love the scale out power of Hadoop but don’t want to deal with the complexities of managing the Hadoop cluster. This is where the cloud comes in. Imagine someone else managing the elasticity and administration of the Hadoop cluster, so you can simply just focus on analyzing your data. What a novel idea! But that’s exactly the value proposition of having a data lake in the cloud.

But, while these customers enjoy the benefits of the cloud, including elastic compute and minimal administration, they also have a lot of data in legacy systems on premises. The data in these systems is valuable for analytics, but moving all that data with related applications and analytics to the cloud is not always immediately feasible given costs and disruption. Many of these applications and analytics systems were developed using legacy technology that doesn’t work in the cloud and would need to be completely redesigned. Anyone who has experienced data migration can relate to this pain. This can be a long process that’s wrought with high cost, huge disruption to the business caused by planned application downtime, potential data loss, etc. That’s why we still have so many mainframes around. Consequently, most enterprises leave legacy data on premises and adopt a cloud first policy for new projects. So now, the restaurant chain can have data in Redshift on AWS and data in Oracle and Teradata in their data centers that they have to analyze to make intelligent decisions. Similarly, the coffee company continues to leverage Oracle, Teradata and Cloudera environments on-premises while building a new data lake using Power BI and ADLS on Microsoft Azure.

So, things should be more complicated, right? I mean, now analytics have to span yet another platform. How does one find data across the cloud, multiple clouds, on-premises data lakes, and RDBMS’s? Well, it doesn’t have to be complicated. We at Waterline have been helping many Fortune 2000 companies across the globe automatically discover, understand and govern their data with a single solution. To tie this diverse and distributed data estate together, our customers are deploying the Waterline Smart Data Catalog to automatically:

  1. Create a virtual view of the data across all data sources
  2. Show both a business or semantic context
  3. Show governance and compliance context and technical context.

Let’s examine these one at a time starting with Waterline Smart Data Catalog. While many relational centric product offerings like Alation or Informatica Enterprise Information Catalog claim Hadoop and Object Stores support, you will see that the support is too primitive and narrow to be useful. For example, the other data catalogs do list all the files in an S3 bucket and let you search them by name and manually tag each file. But they don’t look inside! What if the labels are wrong or simply missing altogether? They also don’t conduct any of this at web -scale. How is someone going to tag millions of files without really knowing what’s inside each file?

This is a list (above) of files in an S3 bucket that the user can browse in Waterline—pretty much what Alation, Amazon Glue or MS Azure Catalog can show you. Not much to go on.

Amazon Glue goes a little further and tries to parse each file to understand the schema and to group files with similar schema into logical data sets. This supports the most common pattern in Big Data where new partitions of large data sets are periodically loaded into Hadoop or an Object Store. This pattern is supported by, for example, partitioned Hive tables that specify a directory structure, and when new files are added into that structure, they automatically become parts of a Hive table. In the case of Glue, the goal is to mechanically dump data into a proprietary RedShift rather than understand data itself.


In this screen shot, Waterline has automatically determined that all the files in /LocalFS/demo/Pass5/pub/claims folder tree are really partitions of the same virtual data set (e.g. time series collection). It then automatically created this virtual data set, called it “claims”, and calculated statistics and tags across all the hundreds of files in multilevel folder tree that 1.) has a separate folder for each year, 2.) a separate folder under that for each month, and 3.) a list of files or partitions that contain claims for that month. The following screenshot shows all the files for June 2015 located in /LocalFS/demo/Pass5/pub/claims/sftp/2015/jun. All this was done automatically without any human intervention!

However, even that is not enough. To truly understand and be able to find data in Big Data environments, a catalog needs to introspect the files and their data. This is exactly what Waterline Smart Data Catalog does—it crawls Hadoop file systems and Object Stores like S3 and looks for new or changed files. It then parses each file to determine its schema. Then it profiles each file to collect statistics on each field:

  • How many distinct values are there
  • How many empty values or NULLs
  • What is the real type – a CSV or delimited file may be all text, but a specific field can consist of just numbers
  • What’s the largest and the smallest value
  • What are most frequent values and their counts

And so on.

In the screen shot above, Waterline Data shows the profile of “ethn” field. It introspected the field to find out that there are 93 unique nationalities in the file—the most popular is Alaska Native with 155 occurrences. Selectivity of 0.008 is a measure of uniqueness of a field that’s an inverse of the average number of occurrences per value. In other words, with 93 nationalities in 12,255 records, each nationality occurs 12,255/93 = 132 times on average and selectivity is 1/132 = 0.008. If you are doing analytics based on ethnicity, this gives you a lot of insight on whether this data set is going to work for you—and it is all done automatically, pre-calculated and available immediately when you use the catalog.

Based on that and other information that it collects, Waterline performs automated discovery to automatically tag each field with a business term, so business analysts are able to find and understand data sets using regular familiar business terms.

In the screen shot above, each is tagged with a business tag—basically a business glossary term. Some tags are white with dashed lines and confidence levels. Those tags were automatically inferred or suggested by Waterline Data Discovery Engine.

Waterline learns: by double clicking on a suggested tag, the analyst can approve it or reject it. Approved tags become solid blue and rejected tags disappear from the display. Discovery engine’s machine learning capabilities learn from the analysts’ acceptances and rejections to improve its suggestions, and all the learnings are applied to all the data sets across the enterprise. This ability to automatically tag data with business terms is unique to Waterline Data Catalog and key to its ability to catalog a large distributed data estate created by Big Data. To underscore the point, Waterline can discover business terms across file and table and different datasources!

Oftentimes, before analysts can decide whether they can use a data set, they need to know where it came from. To facilitate that, another important and unique capability of Waterline Data is its ability to import, or if not available to import, to automatically infer lineage of each data set.

The above diagram shows lineage in Waterline where dashed lines indicate inferred lineage that the analysts can approve or reject (just like they do with tags) and solid lines indicate imported or approved lineage. This diagram can be navigated by users to drill into different parent and child data sets in order to explore where data came from and how it is being used.

A unified model for analytics brings together the best of both worlds. Unfortunately, the separation of data can create a challenge in having a unified method of governing data and consolidating it for analytics. In order for analytics to be useful or governance to be applicable, one must have a single and uniform way of finding, understanding and managing data.

This is only possible with a smart data catalog that provides immediate insight into the data by providing profiling statistics and business tags for each field, organizing data sets around common usage patterns like partitions, and providing lineage across all data sources. There is no way to do this manually—even a small database with a hundred tables each having a hundred fields will require ten thousand fields to be manually tagged. Just imagine trying to do it for any sizable company with millions of data sets. Therefore, all this information must be gathered and discovered automatically and kept up to date by doing continual incremental discovery. By doing this, a data catalog becomes “smart”. It is only with a Smart Data Catalog where one can not only leverage a unified data platform model but to further simplify the discovery, management, and governance of that data.