We recently announced a new book written by our very own Waterline founder Alex Gorelik.
The Enterprise Big Data Lake: Delivering the Promise of Big Data and Data Science, which guides UT executives and practitioners through all the stages of implementing and managing a modern data lake, has already earned high praise from some of data’s biggest leaders. Published by O’Reilly Media, it is the #1 New Release in the Data Warehousing category on Amazon.
Today we kick off a new blog series. Every month, we will delve into one of the chapters of the book to present an overview of the guidance gleaned from Alex’s 30 year career as well as from some of the world’s leading data-driven enterprises. Ready?
Chapter 1 serves as an introduction to data lakes—what it is, why we need it, how it supports the self service analytics that is so critical for data-driven enterprise. The problem with data silos, for instance, is completely resolved by the data lake.
What Data Lakes Need to Be Successful
The chapter also sets up some of the big deployment and management challenges that are addressed in subsequent chapters, such as security and governance.
After defining the several stages a data lake may go through before achieving maturity—data puddles and data ponds, for example—Alex jumps right into the platforms, kinds of data and interfaces organizations must evaluate in order to create a successful data lake. Organizations must make their choices knowing what their needs are in terms of scale, cost, data variety, future proofing and other considerations.
When it comes to determining what kind of data will be stored, Alex implores the reader to save as much data as possible in its native format to create a kind of piggy bank for later use. While enterprises tend to throw a lot of their data away, Alex basically argues what may be trash to one user today may become gold to another user tomorrow.
A big point Alex makes about mid-way through chapter 1 is how much broad adoption for the data lake comes down to equally broad access. This means knowing how to serve different types of users, each with different needs and skill sets, something that he promises to discuss in detail later in the book.
Alex also talks about the importance of being able to apply an Amazon type of approach to the the data “shopping” experience. This requires embracing tools that offer up a familiar, user-friendly interface, faceted and contextual search, and the ability to sort data assets based on specific criteria.
Building the Data Lake
After discussing the requirements and pitfalls to avoid for a successful data lake, Alex then takes on how to actually build one. He discusses the three different architectural options—on-premises, cloud and logical—and how they should be organized into different data zones, such as the raw, production, work and sensitive zones which determine how data is treated and interacted with. Governance, he writes, should only pertain to data that needs to govern in the way it needs to be governed as opposed to governing all data regardless of the data’s location or purpose.
The remaining pages in the chapter mostly focus on how to set up the data lake for self-service analytics. This involves finding and understanding the data, provisioning the data, and then preparing the data. Alex discusses each step in detail.
Summary: In chapter 1, we learn how selecting the right platform, loading it with the right data, organizing it in the right way and setting it up for self-service puts us on the road toward achieving a successful data lake.
Check in next month for our overview of The Enterprise Big Data Lake: Delivering the Promise of Big Data and Data Science: Chapter 2, which explores how to accomplish these tasks.
In the meantime, you can try and win a free signed copy of Alex’s book by joining next week’s free webinar, an expert panel discussion on How DataOps is Adding Value to Data Lakes, hosted by Waterline Data, Streamsets and Trifacta. Register today!