Big Data

2017 Prediction #2: Orgs Realize Hadoop Wasn’t the Panacea They Thought

Posted on January 24th, 2017 | Todd Goldman

As I continue these blog posts about my predictions for 2017, my mind wanders back to 2014 when Cloudera had just taken a massive investment from Intel, and Hortonworks had just gone public. Ah yes, I remember those good old days when Hadoop was going to eliminate the need for ETL, and data quality would no longer be needed, because the volume of data would make data quality issues operate at the noise level.

 

Tangential thought: OK, before I move on, I have to comment on that last point about data quality. First of all, I am not making this up. People really did say this about data quality. People who were supposedly super smart but clearly still didn’t understand basic math. I just want to note that if 10% of your data was of poor quality, and now because of Hadoop you can process more data, 10% of your data will still be bad. And if that 10% of bad data was a problem for you before, it will still be a problem. Yes, there are some kinds of analysis where you can ignore data quality—for example, the kind of analysis that looks at data sets in aggregate and is looking more at trend lines than specific actions based on specific transactions. However, for lots of others where each transaction counts, that kind of approach does not work.

 

Also, I might as well continue my rant to comment on the ETL point. Yes, ETL tools as we knew them would not function in a Hadoop environment. But data would still need to be extracted from source systems. It would still need to be loaded into a Hadoop cluster. And since data would still come in lots of different formats, it would need to be transformed so you could combine different data from different sources. So while the precise tools wouldn’t be needed, the same concepts would be. The proof by the way are all the data ingestion and data wrangling startups that effectively do what the old ETL vendors used to do. Now back to my main thread…

 

Three years ago, Hadoop had all the answers. It would straighten your spine, whiten your teeth, and ensure the country won the war. Database vendors would go out of business, ETL vendors would go out of business, data warehouse vendors would go out of business, and so on and so forth. And while Hadoop did have negative effects on lots of legacy vendors—while customers slowed some of their investments so they could wait for the yellow elephant to take care of everything, the reality is that none of the legacy companies disappeared. And they won’t disappear any time soon.

 

Why not? The same reason that mainframes are still around. Old technology dies very, very slowly. And in the case of data, it dies even more slowly. Reworking legacy systems is incredibly hard. People who know how to manipulate COBOL copybooks are rare, and the philosophy is often, “If it ain’t broke, don’t fix it.” In addition, Hadoop isn’t the only new technology trend on the scene. What this means is that where there used to be a single center of data gravity around relational databases, there are now three centers of gravity pulling data investment in three directions: Relational, Hadoop/Spark, and Cloud.

 

This means if you want to build a data lake to combine all of your data for newfangled analysis, you have to think beyond just Hadoop and Spark. You need to think about an Enterprise Data Lake that truly encompasses all of your data. It also means you will also have to figure out:

When will you use each data technology type?

When will you move data and when will you leave data in place?

How will you catalog all of that data, so you spend more time using the data and less time searching for it?

How will you eliminate redundant data to keep costs under control?

How will you govern all of that data to stay out of regulatory hell?

 

Those are just a few of the challenges brought about by the new world of relational, Hadoop/Spark and Cloud. The bottom line, however, is that very few organizations will become 100% Hadoop. So get used to a heterogeneous data lake environment that encompasses Hadoop, Relational and Cloud. Because if you want to innovate using data, remember that innovation happens at the intersection of different disciplines. Those folks who can figure out how to master these different technologies to bring data together in useful ways will be the ones who create new insights and win in the long run.