Big Data

Machine Learning isn’t the Answer to Every Data Problem

Posted on November 14th, 2017 | Todd Goldman

“Machine Learning” is an extremely hot term these days. People are talking about machine learning in just about every context. It will make you more handsome, whiten your teeth and ensure your team wins the championship. And while we at Waterline Data are big proponents of machine learning in the context of data catalogs, we also must acknowledge that it can’t do everything.

What machine learning is good for is doing things like automatically looking at data values in a column of data and determining what that column consists of, so it can be labeled and tagged. Is it a name? Is it an address? Is it a customer number? Is it a product ID? It can also profile the data and provide objective statistical analysis about the completeness or consistency of data.

Or course, most organizations have so much data that they can’t reasonably crowdsource the tagging of all of their data. There has to be some sort of smart algorithm that can do a lot of the heavy lifting along with human curation that corrects the algorithmic errors. That is where the machine learning also kicks in—when a human corrects a machine based tagging error, machine learning learns from the correction.

But again, there are many things that machine learning cannot do—at least not yet. And this is where ratings, reviews and comments by humans about data sets come into the picture. Machine learning isn’t very useful when it comes to context or having an opinion about how useful a particular data set was within a given context. This is where adding ratings and reviews to an automated tagging in a data catalog can help.

Was the data useful for the analysis I used it for? Perhaps it is good for HR but not for Finance? Or maybe a data set was well curated by someone and, because of that, there is a comment telling me I should use a different data set (different from the one I am looking at right now) but one that is derived from this one. And while a machine learning algorithm can discover lineage and show upstream sources and downstream consumers of a dataset, it is much easier if someone has commented about the relative value of the various data sets in that data flow, so I can just read their comments.

That said, it is nice to have objective profiling and quality information provided by an automated discovery process to validate the human comments. So, both the automation and the ratings, reviews and comments go hand in hand to reinforce the value of the other.

The bottom line is that to deal with the scale of data in today’s world, automation, algorithms and machine learning are necessary to deal with the scale. But that still doesn’t remove the additional value added of a human opinion and perspective at the right place and time.