Machine Learning isn’t the Answer to Every Data Problem

“Machine Learning” is an extremely hot term these days.  People are talking about machine learning for almost everything.  It will make you more handsome, whiten your teeth and ensure your team wins the championship.  And while we at Waterline Data are big proponents of machine learning in the context of data catalogs, we also acknowledge that it won’t do everything.

Now what Machine Learning is good for is doing things like automatically looking at data values in a column of data and determining what that column consists of so it can be labeled and tagged.  Is it a name?  Is it an address?  Is it a customer number?  Is it a product ID?  It can also profile the data and provide objective statistical analysis about the completeness or consistency of data.

Of course, most organizations have so much data that they can’t reasonably crowdsource the tagging of all of their data.  There has to be some sort of smart algorithm that can do a lot of the heavy lifting along with human curation that corrects the algorithmic errors.  That is where the machine learning also kicks in. When a human corrects a machine based tagging error, machine learning does it things and learns from the correction.

However, there are many things that machine learning cannot do, at least not yet.  And this is where ratings, reviews, and comments by humans about datasets come into the picture.   Machine learning isn’t very useful when it comes to context or having an opinion about how useful a particular data set was within a given context.  This is where adding ratings and reviews to an automated tagging in a data catalog can help.

Was the data useful for the analysis I used it for?  Perhaps it is good for HR but not for finance?   Or maybe a dataset was well curated by someone and because of that, there is a comment telling me that I should use a different data set than the one I am looking at right now that is derived from this one.  And while a machine learning algorithm can discover lineage and show upstream sources and downstream consumers of a dataset, it is much easier if someone has commented about the relative value of the various data sets in that data flow and I can just read their comments.

That said, it is nice to have objective profiling and quality information provided by an automated discovery process to validate the human comments.  So both the automation and the ratings, reviews, and comments go hand in hand to reinforce the value of the other.

The bottom line is that to deal with the scale of data in today’s worlds, automation, algorithms and machine learning are necessary to deal with the scale.  But that still doesn’t remove the additional value added of a human opinion and perspective at the right place and time.