One of the most consistent questions we received about the Waterline Data Catalog is “what is data fingerprinting and how does it work?” Data fingerprinting is the idea that a column of data has a signature, or a fingerprint, and that by examining the data values in a column of data, we can identify what that data is and determine two things: 1.) What other columns share this same fingerprint? and 2.) What is the business term or label that can be connected to this data?
To address the second question, connecting a business term to an unlabeled or mislabeled column of data, Waterline Data Fingerprinting can do this very well for lots of business terms, but to improve the match accuracy for new terms, the fingerprinting system has to be trained. For example, it knows what a first name or last name is, or what a credit card number is, but it doesn’t know what a “Claim Number” is for ACME Insurance. That is because the format of a claim number would be unique to ACME. However, once a knowledgeable business user or data steward tags just one column as a “Claim Number,” the system learns and now knows what a claim number is and that tag or business term gets propagated automatically to all of the other unlabeled columns of data that have the same characteristics or fingerprint.
The basic idea and the reason why tagging is so powerful, is because you only have to tag a unique attribute once, after which the computer learns and propagates the tags automatically. This means that tagging large volumes of data to connect the technical metadata to the business term is incredibly efficient. This efficiency makes populating a data catalog much easier and the results of being able to find the data you need to do your job arrive faster. Tagging also speeds up the identification and masking of sensitive data as well as making the elimination of redundant data much faster.
So how does the fingerprint itself get formed? There are two main techniques that are used to connect a business term to a column of data. The first is using regular expressions. This works well for something like a credit card number, where we expect a certain number of digits and those digits have a consistent set of repeatable information.
The second is what we call “value-based” fingerprinting. This is what would be used in the “Claim Number” example above. By taking basic profiling information such as the column, the data type, the value distributions, etc., in a specific combination, we establish a mathematical formula for converting that profiling information into a unique identifier or fingerprint, which earlier I referred to as a column signature. We can then compare identifiers and find other columns that match. The trick here is getting the algorithm for creating the signature just right to avoid both over and under matching.
This is the approach that is used for cities, for example. We simply take a list of cities, use that to create a signature, and then whenever the system sees a column that shares a similar signature, it gives it the same label.
But what happens if for your data set, you have a more granular understanding of cities? Perhaps you have different columns for “French Cities”, “German Cities” or “California Cities”. The first time Waterline Data Fingerprinting sees these columns of data, it will think they were just “Cities”, because that is all it knows. However, a user can then indicate that the column it just labeled isn’t just “Cities”, but it is more specifically “California Cities” after which the machine learning algorithm updates itself and from then on, the system will properly label other columns that have only Californian cities based on the updated and more precise signature.
Value-based fingerprinting is very powerful because it isn’t fooled by poorly labeled data. It doesn’t care if a column is labeled “C01”, “CLM” or “Claim Number”.
However, it is important to note that Waterline will take advantage of good labels. In fact accurate column labels are used to validate Waterline’s suggestions. So for two columns with the exact same data, the one which is labeled “Claim Number” will get a higher confidence score than the same data that is labeled “C01”. That is to say, the algorithm will identify both columns and tag them “Claim Number”, but it might rate the nicely labeled column with a confidence of 96% and the one with the generic label with a confidence of 92%. But if the column labels for first and last name are reversed, it will ignore the column labels in that case.
The idea of examining the actual data values, whether via regular expression or value-based tagging is critical, because data is often mislabeled. While only a small percentage of your data might be mislabeled, the only way to figure that out is by looking at the data values themselves. Manual inspection doesn’t work in this case because you can’t manually inspect one million rows of data. Another important aspect of examining the data values is that it also allows you to automate the identification of sensitive data that needs to be either secured, or even just cataloged to comply with government regulations like the EU GDPR.
The important point in all of this is that on the surface, the concept of data fingerprinting is pretty simple. The devil however, as usual, is in the details that enable the system to scale, make it robust, and keep it from both over and under matching. Hopefully this gives you enough information to understand the concept behind the magic of Waterline Data Fingerprinting.