Data Catalog

Data Fingerprinting Part II: Automatically Inferring Data Lineage

Posted on September 26th, 2017 | Todd Goldman

In my earlier post, I wrote about how Waterline uses a combination of data profiling with machine learning to create a new process called data fingerprinting that automatically tags columns of data with the appropriate business term regardless of the technical column name.  An additional aspect of the process is to also use the fingerprinting information to derive data lineage.  

Why do we care about lineage in building a data catalog? The key reason is that lineage provides another sense of data quality beyond what is found in the profiling information. While profiling provides basic statistics about a given column of data, the lineage gives perspective on the upstream data source as well as downstream uses of the data. This allows a user of the catalog to decide whether they want to get closer to the source or instead use a version of the same data that has been further refined. Looking at the lineage helps users make those kinds of decisions quickly.

Now in many cases you may already know the lineage from your ETL or HQL code or if you are tracking lineage in Apache Atlas or Cloudera Navigator. In that case, there is no need to derive much of the lineage. All you have to do is import that information into the data catalog. However, very often, developers hand code movement of data with scripts that don’t integrate well with standard toolsets. In that case, there needs to be a way to augment the factual information you have about lineage. The question, then, is how does this work?

To begin with, we want to start with what we know. Why? Because if we already know that Dataset A-prime is the child of Dataset A, then that is one less combination of datasets we need to look at. Part of the challenge with solving this kind of problem is the sheer magnitude of the number of combinations we can look at. Also, if we know that we moved an entire database of thousands of tables from Oracle into HDFS, just letting the lineage discovery algorithms know that this happened will make it much easier for the algorithms to derive the lineage down at the table and column level. In general, we want to start from the perspective of the child and look backwards upstream to identify the parent or parents.

Then, once we begin the lineage discovery process, we start looking for vertical and horizontal copies of data. By vertical copy, we mean looking at columns of data in one data set that appear in a second dataset. By horizontal copy, we mean that we are looking for a subset of rows of data, like where state=California for example, which allows us to find where a subset of data might have come from.

Any hints that reduce the problem set significantly speed up the process. For example, for Hive and HDFS (when there is no lineage in Atlas or Navigator), we can use the time-stamps of the files as an optimization hint as to what files are likely the parents or the children. This doesn’t work when looking at data moving from Oracle to or from HDFS. However, in general, people are moving data into HDFS when it comes to this kind of exercise.  

This process can be used to derive lineage in many cases. It can even make very good educated guesses at joins. What it can’t do is find data that has been transformed. That said, in general, most data doesn’t get transformed and is moved in a direct copy anyway, so the process works quite well in most cases. Of course, it isn’t perfect, which is why Waterline visually represents lineage as dashed lines ( see figure 1) when it is proposed by the algorithms.

.Figure 1.  Dashed lines indicate inferred lineages, solid lines indicate reviewed or imported lineage

Lineage appears as solid lines when it has either been imported from trusted sources like Atlas or an ETL tool, or when it has been approved by a data steward (see figure 2) .

Figure 2:  Accept/Reject lineage dialog box for a data steward

The important part of inferred data lineage is that while it isn’t perfect as noted above, it is much better than attempting to perform the same task by hand. In a scalability test, 4320 tables of data with an average of 150 columns each had the lineage derived in a little under three hours.  Try doing that by hand!

The important aspect of this approach is that when used in the context of building a data catalog to help business users quickly find that data they need to do their jobs, inferred lineage is a useful tool to fill in the gaps when fully documented lineage is unavailable.

Stay tuned for future posts that will continue to look at the strengths and, yes, shortcomings of data fingerprinting.