Video Data Catalog

The Essence of Data Catalogs

Analyst on the Essence of Data Catalogs

If we don’t know what data we have,
If we don’t know where to find that data, and
If we don’t have the metadata to manage it effectively,
Analytics is doomed to fail.
     – Dave Wells, Eckerson Group

In this 3-minute excerpt from a recent webcast, hear Dave Well’s concise explanation of how a data catalog fits in the analytics ecosystem. Click below to view the video.

Video Transcript

Okay, data cataloging manages the inventory of datasets, by collecting and maintaining the metadata that is the critical infrastructure of the analytics ecosystem. Let me repeat that, the critical understructure of the analytics ecosystem. If we don’t know what data we have, if we don’t know where to find that data, if we don’t have the metadata to manage it effectively, analytics is doomed to failure.

The data catalog connects people with data. It’s designed to help people find, evaluate, understand, and acquire the data that they need to do their jobs. This is especially important in the world of self-service analytics, where many of those line-of-business analysts don’t know where all the data resides or what datasets exist, and they spend a tremendous amount of time seeking data. So, a catalog is especially valuable to self-service data consumers.

IT organizations can’t possibly provide all of the data needed by the growing numbers of people who analyze data. So, we’re working today in a world where most self-service analysts are working blind. They don’t have visibility into the datasets that exist. They can’t see the contents of those datasets. They don’t fully understand the quality and the usefulness of each.

As a result, they spend too much time searching for data and working to understand it. Sometimes they misunderstand. Frequently they recreate datasets or variations of datasets that already exist. So, we create yet another problem with redundancy. And, any data management professional knows that with redundancy comes inconsistency. And, after all these years that we’ve spent integrating data, this takes us on a path toward disintegration.

Self-service analysts frequently don’t work with the best-fit datasets. They work with those that they can find, those that they already know about, or those that they learn about by asking co-workers and colleagues, finding datasets through tribal knowledge. This is just not an effective way to do it, and data cataloging sets that aside.

The CEO of a healthcare company where I was consulting said this to me. He said, “I don’t get any real data analysis. My analysts spend 80% of their time finding data, and another 15% fixing the data. That doesn’t leave much time for analysis.”

In this extreme case, the analysts were finding data by word-of-mouth, then transforming, blending, and analyzing that data with Excel. So, this gentleman, Mike Blackwood, was saying it takes 95% of the time just to get ready to analyze. A more commonly sized distribution of work is 80% of the time for finding and evaluating, and 20% for analysis. But, that’s still wrong. That’s still not the numbers we want.

Data cataloging can turn those numbers around. 20% finding and evaluating data, 80% analyzing. We turn those numbers around by eliminating the waste and the rework that’s inherent in the trial-and-error processes and the tribal knowledges and processes for seeking data.