In talking with prospective customers, I am finding there is a lot of confusion about what makes a data catalog. Lately, we’ve seen some companies consider taking their search technology and calling their search product a “data catalog”, which is not entirely accurate. This is happening purely because the enterprise search market is being overrun by open source phenom, Solr and companies that were once in the enterprise search space are looking for a new home.
With that said, there is a place for search to play a part of a data catalog solution. But when you think “data catalog” you should think about an entire stack of technologies and capabilities that go well beyond just search. Below is a list of the key capabilities of a data catalog solution—
Discovery and Automated Tagging: The first issue with a data catalog is getting it populated, which you can’t just do by crowdsourcing – there is too much data. As a result, you need some level of automatic population of the data catalog as well as automatic profiling and tagging of data.
Crowdsourcing: Just as you can’t count on crowdsourcing alone, you also can’t just count on automated discovery—both are necessary to build a functioning data catalog. In particular, automated discovery helps to speed the process while crowdsourcing, corrects and improves the automation as well as fills in the gaps that automation can’t. So as it turns out, a combination of approaches works best.
Ratings and Reviews: On top of objective information about data that comes from the profiling and discovery process, it is important to have human commentary about how a data set might be good for one use case and not for another. Once again, you want to combine ratings and reviews with objective information about data that is collected by profiling, so the combination of the human eye as well as ratings and reviews works best.
Integration Interfaces: Output from the catalog can be consumed by lots of other tools including, but not limited to: business glossaries, data wrangling, business intelligence, data security and search. At Waterline Data, we are seeing companies integrate the data catalog into their overall data processing flows and begin to tie the data catalog into their own applications. The number of apps that a data catalog can be integrated into is limited only by the imagination of our customer base.
Search: After the data catalog components mentioned above have been implemented, we arrive at search. While search is certainly a killer app that can run on top of a catalog, like the others mentioned, by itself, search isn’t a data catalog but only one feature of a data catalog.
For those out there who think “search” is going to be a cheap replacement for a full data catalog, strongly consider the definition of a data catalog and the aforementioned components in this blog post that are crucial to building a catalog. Most importantly, don’t be sold on search being a catalog in itself.
And if you want an even more detailed evaluation of what makes a data catalog, here is a link to a white paper by Intelligent Business Strategies.