If 2018 was about anything, it was about preparing big data for its next big phase. After several years of being stalled, stymied and even charged with being one big bust, big data is about to live up to its hype. Many of the CDOs and other data professionals I’ve spoken with in recent months agree we’re on the cusp of something truly “big”. These are the same organizations that have continued to make big investments in data, learning from their mistakes, applying those lessons, and adopting the technologies that are allowing them to blow past many of their remaining obstacles. It’s this new optimism, investment and innovation that’s bringing about the actual execution everyone’s been expecting for some time now, and it’s the same trifecta of opportunity that’s behind my top five big data predictions for 2019:
#5: Now Proven, AI and ML Will Dig Deeper into Enterprise
A couple of years ago, as we brought the first ever data catalog to market, I spoke with a general tech reporter about automated data cataloging. The reporter responded with some skepticism. “That sounds like one big mess just waiting to happen.” The reporter, understandably, feared automating the classification of data would lead to a lot of mislabeled information and faulty analysis. Since then, automation as driven by artificial intelligence and machine learning is often viewed as a primary enabler of big data in the enterprise. In 2018, we watched as the time, cost, and labor-intensive manual processes that have been holding up the big data initiatives within organizations began to melt away. Automation, AI and ML—proven now not just in terms of speed but also accuracy—is now being applied to more and more business functions. This fits into a general trend of moving away from hard-coding business process and operations into software–and adjusting people and physical operations to match the predefined and rigid business processes–and toward dynamically adapting business processes and operations to the physical realities and historical learnings.
For example, universities measure historical admission and acceptance trends to determine who is likely to accept admission and how much would scholarships affect their decision. Alternative credit risk analysis is performed to determine creditworthiness of first-time or low income borrowers. Customer churn predictions is gleaned from sentiment analysis of social media. Transportations routes are dynamically calculated based on real-time traffic and weather information. Key to all these applications is the ability to create good stable models and the key to building good stable models is being able to find the right data and create the right features. Imagine trying to predict the price of a house without knowing how big it is. We may try to glean its size from the size of the lot, but it is very unlikely that we will be able to build a stable predictive model. In 2019, you will see greater reliance on catalogs by AI and ML teams in order to find and understand the data needed to build those models.
#4: Say Hello to Hybrid Environments
Last year, I predicted broad adoption of the cloud will finally force object stores to be hardened and properly governed, and that the new standards would require data governance that’s cloud, location and platform agnostic. In 2019, you will see more organizations that are now comfortable with the cloud rowing a hybrid, heterogeneous data estate that includes multiple fit-for-purpose big data, relational and NoSQL data stores both on-premise and in the cloud. With a hybrid model in place, applications that work best on the public cloud can reside there. Those that need to remain on-premises can do so. While this seems like it would create greater complexity, in 2019, you will see more and more solutions that abstract this complexity through location and compute transparency. They will provide consistent access, management and governance, which organizations will take advantage of to transparently move data to the appropriate storage tiers, elastically spin up required compute nodes, and in general provide consistent, governed and managed data access and usage. From file systems like MapR’s data fabric that create a single name space to catalogs like Waterline Data that provide a single interface to all the data in the enterprise, end users will be increasingly shielded from the complexity of hybrid architectures while getting full benefits of fit-for-purpose, elastic solutions that it offers.
#3: It’s the Data Lake’s Great Return
While organizations have been traditionally focused on the mechanics of creating and hydrating the data lakes, but frequently creating data swamps instead, 2019 will bring renewed focus to data lake adoption. I was recently at a Gartner conference where an analyst asked a large audience how many have a data lake and almost every hand went up. He then asked how many are on a second generation data lake and about half the hands went up. This is very similar to what we experienced with data warehousing where the initial generation of data warehouses were often misguided and lacked adoption, but they taught the organizations what was really required to create value and achieve broad adoption. I believe we are at the same stage with data lakes and in 2019 the focus will turn from the mechanics of the data lake to making the data in the lakes findable, usable and governed at scale and in automated manner, powered by the new spate of AI-driven data catalogs and governance solutions. Even new data lakes will get rolled out in a much more deliberate manner with clear initial use cases and usage and governance policies. We will also see more and more data lakes being built or migrated to the cloud to take advantage of managed infrastructure, elastic storage and compute and rich ecosystems. Additionally, we will see the beginning adoption of Virtual Data Lakes that span multiple systems. And instead of building, maintaining and paying for large hydrated data lakes, organizations will create a catalog that makes all data look like it is in a data lake but hydrate it on demand.
#2 Big Data Becomes Little Data
No, organizations won’t be dumping all the stockpiles of their data, but well, they will in limited scope. With greater visibility into the data they have will come opportunities to rationalize and consolidate for significant savings in storage costs and even more accurate analytics now that organizations know which data is corrupted and can be jettisoned. But “becoming little” also speaks to large volumes of data that used to choke the organization now becoming manageable enough to put to use, thanks to the automation of key processes like cataloging.
#1: Explainability Will Emerge as Key AI Requirement
As more and more business (and government) is run using AI and ML algorithms, there will be more focus on transparency and explainability. Why was a mortgage denied? Can a bank prove that none of the illegal demographics (like race, gender and so forth) were used to make the decision or train the model that made the decision. Using catalogs to find appropriate data sets and document their lineage and quality is the first step to such transparency and explainability. If we do not know where data came from or what it means, we will not be able to explain the model or insure it’s proper and legal operations.