Sensitive Data Discovery
Don't Let an Incident Become a Breach
How can you find a needle if the hay isn’t even stacked? Amid all the noise around protecting an enterprise from regulatory compliance failure, most enterprises are still stuck in the barn with step one, that is discovery. While trying to determine where data is located, what it contains, and how it can help, discovery complications can result in regulatory non-compliance. Ignorance makes no enterprise immune from General Data Protection Regulation (GDPR) or California Consumer Privacy Act (CCPA) violations, customer backlash, or loss of shareholder value in the event of a breach or loss. Will your hay feed the future of your enterprise or cause an unnecessary (and distracting) barn fire?
Considering how ubiquitous lack of sensitive data control is, it’s no wonder that sensitive data discovery tools are growing in popularity. Many large enterprises have billions of columns with missing, inconsistent, or obscure descriptions of data.
For the enterprise petabyte data ecosystem, discovering and tagging sensitive data is a monumental undertaking:
- Goldman Sachs has more than 2.5 billion columns of data
- Fannie Mae receives more than 10 million data sets per day
- HSBC has 50 different data lakes, each containing more than a petabyte
- Kaiser has an estimated 4.1 billion columns of data
If all of that data is not accurately classified and tagged, then how can enterprises know whether it contains sensitive information? And if an enterprise doesn’t have an automated process for tagging data intake, then how much is the problem compounded each day?
Sensitive Data Discovery Management
Regulations Civilize the Data Wild West
Embrace a highly regulated future. GDPR and CCPA are just the beginning, and other countries and states are going come with their own regulation – andcompliance will not be optional. All organizations must have the mindset that customers will own their data and have the right to have their data deleted or returned to them. To continue developing insight based on consumer data, enterprises must protect all customer personally identifiable information (PII), protected health information (PHI), cardholder data (CD), and more, which is as varied as buying preference, intellectual property, credit card information, music preference, shirt size, salary information, zip code.
Consequences for non-compliance can be severe. Although CCPA caps fines at $7,500 per violation, the total amount can quickly add up to millions of dollars if there are petabytes of violations. And Google’s 50 million euro fine for GDPR infractions has many wondering not if, but when, similar regulation will be instituted nationally in the U.S.
The regulatory fines are a small fraction of the total cost of non-compliance. The majority of non-compliance costs often involve loss of customer trust, bad press, stock valuations and diversion from the organization’s focus. When the outside world sees one non-compliance issue, they question whether there are more issues and is this issue indicative of larger, more wholesale internal problems.
The Core of the Discovery Problem
You collect or buy sensitive data because you need it to run your business. From identifying customers trends, marketing to prospective customers, to advancing product development, to monitoring payment history, enterprises need data to make smarter decisions, improve the customer experience, and develop the next big thing. None of that will be possible if the data is treated irresponsibly, beginning with not knowing what data is housed within the organization or where it’s located.
There’s a wide operational gap that’s causing this problem. Business and data stewards express their needs and requirements in terms that don’t align to technical metadata in databases and data lakes on-prem and in the cloud. This flawed process results in manually tagged data that cannot be identified because metadata is obscure, inconsistent, or missing altogether. Manual tagging is already fraught with inefficiencies, and at a petabyte volume it’s disastrous, causing operational inefficiency, lost productivity, and unreliable search results.
What are Sensitive Data Discovery Tools?
Two processes are essential to protecting sensitive data: discovery and classification. Enterprises need sensitive data discovery tools that automate both processes, using a combination of AI to automatically discover and tag data with business terms, ML, and user-driven curation and training.
As part of Waterline Data’s data catalog product, sensitive data discovery tool named Aristotle, collects hundreds of data points about each data field and saves it as a “fingerprint” in the Waterline data catalog. This unique fingerprint is an approximation of the values contained in each field and can easily be used to find similar fields. It then applies machine learning to cluster and group similar fields, often grouping thousands of similar fields together. After it learns the business contents of one field from existing metadata or a business user, it propagates that label across all the other fields.
Users then continually passively train the data catalog as they go about their normal tasks, such as searching for data. If they see an incorrect tag or a missing tag, they can easily correct it. Aristotle’s machine learning learns from this action and continuously improves its ability to recognize and accurately tag data. Users also add value to the data catalog as they rate, comment, and collaborate. The end result is a very efficient method for tagging and cataloging all data.
Make Hay While the Sun Shines
Using sensitive data discovery tools to identify and classify data helps enterprises achieve three primary objectives:
- Do more with the data you already have (after you know what the data is).
- Secure the data to meet regulatory demands and prospect/customer/shareholder expectations.
- Be ready for the future, as more state/region and global regulation is coming. It’s inevitable.
Enterprises that engage sensitive data discovery tools now to gain control of existing and incoming data will have a competitive advantage over those that get bogged down by non-compliance in the very near future.