Data Governance Compliance with Waterline Data
The European Union (EU) published the General Data Protection Regulation (GDPR) in May 2016, and it went into effect on May 25, 2018. Almost immediately we started to see rulings and fines, from smaller cases such as the ICANN internet domain registrations system, to a €50M fine levied against Google, and more recently, a £183M fine levied against British Airways following a data breach. Data privacy regulations have been around for a long time, but suddenly the regulations, and regulators, have teeth, and data governance software is more important than ever before.
GDPR applies to the processing and protection of the personal data of all data subjects, including customers, employees, prospects, and any other people about whom we collect data. The regulation applies to organizations collecting, retaining, and processing information about data subjects who reside in the European Union or Special Member State territories. Non-compliance with GDPR may result in huge fines, which can reach €20M or 4 percent of an organization’s worldwide revenues, whichever is highest.
In addition to GDPR, CCPA and similar data privacy initiatives and other regulations have been enacted, ranging from data residency laws to industry-specific regulations. No one expects the pace of regulations to slow down. A strong data governance program, including utilizing a scalable data governance software platform, is a pivotal part of the landscape for organizations wishing to protect their brand reputation and bottom line. The traditional data governance disciplines of data ownership, metadata management, data quality management, and model governance are critical to GDPR compliance.
Comparing GDPR and CCPA
According to PWC, CCPA is the beginning of “America’s GDPR.” Similar to GDPR, the CCPA will require organizations to focus on user data and provide transparency in how they’re collecting, sharing, and using such data. Certain CCPA requirements overlap with the existing GDPR individual rights requirements, which may give GDPR-ready organizations a jumpstart on building a capability around user-data handling practices.
Over the past decade, Waterline Data has worked with a wide variety of data governance software and management tools that help manage artifacts such as policies, data taxonomy, data owners, critical data elements, and data standards that are equally applied to GDPR, and recently CCPA. However, most current data governance software platforms are useful only at the governance process and policy level. They tell you what kinds of data are sensitive, or what combinations might be considered sensitive, and they provide guidance to data security professionals about whether data should be masked, de-identified, or completely locked down and restricted.
Data governance and management software can tell you what kinds of data should be considered to be sensitive (e.g., national ID, first name, and last name) and what policies to apply when you see that data. However, it assumes you already know where to find it.
At the other end of the spectrum is software at the data security and storage level for masking and de-identifying data. These software platforms help you successfully obfuscate or even lock down access to sensitive data when you see it. They also assume you already know what it looks like and where to find it.
The Tough Questions
- Where is the data protected by GDPR, or CCPA located in the first place?
- Where did the protected data come from and where is it going? What is its provenance?
- How do we automate the process of identifying, reporting, and then controlling access to GDPR-regulated data as new data is constantly entering our environment?
- How do we fulfill our obligations under GDPR, or CCPA to report the uses to which data has been put, and to erase or correct data in flight upon demand of the subject?
Two Critical Components Need to be Addressed and Automated:
Identifying personal data elements and their location across all data stores spanning the organization
Establishing a practice of data-driven decision making on access control, and record keeping of who is using what data, when, for what purpose
The Technical Challenge
Keeping data that is covered under GDPR and CCPA in a governable state is much harder than it sounds. In fact, most organizations don’t even have the baseline infrastructure to properly support data privacy initiatives.
The result is that very often the solution to governing data is to lock it all down, put it into quarantine, and limit access. This approach has the unfortunate side effect of treating all data as if it is sensitive. People who have legitimate need to see some data, but not all data, have to go through a centralized group to process their requests. Unfortunately, these centralized groups are often understaffed, so access to data is slow and frustrating, with the result that the concept of “Data Governance” now has a bad reputation.
This approach also fails to support GDPR or CCPA because part of the regulations includes “the right to be forgotten,” requiring companies to be able to present whatever information is being kept about an individual as well as provide the ability to remove most of this data upon request. Other regulations may require you to keep some of the data, perhaps to balance your books, or to support accident investigations, or to meet potential discovery requirements. Therefore, just protecting data from unauthorized access is not enough to comply with GDPR.
Fortunately, data governance software solutions are now available that can automate much of this process.
Identify Personal Data Elements and Their Lineage
“The data subject shall have the right to obtain from the controller the erasure of personal data concerning him or her without undue delay. . . .” GDPR Article 17 deals with the right to erasure (a.k.a., “The Right to Be Forgotten”). Article 17 of GDPR deals with an individual’s right to have their personal data erased once their “personal data are no longer necessary in relation to the purposes for which they were collected or otherwise processed.”
“…the controller and the processor shall implement appropriate technical and organizational measures to ensure a level of security appropriate to the risk.” GDPR Article 32 deals with the security of data processing. Similarly, Article 32 of GDPR calls for organizations to “ensure a level of security appropriate to the risk” associated with this same data. In simple terms, GDPR expects organizations to secure personal information about their customers and consumers.
With both of these articles, there is an assumption that the organization already knows where all the personal data is located across the vast mounds of data that exist.
How can you “forget” someone if you don’t know where their data is located in the first place?
The challenge is that most organizations have no process for tracking and documenting all their data and data flows. Many organizations will have kept track of their most critical systems but lack a comprehensive catalog of all their data, including development, test, production, data warehouse, and backup systems.
In addition, GDPR Article 30 requires organizations to maintain a record of processing activities, including a description of the categories of personal data; the categories of recipients of personal data, including those in third countries or international organizations; and transfers of personal data to a third country or an international organization. The record-keeping requirements also extend to so-called processors who process data on behalf of an organization. To comply with this article, data privacy teams need to strengthen metadata management and data lineage capabilities using a machine learning-based data governance software platform.
Masking Sensitive Data
“The principles of data protection should apply to any information concerning an identified or identifiable natural person.” GDPR Recital 26 addresses the kind of data to be protected.
To comply with GDPR, the data protection officer must establish controls to appropriately mask or encrypt sensitive personal data. The data masking standards need to ensure that data cannot be reconstructed when multiple fields are combined.
The challenge, once again, is knowing where the personal data is located and what standalone data and data combinations will be considered sensitive. The second challenge is then programming data masking or de-identification tools to do this work.
Most of these data governance tools are programmed manually, either by a data steward who personally identifies the specific attributes to be masked column by column or by using a “broad brush” approach in which all data in each data source is protected.
Determining what should be masked or de-identified is especially problematic because new data arrives in organizations on a regular basis, and that data must be evaluated to determine whether it should be masked or de-identified. Most organizations don’t even have the baseline infrastructure to properly support data governance initiatives as data volumes and variety continue to scale up. Enterprises are adopting tag-based security policies, such as those implemented by the Apache® Ranger project. Instead of defining access control for each specific field, table, or file, by name, instead access control is defined for any field, table, or file tagged with a specific tag.
Finding Sensitive Data
Waterline Data Governance software can discover, search, and surface critical data to an organization, connecting the business and governance processes built around GDPR, CCPA, and other privacy rules to the actual data that needs to be governed. It automatically “fingerprints” data at scale by analyzing source data. If a business glossary exists, it organizes the data to that nomenclature using machine learning to automatically tag and match data fingerprints (Figure 1). It then matches the unmatched terms and creates new missing terms through crowdsourcing. This is the process that automatically builds an inventory of all your sensitive data and answers the first question:
“Where is the data that is protected by GDPR located in the first place? (Figure 1)
As part of this fingerprinting process, Waterline lets data stewards review and curate the automatic suggestions. Waterline also profiles data, providing data quality metrics that include min, max, cardinality, and selectivity as well as presenting statistics such as data type, number of null values, mean, standard deviation (when appropriate), and total number of rows (Figure 2).
Waterline also captures data lineage (Figure 3), importing lineage from other systems and filling in the gaps automatically by using automated algorithms to infer lineage when it isn’t already documented, answering the second question:
“Where did the protected data come from and where it is going?” (Figure 3)
Waterline protects sensitive data within its own interface, showing users only data values they can already see based on their permissions to the underlying file or table. More important, because Waterline automatically tags data as sensitive, those tags will be passed on to access control tools such as Apache Ranger or Cloudera® Sentry (or others via a published and supported REST API), which can then be used to make sure only people who should see the sensitive data, can see the sensitive data.
This answers the final question: How do we automate the process of identifying and controlling access to GDPR-regulated data as new data is constantly entering our environment?
Benefits of Waterline Data for GDPR, CCPA and other privacy rules
Automating the end-to-end data cataloging process with Waterline data governance software significantly reduces the cost of data governance while making it a process that can be used for competitive advantage:
- Significant increase in data inventory and lineage accuracy, reducing the risk of financial penalty (e.g., 4 percent of worldwide revenue from GDPR fine exposure)
- Significant reduction in time required to get data out of “quarantine” and into use
- Making it practical to tag and inventory data at scale, enabling organizations to spend more time using data and less time protecting and searching for it
- Agility and readiness for additional regulations
Complying with GDPR or CCPA doesn’t have to leave business professionals muttering under their breath. With Waterline Data, organizations who can automate this process and properly get data into production quicker will reap the benefits that come from being first to market with new
information, products, and services. By automating the underlying process for inventorying, tagging, and curating data and data lineage, organizations can quickly pass new data through quarantine and safely put it into use, while complying with GDPR and CCPA. These same companies will turn the governance process from a liability into an asset that continuously delivers competitive advantage.