Tag Based Security – Another Key to Unlocking Self-Service Data

One of the key challenges in big data is data security. The relative lack of security in big data environments is a major issue that has hindered many organizations from taking their big data environments into full production. In the Hadoop world, some new technologies are trying to make this issue a thing of the past, by responding with Apache Ranger and Sentry as technologies that can be used to secure access to data in your data lake. From their apache.org definitions:

Apache Ranger™ is a framework to enable, monitor and manage comprehensive data security across the Hadoop platform.

Apache Sentry™ is a system for enforcing fine grained role based authorization to data and metadata stored on a Hadoop cluster.

While data security is still a relatively new concept in the Hadoop world, there are some interesting capabilities that this new world of Ranger and Sentry provides, which aren’t available in the legacy relational world, and that over time will make securing data much easier.

Traditional approaches are based mainly on assigning roles and domains to define the security paradigm. A person has a role, which gives them a certain level of access to data, and that role is then assigned to specific data domains. This means however, that whenever a new set of data comes into an organization, that new data domain has be evaluated to determine what data is in it, and who should have access to it. After this evaluation is completed, security settings then have to be manual placed to allow people with the designated roles to have access to this new data domain. The problem with this approach is that with an ever growing variety and volume of data, manually assigning security rights inhibits the organization’s ability to keep up with the volume and variety of data.

To address this, the big data world has introduced the idea of tag-based security. This is the idea that some data within a table might be sensitive, other data might be considered to be super sensitive, while other data might be OK for anyone to look at. So, if users can tag specifically which fields are sensitive, then they can assign access rights based on tags, which means that authorization policies can be controlled at a much finer grain level than before.

The downside of this of course is that you have to tag each field to take advantage of this capability. But if we then layer on the concept of data fingerprinting, the ability to automatically tag data with a business term (some of which might be considered sensitive and other data not sensitive at all), then we now have a new concept of automated tag-based security which would work as follows.

Imagine a new set of financial data comes into your organization and there is a data catalog that looks at the data in each column, evaluates the data and then tags it with an understandable business term like: First Name, Last Name, Claim Number etc. The access rights for each tag name is predetermined for their sensitivity. First Name and Last Name are traditionally considered to be personally identifiable information (PII) and as such, could belong to a “tag group” that is assigned a security level that says it is restricted. But let’s also assume that Claim Number is not restricted. This means that Apache Ranger and Sentry would restrict access to First Name and Last Name to people with the proper role, but allow anyone with domain rights to that data set to see the claim number regardless of their role.

The result is that security rights can be set automatically, with no human intervention necessary to configure access rights to data in the system. The reason this is a big deal is because with new data sets getting added, moved and reclassified on a regular basis, keeping up with those changes manually is becoming a growing challenge. Like with many things these days, there is simply too much data to keep up. The option in the old world was to either restrict data access to a very limited, trusted set of people, which meant that the idea of “self-service” data would get blocked or at least significantly slowed down. Or, data professionals could make new datasets available to everyone, which meant that sensitive data would get exposed to people who shouldn’t see it.

The advantage with this emerging approach with tools such as Sentry and Ranger is that it automates the data classification, tagging and security process, making it possible to quickly deliver self-service data while automating data security and data governance. Clearly the big data world has much more work to do before it becomes truly mainstream. But the introduction of tag-based security is another big step towards the democratization of the use of data.