Rapid Controlled Access to Data with Starburst and Immuta

June 16, 2021

Pat Bates

Solutions Architect

Starburst

Pat Bates

Solutions Architect

Starburst

More deployment options

Request Enterprise trial license key →

Start for Free with Starburst Galaxy

Try our free trial today and see how you can improve your data performance.

Start Free

Why Enterprise AI Success Comes Down to Data Access

A growing number of enterprises are experiencing the benefits of the Starburst single point of access to all of their data that allows them to rapidly drive value from their data assets. A key challenge for data managers and data engineers who deliver this data to business analysts and consumers is how to control the access and privacy of data sets regardless of the data source. Frequently there is a tension between the two goals of rapid access and controlled access. Thanks to the latest integration between Starburst and Immuta, it is now possible to deliver on both:

Rapid access to all the data that the business needs, wherever it lives.
Controlled access to data assets while appropriately protecting sensitive data.

This blog post will examine this integration and how Immuta enhances the value that Starburst delivers through the techniques of:

Attribute-Based Access Control (ABAC) and how it’s different from Role-Based Access Control (RBAC).
Purpose-based access policies that take into account not just the role of the data requester, but also their purpose (this is key for regulatory compliance).
Automatic discovery of sensitive data fields and automatic policy assignment to accelerate the pace of data availability.

Immuta access and privacy policies further improve the value of Starburst to organizations by ensuring that:

Data access and sensitive data protection policies closely align with corporate policies.
Protection of sensitive data adheres closely to governing regulatory requirements such as CCPA, GDPR, HIPAA, etc.
The protection of the data is automatic and does not impose delays to data delivery to the business processes that depend on it.

Let’s start things off by examining the fundamental concepts that Immuta brings to bear.

Attribute-Based Access Control

Immuta expands the familiar concept of role-based access control (RBAC) by introducing the notion of attributes, which make access control both simpler and more powerful. Traditional RBAC has allowed for users to be grouped into common roles and then for an access policy to be applied to all members of that role. For example, all employees in the Human Resources department can be assigned the role “HR”, and then the role “HR” can be granted access to employee personal and employment information. Broad-based roles are useful, but they tend to be cumbersome to manage when finer grained control is required. For example, if there is some HR data that is considered more sensitive than others, such as the employee SSN or their current salary, then additional groups must be defined. A new role “HR Supervisors” might be created with a subset of members of role “HR” who are granted open access to those data fields, while the rest of the members of “HR” would have these fields masked or redacted. As more and more fine-grained access policies must be considered, the result is an ever growing collection of specialized roles, to the point that role-based access is almost indistinguishable from individual user-based access.

Immuta addresses this with Attribute-Based Access Controls (ABAC), which enhances RBAC with finer-grained specifications. It starts with basic role-based policies and extends them with attributes that refine the access rules for a user. Consider the following from the Immuta console for a user in the HR department:

As a member of the group “HR” this user will be granted appropriate role-based access. But the inclusion of the various attributes, addressing aspects such as geographic location, professional or legal certifications and various data segment authorizations is what allows for the definition of fine-grained controls across arbitrarily complex attribute combinations. For example, the following policy that addresses PCI data can simply defined:

Data fields tagged as Social Security Number are masked for all users except for those with unrestricted clearance.
Data fields tagged as Person Name are tokenized using a regex expression for all users except for those with unrestricted clearance.

To provide this level of granularity using only role-based access controls would require the creation and management of every unique attribute-value combination. Attribute-Based Access Controls, on the other hand, provide for precisely granular control simply defined in an intuitive directive.

Purpose-Based Access Controls

To further control the access to data, especially sensitive data that is governed by various regulations or statutes like HIPAA, CCPA or GDPR, Immuta provides for designation of the purpose behind a request for data. Purpose can be tied to a particular business activity or use case, such as fraud detection or resource actions, and the controls applied to the purpose can supersede or augment the role or attributes of a given user requesting access. For example, a data policy can be defined as follows:

This specifies that users will see a formatted token for credit card numbers, unless they are acting for the purpose of “Fraud Audit”. Only when they are operating for that purpose will they be allowed to see raw credit card numbers. The same user would see those same credit card numbers as masked tokens if they were operating under any other purpose. This kind of access control is not supported using basic RBAC — each role is perceived to always operate for the same purpose.

This particular purpose-based access policy approach speaks directly to legal language in various privacy regulations that require the notion of purpose when handling sensitive information. Thus Starbust with Immuta can readily support an enterprise’s ability to maintain compliance and reduce risk.

Automatic Detection of Sensitive Data

Attribute-based and purpose-based access control policies require identification and tagging of sensitive data fields (aka columns). Manual tagging of table columns is a tedious and error-prone task, but is required by many role-based access control technologies. Immuta simplifies this process through automated detection and tagging of sensitive fields at the time that data sources are added to the Immuta catalog.

Consider a single table, Customer. A review of the data dictionary reveals a number of tags that have been automatically applied to several fields:

Note:

Several fields are tagged as PII or PHI sensitive data.
Several fields are categorized as identifying location attributes such as Address and State.
Other fields are categorized as personal identifiers, such as first and last names.

By automatically applying a rich set of attribute tags to each data field, Immuta makes it incredibly easy to define policies that protect the access to those fields.

Tying it All Together with Starburst

We now have in place:

Tags automatically assigned to data fields
Users assigned to groups
Attributes attached to users
Purposes associated with data projects

Immuta provides a rich set of capabilities to restrict or limit access to data fields and data sets, to apply automatic filtering and to protect sensitive data with a rich library of privacy enabling techniques. The full depth and breadth of these techniques is beyond the scope of this blog post, but more information can be found in https://documentation.immuta.com/2021.1/.

When users access data through Starburst that is protected by Immuta, they will direct all access through the Immuta catalog. The schemas in the Immuta catalog reflect the various data sources, and all tables are represented as Immuta views. For example, the catalog would appear as follows in a query editor:

When data sources are registered with Immuta, the data engineer gives a meaningful schema name for the tables, such as bb_accounts above (for the “burst_bank” schema in the “accounts” Starburst catalog. So instead of users accessing through Starburst in the typical nomenclature of “accounts.burst_bank.customer”, they would access through the Immuta views of “immuta.bb_accounts.customer”. All of this mapping occurs when the data source is first registered to Immuta.

All direct catalog connections have been disallowed by Starburst, and all traffic will flow through the Immuta connection.

Compare and contrast the catalog view for a non-Immuta deployment in which the user has explicit access to catalogs for each individual data source:

The following query illustrates the impact of Immuta. We will first observe the results of the query as run by a Business Analyst whose business function does not require any of the personally identifying fields or other sensitive information and generally only requires aggregates and summaries, not specific details of an individual customer.

For the business analyst, consider the following “real-world” query. This query illustrates both of the advantages of using Starburst and Immuta together:

Multiple data sources in a single SQL statement. As you view the query you can see multiple joins of different Immuta schemas that refer to distinct underlying data sources. This shows the power of using Starburst. These include joins on:
1. bb_customer_master_data — customer master data held in the data lake (on S3)
2. bb_accounts — accounts data warehouse on postgresql
Multiple sensitive data fields including unique identifiers (name, credit card number, etc.) and private information (account balances, income, etc.). This shows the power of using Immuta — different users see the data protected differently depending on their attributes and purpose.

The query selects a number of fields (enumerated in the final SELECT clause) from two joined data sources.

with lccp as (

with mccp as (

select

a.cc_number,

max(ccp.payment_date) as max_date

from

immuta.bb_customer_master_data.credit_card_payment ccp

inner join immuta.bb_accounts.account a on ccp.cc_number = a.cc_number

group by a.cc_number

)

select

mccp.cc_number,

mccp.max_date as payment_date,

ccp.payment_amount

from

mccp inner join immuta.bb_customer_master_data.credit_card_payment ccp

on mccp.cc_number = ccp.cc_number

and mccp.max_date = ccp.payment_date

)

select

a.cc_number as “CC Number”,

c.last_name as “Last Name”,

c.first_name as “First Name”,

c.street as “Address”,

c.postcode as “ZIP”,

c.state as “State”,

trim(c.dob) as “DOB”,

c.estimated_income as “Income”,

c.fico as “Fico”,

lccp.payment_amount as “Monthly Payment”

from

lccp inner join immuta.bb_accounts.account a on lccp.cc_number = a.cc_number

inner join immuta.bb_customer_master_data.customer c on c.custkey = a.custkey

order by c.custkey desc

Note the query is combining data (federating) from multiple schemas backed by distinct data sources. The results returned to the user are different depending on the attributes and purpose of the user who issues the query.

The following results are produced for the business analyst:

All of the sensitive, uniquely identifying fields are masked, such as:

Credit card number
Last and first names
Address

Other fields that are useful for a business analyst to aggregate and group data have been generalized:

The last 3 digits of the ZIP code have been zeroed out.
The date of birth is generalized to only the year of birth.
Financials like income and monthly payment data have been rounded.
The Fico score has been rounded as well.

If the exact same query is run by a fraud and risk investigator, they would see the following:

The data is very similar to the business analyst example in that several of the fields are still masked. Even the fraud investigator doesn’t need to know a person’s name or address to investigate anomalies or patterns in the data. They do however need specific details:

The credit card number is no longer masked.
The ZIP code is in it’s full form.
The date of birth is fully represented.
Financials like Fico score, income and payments are exact.

In Conclusion

For different kinds of business users running the same query, each is presented with the data at a level of protection and specificity that is consistent with their business function. This allows you to address the tension between the key challenges of providing fast and secure access for various data consumers in your organization. Starburst provides the single point of data access that allows such queries to rapidly combine / federate data across multiple data sources and Immuta provides the rich access and privacy policies to protect that data on demand.

The Data Engineers Guide to Iceberg v3

Rapid Controlled Access to Data with Starburst and Immuta

More deployment options

Start for Free with Starburst Galaxy

Why Enterprise AI Success Comes Down to Data Access