h2h2h2

12 minute read · January 21, 2020

Using Data-driven Permissions to Secure Your Data Lake

Mikhail Setkin

Mikhail Setkin · Data Lake Platform Owner, Raiffeisenbank

Row and column level security is an important functionality for the majority of Dremio users. To implement this requirement, Dremio provides the following SQL functions:

  • is_member(ldap_group)
  • query_user()
  • USER IN (list of user names).

Our documentation provides basic information on how to utilize these functions using examples with hard-coded users and groups. In this article, we cover this topic at depth with complete data-driven row and column level permissions.


The Scenario


This is a common example of data tables that contain sensitive data. Let’s take SALES_REGIONS defined below as an example of sensitive data that requires restricted view permissions.

ColumnDescription
REGIONNon-sensitive column
SALES_VOLUMESensitive column. Only marketing people from that region can see that data
SALES_CONTACTSensitive column. All marketing people can see that data.
RegionSales_VolumeSales_Contact
WA1,000,000.00Sam
CA1,100,000.00Jon

This scenario takes into consideration the following conditions:

  • People in the Marketing team can view the sensitive data only if it belongs to their region.
  • There are three LDAP groups: Marketing that lists all people in the Marketing team, CA_Org *and *WA_Org that lists people in the entire organization that belongs to the correspondent region.

What to avoid


The easiest way to implement row-level permissions, is to create a SQL statement that would allow users see only rows that they have access to.

SELECT REGION, SALES_VOLUME, SALES_CONTACT
FROM SALES_REGIONS
WHERE IS_MEMBER('Marketing') AND
( (REGION = 'WA' AND IS_MEMBER('WA_Org')) OR
  (REGION = 'CA' AND IS_MEMBER('CA_Org')) )

In order to provide a more comprehensive view to that data, we can implement column-level permission we can create the following SQL statement:

SELECT
    REGION,
    CASE    WHEN REGION = 'WA' AND IS_MEMBER('WA_Org') THEN SALES_VOLUME
            WHEN REGION = 'CA' AND IS_MEMBER('CA_Org') THEN SALES_VOLUME
            ELSE ' *** ' END SALES_VOLUME,
    SALES_CONTACT
FROM SALES_REGIONS
WHERE is_member('Marketing')

However, this SQL code contains hardcoded LDAP groups and this approach is not encouraged since it is error-prone and hard to maintain.


Implementing data-driven permissions


For this example, let’s take into consideration the table SALES_PERM that contains all required permission information:

ColumnDescription
TABLE_NAMETable in the scope of the permission record
COLUMN_NAMEColumn in the scope of the permission record
REGIONSensitive Data Key
LDAP_GROUPLdap group that has access to the sensitive data
TABLE_NAMECOLUMN_NAMEREGIONLDAP_GROUP
SALES_REGIONS  Marketing
SALES_REGIONSSALES_VOLUMECACA_Org
SALES_REGIONSSALES_VOLUMEWAWA_Org
SALES_REGIONSALES_CONTACT Marketing

The new row level permission query now looks like this:

SELECT R.REGION, R.SALES_VOLUME, R.SALES_CONTACT
FROM SALES_REGIONS R
JOIN SALES_PERM P ON
    P.TABLE_NAME = 'SALES_REGIONS' AND
    P.COLUMN_NAME = 'SALES_VOLUME' AND
    R.REGION = P.REGION AND
    IS_MEMBER(LDAP_GROUP)

The column level permission query would look like this:

SELECT
    R.REGION,
    CASE    WHEN P.REGION IS NOT NULL THEN R.SALES_VOLUME
            ELSE ' *** ' END SALES_VOLUME,
    R.SALES_CONTACT
FROM SALES_REGIONS R
JOIN SALES_PERM RP ON
    RP.TABLE_NAME = 'SALES_REGIONS' AND
    RP.COLUMN_NAME IS_NULL AND
    RP.REGION IS NULL AND
    IS_MEMBER(RP.LDAP_GROUP)
LEFT JOIN SALES_PERM CP ON
    CP.TABLE_NAME = 'SALES_REGIONS' AND
    CP.COLUMN_NAME = 'SALES_VOLUME' AND
    CP.REGION = R.REGION AND
    IS_MEMBER(CP.LDAP_GROUP)

This approach does not have any data hardcoded and is a much better approach from maintenance and security perspective.


Considerations and Best Practices


Designing a permission table is a challenging task and depends on the concrete requirements. So far, the following approaches have been considered and implemented:

As per the example above, a single table can contain row and column level permissions for all tables. While it may look like a simple design that would also be easy to maintain, the number of columns in this table may start growing quickly. This approach may work well when the number of sensitive columns is small and/or the column names are consistent across all tables and databases.

Another approach is to use separate permission tables for different tables and columns. This approach would allow for more flexibility but has the potential to increase maintenance overhead.

A great approach to consider is to utilize a single key/value table where key could be a table name or a combination of a table name and and a column name. Value would be a permission flag, such as Y/N. This table can be pivoted with views to make it easy to consume in the virtual data sets (VDSs).

Instead if making the permission flag a Y/N, we can support multiple values, such as:

  • 100 - no access, completely mask the data
  • 200 - partial access, show last 4 digits
  • 300 - full access, show data as is.

That approach would allow for various levels of security as well as easy potential CI/CD automation.

More normalized/advanced design can support additive and inverse LDAP groups, such as Marketing and WA_Org from the example above. This approach would most likely require LDAP_GROUPS table with many-to-many relationship to other permission table(s).


What to look for


Row duplicates: when following any of the design options above, there is always the risk of generating duplicate rows. It’s possible that a query user belongs to multiple LDAP groups listed in the permission table. When joining a data table, it might produce duplicate rows. Here are a couple of options to consider:

  1. Assign the DISTINCT clause. Depending on the data size, that option might be costly.
  2. The other, better option, is to prevent duplicates by ensuring that the permission table does not return more than one record. For example, when using a permission flag. We can use MIN function to support the “least permission” approach or MIN to support “most permission” approach.

This approach has been tested on sub-second queries with datasets as large as 50M records without any impact on performance. However, performance is still a significant design consideration. Improper design (using DISTINCT for example) can drastically increase execution time. It’s important to validate design from a performance perspective.


Conclusion


In this article, we walked through the steps of implementing security on your data lake using data-driven permissions using Dremio. We have shared a sample scenario that is very similar to what many enterprises face when they need to secure data. We have also talked about design considerations, best practices and things to keep an eye out for when performing this implementation.

We hope you find this information useful, to learn more about how you can enhance the security of your data lake using Dremio, checkout these resources:

Ready to Get Started?

Enable the business to create and consume data products powered by Apache Iceberg, accelerating AI and analytics initiatives and dramatically reducing costs.