9 minute read · March 23, 2020
How To Secure Your Data Lake
· Director of Technical Marketing, Dremio
Data breaches have been increasing at an exponential rate; The business impact of these attacks has been calculated at over $6 trillion. I don’t know about you, but that sounds like a pretty hefty sum to me.
We all understand the value of a cloud data lake. They are easy to set up and maintain, and in addition to being virtually limitless, they allow you to run any engine on top of your data. Cloud data lakes are, by nature, the first place where data lands. Because of this, they become the most attractive target for cybercrime. For these reasons, organizations need to adopt especially stringent security controls for their analytical systems.
In this multi-part series, we will cover everything you need to know to implement data lake security, as well as best practices, the latest technologies, and how you can leverage security features using Dremio.
What Is a Cloud Data Lake?
A cloud data lake is a cloud-hosted centralized repository that allows you to store all your structured and unstructured data at any scale, typically using an object store such as S3 or Azure Data Lake Store. Its placement in the cloud means it can be interacted with as needed, whether you need to process, analyze, or provide reports of said data. Cloud data lakes can also be used to store an organization’s data (including data generated from internal and external actions and interactions).
“Data lake” is a broad term that’s traditionally associated with Hadoop-oriented object storage. In such a scenario, an organization’s data is loaded into the Hadoop platform, and then analyzed as it resides on Hadoop’s cluster nodes of commodity computers. While traditional data lakes have been built on HDFS clusters on-premises, the current trend is to move data lakes to the cloud and maintain them with cloud service providers such as AWS, Azure, GCP, and others.
A data lake can include structured data from relational databases (rows and columns), semi-structured data such as CSV, JSON, and more, unstructured data (documents, etc.), and binary data such as images or video. The primary utility of this shared data storage is in providing a unified source for all data in a company; each of these data types can then be collectively transformed, analyzed, and more.
If you want to learn more about this technology, read this robust explainer that covers it in more detail.
Why Do We Need Data Lake Security?
Industries have developed standards and regulations to better protect data. The following are some examples:
- CCPA to enhance privacy rights and consumer protection for residents of the state of California
- FISMA to ensure the security of data in the federal government
- FERPA to protect the privacy of student education records
- GDPR for the protection of EU citizen data privacy
- HIPAA standards for managing healthcare information
- PCI DSS for managing cardholder information
- The Asia Pacific Cross-border Privacy Enforcement Arrangement (CPEA), creating a framework for regional cooperation in the enforcement of privacy laws
While all of them are different, and each one treats a different symptom, these ever-evolving regulations have several features in common:
- Access control
- Auditing
- Data encryption
What Are Your Options?
Cloud vendors such as Azure and AWS offer several features that can help you implement security best practices on your cloud data lake. These built-in controls go all the way from identity security to security management.
User Authentication and Authorization
Rule of thumb numero uno, don’t talk to strangers. Well, at least that is what our parents used to tell us, and now we literally summon strangers through the internet and jump in their cars so that they (hopefully) will drive us to our destination, but that is a whole different story.
Securing and governing user access will always ensure that your data is not just open to the public, but it also gives you the opportunity to identify who can access it, what actions can they take on the data once they have access to it, and much more. Some of the entities that need to be authenticated include users, administrators, and systems (interacting with other systems through APIs).
In addition, governing and securing users’ rights will allow you to control what actions an authenticated entity (users or admins) can take within the system. These actions are defined and managed as privileges. It is imperative to have a security plan laid out before the data lake is created, as this will provide an opportunity to define roles and privileges. Access settings such as read, write, and execute can all be easily granted or denied through the cloud’s management console.
AWS’s Identity and Access Management (a.k.a. IAM) and Microsoft’s AzureAD (a.k.a. AAD) offer ways to implement user authentication and access control. These tools can be used by administrators to create and manage users and groups, and apply detailed permission rules to these users and groups to limit access to objects and resources located within the cloud environment.
There are several security best practices that you should apply to make the most of these features:
- Follow the “least privilege” principle by selecting the smallest scope of permissions necessary
- Use security groups instead of individual users; when creating access policies, granting permissions to groups instead of single users minimizes the risk of users obtaining unnecessary or excessive permissions by chance
- Provide access through roles instead of creating individual sets of credentials; misplaced or compromised credentials can lead to security breaches
- Enable multi-factor authentication
- Implement expiration periods on access keys and enforce strong password policies
Managing Access to Resources
In addition to these practices, you should also implement security directly on your data. Follow these guidelines to help ensure the integrity of your data lake storage:
- Ensure that your resources (S3 buckets, ADLS objects, etc.) don’t have public read/write permissions enabled
- Enable audit logging, as this will help you track actions performed on data stores and help you identify possible security flaws
- Encrypt data in transit—this can be accomplished through HTTPS or SMB 3.0
- Leverage features such as Azure Disk Encryption to encrypt OS and data disks
- If needed, delegate access to data objects through the use of shared access signatures
An example of the policies that can be implemented to manage resources is AWS’s IAM. This method provides authorization to cloud services and resources; you can easily define who is allowed access to the resource, what action they can perform to the resource, and under which conditions.
Securing Metadata Access
Securing your data is only part of the story—securing metadata is just as important. Armed with metadata, an attacker can target users as well as applications within your organization, and gain access to sensitive information. If you are working on AWS and manage your data in your cloud data lake with AWS Glue, you can add IAM-based policies for controlling access to metadata., Additionally, you could use resource-based policies managed by AWS Glue which are similar to those of S3 buckets. Similarly, Azure Data Catalog allows you to specify who can access the data catalog and what operations (register, annotate, edit) they can perform on metadata in the catalog.
Data Lake Security with Dremio
Because Dremio connects directly to your data lake, security is a critical and essential feature. Dremio provides sophisticated and flexible security controls to ensure that data can be safely accessed from data sources across the enterprise. Dremio integrates seamlessly with the existing security control of your enterprise architecture, such as authentication and authorization schemes based on AD/LDAP.
Single Sign-On and Azure AD
Starting in Dremio 3.3, we added Single Sign-On (SSO) support, which provides a flexible method to integrate Dremio with existing identity management systems and offers numerous advantages for an organization, including:
User experience: With Single Sign-On, users can move between different systems’ security without interruption, since they don’t have to re-login to each system separately. SSO joins individual systems from a user’s perspective and switching between applications is seamless.
Security: With SSO, user credentials and access are governed directly from a central Identity Provider, instead of the individual system the user is trying to access. This consolidates and centralizes identity management, which significantly reduces administrative overhead.
Full support for Azure AD is included, drastically simplifying management, security, and administration. Simply configure Dremio to use Azure AD, and Dremio will automatically integrate with Azure AD for identity management and security. Additionally, Dremio supports the OAuth2.0 standard and can be configured with most Identity Providers that support the OAuth2.0 standard.
Personal Access Tokens
Building on SSO integration, Dremio also now supports Personal Access Tokens. Personal Access Tokens provide the ability to authenticate and log in to Dremio using SSO-configured tokens over ODBC, JDBC, and even Arrow Flight endpoints. They offer security features such as built-in expiration and on-demand revocation for flexible administration.
We have covered the very tip of the iceberg in terms of reviewing all of the features and best practices that you should keep in mind when implementing and securing your data lake. Stay tuned! In the next blog post, I will cover how Dremio handles sharing, permissions, row and column-level access, and much more.