5 minute read · January 14, 2021

Building a Better Data Lake on Amazon S3 with Dremio

Louise Westoby · Head of Product & Partner Marketing, Dremio

Traditional Data Lake Solutions Fall Short

Building a Modern Architecture for Interactive Analytics on Amazon S3 Using Dremio

Download the White Paper

Cloud Data Lake Engines Offer a Better Alternative

Dremio – Built for Cloud Data Lakes

Learn More

When it comes to storing large datasets, cloud-based data lakes are where the action is. The Amazon Simple Storage Service (S3) has emerged as a preferred data lake platform for good reasons. S3 is secure, scalable, flexible and offers excellent performance. It is also attractive because it allows users to only pay for the amount and quality of storage they need. S3-based object stores can store virtually any data. With optimized column-oriented file formats such as Apache Avro and Apache Parquet, improved metadata management and SQL-oriented query tools, it is increasingly practical to run SQL queries directly against S3 resident data.

Traditional Data Lake Solutions Fall Short

Unfortunately, even with SQL query tools such as Hive and Presto, data lakes still fall short for many applications. This is especially true for business intelligence (BI) and decision support systems (DSS). There are two key limitations:

Data lake query performance is far too slow to support popular reporting and analysis tools such as Tableau and Power BI
Traditional data lake solutions often lack necessary data governance and security controls

To work around these limitations, users often find themselves extracting data subsets from the data lake and replicating it in a data warehouse. Extracting, transforming and loading datasets into a format where they can be queried efficiently usually requires corporate IT assistance. This slows time to insight, adds costs and frequently undermines the benefit that the data lake was meant to achieve. It also complicates data governance because sensitive data is replicated into ungoverned data extracts, cubes and aggregation tables.

Business analysts and data scientists struggle to find the right balance between investments in the data lake and the data warehouse. Data lakes are scalable and cost-effective but lack the query performance and data governance features of a data warehouse. Ideally, enterprises would like tools that can query and analyze data in S3 directly at interactive speed, without having to copy data into other systems or compromise on performance or security.

Building a Modern Architecture for Interactive Analytics on Amazon S3 Using Dremio

Download the White Paper

Cloud Data Lake Engines Offer a Better Alternative

Fortunately, a new breed of cloud data lake engine can help organizations avoid this trade-off for both transactional and non-transactional workloads. Enabled by new open source technologies, the Dremio cloud data lake engine delivers lightning-fast queries. It provides a 100x improvement in BI query speed and a 4x improvement in ad hoc query speed running against S3 object storage and metadata solutions such as AWS Glue or the Hive metastore.

The cloud data lake engine also provides a self-service semantic layer enabling data analysts and engineers to easily manage, curate and share virtual datasets. Datasets are exposed via standard interfaces, and access is managed via centralized data governance and security policies. The semantic layer implements data governance features similar to a full-featured data warehouse, including granular row- and column-based access controls, data masking, encryption, auditing and more. A cloud data lake engine can access data from multiple sources, including other data lakes, file storage and various relational and non-relational data stores, providing a unified view of data assets to data scientists and business analysts.

By using a cloud data lake engine, organizations can realize multiple benefits:

Reduced cost by avoiding the need to extract data into separate data warehouses or aggregation tables to meet BI and data science application requirements
Faster time to insight by avoiding reliance on corporate IT to implement ETL workflows and provide suitable data extracts
Improved data security and governance with centralized data access controls regardless of the underlying data source
Improved productivity and collaboration between BI and data science users with a common view of enterprise data

Dremio – Built for Cloud Data Lakes

With a flexible multi-engine architecture scalable from one to thousands of nodes, the Dremio cloud data lake engine takes advantage of the AWS cloud’s underlying elasticity. It maximizes concurrency and performance and dramatically reduces infrastructure costs by scaling engines based on workload. The cloud data lake engine is easily deployable on Amazon S3. It also works seamlessly with other AWS data management solutions such as Amazon RDS, Amazon Redshift and other data sources. For enterprise users who want to ensure flexibility and portability, Dremio runs on premises and across multiple clouds. It can be deployed using AWS CloudFormation, in Kubernetes pods, Docker containers or Apache Hadoop environments.

Learn More

Dremio’s cloud data lake engine can help organizations strike the right balance between data lake and data warehouse investments while simultaneously reducing cost and complexity. It does this while helping enterprises avoid vendor lock-in, data duplication and enabling users to keep full control of their data. Download our free whitepaper Building a Modern Architecture for Interactive Analytics on Amazon S3 Using Dremio to learn how to get started or visit Dremio.com.

Article Topics

Dremio Blog: Partnerships Unveiled

Additional Resources

BLOG

Hadoop Modernization on AWS with Dremio: The Path to Faster, Scalable, and Cost-Efficient Data Analytics

Hadoop modernization on AWS with Dremio represents a significant leap forward for organizations looking to leverage their data more effectively. By migrating to a cloud-native architecture, decoupling storage and compute, and enabling self-service data access, businesses can unlock the full potential of their data while minimizing costs and operational complexity.
Dremio Blog: Partnerships Unveiled,

Learn More ->

BLOG

Enhancing your Snowflake Data Warehouse with the Dremio Lakehouse Platform

Integrating Snowflake with the Dremio Lakehouse Platform offers a powerful combination that addresses some of the most pressing challenges in data management today. By unifying siloed data, optimizing analytics costs, enabling self-service capabilities, and avoiding vendor lock-in, Dremio complements and extends the value of your Snowflake data warehouse.
Dremio Blog: Partnerships Unveiled,

Learn More ->

BLOG

Why Modernize Your Hadoop Data Lake with Dremio and MinIO?

Modernizing a Hadoop data lake with Dremio and MinIO brings substantial advantages to organizations seeking to enhance their data infrastructure. This transformation not only resolves the performance, scalability, and cost challenges associated with traditional Hadoop environments but also empowers businesses to achieve greater agility and efficiency. By leveraging Dremio's advanced analytics capabilities and MinIO's scalable storage, companies can modernize their data lakes to meet the demands of today's fast-paced, data-driven world. The result is a robust, flexible, and cost-effective data environment that accelerates time to market and drives business innovation.
Dremio Blog: Partnerships Unveiled,

Learn More ->

Building a Better Data Lake on Amazon S3 with Dremio

Table of Contents

Traditional Data Lake Solutions Fall Short

Building a Modern Architecture for Interactive Analytics on Amazon S3 Using Dremio

Download the White Paper

Cloud Data Lake Engines Offer a Better Alternative

Dremio – Built for Cloud Data Lakes

Learn More

Ready to Get Started?