Data Lakehouse

What Is a Data Lakehouse?

A data lakehouse combines the performance, functionality and governance of a data warehouse with the scalability and cost advantages of a data lake. With a data lakehouse, engines can access and manipulate data directly from data lake storage without copying data into expensive proprietary systems using ETL pipelines.

Learn more about Data Lakehouses

Data Lakehouse Architecture

A data lakehouse is a new type of data platform architecture that is typically split into five key elements.

Learn more about the architecture behind data lakehouses

Open Data

In a data lakehouse architecture, the data is stored in open formats like Parquet, ORC and Apache Iceberg, allowing multiple engines to work in unison on the same data. Therefore, data consumers can have faster and more direct access to the data.

Learn more about the possibilities of open data within a data lakehouse

Data Management

A data lakehouse offers flexible and scalable solutions for data storage and management. By leveraging cloud-based object stores, open-source table formats, and query engines, data lakehouses provide organizations with the tools they need to store and manage large volumes of structured and unstructured data at a lower cost.

Learn more about managing data inside a data lakehouse

Storage

Data lakehouses offers storage where the data lands after ingestion from operational systems. Object stores are available from the three inexpensive cloud service providers — Amazon S3, Azure Blob Storage, and Google Cloud Storage — which supports storing any type of data and facilitates required performance and security.

File Formats

The next component is where the actual data is stored, usually in columnar formats providing advantages is reading or sharing data between multiple systems. Common file formats include Apache Parquet, ORC, Apache Arrow, etc.

Table Formats

The most important component of the data lake table format is organization and managing raw data files in the data lake storage. Table formats abstract the physical data structure’s complexity and allow different engines to work simultaneously on the same data. Apache Iceberg, Hudi, and Delta Lake are the three most popular table formats, and are widely gaining enterprise adoption.

Query Engines

Query engines are responsible for processing the data and providing efficient read performance, allowing some engines to have native connections with BI tools such as Tableau and Power BI, making it easy to report data directly. Other engines such as Dremio Sonar and Apache Spark work with table formats like Apache Iceberg to enable a robust lakehouse architecture using common languages like SQL.

Applications

The final component of a data lakehouse is the downstream applications interacting with the data. These include BI tools such as Tableau and Power BI and machine learning frameworks like TensorFlow, PyTorch, etc., making it easy for data analysts, data scientists, and ML engineers to directly access the data.

Data Lakehouse Features

A data lakehouse architecture blends the best of a data warehouse and data lake to support modern analytical workloads.

Data Governance

Data lake architecture does not use governance policies on the data. The quality of data landing in the object store may not be helpful for deriving insights, leading to data swamp problems. Best practices are adopted from data warehouse to ensure proper access control.

Learn more about governance within a data lakehouse

Transactional Support

Data lakehouse supports ACID transactions, which ensures the same atomicity and data consistency guaranteed in a data warehouse. This is critical for multiple read and write operations to run concurrently in a production scenario.

Schema Management

A lakehouse architecture guarantees that a specific schema is respected when writing new data providing support with no side effects. With new use cases there may be changes in the data type, and new data may be added.

Data Lakehouse Advantages

A data lakehouse mitigates the critical problems experienced with data warehouses and data lakes.

Multiple Analytical Workload Support

A data lakehouse supports diverse workloads, including SQL for business intelligence, data science, and near real-time analytics. Various BI and reporting tools (e.g., Tableau) have direct access to the data in a lakehouse without the need for complex and error-prone ETL processes.

Cost Efficiency

With all the data stored in a cost-effective cloud object storage, organizations don’t have to pay hefty costs associated with data warehouses or licensing costs for BI extracts. Keep your data in cloud object storage and reduce expensive ETL pipelines.

Independent Scaling

A lakehouse architecture separates compute and storage which helps scale these components independently to cater to the needs of an organization or department.

No Lock-In or Lock-Out

The open nature of the data lakehouse architecture allows teams to use multiple engines on the same data, depending on the use case, and helps to avoid vendor lock-in. This is important for organizations with data infrastructure across multiple cloud providers.

Infinite Scalability

Scaling up resources infinitely based on the type of workload.

Data Copies

With a data lakehouse architecture, engines can access data directly from the data lake storage without copying data using ETL pipelines for reporting or moving data out for machine learning-based workloads. Make ETL optional and simplify your data pipelines.

Who Uses a Data Lakehouse?

Data lakehouses are used by a wide variety of organizations, including enterprise data platform teams, , government agencies, and educational institutions to quickly analyze large amounts of data to make informed decisions.

CUSTOMER USE CASE

“Dremio bridges the data warehouse and the data lake, enabling NCR to derive more value between the two data sources.”

Ivan Alvarez | IT Vice President, Big Data and Analytics, NCR Corporation

Learn more ->

CUSTOMER USE CASE

“Where it can be expressed simply in SQL, Dremio will not be beaten…It’s given us this enterprise-wide semantic layer facilitating us to have a data mesh architecture on the lake. And on top of that we have happy engineers…Been waiting for a tool like Dremio for the last 15 years of my data career.”

Achille Barbieri | Senior Project Manager, Enel

Learn more ->

CUSTOMER USE CASE

“With Dremio, we are able to reach the source of truth of the data for analytical access, without data replication. This is a crucial aspect for us…We are among the first companies with this dimension and complexity to implement a data mesh architecture. Agile Lab and Dremio have been important for us to reach these results.”

Andy Kenna | SVP & Head of Data, RenaissanceRe

Learn more ->

Data Lakehouse vs. Other Technologies

Data Lakehouse vs. Data Warehouse

While the first generation on-premises data warehouses helped businesses derive historical insights from multiple data sources, it required a significant amount of sunk time and cost in managing the infrastructure. Cloud data warehouses addressed some of the problems with on-premises data warehouses discussed earlier. As the volume and variety of data increased, organizations on cloud data warehouses realized they needed a different solution that reduces the ETL costs associated with copying data.

Data Lakehouse vs. Data Lake

To address the problems businesses were experiencing with data warehouses and to democratize data for all sorts of workloads, a different type of data platform emerged — the data lake. It all started with storing, managing, and processing a huge volume of data using the on-premises Hadoop ecosystem (e.g., HDFS for storage, Hive for processing). Eventually cloud data lakes emerged and gave teams the flexibility to store all types of data at a low cost and enable data science workloads on that data.

GNARLY DATA WAVES EPISODE

Overview of Dremio’s Data Lakehouse

On our 1st episode of Gnarly Data Waves, Read Maloney provides an Overview of Getting Started with Dremio's Data Lakehouse and showcase Dremio Use Cases advantages.

Watch now

WHITEPAPER

The Definitive Guide to the SQL Data Lakehouse

A SQL data lakehouse uses SQL commands to query cloud data lake storage, simplifying data access and governance for both BI and data science.

Learn More

WHITEPAPER

A New Paradigm for Managing Data

Open data lakehouse architectures speed insights and deliver self-service analytics capabilities.

Learn More

Ready to get started?

Dremio Test Drive

Experience Dremio with sample data

The simplest way to try out Dremio.

Dremio Cloud

Open & fully-managed data lakehouse

Best option if your data is on AWS.

Dremio Software

Software for any environment

Self-managed software for data in Azure, GCP and on-premises.

What Is a Data Lakehouse?

Data Lakehouse Architecture

Open Data

Data Management

Storage

File Formats

Table Formats

Query Engines

Applications

Data Lakehouse Features

Data Governance

Transactional Support

Schema Management

Data Lakehouse Advantages

Multiple Analytical Workload Support

Cost Efficiency

Independent Scaling

No Lock-In or Lock-Out

Infinite Scalability

Data Copies

Who Uses a Data Lakehouse?

Data Lakehouse vs. Other Technologies

Data Lakehouse vs. Data Warehouse

Data Lakehouse vs. Data Lake

GNARLY DATA WAVES EPISODE

Overview of Dremio’s Data Lakehouse

WHITEPAPER

The Definitive Guide to the SQL Data Lakehouse

WHITEPAPER

A New Paradigm for Managing Data

Ready to get started?

Experience Dremio with sample data

Open & fully-managed data lakehouse

Software for any environment

Ready to Get Started?