11 minute read · February 1, 2024

Why Lakehouse, Why Now?: What is a data lakehouse, and How to Get Started

Alex Merced

Alex Merced · Senior Tech Evangelist, Dremio

The story of the data lakehouse is a tale of evolution, responding to the growing demands for more adept data processing. In this article, we delve into this journey and explore how each phase in data management's evolution contributed to the data lakehouse's rise. This solution promises to harmonize the strengths of its predecessors while addressing their shortcomings.

The Beginning: OLTP Databases and Their Limitations

The journey to the modern data lakehouse begins with traditional Online Transaction Processing (OLTP) databases. Initially, these databases were the backbone of operational workloads, handling myriad transactions. However, they encountered significant challenges when it came to analytical processing. The primary issue was that OLTP databases were optimized for transactional integrity and speed, not for complex analytical queries. As the volume and variety of data grew, these systems needed help keeping pace, leading to more specialized solutions.

Emergence of Data Warehouses and OLAP Systems

Data warehouses and Online Analytical Processing (OLAP) systems were developed to address the shortcomings of OLTP databases in handling analytical workloads. These systems were explicitly designed for query and analysis, offering improved performance for analytical queries. They provided structured environments where data could be cleaned, transformed, and stored for business intelligence. However, these on-premises deployments came with their challenges: they were costly to maintain, complex to operate, and difficult to scale. The coupling of storage and compute resources often led to inefficiencies, with organizations having to pay for more capacity than they needed.

Data Lakes: A Paradigm Shift

Hadoop and similar technologies offered a more affordable repository for structured and unstructured data, giving birth to the data lake concept. The critical advantage of data lakes was their ability to store vast amounts of raw data in native formats. This meant only necessary data must be processed and transferred to data warehouses for analysis. However, directly using data lakes for analytics proved to be cumbersome and slow. They needed more processing power and optimized structures of data warehouses, making them unsuitable as standalone analytical solutions.

The Move to the Cloud: Decoupling of Storage and Compute

The migration of data warehouses and data lakes to the cloud represented a significant advancement. Cloud deployments offered the much-needed decoupling of storage and compute resources. This separation meant organizations could scale storage and processing independently, increasing flexibility and cost-efficiency. Maintenance became minimal, but running data warehouses in the cloud, especially at the petabyte scale, remained expensive despite these improvements. This was particularly evident as the volume of data continued to grow exponentially.

The Quest for an Alternative

The search for an alternative solution began with data warehouses becoming increasingly costly and data lakes lacking analytical capabilities. Organizations needed a system that combined the storage capabilities of data lakes with the analytical power of data warehouses, all while being cost-effective, scalable, and efficient. This quest laid the groundwork for the emergence of the data lakehouse — a new architecture that promised to address the shortcomings of its predecessors.

The Birth of the Data Lakehouse

The data lakehouse addresses the challenges traditional data warehouses and data lakes face. It emerges as a unified solution, combining the best features of its predecessors. This innovative architecture offers the vast storage capabilities of data lakes and the powerful analytical processing of data warehouses.

Core Technologies Behind the Data Lakehouse

Object-storage solutions: Cloud-based object storage services like Amazon S3, Azure Data Lake Storage (ADLS), and MinIO provide scalable, secure, and cost-effective storage solutions. They offer the foundational layer for storing vast amounts of structured and unstructured data in a data lakehouse.

Columnar storage with Parquet: The adoption of Apache Parquet, an open source, binary columnar storage format, revolutionized data storage. Parquet allows for efficient data compression and encoding schemes, reducing storage costs and enhancing query performance due to its columnar nature.

Table formats like Apache Iceberg: Open-source table formats such as Apache Iceberg play a pivotal role in data lakehouses. They enable the representation of large datasets as traditional tables, complete with ACID (atomicity, consistency, isolation, durability) transactions and time-travel capabilities. This feature brings the reliability and manageability of traditional databases to the scalability of data lakes.

Catalogs for data management: Open source catalogs like Project Nessie facilitate data versioning and management, akin to git functionality for data. Nessie enables easy transportation of tables across various tools and environments, enhancing data governance and collaboration.

The data lakehouse platform: Platforms that integrate these technologies into a cohesive user experience. They offer:

Dremio stands out as the premier data lakehouse platform, adeptly meeting all the requirements for creating a unified access layer and a comprehensive data lakehouse. It integrates the technologies above seamlessly and includes the necessary features to create a proper data lakehouse abstraction on top of your data lake.

Getting Started with a Data Lakehouse

Choose your storage: Select a cloud-based object storage solution that suits your scale and budget. If you can’t be in the cloud, consider on-prem object storage options like MinIO, OpenIO, ECS, StorageGRID, and more.

Implement a table format: Adopt a table format like Apache Iceberg to structure your data within the lakehouse.

Set up a data catalog: Implement a system like Nessie to manage your data assets efficiently. (This is already integrated into the Dremio Cloud lakehouse platform in its “Arctic Catalog,” which saves the trouble of deploying and managing a catalog for your lakehouse tables.)

Integrate a data lakehouse platform: Get started with a Dremio Cloud or Dremio Software deployment. Refer to this tutorial to see all these pieces running in a small prototype on your laptop.

Begin integration: Connect your existing databases, data lakes, and data warehouses to your Dremio cluster and convert data into an Apache Iceberg table tracked in an Arctic Catalog as needed; these tables will be automatically optimized and maintained by Dremio.

Curate your data: Craft a virtual data mart using Dremio’s semantic layer and avoid writing complex acceleration pipelines using Dremio’s reflections.

Implement governance and security: Ensure that data governance and security protocols are in place to protect and manage your data effectively. Dremio includes role-based access controls, column-based access controls, and row-based access controls to give fine-grained access control.

Train your team: Equip your team with the necessary knowledge and tools to leverage the full potential of your data lakehouse.

Conclusion

The path from OLTP databases to the modern data lakehouse is a testament to the relentless pursuit of more advanced data management solutions. Each phase of this journey, from the early days of OLTP databases to the advent of data warehouses, OLAP systems, and the transformative emergence of data lakes, has played a crucial role in shaping today's data-centric world. The data lakehouse, as the latest milestone in this evolution, embodies the collective strengths of its predecessors while addressing their limitations. It represents a unified, efficient, and scalable approach to data storage and analysis, promising to unlock new possibilities in data analytics. As we embrace the data lakehouse era, spearheaded by platforms like Dremio, we stand on the cusp of a new horizon in data management, poised to harness the full potential of our ever-growing data resources.

Create a Prototype Data Lakehouse on Your Laptop with this Tutorial

Ready to Get Started?

Enable the business to create and consume data products powered by Apache Iceberg, accelerating AI and analytics initiatives and dramatically reducing costs.