GNARLY DATA WAVES EPISODE
Overview of Dremio’s Data Lakehouse
On our 1st episode of Gnarly Data Waves, Read Maloney provides an Overview of Getting Started with Dremio's Data Lakehouse and showcase Dremio Use Cases advantages.
Watch nowA data lakehouse combines the performance, functionality and governance of a data warehouse with the scalability and cost advantages of a data lake. With a data lakehouse, engines can access and manipulate data directly from data lake storage without copying data into expensive proprietary systems using ETL pipelines.
A data lakehouse is a new type of data platform architecture that is typically split into five key elements.
In a data lakehouse architecture, the data is stored in open formats like Parquet, ORC and Apache Iceberg, allowing multiple engines to work in unison on the same data. Therefore, data consumers can have faster and more direct access to the data.
Learn more about the possibilities of open data within a data lakehouse
A data lakehouse offers flexible and scalable solutions for data storage and management. By leveraging cloud-based object stores, open-source table formats, and query engines, data lakehouses provide organizations with the tools they need to store and manage large volumes of structured and unstructured data at a lower cost.
Data lakehouses offers storage where the data lands after ingestion from operational systems. Object stores are available from the three inexpensive cloud service providers — Amazon S3, Azure Blob Storage, and Google Cloud Storage — which supports storing any type of data and facilitates required performance and security.
The next component is where the actual data is stored, usually in columnar formats providing advantages is reading or sharing data between multiple systems. Common file formats include Apache Parquet, ORC, Apache Arrow, etc.
The most important component of the data lake table format is organization and managing raw data files in the data lake storage. Table formats abstract the physical data structure’s complexity and allow different engines to work simultaneously on the same data. Apache Iceberg, Hudi, and Delta Lake are the three most popular table formats, and are widely gaining enterprise adoption.
Query engines are responsible for processing the data and providing efficient read performance, allowing some engines to have native connections with BI tools such as Tableau and Power BI, making it easy to report data directly. Other engines such as Dremio Sonar and Apache Spark work with table formats like Apache Iceberg to enable a robust lakehouse architecture using common languages like SQL.
The final component of a data lakehouse is the downstream applications interacting with the data. These include BI tools such as Tableau and Power BI and machine learning frameworks like TensorFlow, PyTorch, etc., making it easy for data analysts, data scientists, and ML engineers to directly access the data.
A data lakehouse architecture blends the best of a data warehouse and data lake to support modern analytical workloads.
Data lake architecture does not use governance policies on the data. The quality of data landing in the object store may not be helpful for deriving insights, leading to data swamp problems. Best practices are adopted from data warehouse to ensure proper access control.
Data lakehouse supports ACID transactions, which ensures the same atomicity and data consistency guaranteed in a data warehouse. This is critical for multiple read and write operations to run concurrently in a production scenario.
A lakehouse architecture guarantees that a specific schema is respected when writing new data providing support with no side effects. With new use cases there may be changes in the data type, and new data may be added.
A data lakehouse mitigates the critical problems experienced with data warehouses and data lakes.
A data lakehouse supports diverse workloads, including SQL for business intelligence, data science, and near real-time analytics. Various BI and reporting tools (e.g., Tableau) have direct access to the data in a lakehouse without the need for complex and error-prone ETL processes.
With all the data stored in a cost-effective cloud object storage, organizations don’t have to pay hefty costs associated with data warehouses or licensing costs for BI extracts. Keep your data in cloud object storage and reduce expensive ETL pipelines.
A lakehouse architecture separates compute and storage which helps scale these components independently to cater to the needs of an organization or department.
The open nature of the data lakehouse architecture allows teams to use multiple engines on the same data, depending on the use case, and helps to avoid vendor lock-in. This is important for organizations with data infrastructure across multiple cloud providers.
Scaling up resources infinitely based on the type of workload.
With a data lakehouse architecture, engines can access data directly from the data lake storage without copying data using ETL pipelines for reporting or moving data out for machine learning-based workloads. Make ETL optional and simplify your data pipelines.
Data lakehouses are used by a wide variety of organizations, including enterprise data platform teams, , government agencies, and educational institutions to quickly analyze large amounts of data to make informed decisions.
CUSTOMER USE CASE
“Dremio bridges the data warehouse and the data lake, enabling NCR to derive more value between the two data sources.”
Ivan Alvarez | IT Vice President, Big Data and Analytics, NCR Corporation
CUSTOMER USE CASE
“Where it can be expressed simply in SQL, Dremio will not be beaten…It’s given us this enterprise-wide semantic layer facilitating us to have a data mesh architecture on the lake. And on top of that we have happy engineers…Been waiting for a tool like Dremio for the last 15 years of my data career.”
Achille Barbieri | Senior Project Manager, Enel
CUSTOMER USE CASE
“With Dremio, we are able to reach the source of truth of the data for analytical access, without data replication. This is a crucial aspect for us…We are among the first companies with this dimension and complexity to implement a data mesh architecture. Agile Lab and Dremio have been important for us to reach these results.”
Andy Kenna | SVP & Head of Data, RenaissanceRe
While the first generation on-premises data warehouses helped businesses derive historical insights from multiple data sources, it required a significant amount of sunk time and cost in managing the infrastructure. Cloud data warehouses addressed some of the problems with on-premises data warehouses discussed earlier. As the volume and variety of data increased, organizations on cloud data warehouses realized they needed a different solution that reduces the ETL costs associated with copying data.
To address the problems businesses were experiencing with data warehouses and to democratize data for all sorts of workloads, a different type of data platform emerged — the data lake. It all started with storing, managing, and processing a huge volume of data using the on-premises Hadoop ecosystem (e.g., HDFS for storage, Hive for processing). Eventually cloud data lakes emerged and gave teams the flexibility to store all types of data at a low cost and enable data science workloads on that data.
On our 1st episode of Gnarly Data Waves, Read Maloney provides an Overview of Getting Started with Dremio's Data Lakehouse and showcase Dremio Use Cases advantages.
Watch nowA SQL data lakehouse uses SQL commands to query cloud data lake storage, simplifying data access and governance for both BI and data science.
Learn MoreOpen data lakehouse architectures speed insights and deliver self-service analytics capabilities.
Learn MoreThe simplest way to try out Dremio.
Best option if your data is on AWS.
Self-managed software for data in Azure, GCP and on-premises.