h3h4h4h4h3h3h4h4

7 minute read · August 27, 2024

What’s New in Dremio 25.1: Improved Performance, Data Ingestion, and Federated Access for Apache Iceberg Lakehouses

Mark Shainman

Mark Shainman · Principal Product Marketing Manager

In today’s data-driven world, businesses face the constant challenge of managing and analyzing data across various environments—cloud, on-premises, and hybrid. With our latest release of Dremio 25.1, we continue to innovate and deliver features that enhance performance, streamline data ingestion, and improve federated query access. This release introduces improvements that collectively drive better performance, efficiency, and security in managing and querying data.

Improved Lakehouse Performance

Dremio 25.1 sets a new standard for performance in lakehouse platforms, reinforcing Dremio's commitment to providing the highest performance Iceberg lakehouse platform on the market. This release reinforces that  Dremio is the premier lakehouse analytics platform for Apache Iceberg.

Reflection Enhancements

Dremio’s Reflections act as an optimized relational cache, accelerating queries for analytical workloads. Query performance improvements are driven by the optimized physical data structure and the strategic planning advantages provided by Apache Iceberg metadata. In version 25.1, numerous Reflection enhancements further improve the performance and management of Reflections.

With Live Reflections on Iceberg tables, Dremio ensures that any changes in underlying data structures automatically update the Reflections, keeping them current. Continuous refresh processes reduce manual intervention and management overhead, enhancing query performance by maintaining up-to-date Reflections. Additionally, version 25.1 introduces improved Reflection recommendations that take a holistic view of workloads, analyzing trends and patterns in queries over the past seven days. This enables the recommendation engine to suggest Reflections that accelerate a wide range of queries, optimizing performance across all workloads and ensuring the best return on investment.

Result Set Caching

Dremio’s result set caching mechanism significantly enhances the performance of analytical queries by storing the results of frequently accessed queries. This feature can deliver up to a 28x improvement in performance for commonly used queries by eliminating the need to reprocess complex computations. The system immediately writes query result sets to distributed storage as Apache Arrow files, which are maintained centrally. If a new query matches an existing result cache, the system retrieves the cached results, minimizing query response times and improving user productivity.

Merge-on-Read

In modern analytical environments, the ability to update data rapidly is crucial. Dremio’s new Merge-on-Read (MoR) capability allows for efficient data management by applying changes without the immediate rewriting of large datasets. Traditional data warehouses usually require direct modification of data files at the time of updates or deletes, which can be slow and resource-intensive for certain write operations. Dremio’s MoR approach, however, writes changes to log or delta files, allowing the system to operate smoothly even during updates. This method enhances performance, ensures data consistency, and supports scalability, making it a highly effective component for Iceberg lakehouse environments.

Automatic Iceberg Data Ingestion with Auto Ingest Pipelines

Dremio continues to innovate with its latest feature, Auto Ingest Pipelines for Iceberg tables. This cutting-edge functionality, available in both Dremio Enterprise Software and Dremio Cloud, simplifies data ingestion from Amazon S3 into Iceberg tables in Lakehouse environments. Auto Ingest Pipelines automate and streamline the data ingestion process, reducing the complexity of managing data pipelines and ensuring that Iceberg tables are continuously updated with fresh data.

Auto Ingest Pipelines follow a cloud-native, event-driven architecture for notifications. For AWS S3, Dremio leverages native S3 event notifications to load the latest files into Iceberg as they arrive, ensuring that any changes in the data are reflected in the Iceberg tables, maintaining up-to-date and accurate data for analysis. Additionally, Auto Ingest Pipelines enforce exactly-once write semantics, ensuring that each piece of data is written to storage exactly once, even in the presence of failures or retries. Dremio’s Auto Ingest Pipelines achieve this by checking the current files to be loaded against a set of previously loaded files, ensuring that each file is only loaded once.

By leveraging Auto Ingest Pipelines, companies can eliminate some of the operational costs associated with building and maintaining event-driven pipelines. By automating the ingestion process, organizations reduce the complexity and overhead of these tasks, leading to faster, more reliable data access and enhanced performance, ultimately improving total cost of ownership (TCO).

Accelerating Cross-Database Access Control and Workload Management with User Impersonation

Dremio understands that managing and querying data across diverse environments—cloud, on-premises, or hybrid—can be challenging. In this latest release, we have introduced significant improvements in query federation capabilities, simplifying data access and ensuring robust security and performance. A key new feature is user impersonation for Teradata, Oracle, and SQL Server, which enhances security, personalization, and performance in federated query execution.

Improved Access Controls

User impersonation is a mechanism that allows a query engine to execute queries across multiple data sources using the identity and permissions of the end-user who initiated the query. This process ensures that each data source enforces its specific security policies and access controls based on the individual user's credentials, maintaining consistent and secure access. Dremio passes along user credentials to the underlying databases, ensuring that queries are executed with the appropriate permissions and access controls. This significantly enhances security by supporting more granular user permissions, improved access control, and detailed user workload tracking.

Leveraging Existing Database Native Workload Management Capabilities

User impersonation also allows organizations to more easily leverage existing database native workload management capabilities for queries coming from Dremio. By aligning queries with specific user profiles and resource groups, Dremio enables organizations to fully utilize the workload management features built into their databases, improving the performance and efficiency of federated queries.

For example, in Teradata environments, this feature allows companies to more easily leverage Teradata Active System Management (TASM) for queries mapped to a user coming from Dremio. In Oracle, it makes it possible for companies to utilize Automatic Workload Management in Oracle Real Application Cluster (RAC) for Dremio federated queries. It also enables granular tracking of workload service levels, statistics, and metrics in their Automatic Workload Repository. In Microsoft environments, SQL Server users can benefit from enhanced workload personalization capabilities, leveraging SQL Server's Resource Governor and its Classifier Function to align queries with specific user profiles and workload groups. Being able to more easily leverage native Teradata, Oracle, and SQL Server workload management capabilities can improve overall performance for federated queries, leading to faster time to insight, decreased resource utilization, and reduced TCO.

Get Started Now!

All of these capabilities are available now! If you’re already a Dremio Self-Managed customer, it’s easy to upgrade. Visit our Support Portal to download the latest version. Dremio Cloud Users, simply log in to get started! Not yet a Dremio user? Visit the Get Started page to find offerings for Dremio Cloud or Dremio Self-Managed.

Ready to Get Started?

Enable the business to create and consume data products powered by Apache Iceberg, accelerating AI and analytics initiatives and dramatically reducing costs.