Today we are excited to announce the release of Dremio 4.7.
This month’s release delivers multiple performance-oriented features such as Arrow caching, the ability to scale out coordinator nodes, runtime filtering, AWS Edition improvements, and more. In total, this release includes 150 improvements!
Here is the complete list of updates:
- Arrow caching for data reflections
- Scale-out coordinator nodes
- Engine tagging for AWS Edition
- Disable cross source select
- Improved Helm charts for Kubernetes, AKS and EKS deployments
Arrow Caching for Data Reflections
In this month’s release Dremio leverages the benefits of Apache Arrow to accelerate the performance of data reflections. Arrow caching for data reflections enables users to utilize Apache Arrow format to store data reflections when using columnar cloud cache (C3).
Storing reflections in Arrow format significantly increases the performance of queries that read data from reflections and can be used to speed up queries and increase system capacity. By leveraging C3, Arrow caching takes advantage of all the optimizations and benefits that Dremio includes with C3, such as being fully automated and simple to use so users do not need to spend time on complex administrative or maintenance tasks. To enable Arrow caching, users simply configure the option for a data reflection and Dremio automates all acceleration behind the scenes.
Arrow caching for data reflections is beneficial for workloads where data is pre-computed in a reflection and queries primarily look up pre-computed results from a reflection. In these scenarios a large portion of the query execution time is spent decompressing the reflection’s Parquet files which store data in compressed format. By enabling Arrow caching and storing data reflections in Arrow format, Dremio is able to directly load reflections into memory and deliver results faster and more efficiently. When using Arrow caching query execution times can be reduced by 5-10x in some cases.
Scale-Out Coordinator Nodes
Today Dremio supports both the ability to scale the number of nodes in execution engines and elastically increase the number of engines within a system. This allows Dremio to scale resources as required to process the largest and most complex big data problems. In the past, Dremio could scale execution resources for query processing, however, Dremio only supported a single coordinator node, which users accessed to connect to Dremio and where SQL planning operations were performed.
Dremio 4.7 now provides the ability to deploy multiple coordinator nodes to increase the concurrency and number of users supported by a Dremio system. Scale-out coordinator nodes enable users to deploy multiple coordinator nodes to support the number of users and increase the number of queries that can be processed by a single Dremio system. With multiple coordinator nodes Dremio can scale to support thousands of users simultaneously. To learn more about how to leverage this feature take a look at the Dremio 4.7 documentation.
Scale-out coordinator nodes enable Dremio to now scale all aspects of the system, from the size of execution engines, to the number of engines and now the capacity of the coordinator node services. In doing so Dremio has removed all scaling limitations due to system resources and can support the largest and most complex big data problems.
Engine Tagging for AWS Edition
Recently, we announced Dremio AWS Edition, a production-grade, high-scale data lake engine that is highly optimized for AWS to eliminate costs for idle compute and thus further reduce infrastructure compute costs by over 60%. Last month Dremio improved the AWS experience by introducing a new hourly pricing model which enables users to take advantage of enterprise edition features of Dremio on an hourly basis.
Today Dremio 4.7 provides users the ability to add EC2 tags to engines.
This new feature allows users to add custom tags to all EC2 resources created when dynamically deploying engines. Custom tags are useful for many purposes including internal billing or automatic resource deletion for cost control.
Disable Cross Source Select
A key security requirement when analyzing data from multiple domains (clients, data zones, etc.) is to be able to prevent mixing data from these domains. For example, analysts who have access to sales data from Client A and sales data from Client B should not be able to link or join those two datasets.
Dremio 4.7 introduces a new feature that allows users to disable cross source select to prevent mixing data from different sources. When this feature is configured, queries can only access a single data source within a single SQL statement, including virtual datasets (VDS). If users try to execute a query that includes multiple sources, this feature will send an alert and prevent the query from being executed.
Improved Helm Charts for Kubernetes, AKS and EKS Deployments
Dremio 4.7 includes multiple improvements to the Helm charts used in Kubernetes, AKS and EKS deployments. These improvements include:
- Support for multiple engines
- Configuration options for Hive 2 and 3
- Ability to enable load balancers to specify a static IP address
- Additional support for tolerations and node selectors
- Support for AWS EC2 metadata authentication for EKS (not K8s service account based)
- Improved memory usage calculations for coordinator and executor nodes
- Improved ADLS Gen1 distributed storage configuration
- Ability to specify additional configuration parameters for distributed storage via values.yaml
Learn More
We are very excited about this release and its capabilities. For a complete list of additional new features, enhancements, changes and fixes, please review the release notes. As always, we look forward to your feedback. Please post any questions or comments on our community site.