4 minute read · December 14, 2023
Vectorized Reading of Parquet V2 Improves Performance Up To 75%
· Principal Product Manager, Dremio
· Manager, Software Engineering
· Senior Staff Software Engineer
We are thrilled to announce the release of an enhanced vectorized Parquet Reader in Dremio software version 24.3 and Dremio Cloud. This Dremio-exclusive reader improves query performance up to 75% for Parquet datasets encoded with the Parquet V2 encodings.
Apache Parquet-MR Writer version PARQUET_2_0, which is widely adopted by engines that write Parquet data, supports delta encodings. However, these encodings were not previously supported by Dremio's vectorized Parquet reader, resulting in decreased speed. Now, in version 24.3 and Dremio Cloud, when you use the Dremio SQL query engine on Parquet datasets, you’ll receive best-in-class performance.
The Dremio vectorized Parquet reader now supports the following encodings in addition to PLAIN, RLE, and PLAIN_DICTIONARY:
- RLE_DICTIONARY
- DELTA_BINARY_PACKED
- DELTA_LENGTH_BYTE_ARRAY
- DELTA_BYTE_ARRAY
Read Performance
Execution of TPC-DS queries on a Parquet dataset encoded with V2 encoding using the new vectorized reader delivered an average of 77% improvement in query performance compared to the previous version. Previously when dealing with Parquet datasets encoded in Parquet V2, Dremio utilized the Apache Parquet-MR row-wise reader.
NOTE: Dremio's vectorized reader already reads Parquet datasets encoded with Apache Parquet-MR writer version PARQUET_1_0, so this enhancement does not affect the performance of queries executed on such datasets.
Query Performance Improvements with 24.3 | |||
Parquet file type | Least Improvement | Highest Improvement | Average Improvement |
With V2 encodings | 22.5% | 97.2% | 77.3% |
Write Performance
For writing Parquet data, Dremio utilizes the Apache Parquet-MR Writer. An average of 25% reduction in the storage footprint of TPC_DS data was observed with Parquet-MR Writer version V2 when compared to V1. Reducing storage footprint can also help store more data into Dremio’s proprietary Columnar Cloud Cache (C3). C3 cache enables the Dremio query engine to achieve NVMe-level I/O performance on S3/ADLS/GCS by leveraging the NVMe/SSD built into cloud compute instances, like Amazon EC2 and Azure Virtual Machines
Release 24.3 of Dremio will continue to write Parquet V1, since an average performance degradation of 1.5% was observed in writes and 6.5% was observed in queries when TPC-DS data was written using Parquet V2 instead of Parquet V1. The aforementioned query performance tests utilized the C3 cache to store data.
Storage Footprint & R/W Performance of Parquet V2 over V1 (Average) | ||
Storage footprint | Write performance | Read Performance with C3 |
-24.8% | +1.5% | +6.4% |
Guidance for Dremio Users:
Dremio users who query Parquet datasets and use data encoded in Parquet V2 should upgrade to Dremio version 24.3 to benefit from these substantial performance improvements. (Dremio Cloud users can benefit from this capability now)
Dremio writes data to Reflections and Iceberg tables in Parquet format. Writing with Parquet V2 can reduce storage footprint by as much as 25% and should also improve utilization of the Dremio exclusive Columnar Cloud Cache (C3).
Users can enable Parquet V2 on write using the following configuration key.
ALTER SYSTEM SET "store.parquet.writer.version" = 'v2' |