We are delighted to announce the release of new Intelligent Reflection Refresh. Now, Dremio will automatically select the fastest and most cost-effective Reflection refresh method based on table operations performed since the last refresh. Prior to this release, Dremio would refresh a Reflection anchored on an Apache Iceberg table from scratch; with this feature, Reflections in Dremio can keep up with frequent data ingestions and mutations happening on data lakehouse workloads at a significantly lower cost. This functionality is available for Reflections anchored on Iceberg tables. And, with Reflection refreshes that are up to 100x faster, customers can significantly decrease their compute expenses.

This capability is available now in Dremio Cloud and 24.2 release of Dremio software.

Incremental Reflection for Apache Iceberg

A Reflection is an optimized materialization of an “anchor” table or view that can be used to partially or wholly accelerate data processing. A Reflection may be aggregated, sorted and/or partitioned in various ways. The Dremio query engine automatically uses Reflections to accelerate queries by matching and substituting them in a query plan. When anchor tables are modified as a result of data ingestion or DML operations, the Reflection refresh procedure keeps the Reflections current.

Reflections are particularly useful for BI workloads because they address performance challenges for BI dashboards and reports (e.g., Tableau, Power BI, Looker). As a result, data teams no longer need to export data from the lakehouse into Tableau extracts or Power BI imports, or to wire up new ETL pipelines to precook tables in the lakehouse for BI use cases.

Dremio now includes a new Reflection refresh system that was built from the ground up for table formats such as Apache Iceberg. This new system delivers significant improvements in refresh performance by detecting transactions and incrementally updating the Reflections based on those transactions.

  • Snapshot-based incremental refresh. When Dremio detects that all operations
    performed on an anchor Iceberg table since the last refresh have been append-only,
    such as INSERT INTO, COPY INTO, the Reflection data is updated incrementally
    (including incremental aggregation calculations).
  • Partition-scoped incremental refresh. When Dremio detects DML operations that
    mutate data, such as UPDATE, DELETE, and MERGE, the Reflection data is refreshed
    only in the impacted partitions (as long as the Reflection and anchor table have common
    partition columns).

This blog demonstrates the two types of incremental Reflection refresh methods introduced in this release and also touches upon a new and exciting feature that enables Reflections to be partitioned using partition-transforms supported by Apache Iceberg.

Snapshot-based Incremental Reflection Refresh

Using Iceberg snapshots, Dremio can determine operations that were performed between the snapshot used for the previous refresh and the current snapshot. The Reflection refresh is incremental when the change is append-only. An OPTIMIZE TABLE operation which will not mutate actual data does not require a Reflection refresh and is not considered when choosing incremental refresh method.

You can read more about this feature in the documentation here.

Snapshot-based Incremental Reflection Refresh in Action

For demonstration purposes, we will create the S3.NYC_Taxis table and populate it with New York taxi data from January 1, 2013 to June 30, 2013. We want to get the total fares aggregated by pickup month and passenger count. We can write a SELECT query against the NYC_Taxis table, but the response time will vary depending on the size of the table. To achieve sub-second response times regardless of table size, we will create an aggregate view and a Reflection anchored on this view.

This Reflection will be built using the entire NYC_Taxis table for the first time: refresh from scratch. A Reflection will be available to accelerate queries as soon as it is refreshed.

CREATE TABLE S3.NYC_Taxis
PARTITION BY (month(pickup_datetime)) as
SELECT pickup_datetime, "passenger_count", fare_amount
FROM Samples."samples.dremio.com"."NYC-taxi-trips-iceberg"
WHERE pickup_datetime >= '2013-01-01' and pickup_datetime <= '2013-06-30';
CREATE OR REPLACE VIEW NYC_Taxis_view as
SELECT DATE_TRUNC('MONTH', pickup_datetime) as pickup_month, passenger_count,
SUM(fare_amount) as fare_amount
FROM S3.NYC_Taxis
GROUP BY 1, 2;
ALTER DATASET NYC_Taxis_view
CREATE RAW REFLECTION "fares_by_month"
USING DISPLAY (pickup_month, passenger_count, fare_amount)
PARTITION BY (pickup_month);
SELECT DATE_TRUNC('MONTH', pickup_datetime) as pickup_month, passenger_count,
SUM(fare_amount) as fare_amount
FROM S3.NYC_Taxis
GROUP BY 1, 2;
SELECT DATE_TRUNC('MONTH', pickup_datetime) as pickup_month, SUM(fare_amount) as fare_amount
FROM S3.NYC_Taxis
GROUP BY 1;

Two queries that aggregated data by month and month plus passenger_count against the NYC_Taxis table were accelerated by a Reflection created on a view NYC_Taxis_view and ran under a second. Thanks to Dremio's cutting-edge algebraic Reflection matching algorithm, a Reflection that was not anchored on S3.NYC_Taxis matched and accelerated two queries.

Now that we have demonstrated the value of Reflections, we will load more data into this table from July 1, 2013 to July 31,2013. Following the INSERT command, the Reflection refresh will now be incremental and will append data from July, 2013.

INSERT INTO S3."NYC_Taxis"
SELECT pickup_datetime, "passenger_count", fare_amount
FROM Samples."samples.dremio.com"."NYC-taxi-trips-iceberg"
WHERE pickup_datetime >= '2013-07-01' and pickup_datetime <= '2013-07-31';

For the purpose of this demonstration, we can validate that Dremio has used Snapshot-Based Incremental Refresh by examining the newly-added Refresh Decision section in the refresh job's job profile.

Partition-scoped Incremental Reflection Refresh

When both the Reflection and the anchor table are partitioned, and the partition of the anchor table is identical or contained in the Reflections partition for at least one partition column, Dremio performs a partition-scoped incremental refresh when the change is not append-only. In the following image, the table is partitioned by hour on a timestamp column, while the Reflection is partitioned by day on the same column. When a specific hour partition in an anchor table is modified, the day partition that contains the modified hour in the Reflection is recalculated.

You can read more about this feature in the documentation here.

Partition-scoped Reflection Refresh in Action

Now, we will delete data from the S3.NYC_Taxis table for the entire month of January 2013 and modify February 2023 records so that the fare_amount column contains the value 0.

DELETE FROM S3.NYC_Taxis
WHERE date_trunc('MONTH',"pickup_datetime") = '2013-01-01 00:00:00.000';
UPDATE S3.NYV_Taxis
SET "fare_amount" = 0
WHERE date_trunc('MONTH',"pickup_datetime") = '2013-02-01 00:00:00.000';

After updating, use the "Refresh Now" icon on the S3.NYC_Taxis anchor dataset to refresh the Reflection and confirm that above SELECT statements are accelerated and run under a second. In case of a Reflection on a view, Dremio will examine the expression used to compute the partition column and determine if it is identical or contains the partition of the anchor table. The Reflection refresh will be partition-scoped if the change in the anchor table is not append-only and the above condition is met. In our example, the refresh will reconstruct data for just the two partitions that have been updated.

You can verify that it is a partition-scoped incremental refresh by examining the newly introduced Refresh Decision section in the refresh job's job profile. Dremio is able to determine that the DATE_TRUNC('MONTH', pickup_datetime) expression used in the view definition is identical to the month transform used to partition the anchor table.

As a result, Dremio was able to refresh the Reflection incrementally by partition, despite the fact that the column names are different and the Reflection is not explicitly partitioned with a month transform.

For a full list of SQL functions that can be used to aggregate the data while maintaining compatibility with partition-scoped incremental refresh please refer to the documentation here.

Partition-Transforms in Reflections

We are pleased to announce that Dremio Reflections can now be partitioned using all partition-transforms supported by Apache Iceberg. This blog post, titled “Partition and File pruning for Dremio's Apache Iceberg-backed Reflections”, is an excellent resource for understanding how large Reflections should be designed using the “Partition by” and “Sort by” clauses to maximize query performance with partition and split pruning when used to accelerate queries.

The following aggregate Reflection is partitioned using a year transform on the pickup_month column and can refresh using partition-scoped incremental refresh as it contains the month transform used to partition the NYC_Taxis anchor table.

ALTER DATASET NYC_Taxis_view
CREATE AGGREGATE REFLECTION "passenger_by_month"
USING DIMENSIONS (pickup_month) MEASURES (passenger_count(MAX), passenger_count(MIN))
PARTITION BY (YEAR(pickup_month))

The user now has the option to select partition-transform when creating an aggregate Reflection using the Dremio user interface. In an upcoming release, Dremio will add automatic Reflection partition recommendations in the Dremio UI to assist users in choosing partition clauses for Reflections that will facilitate incremental refresh.

You can read more about the Partition-transform in Reflection feature in the documentation here.

Ready to Get Started?

Enable the business to create and consume data products powered by Apache Iceberg, accelerating AI and analytics initiatives and dramatically reducing costs.