7 minute read · June 22, 2022
Announcing DML Support for Apache Iceberg
· Director of Technical Advocacy, Dremio
· Principal Product Manager, Dremio
Today we’re excited to announce DML (Data Manipulation Language) support for Apache Iceberg! This allows you to mutate your data directly in your data lake via standard SQL DML commands like INSERT
, UPDATE
, MERGE INTO
, and DELETE
from Dremio. Together with the existing SELECT
support for Iceberg tables, you can run any SQL statement on data in your data lakehouse with Dremio.
This capability is available now in Dremio Cloud and will soon be available in the upcoming 22.0 release of Dremio software.
The full power of SQL, on an open lakehouse
Once a capability only available in data warehouses and other databases, data mutations are now possible directly on the data lake. With the ability to do DML directly on the data lake, it’s now easier than ever to run your workloads on an open lakehouse without having to deal with expensive and proprietary cloud data warehouses. You no longer need to copy your data into a closed data warehousing system to have the full power of SQL at your fingertips.
This DML capability is enabled by Apache Iceberg, a high-performance table format that solves challenges with traditional tables in data lakes. It’s rapidly becoming an industry standard for managing data in data lakes, establishing the foundation of an open lakehouse. Iceberg brings the reliability and simplicity of database tables to the lakehouse, while also enabling multiple engines to work together on the same data in a safe and transactionally consistent manner.
DML on the lakehouse in action
To demonstrate DML capability in Dremio, we will create two Iceberg tables SF_incident2016
and SF_incidents2016_stage
using the SF_incidents2016.json
dataset included in the sample source in Dremio.
For this demonstration, let’s say that some of the data for December 31, 2016 has changed since it was originally loaded into the production SF_incident2016
table and the new values for this day have been loaded into SF_incidents2016_stage
. We want to merge the new changes into the production SF_incident2016
table.
You can follow along with this scenario by running these two CREATE TABLE AS SELECT
statements:
CREATE TABLE S3.SF_incidents2016 AS SELECT * FROM Samples."samples.dremio.com"."SF_incidents2016.json";
CREATE TABLE S3.SF_incidents2016_stage AS SELECT IncidntNum, Category, Descript, DayOfWeek, "Date", "Time", PdDistrict, CASE WHEN IncidntNum IN (161061373, 161061420, 161061464) THEN 'ARREST, BOOKED' ELSE Resolution END AS Resolution, Address, X, Y, Location, PdId FROM Samples."samples.dremio.com"."SF_incidents2016.json" WHERE "Date" = '2016-12-31';
Now that the tables are set up, let’s merge the updated values of the data for December 31, 2016 from the stage table into the production table.
The following MERGE INTO
command can be run in Dremio to update the values of Resolution in SF_incidents2016
for rows where there’s a matching IncidntNum
value in the stage table, and INSERT
non-matching records into SF_incidents2016
from the stage table:
MERGE INTO S3.SF_incidents2016 a USING S3.SF_incidents2016_stage b ON a.IncidntNum = b.IncidntNum AND a.PdId = b.PdId WHEN MATCHED THEN UPDATE SET Resolution = b.Resolution WHEN NOT MATCHED THEN INSERT VALUES (b.IncidntNum, b.Category, b.Descript, b.DayOfWeek, b."Date", b."Time", b.pdDistrict, b.Resolution, b.Address, b.x, b.y, b.Location, b.PdId);
Now we can see that the values for the updated incidents are in the production table by running the following statement:
SELECT * FROM S3.SF_incidents2016 WHERE IncidntNum IN (161061373, 161061420, 161061464)
Now, let’s say that the court decided to expunge someone’s record, so we need to delete the record of the incident. We can run the following DELETE
statement to delete that single record in the SF_incidents2016
table:
DELETE FROM S3.SF_incidents2016 WHERE IncidntNum = 161061549
Now, let’s say that moving forward the application writing this data will change the values it writes. Now, when an incident is resolved by arrest and booking, it will no longer write ARREST, BOOKED
, it will write ARREST & BOOKED
. To ensure all downstream reports don’t see both ARREST, BOOKED
and ARREST & BOOKED
, we need to update all the previous records to reflect this new column format.
UPDATE S3.SF_incidents2016 SET Resolution = 'ARREST & BOOKED' WHERE Resolution = 'ARREST, BOOKED';
Run all your SQL workloads on an open lakehouse today
Dremio offers the ability to change data directly on the lakehouse, so you no longer need to copy your data into an expensive and proprietary data warehousing system to have the full power of SQL at your fingertips.
You can read more about this feature in the documentation here.
Try it out now on Dremio Cloud and please reach out with any feedback to [email protected]!