8 minute read · September 29, 2022
Apache Iceberg and the Right to Be Forgotten
· Senior Tech Evangelist, Dremio
Time travel is a fantastic feature that’s available when you adopt a data lakehouse table format like Apache Iceberg, but sometimes this useful feature runs contrary to regulatory requirements such as GDPR (General Data Protection Regulation) and CCPA (California Consumer Privacy Act). This is true of other systems that offer time travel as well, such as Snowflake.
This is an issue because regulations often require you to ensure that you no longer have access to protected consumer information. Typically, this can be done by merely deleting the data in question, but with systems that provide time travel, deletions are soft and the data still exists if you use time travel to query data at a previous point in time.
Apache Iceberg table format, however, offers several ways to manage tables so you can ensure compliance with the proper regulations, and guarantee data that needs to be forgotten is forgotten.
Method #1: Hard Deletes with Copy-on-Write
With Copy-on-Write, managing hard deletes is straightforward. It just takes one step:
- Expire snapshots older than X days.
Since the data files are rewritten on every update or delete, newer snapshots will refer to files that don’t have the deleted records. So expiring snapshots past your desired length of retention will clean up data files with any deleted records since they are not referenced in your newer snapshots.
-- if your table uses copy-on-write ALTER TABLE catalog.db.users SET TBLPROPERTIES ( 'write.delete.mode'='copy-on-write', 'write.update.mode'='copy-on-write', 'write.merge.mode'='copy-on-write' ) -- To hard delete files just expire snapshots prior -- To the deletion of records CALL catalog.system.expire_snapshots( table => 'db.users', older_than => TIMESTAMP '2022-06-30 00:00:00.000', retain_last => 10 )
Method #2: Hard Deletes with Merge-on-Read
Merge-on-Read is a little more complicated because newer snapshots where records are deleted may still refer to data files with deleted records. These records are reconciled using delete files that list deleted records. Expiring snapshots alone may not delete the desired data files in this situation. So the proper flow to ensure hard deletions would be:
- Run compaction to reconcile all delete files into new data files periodically.
- Expire snapshots older than X days.
Since all snapshots from before the compaction job are expired, the data files with those deleted records can now be deleted.
-- if your table uses merge-on-read ALTER TABLE catalog.db.users SET TBLPROPERTIES ( 'write.delete.mode'='merge-on-read', 'write.update.mode'='merge-on-read', 'write.merge.mode'='merge-on-read' ) -- Run compaction to eliminate delete files CALL catalog.system.rewrite_data_files( table => 'db.users', strategy => 'sort', sort_order => 'zorder(role, state)' ) -- To hard delete files expire all prior snapshots CALL catalog.system.expire_snapshots( table => 'db.users', older_than => TIMESTAMP '2022-06-30 00:00:00.000', retain_last => 1 )
Although, if you run compaction regularly, you should be able to develop a cadence that’ll delete files when needed, while leaving some snapshots for time travel.
For example, let’s assume weekly compaction jobs with hard deletes require data deleted for 30 days or more. At the end of each month, you can run an expiration of all snapshots before the most recent four compaction jobs, which still leaves you with a month's worth of snapshots for time travel while hard deleting the files that need to be hard deleted.
Method #3: Use Key Rotation
Another option that can complement the previous methods or be used on its own is to control access to the files using key rotation.
This means you may use different keys to encrypt your files based on where you store them. For example, let’s say you change the encryption key each month. If you want to eliminate access to data before three months old you can drop older keys rendering those older files inaccessible even though they still physically exist.
Implementation differs depending on what vendor handles your file storage or how intricate you want the process to be. You may already be implementing some level of key rotation to comply with PCI DSS regulations.
Other Regulatory Best Practices
Here are some other considerations as you design your compliance plan:
- Don’t commingle protected data with unprotected data in the same table, so you don’t lose data unnecessarily when deleting a row of protected data (e.g., put email and phone numbers in a different table than data that denotes user application preferences or activity).
- Delete the data in batches. Instead of deleting as you go, set a flag on records that need to be deleted, then run a batch delete operation on all records that have been flagged. This flag can be tracked as an insert into a separate table that tracks delete requests, which also allows you to track metadata around the deletes like the time of the request, time of actual deletion, and so forth. The benefit is that it clusters all the deletes into fewer snapshots making it easier to predict what files will be cleaned up when expiring snapshots vs the deletes occurring over several snapshots.
Conclusion
Regulatory requirements can make data engineering more complex than it otherwise would be. Fortunately, whether you need hard deletes to occur on a Copy-on-Write or Merge-on-Read table or instead use a key rotation strategy, there are techniques to get all the benefits of time travel while still maintaining compliance.