6 minute read · October 29, 2024
Now in Private Preview: Dremio Lakehouse Catalog for Apache Iceberg
· Principal Product Manager, Dremio
Virtually all companies understand the value of a data lakehouse architecture in supporting analytics workloads. Many of the world’s largest enterprises already use Dremio’s query engine and semantic layer as core building blocks for their own lakehouse architectures.
However, to build a lakehouse architecture, you need more than just query engines to ingest, process, and analyze data. You also need a catalog to enable engines to work with data in a safe and governed way. Although there has been unprecedented demand for Apache Iceberg catalogs this year, some of the biggest pain points and hurdles for companies to build a lakehouse architecture are still in deploying, managing, and getting support for their lakehouse catalog. In particular, customers who operate in regulated markets and need to run their own self-managed infrastructure are restricted to using open-source catalogs that are unable to provide the governance, table maintenance, and support required to run their enterprise workloads.
Today, we’re excited to bring the Dremio Lakehouse Catalog for Apache Iceberg into Dremio Software for the first time, making Dremio the easiest way for customers to build a data lakehouse on their own terms. This feature will be available in Private Preview capacity for Dremio Software version 25.2.0.
Key Capabilities
In addition to Dremio’s query engine and semantic layer, customers now also get a built-in lakehouse catalog with industry-standard authorization and automated table management when they deploy Dremio, empowering a full-stack lakehouse experience out of the box. Key features of the Dremio Lakehouse Catalog include:
- Openness and Interoperability: The Dremio Lakehouse Catalog is built on an open foundation and supports the Iceberg REST API, so you can read and write from it using any engine or framework compatible with the Iceberg REST API.
- Industry-Standard Governance: Secure and track access to data using Role-Based Access Control (RBAC) privileges and a built-in commit log.
- Automated Table Maintenance: Automate data maintenance tasks like compaction and table vacuuming to optimize query performance and minimize storage costs.
Unify Governance for the Lakehouse
Access control mechanisms are fundamental to protecting sensitive information from unauthorized access, maintaining compliance with regulations, and ensuring that users have the appropriate level of access based on their roles and responsibilities within an organization.
The Dremio Lakehouse Catalog lets you regulate access to data by implementing role-based access control (RBAC) policies which determine who can access specific objects and what actions they can perform on those objects. You can grant privileges on a per-user or per-role basis, and integrate with users/roles defined in an external identity provider.
For example, if you organize your data into Bronze, Silver, and Gold layers (where each layer is ordered in an ascending level of refinement), you can implement RBAC policies to allow data analysts to only read data from the Gold namespace (that contains refined, enriched views approved for reporting), and allow data engineers to read and write data from all namespaces.
As another example, if you use folders/namespaces to organize data by department (e.g., Sales, Marketing, and Product), you can implement RBAC policies to allow Sales users to only read data from the Sales folder, Marketing users to only read data from the Marketing folder, and so on.
Automate Maintenance Operations for the Lakehouse
Query performance may be impacted as you ingest and make changes to data over time. For example, writing millions of small files during data ingestion jobs will result in slower queries because the engine needs to read more files when querying tables. You may also need to delete old data to meet compliance requirements.
The Dremio Lakehouse Catalog automates data maintenance tasks to optimize query performance and minimize storage costs. The catalog eliminates the need to spend time and effort to run manual data maintenance operations by automating the following tasks:
- Table optimization, which compacts small files into larger files.
- Table cleanup, which uses the VACUUM CATALOG command to delete expired snapshots and orphaned metadata files for Iceberg tables.
What's Next?
Many customers and prospects have been asking about this feature for a long time, so we’re excited to finally make this available to the world and greatly simplify the path to a lakehouse architecture, especially for those restricted to on-premises deployments. Over the next few months, we’ll be using your feedback to bolster the catalog and build in additional features as we move towards general availability.
To get access to the Private Preview, contact us or speak with your account team, and tune into the upcoming Gnarly Data Waves release webinar to learn more!