22 minute read · November 1, 2024
Integrating Databricks’ Unity Catalog with On-Prem Hive/HDFS using Dremio
· Senior Tech Evangelist, Dremio
As organizations increasingly adopt hybrid data architectures, they often face challenges in accessing and analyzing data stored across cloud and on-premises environments. Databricks’ Unity Catalog offers a unified metastore that centralizes data management for cloud-based Delta Lake tables, enabling streamlined access to cloud data. At the same time, many companies retain valuable data in on-premises systems like Hive and HDFS, containing historical and sensitive information essential for in-depth analysis.
Integrating Unity Catalog with on-prem Hive/HDFS data allows organizations to bridge the cloud and on-prem data divide, enabling powerful cross-environment analytics. Using Dremio to connect these environments simplifies the process, providing a platform where cloud and on-prem data can be seamlessly joined and analyzed. In this blog, we explore the benefits of integrating Unity Catalog and Hive/HDFS data through Dremio, walk through the setup process, and highlight the value of using Dremio reflections to accelerate cross-environment queries and analytics.
Benefits of Integrating Unity Catalog and Hive/HDFS with Dremio
Bringing together data from Unity Catalog and Hive/HDFS in Dremio provides several advantages for data teams. From simplifying access and reducing operational overhead to enabling faster, deeper analytics, this integration unlocks new capabilities for hybrid data environments.
Unified Data Access
Integrating Unity Catalog and Hive/HDFS in Dremio eliminates data silos, allowing users to access cloud and on-prem data in one environment without duplicating or migrating data. When cloud data in Unity Catalog and on-prem data in Hive/HDFS can be queried together, teams can perform richer analyses that span historical and real-time data. For example, customer engagement data from Unity Catalog can be joined with transaction history from Hive, allowing teams to analyze customer behavior holistically. With Dremio’s virtual datasets, users can create custom views across Unity Catalog and Hive/HDFS, providing flexible, cross-environment data access that fuels powerful analytics and business intelligence.
Cost and Performance Optimization
Dremio’s reflections feature helps optimize query performance by creating materialized representations of frequently accessed data, enabling fast querying without high data transfer costs. By configuring reflections on cross-environment views, data teams can reduce the load on both Unity Catalog and Hive/HDFS sources, minimizing the need for repeated queries to each source. This reduces cloud data transfer costs and significantly accelerates query performance, making it feasible to analyze more extensive datasets and support more demanding analytical workloads across both cloud and on-premises data environments.
Setting Up Unity Catalog and Hive/HDFS Connections in Dremio
The first step in integrating cloud data from Unity Catalog with on-prem data in Hive/HDFS is configuring each source in Dremio. This setup provides seamless access to both environments, enabling hybrid data queries and analytics within a single platform. Below are the configuration steps for Unity Catalog and Hive/HDFS sources in Dremio.
Unity Catalog Setup
Connecting Databricks’ Unity Catalog to Dremio allows you to access Delta Lake tables through the UniForm Iceberg metadata layer, which provides a read-only interoperability layer compatible with Iceberg clients.
- Add Unity Catalog as a Source
- Navigate to the Datasets page in Dremio, go to the Sources panel, and click on the Add Source icon.
- In the Add Data Source dialog, select Unity Catalog under the Metastores section.
- General Configuration
- Name: Choose a unique name for the Unity Catalog source, like “UnityCatalog_CloudData” (no special characters).
- Catalog Name: Enter the name of the Unity Catalog you want to connect.
- Endpoint URI: Provide the catalog service URI (consult the Databricks documentation to locate the Unity Catalog URI).
- Authentication
- Choose your preferred method for authenticating with Unity Catalog:
- Dremio: Store the Unity Catalog password in plain text within Dremio.
- Azure Key Vault or AWS Secrets Manager: Provide the URI (Azure) or ARN (AWS) for secure password storage.
- Databricks Personal Access Token (PAT): Generate a PAT for Databricks and provide it in Dremio to enable access. Be sure to create the PAT according to your deployment type (AWS or Azure).
- Choose your preferred method for authenticating with Unity Catalog:
- Advanced Options
- Asynchronous Access: Dremio allows asynchronous access and local caching for faster query performance. This is enabled by default and can be adjusted as needed.
- Storage Authentication: For S3 or Azure storage, configure storage credentials to ensure Dremio can securely access the Delta Lake tables in Unity Catalog. Specify the necessary authentication properties, such as access keys for S3 or account keys for Azure Storage.
By completing these steps, you’ll enable access to Unity Catalog data in Dremio, allowing it to be combined with on-premises Hive/HDFS data.
Hive/HDFS Setup
Adding Hive and HDFS sources to Dremio allows you to incorporate on-prem data, facilitating cross-environment analytics alongside Unity Catalog data.
- Add Hive/HDFS as a Source
- Go to the Datasets page in Dremio, navigate to the Sources panel, and click on the Add Source icon.
- Under Metastores, select Hive for Hive data, or under Object Storage, select HDFS for Hadoop data.
- General Configuration for Hive or HDFS
- Name: Enter a name for the source, like “Hive_OnPremData” or “HDFS_Storage”.
- Connection Settings: Specify connection parameters, such as the Hive Metastore host and port for Hive, or the NameNode host and port for HDFS.
- Enable Impersonation (Optional)
- If desired, enable impersonation to allow Dremio to run queries on behalf of specific users, ensuring access control based on Hive or HDFS permissions. This feature can help manage secure access to sensitive on-prem data.
- Advanced Options and Metadata Settings
- Define caching options and reflection refresh policies to optimize access speed and ensure data remains current.
- Configure metadata refresh intervals, which help manage the frequency of updates for top-level metadata (such as databases and tables) and dataset details.
After completing these configuration steps, you’ll have both Unity Catalog and Hive/HDFS data sources connected in Dremio, enabling you to run queries and join data across cloud and on-prem environments. With these sources in place, you can create curated cross-environment views and use Dremio’s reflections to boost query performance.
With Unity Catalog and Hive/HDFS sources configured in Dremio, the next step is to build virtual datasets that combine data from both environments. Dremio’s virtual datasets provide a flexible way to join cloud and on-prem data, allowing you to create logical views without physically moving or duplicating data. These views streamline data exploration and make it easier for users to access relevant information for analytics.
Virtual Datasets and Views in Dremio
Dremio’s virtual datasets act as SQL-based views that can combine data from Unity Catalog and Hive/HDFS sources into a unified, queryable format. These virtual datasets allow teams to perform seamless cross-environment joins and create reusable views for analytics.
- Creating Cross-Environment Views:
To create a virtual dataset that joins Unity Catalog and Hive/HDFS data, use Dremio’s SQL editor to write queries that perform JOIN operations across sources. For example, you might create a view that joins recent transaction data from Unity Catalog with customer details in Hive, providing a single, enriched dataset for customer behavior analysis. - Flexibility of Virtual Datasets:
Dremio’s virtual datasets are dynamic and can be easily updated to reflect changing data needs. Analysts can add or remove fields from views as requirements evolve, enabling an agile approach to data exploration. Additionally, because virtual datasets are logical representations, any changes to the underlying data structure (e.g., table schema updates) don’t disrupt the view, keeping downstream applications and reports intact.
Data Curation Strategy: Raw, Clean, and Gold Layers
To organize data effectively, Dremio enables the creation of curated layers that serve different stages of data processing. This structure makes it easier for teams to access and analyze data in a logical flow, from raw data to business-ready insights.
- Raw/Bronze Layer:
The raw layer contains unprocessed data directly from Unity Catalog and Hive/HDFS sources. This layer provides an untouched view of the data, allowing users to access the original records for auditing or exploratory purposes. Organizing data in the raw layer preserves the source structure, making it easy to refer back to the initial dataset as needed. - Clean/Business/Silver Layer:
In the clean layer, data is transformed to harmonize fields across cloud and on-prem sources, ensuring consistency. This layer is where data cleaning and standardization take place. For example, date formats from Unity Catalog and Hive can be standardized, and naming conventions can be aligned. This intermediate step prepares the data for more complex transformations, making it easier to query and reducing the chances of inconsistencies in analysis. - Gold/Application Layer:
The gold layer represents the curated, business-ready datasets that are optimized for reporting and analytics. This is where data is fully transformed, aggregated, and organized in a way that aligns with business objectives. In the gold layer, you might create views that aggregate sales by customer segments, combining data from both Unity Catalog and Hive to support strategic decisions. This layer is typically accelerated with Dremio’s reflections, which store materialized views to enhance query performance.
Curating data into these layers not only streamlines data workflows but also provides clarity on data usage. Users can access the layer most relevant to their needs, whether it’s raw data for exploration, clean data for analysis, or gold data for reporting, making Dremio a powerful platform for hybrid data management. With these cross-environment views in place, the next step is to leverage Dremio’s reflections to optimize query performance and ensure efficient access to these curated datasets.
Accelerating Cross-Environment Queries with Dremio Reflections
Once you’ve created curated views across Unity Catalog and Hive/HDFS, using Dremio’s reflections can significantly enhance query performance. Reflections are materialized representations of frequently queried datasets that Dremio can automatically leverage to accelerate queries, making cross-environment analytics efficient and responsive.
Types of Reflections and Their Use Cases
Dremio provides two main types of reflections — raw and aggregation — each suited to different types of queries and analytics needs:
- Raw Reflections:
Raw reflections capture the underlying data of a table or view, optimizing queries that access a subset of columns or rows. This reflection type is beneficial for scenarios where users repeatedly query large datasets with filters. For example, if a dataset joins Unity Catalog’s sales data with Hive’s customer data, a raw reflection can speed up customer-specific queries, ensuring that frequently accessed columns and rows are readily available without having to scan the full source data. - Aggregation Reflections:
Aggregation reflections precompute summary metrics, such as averages or totals, across specific fields. These are ideal for BI-style queries involving GROUP BY statements or aggregation functions. For instance, if you’re analyzing monthly sales across regions, an aggregation reflection can store precomputed summaries, enabling faster performance for reporting and dashboards.
Implementing Reflections for the Gold Layer
Reflections are especially useful for the gold layer, where business-ready datasets are frequently accessed for analytics and reporting. By applying reflections to gold-layer views, Dremio can serve up results quickly without repeatedly querying Unity Catalog or Hive/HDFS.
- Setting Up a Reflection:
To set up a reflection on a virtual dataset in Dremio, navigate to the Reflections tab for the dataset. Choose the reflection type (raw or aggregation) and configure the fields and filters based on your analysis needs. You can specify columns to be partitioned or sorted, which helps Dremio optimize queries that filter or sort data on these fields. - Example Use Case:
Suppose you have a gold-layer view that combines Unity Catalog’s customer data with transactional data from Hive. This view might be used to analyze customer spending across different segments. Creating an aggregation reflection with “customer segment” as a dimension and “total spending” as a measure allows Dremio to bypass the need for on-the-fly calculations, accelerating reports that require these aggregated insights.
Reflection Refresh Policies: Incremental and Live Refresh
If you take your unified data and turn them into Iceberg tables from which you then build your gold layer, you get a couple of extra benefits:
- Incremental Refresh:
Incremental refreshes update only new or modified data in a reflection, making it an efficient choice for data sources that are appended regularly. For example, a transactional dataset that adds daily sales data can benefit from an incremental refresh, ensuring that the reflection remains current without reprocessing the entire dataset. - Live Refresh:
With live refresh, reflections automatically sync with the underlying data in real time, which is ideal for views that power live dashboards or real-time reporting. For instance, if your organization has a dashboard that displays real-time revenue from Unity Catalog and Hive data, configuring a live refresh ensures that users always see the latest numbers without needing to manually update the view.
With live and incremental reflections, Dremio can deliver fast, accurate queries across combined Unity Catalog and Hive/HDFS data. Reflections improve query speed and reduce the workload on cloud and on-prem sources, making it easier to work with large datasets in a hybrid environment. With these optimizations, teams can focus on deriving insights without waiting for queries to complete, maximizing productivity and value from their data.
Practical Use Case Scenarios
Let's explore some real-world scenarios to better understand how integrating Unity Catalog with Hive/HDFS in Dremio enhances hybrid data workflows. These examples demonstrate how Dremio’s cross-environment capabilities and reflections streamline data processing, enabling faster, more comprehensive analysis across cloud and on-premises data.
Scenario 1: Comprehensive Customer Analysis
In an e-commerce setting, customer activity data may be stored in Unity Catalog, while historical transaction data resides in Hive on-premises. Combining these datasets enables a unified view of customer behavior across both recent and historical touchpoints.
- Objective:
Generate a 360-degree view of customer engagement by joining recent activity data from Unity Catalog with historical purchases in Hive. - Solution with Dremio:
Use Dremio’s SQL editor to create a virtual dataset that joins Unity Catalog’s customer activity data with Hive’s transaction records by customer ID. This combined view can then be curated in the gold layer for easier access. - Benefit of Reflections:
By applying a raw reflection on this view, queries that segment customer data by recent activity or purchasing history are accelerated, ensuring faster insights for marketing and sales teams without repetitive access to the full source data.
Scenario 2: Real-Time Revenue Tracking
Many businesses need to monitor sales performance in near real-time, combining recent transactions from Unity Catalog with historical revenue data in Hive for a comprehensive, up-to-the-minute view.
- Objective:
Create a real-time revenue dashboard that tracks sales by region, product, or time period, integrating current and historical data across environments. - Solution with Dremio:
In Dremio, build a virtual dataset that joins recent sales transactions from Unity Catalog with Hive’s revenue history. This view can be structured in the gold layer to provide quick access to critical metrics. - Benefit of Reflections:
Implementing an aggregation reflection on this view, with dimensions such as “region” and “product category” and a measure like “total sales,” precomputes revenue summaries. Configuring a live refresh policy ensures the dashboard reflects the latest data in near real-time, enabling the sales team to make informed decisions based on current performance.
Scenario 3: Demand Forecasting for Inventory Management
Retailers and manufacturers must often integrate inventory data stored on-premises in Hive with recent sales data in Unity Catalog to forecast demand and accurately manage stock levels.
- Objective:
Forecast product demand by combining recent sales data from Unity Catalog with historical inventory records in Hive, providing insights into demand patterns and inventory needs. - Solution with Dremio:
Create a virtual dataset in Dremio that joins Hive’s historical inventory data with Unity Catalog’s sales records. Aggregate the data by product and date to identify demand trends over time. - Benefit of Reflections:
Applying a raw reflection on this view, with partitions by “product_id” and “date,” enables fast retrieval of the most relevant data for forecasting. By setting an incremental refresh policy, the reflection stays current with minimal overhead, allowing teams to quickly update forecasts as new sales data flows in.
These scenarios illustrate how Dremio’s reflections and virtual datasets make it easier to achieve performance and efficiency in a hybrid data setup. By joining Unity Catalog and Hive/HDFS data, Dremio empowers organizations to conduct high-value analyses that draw from both recent and historical data, accelerating the path to insights across diverse data sources.
Conclusion
Dremio’s integration with Unity Catalog and Hive/HDFS empowers organizations to harness the full potential of their hybrid data environments. By simplifying access, accelerating queries, and providing a robust platform for data curation, Dremio helps organizations build a unified, high-performance data architecture that supports faster, more informed business decisions.