63 minute read · September 15, 2024
Unifying Data Sources with Dremio to Power a Streamlit App
· Senior Tech Evangelist, Dremio
Businesses often face the daunting task of unifying data scattered across multiple platforms. Whether it's transactional data stored in PostgreSQL, customer preferences housed in MongoDB, or analytics data in a data warehouse like Snowflake, integrating these disparate sources into a cohesive dataset is a complex challenge. This fragmentation not only complicates data analytics but also hinders the development of robust applications that rely on consolidated information.
Overview of Data Unification Challenges
Complexity of Handling Multiple Data Sources
- Diverse Data Formats: Different databases and data stores often use varying data formats and schemas, making it difficult to standardize data for analysis.
- Multiple Connection Protocols: Each data source may require a unique connection setup, authentication method, and query language, increasing the complexity of the codebase.
- Data Silos: Isolated data sources can lead to data silos, where critical information is inaccessible to other systems or teams, impeding holistic analysis.
- Performance Bottlenecks: Joining data across platforms can be resource-intensive and slow, negatively impacting application performance and user experience.
The Need for Seamless Integration
To derive meaningful insights and make informed decisions, organizations require a unified view of their data. Seamless integration enables:
- Efficient Data Analytics: Streamlined access to all data sources accelerates analysis and reporting.
- Improved Decision-Making: A holistic view of data supports better strategic decisions.
- Enhanced Application Development: Simplified data management reduces development time and complexity, allowing for more focus on building features and user experiences.
Introducing Dremio and Streamlit
To address these challenges, we explore two powerful tools that simplify data unification and application development: Dremio and Streamlit.
Dremio: The Lakehouse Platform for Unified Analytics
What is Dremio?
Dremio is an data lakehouse platform that offers unified analytics by connecting to a wide array of data sources—all in one place. It bridges the gap between data lakes, data warehouses, and various databases, providing a single interface for querying and managing data.
Key Features:
- Unified Connectivity: Connects seamlessly with databases like PostgreSQL and MongoDB, data warehouses like Snowflake and Redshift, data lakes, and lakehouse catalogs like Nessie and Polaris.
- Powerful SQL Query Engine: Allows for federated queries across multiple data sources without the need to move or copy data.
- Semantic Layer: Enables the curation of business metrics and definitions to ensure consistency across the organization.
- Lakehouse Management: Automates optimization and cleanup tasks for data formats like Apache Iceberg, and supports catalog versioning for creating zero-copy environments and workload isolation.
Streamlit: Simplifying Data App Development
What is Streamlit?
Streamlit is an open-source Python library designed for creating intuitive and interactive web applications for machine learning and data science projects. It turns data scripts into shareable web apps in minutes.
Why Use Streamlit?
- Ease of Use: Build apps with minimal code using familiar Python syntax.
- Interactivity: Incorporate widgets to allow users to interact with data in real-time.
- Integration: Works seamlessly with popular data manipulation and visualization libraries like Pandas, NumPy, and Matplotlib.
- Rapid Development: Accelerates the development process, allowing data scientists and analysts to focus on insights rather than web development intricacies.
By leveraging Dremio's unified analytics capabilities and Streamlit's simplicity in app development, we can overcome the challenges of data unification. In the following sections, we'll delve into how to join data from MongoDB and PostgreSQL using Dremio and build a Streamlit app that taps into this unified dataset—all without the need to manage multiple database connections within our Python code.
Introduction to Streamlit
The ability to quickly prototype and deploy interactive applications is invaluable. Streamlit is a heavyweight in this space, offering a straightforward way for data scientists and developers to turn data scripts into shareable web apps with minimal effort.
Key Features of Streamlit
- Ease of Use: With its intuitive API, you can create web apps by writing Python scripts as you normally would. Streamlit automatically handles the UI elements.
- Real-Time Interactivity: Streamlit apps update in real-time as users interact with widgets like sliders, text inputs, and buttons.
- Integration with Data Libraries: Seamlessly integrates with popular Python libraries such as Pandas, NumPy, Matplotlib, Plotly, and more.
- Rapid Prototyping: Ideal for quickly prototyping ideas and sharing them with others without extensive frontend development knowledge.
Why Use Streamlit?
Simplifies Development of Data-Driven Applications
Traditionally, building a web application requires knowledge of web frameworks and front-end development. Streamlit abstracts these complexities, allowing you to focus on the data and logic rather than the intricacies of web development.
- No Web Development Skills Required: Write apps entirely in Python without dealing with HTML, CSS, or JavaScript.
- Fast Development Cycle: Make changes to your script, save, and see the results immediately in the browser.
- Collaboration-Friendly: Easy to share your app with colleagues or deploy it for broader access.
Enhances Data Visualization and Interaction
Streamlit makes it effortless to visualize data and build interactive dashboards.
- Dynamic Visualizations: Create charts and graphs that update based on user input.
- Customizable Widgets: Use built-in widgets to allow users to filter data, select options, and control parameters.
- Responsive Layouts: Automatically adjusts the layout for different screen sizes and devices.
Supports Machine Learning and AI Workflows
Streamlit is particularly well-suited for machine learning practitioners.
- Model Exploration: Build interfaces to adjust model parameters and observe outcomes in real-time.
- Data Inspection: Interactively explore datasets to gain insights and identify patterns.
- Deployment of Models: Share trained models with stakeholders through an accessible web app.
How Streamlit Works
Streamlit watches your Python script for changes and reruns it from top to bottom whenever it detects a modification or user interaction.
Basic Structure of a Streamlit App
Here's a simple example to illustrate how a Streamlit app is structured:
import streamlit as st import pandas as pd # Title of the app st.title('Simple Streamlit App') # Load data data = pd.read_csv('data.csv') # Display data st.dataframe(data) # User input number = st.slider('Select a number', min_value=1, max_value=100) # Display output st.write(f'The selected number is {number}')
When you run this script using streamlit run app.py
, it opens in your web browser and provides an interactive interface based on your code.
Running a Streamlit App
To run a Streamlit app:
- Install Streamlit: If you haven't already, install Streamlit using
pip install streamlit
. - Execute the Script: Run the command
streamlit run your_script.py
in your terminal. - View in Browser: The app will automatically open in a new browser tab.
Deployment Options
- Local Deployment: Ideal for development and testing on your local machine.
- Cloud Deployment: Deploy your app on cloud platforms like Heroku, AWS, or Streamlit Sharing for broader access.
- Docker Containers: Use Docker to containerize your app for consistent deployment environments.
By incorporating Streamlit into your workflow, you can significantly enhance the way you interact with your data and models. It transforms static scripts into engaging, interactive applications, making it easier to communicate insights and collaborate with others.
In the context of our exercise, Streamlit will serve as the front-end interface for our unified dataset. By connecting to Dremio's unified data source, we'll build an interactive app that allows users to explore and analyze data joined from MongoDB and PostgreSQL—all without the complexity of managing multiple database connections within our code.
In the next section, we'll delve into a practical exercise where we simulate a real-world scenario: joining data from MongoDB and PostgreSQL using Dremio and building a Streamlit app to interact with this data.
The Exercise: Joining MongoDB and PostgreSQL Data for a Streamlit App
In this section, we'll delve into a practical exercise that demonstrates how Dremio simplifies the process of unifying data from multiple sources. We'll explore how to join data from MongoDB and PostgreSQL—two fundamentally different databases—and use this unified dataset to build an interactive Streamlit application. This exercise highlights the efficiency gained by using Dremio to handle data integration, allowing us to focus on developing the app without worrying about complex database connections and data merging logic in our code.
Scenario Overview
Imagine a scenario where you have customer data spread across two different databases:
- PostgreSQL: Contains core customer information such as customer IDs, names, emails, and order history.
- MongoDB: Stores customer preferences, loyalty statuses, and other semi-structured data.
Your goal is to build a Streamlit app that presents a comprehensive view of each customer by joining data from both databases. This unified view will enable users to:
- View Customer Profiles: See personal information alongside preferences and loyalty status.
- Analyze Purchasing Behavior: Correlate order history with customer preferences.
- Enhance Marketing Strategies: Identify high-value customers based on combined data insights.
Data Details
- PostgreSQL Tables:
- customers: Contains
customer_id
,customer_name
,email
, andcreated_at
. - orders: Includes
order_id
,order_date
,amount
, andcustomer_id
(foreign key tocustomers
).
- customers: Contains
- MongoDB Collection:
- customer_preferences: Stores documents with
customer_email
,preference
,loyalty_status
, andlast_updated
.
- customer_preferences: Stores documents with
Traditional Approach vs. Dremio Approach
Traditional Approach: Managing Multiple Database Connections in Python
In a conventional setup without Dremio, you would need to:
- Establish Separate Connections: Write code to connect to both PostgreSQL and MongoDB using their respective drivers (e.g.,
psycopg2
for PostgreSQL andpymongo
for MongoDB). - Retrieve Data Individually: Execute queries to fetch data from each database.
- Data Transformation: Convert the retrieved data into compatible formats (e.g., Pandas DataFrames).
- Data Joining Logic: Implement logic to merge datasets, often requiring careful handling of data types and missing values.
- Error Handling: Manage exceptions and connection issues for both databases separately.
- Maintain Credentials and Configurations: Securely store and manage connection strings, usernames, and passwords for each database.
Challenges with the Traditional Approach:
- Complexity: Increased code complexity due to handling multiple connections and data transformations.
- Maintenance Overhead: More code to maintain, test, and debug.
- Performance Issues: Potential inefficiencies in data retrieval and merging, especially with large datasets.
- Security Risks: Managing multiple sets of credentials increases the risk of misconfiguration or exposure.
Dremio Approach: Simplifying Data Unification
By leveraging Dremio, you can streamline the entire process:
- Single Connection Point: Connect to Dremio using its powerful SQL query engine.
- Unified Data Access: Dremio connects to both PostgreSQL and MongoDB, handling the data retrieval internally.
- Federated Queries: Write SQL queries in Dremio to join data across both databases seamlessly.
- Virtual Datasets: Create virtual datasets (views) in Dremio that represent the joined data, without physically moving or copying data.
- Simplified Application Code: In your Streamlit app, connect only to Dremio to access the unified dataset.
Advantages of the Dremio Approach:
- Reduced Complexity: Eliminate the need to manage multiple database connections in your application code.
- Performance Optimization: Dremio's query engine optimizes data retrieval and joins, improving performance.
- Centralized Security: Manage access control and credentials centrally within Dremio.
- Scalability: Easily extend to additional data sources by configuring them in Dremio without altering application code.
- Consistent Data Schema: Ensure consistent data types and schemas through Dremio's semantic layer.
How Dremio Simplifies the Process
Data Source Configuration
In Dremio, you configure data sources for PostgreSQL and MongoDB:
- Add PostgreSQL Source:
- Specify connection details such as host, port, database name, username, and password.
- Add MongoDB Source:
- Provide connection information, including host, port, and authentication details.
Creating a Unified View
With data sources configured, you can:
- Explore Schemas: Browse the tables in PostgreSQL and collections in MongoDB directly from Dremio's UI.
- Write SQL Queries: Use Dremio's SQL editor to write queries that join data across these sources.
- Save as Virtual Dataset: Save this query as a virtual dataset in Dremio for easy reuse.
- Apply Transformations: Use Dremio's tools to further refine data, such as filtering, calculating new fields, or standardizing data types.
Accessing Unified Data in Streamlit
In your Streamlit app:
- Single Connection: Use the
dremio-simple-query
library to connect to Dremio. - Query Unified Dataset: Fetch data from the virtual dataset you created.
- Build Interactive Features: Utilize Streamlit's components to display and interact with the data.
By handling the data unification within Dremio, you significantly simplify your application code. This approach allows you to focus on delivering insights and enhancing user experience rather than dealing with the intricacies of data integration.
In the next sections, we'll guide you through setting up the environment, configuring data sources in Dremio, and building the Streamlit app step by step. You'll see firsthand how Dremio streamlines data unification and how easy it is to build a powerful, interactive application using Streamlit.
Setting Up the Environment
To embark on this exercise, you'll need to set up a development environment that simulates a real-world data ecosystem with multiple data sources. We'll use Docker to containerize and manage our services, making it easier to replicate and maintain the environment. The repository AlexMercedCoder/dremio-notebook
provides all the necessary configurations and seed data to get you started.
Forking and Cloning the Repository
First, you'll need to fork and clone the repository to your local machine:
Step 1: Fork the Repository
- Navigate to the Repository: Go to AlexMercedCoder/dremio-notebook on GitHub.
- Fork the Repository: Click on the Fork button at the top right corner of the page. This creates a copy of the repository under your GitHub account.
Step 2: Clone the Repository
Once you've forked the repository:
- Clone the Repository: Open your terminal and run the following command, replacing
your-username
with your GitHub username:git clone https://github.com/your-username/dremio-notebook.git
- Navigate to the Directory:
cd dremio-notebook
Now, you have the repository cloned locally and are ready to set up the Docker environment.
Understanding the docker-compose.yml
File
The docker-compose.yml
file orchestrates multiple services required for this exercise. Let's break down the services and their configurations.
Overview of Services
The Docker Compose file defines the following services:
- Dremio: The Lakehouse Platform that will unify data sources.
- Spark: Provides a notebook server and computational engine.
- Minio: An S3-compatible object storage for data lakes.
- Nessie: A version control system for data lakes.
- PostgreSQL: Relational database holding customer data.
- MongoDB: NoSQL database storing customer preferences.
Detailed Breakdown
1. Dremio Service
dremio: image: dremio/dremio-oss:latest container_name: dremio ports: - 9047:9047 # Dremio UI - 31010:31010 # Dremio JDBC/ODBC - 32010:32010 # Dremio Arrow Flight - 45678:45678 # Dremio internal environment: - DREMIO_JAVA_SERVER_EXTRA_OPTS=-Dpaths.dist=file:///opt/dremio/data/dist networks: - dremio-notebook
- Image: Uses the latest Community Edition Dremio image.
- Ports:
9047
: Access the Dremio UI.32010
: Used for Arrow Flight connections (important for ourdremio-simple-query
library).
- Environment Variables: Configures Dremio's Java options to store reflections on the container.
- Network: Connected to the
dremio-notebook
network for inter-container communication.
2. Spark Service
spark: platform: linux/x86_64 image: alexmerced/spark35nb:latest ports: - 8080:8080 # Spark Master Web UI - 7077:7077 # Spark Master Port - 8081:8081 # Spark Worker Web UI - 4040-4045:4040-4045 # Spark Job UIs - 18080:18080 # Spark History Server - 8888:8888 # Jupyter Notebook - 8501:8501 # For Streamlit Apps environment: # Spark and AWS configurations volumes: - ./seed/notebook-seed:/workspace/seed-data container_name: spark entrypoint: > /bin/bash -c " pip install streamlit python-dotenv && \ # Start Spark services /opt/spark/sbin/start-master.sh && \ /opt/spark/sbin/start-worker.sh spark://$(hostname):7077 && \ mkdir -p /tmp/spark-events && \ start-history-server.sh && \ # Start Jupyter Lab jupyter lab --ip=0.0.0.0 --port=8888 --no-browser --allow-root --NotebookApp.token='' --NotebookApp.password='' && \ tail -f /dev/null " networks: - dremio-notebook
- Image: Custom image with Spark and Jupyter Lab.
- Ports:
8888
: Access the Jupyter Notebook server.- Other ports are for Spark services and UI.
- 8501 for our streamlit app
- Volumes: Mounts seed data into the container.
- Entry Point:
- Installs Streamlit and python-dotenv.
- Starts Spark master and worker.
- Initiates Jupyter Lab.
- Network: Connected to
dremio-notebook
.
3. Minio Service (not needed for this exercise)
minio: image: minio/minio container_name: minio environment: # Minio configurations ports: - "9000:9000" # Minio API - "9001:9001" # Minio Console volumes: - ./seed/minio-data:/minio-data entrypoint: > /bin/sh -c " minio server /data --console-address ':9001' & \ sleep 5; \ mc alias set myminio http://localhost:9000 admin password; \ # Create buckets and upload seed data mc mb myminio/datalake; \ mc mb myminio/datalakehouse; \ mc mb myminio/warehouse; \ mc mb myminio/seed; \ mc cp /minio-data/* myminio/seed/; \ tail -f /dev/null" networks: - dremio-notebook
- Purpose: Provides S3-compatible storage.
- Ports:
9000
: Minio API.9001
: Minio Console for UI access.
- Volumes: Seeds data into Minio buckets.
- Entry Point:
- Starts Minio server.
- Uses
mc
(Minio Client) to create buckets and upload seed data.
4. Nessie Service (not needed for this exercise)
nessie: image: projectnessie/nessie:latest container_name: nessie environment: # Nessie configurations ports: - "19120:19120" # Nessie API port networks: - dremio-notebook
- Purpose: Provides cataloging of Iceberg tables and catalog versioning.
- Ports:
19120
: Nessie API.
- Configuration: Uses RocksDB for storage.
5. PostgreSQL Service
postgres: image: postgres:13 container_name: postgres environment: POSTGRES_USER: admin POSTGRES_PASSWORD: password POSTGRES_DB: mydatabase ports: - "5435:5432" volumes: - ./seed/postgres:/docker-entrypoint-initdb.d/ networks: - dremio-notebook
- Purpose: Stores relational customer data.
- Ports:
5435
: Mapped to PostgreSQL's default port5432
.
- Volumes: Seeds initial data using SQL scripts placed in
./seed/postgres
. - Environment Variables: Sets up the default user, password, and database.
6. MongoDB Service
mongo: image: mongo:4.4 container_name: mongo ports: - "27017:27017" volumes: - ./seed/mongo:/docker-entrypoint-initdb.d/ environment: MONGO_INITDB_ROOT_USERNAME: admin MONGO_INITDB_ROOT_PASSWORD: password networks: - dremio-notebook
- Purpose: Holds customer preference data in a NoSQL format.
- Ports:
27017
: Default MongoDB port.
- Volumes: Seeds initial data using a JavaScript file in
./seed/mongo
. - Environment Variables: Configures root username and password.
Network Configuration
All services are connected to a custom Docker network named dremio-notebook
. This allows the containers to communicate with each other using their service names (e.g., postgres
, mongo
).
Seed Files for MongoDB and PostgreSQL
To simulate a realistic data environment, seed files are used to populate the databases with initial data upon container startup.
MongoDB Seed File (./seed/mongo/init.js
)
Purpose: Inserts customer preference data into the customer_preferences
collection.
Content:
// Use the 'mydatabase' database db = db.getSiblingDB('mydatabase'); // Create a collection named 'customer_preferences' db.createCollection('customer_preferences'); // Insert data into the 'customer_preferences' collection db.customer_preferences.insertMany([ { customer_email: '[email protected]', preference: 'Electronics', loyalty_status: 'Gold', last_updated: new Date('2024-09-01') }, { customer_email: '[email protected]', preference: 'Books', loyalty_status: 'Silver', last_updated: new Date('2024-09-02') }, { customer_email: '[email protected]', preference: 'Clothing', loyalty_status: 'Bronze', last_updated: new Date('2024-09-03') } ]);
Data Schema:
- customer_email: String (acts as a key to join with PostgreSQL data).
- preference: String (customer's preferred category).
- loyalty_status: String (e.g., Gold, Silver, Bronze).
- last_updated: Date (timestamp of the last update).
PostgreSQL Seed File (./seed/postgres/init.sql
)
Purpose: Creates tables and inserts customer and order data.
Content:
-- Create customers table CREATE TABLE customers ( customer_id SERIAL PRIMARY KEY, customer_name VARCHAR(255) NOT NULL, email VARCHAR(255) NOT NULL UNIQUE, created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP ); -- Insert customer data INSERT INTO customers (customer_name, email) VALUES ('John Doe', '[email protected]'), ('Jane Smith', '[email protected]'), ('Bob Johnson', '[email protected]'); -- Create orders table CREATE TABLE orders ( order_id SERIAL PRIMARY KEY, order_date DATE NOT NULL, amount DECIMAL(10, 2) NOT NULL, customer_id INT REFERENCES customers(customer_id) ); -- Insert order data INSERT INTO orders (order_date, amount, customer_id) VALUES ('2024-09-01', 100.50, 1), ('2024-09-02', 150.75, 2), ('2024-09-03', 200.00, 3);
Data Schema:
- Customers Table:
- customer_id: Serial primary key.
- customer_name: String.
- email: String (unique, used to join with MongoDB data).
- created_at: Timestamp.
- Orders Table:
- order_id: Serial primary key.
- order_date: Date.
- amount: Decimal.
- customer_id: Integer (foreign key referencing
customers.customer_id
).
How the Seed Files Work
- Automatic Execution:
- For PostgreSQL, any
.sql
files in/docker-entrypoint-initdb.d/
are executed upon container startup. - For MongoDB, any
.js
files in the same directory are executed.
- For PostgreSQL, any
- Data Population:
- This ensures that the databases are pre-populated with sample data, eliminating the need for manual data entry.
Running the Environment
Step 1: Build and Run the Containers
In your terminal, within the dremio-notebook
directory, run:
docker-compose up -
d
- The
-d
flag runs the containers in detached mode.
Step 2: Verify the Services
Ensure that all services are running:
docker-compose ps
You should see all the services (dremio
, spark
, minio
, nessie
, postgres
, mongo
) listed and running.
Step 3: Access the Services
- Dremio UI: Open your browser and navigate to
http://localhost:9047
. - Jupyter Notebook: Accessible at
http://localhost:8888
. - Minio Console: Visit
http://localhost:9001
. - PostgreSQL: Can be accessed via
localhost:5435
. - MongoDB: Available at
localhost:27017
.
By setting up this environment, you have all the necessary components to proceed with connecting the data sources to Dremio and building your Streamlit application. The Docker Compose setup abstracts the complexity of installing and configuring each service individually, allowing you to focus on the core objectives of the exercise.
In the next section, we'll guide you through connecting PostgreSQL and MongoDB to Dremio, joining the data, and creating a unified view that will serve as the backbone of our Streamlit app.
Connecting PostgreSQL and MongoDB to Dremio
Now that Dremio is running, the next step is to connect our data sources. When you first access Dremio on localhost:9047 you'll have to setup a username and password, make note of these for later.
Step 1: Add PostgreSQL as a Data Source
Steps:
- Log in to Dremio: If you're not already logged in, go to
http://localhost:9047
and log in. - Navigate to Sources:
- Click on the Sources tab in the left-hand sidebar.
- Add Postgres as a Source:
- Click + Source in the upper-right corner.
- Select PostgreSQL from the list of available source types.
- Configure Postgres Source:
- Name: Give your Postgres source a name (e.g.,
postgres_nessie
). - Hostname: Enter
postgres
(orpostgres
if referencing the Docker container name). - Port: Enter
5432
. - Database: Enter
mydatabase
(the database name set in your Docker Compose environment variables). - Username: Enter
admin
(the Postgres user defined in your Docker Compose environment). - Password: Enter
password
(the Postgres password defined in your Docker Compose environment).
- Name: Give your Postgres source a name (e.g.,
- Click Save: After filling in the details, click Save. You can now query data stored in your Postgres database from Dremio.
Step 2: Add MongoDB as a Data Source
1. Add a New Source
- Click on the "Add Source" button again.
2. Select MongoDB Connector
- Choose "MongoDB" from the list.
3. Configure MongoDB Connection
MongoDB is a NoSQL database, and Dremio has native support for MongoDB connections.
Steps:
- Log in to Dremio: If you're not already logged in, go to
http://localhost:9047
and log in. - Navigate to Sources:
- Click on the Sources tab in the left-hand sidebar.
- Add MongoDB as a Source:
- Click + Source in the upper-right corner.
- Select MongoDB from the list of available source types.
- Configure MongoDB Source:
- Name: Give your MongoDB source a name (e.g.,
mongo
). - Host: Enter
mongo
(ormongo
if referencing the Docker container name). - Port: Enter
27017
(the default MongoDB port). - Authentication: If MongoDB is set to require authentication, enable authentication and fill in the following:
- Username:
admin
(as per your Docker Compose configuration). - Password:
password
(as per your Docker Compose configuration). - Authentication Database:
admin
(the default MongoDB authentication database).
- Username:
- Name: Give your MongoDB source a name (e.g.,
- Click Save: Once you’ve filled in the necessary details, click Save. Dremio will now be connected to MongoDB, allowing you to query collections directly from Dremio.
Creating a Unified View
With both data sources connected, we can now join the data to create a unified view.
Step 1: Explore the Data Sources
PostgreSQL Data
- Navigate to "Sources" > "postgres" > "mydatabase" > "public".
- You should see the tables
customers
andorders
.
MongoDB Data
- Navigate to "Sources" > "mongo" > "mydatabase".
- You should see the collection
customer_preferences
.
Step 2: Join the PostgreSQL Tables
1. Create a New Tab for New Query
- Create a new tab.
2. Write the SQL Query
Join the customers
and orders
tables:
SELECT c.customer_id, c.customer_name, c.email, c.created_at, o.order_id, o.order_date, o.amount FROM postgres.public.customers c INNER JOIN postgres.public.orders o ON c.customer_id = o.customer_id
3. Run the Query
- Click on "Run" to execute the query and preview the results.
4. Save the Result as a Virtual Dataset
- Click on "Save View" (the disk icon).
- Name it
customer_orders
. - Save it under your "@username" space
Step 3: Join with MongoDB Collection
1. Open the customer_orders
Dataset
- Navigate to "@username" >
customer_orders
. - Click on it to open.
2. Create a New Query Based on customer_orders
- With
customer_orders
open, click a new tab for the new query.
3. Write the SQL Query to Join with MongoDB
SELECT co.customer_id, co.customer_name, co.email, co.created_at, co.order_id, co.order_date, co.amount, cp.preference, cp.loyalty_status, cp.last_updated FROM "@username".customer_orders co INNER JOIN mongo.mydatabase.customer_preferences cp ON co.email = cp.customer_email
- Explanation:
- We're joining the
customer_orders
dataset with thecustomer_preferences
collection from MongoDB. - The join is performed on the
email
field from PostgreSQL and thecustomer_email
field from MongoDB.
- We're joining the
4. Run the Query
- Click on "Run" to execute and preview the unified dataset.
5. Save the Unified Dataset
- Click on "Save View".
- Name it
customer_unified_view
. - Save it under "@username".
Step 4: Verify the Unified Dataset
- Navigate to "@username" >
customer_unified_view
. - Open it to ensure all fields are correctly populated and the data looks as expected.
Step 5: Optional Data Refinement
You can use Dremio's tools to further refine your dataset:
- Rename Columns: Click on the column header to rename it for clarity.
- Change Data Types: Ensure all data types are correct (e.g., dates are recognized as date types).
- Add Calculated Fields: Create new fields based on existing data if necessary.
- Filter Data: Apply filters to focus on specific subsets of data.
Benefits of Using Dremio for Data Unification
By using Dremio to join data from PostgreSQL and MongoDB:
- Simplified Data Access: You interact with a single unified dataset instead of managing multiple data sources in your application code.
- Performance Optimization: Dremio optimizes queries and can cache results, improving performance.
- No Data Movement: Data remains in its original source; Dremio queries data in place.
- Security and Governance: Centralized control over data access and permissions.
- Scalability: Easily extend to include additional data sources without altering your application code.
With the unified dataset customer_unified_view
now available in Dremio, we're ready to move on to building the Streamlit app. In the next section, we'll access the notebook server, write the Streamlit code to consume this dataset, and run the app to visualize and interact with our unified data.
Building the Streamlit App
With the unified dataset customer_unified_view
available in Dremio, it's time to build an interactive Streamlit app to visualize and interact with the data. In this section, we'll:
- Access the Jupyter Notebook server provided by the Spark container.
- Write the Streamlit application code in a Python file.
- Use the
dremio-simple-query
library to connect to Dremio and fetch the data. - Create a notebook to run the bash command that launches the Streamlit app.
- Modify the
docker-compose.yml
file to expose the necessary ports for Streamlit.
Accessing the Notebook Server
The Spark container in our Docker environment comes with a Jupyter Notebook server, which we'll use to write and manage our code.
Step 1: Access the Jupyter Notebook Server
- Open your web browser and navigate to
http://localhost:8888
. - Since the notebook server was started with
--NotebookApp.token=''
and--NotebookApp.password=''
, no authentication is required.
Writing the Streamlit Code
We'll create a Python script for the Streamlit app that connects to Dremio, fetches the unified dataset, and displays it.
Step 1: Create a New Python File
- In the Jupyter interface, click on "New" > "Text File".
- Rename the file to
app.py
:- Click on "Untitled.txt".
- Rename it to
app.py
and confirm.
Step 2: Install Required Libraries
Before writing the code normally, you'd need to ensure that all required Python libraries are installed in the Spark container. In this case, the container should have all standard data libraries pre-installed including the ones needed for this exercise (dremio-simple-query, pandas, streamlit, python-dotenv).
Step 3: Write the Streamlit App Code
Open app.py
and start coding.
Import Necessary Libraries
import streamlit as st import pandas as pd from dremio_simple_query.connect import DremioConnection, get_token from dotenv import load_dotenv import os
Load Environment Variables
We'll use a .env
file to store sensitive information like usernames and passwords.
- Create a file named vars
.env
. - Add the following content (replace placeholders with your actual credentials):
DREMIO_USERNAME=
your_dremio_usernameDREMIO_PASSWORD=your_dremio_password
DREMIO_ENDPOINT=dremio:9047
DREMIO_FLIGHT_ENDPOINT=dremio:32010
- In your
app.py
, load the environment variables:
# Load environment variables from .env file load_dotenv('vars.env') # Retrieve variables username = os.getenv('DREMIO_USERNAME') password = os.getenv('DREMIO_PASSWORD') dremio_endpoint = os.getenv('DREMIO_ENDPOINT') flight_endpoint = os.getenv('DREMIO_FLIGHT_ENDPOINT')
Obtain Dremio Authentication Token
# Dremio REST API login endpoint login_endpoint = f"http://{dremio_endpoint}/apiv2/login" # Payload for login payload = { "userName": username, "password": password } # Get authentication token token = get_token(uri=login_endpoint, payload=payload)
Set Up Dremio Connection
# Dremio Arrow Flight endpoint arrow_endpoint = f"grpc://{flight_endpoint}" # Establish connection dremio = DremioConnection(token, arrow_endpoint)
Fetch Data from Dremio
# SQL query to fetch data from the unified view query = """ SELECT * FROM "@username".customer_unified_view """ # Fetch data as a Pandas DataFrame df = dremio.toPandas(query)
Make sure to replace "username" with your username, so if your username was john it would read "@john".customer_unified_view.
Build the Streamlit App Interface
# Set the title of the app st.title("Customer Insights Dashboard") # Display the DataFrame st.subheader("Unified Customer Data") st.dataframe(df) # Add interactive elements st.sidebar.header("Filter Options") # Filter by Loyalty Status loyalty_status_options = df['loyalty_status'].unique() selected_loyalty_status = st.sidebar.multiselect( "Select Loyalty Status:", options=loyalty_status_options, default=loyalty_status_options ) # Filter by Preference preference_options = df['preference'].unique() selected_preferences = st.sidebar.multiselect( "Select Preferences:", options=preference_options, default=preference_options ) # Apply Filters filtered_df = df[ (df['loyalty_status'].isin(selected_loyalty_status)) & (df['preference'].isin(selected_preferences)) ] st.subheader("Filtered Customer Data") st.dataframe(filtered_df) # Display Summary Metrics st.subheader("Summary Metrics") total_customers = filtered_df['customer_id'].nunique() total_orders = filtered_df['order_id'].nunique() total_revenue = filtered_df['amount'].sum() col1, col2, col3 = st.columns(3) col1.metric("Total Customers", total_customers) col2.metric("Total Orders", total_orders) col3.metric("Total Revenue", f"${total_revenue:,.2f}")
Add Data Visualization
# Import additional libraries import matplotlib.pyplot as plt # Revenue by Loyalty Status revenue_by_loyalty = filtered_df.groupby('loyalty_status')['amount'].sum().reset_index() st.subheader("Revenue by Loyalty Status") fig, ax = plt.subplots() ax.bar(revenue_by_loyalty['loyalty_status'], revenue_by_loyalty['amount'], color='skyblue') ax.set_xlabel("Loyalty Status") ax.set_ylabel("Total Revenue") st.pyplot(fig)
Step 4: Save app.py
- Ensure you save the file by clicking "File" > "Save" or pressing
Ctrl + S
.
Creating a Notebook to Run the Streamlit App
We need to run the Streamlit app from within the container. We'll create a Jupyter notebook to execute the necessary bash command.
Step 1: Create a New Notebook
- In the Jupyter interface, click on "New" > "Python 3" to create a new notebook.
- Rename the notebook to
run_streamlit.ipynb
:- Click on "Untitled" at the top.
- Rename it and confirm.
Step 2: Write the Bash Command
In the first cell of the notebook, write the following code:
!pip install python-dotenv !streamlit run app.py --server.port 8501 --server.address 0.0.0.0
Explanation:
!
indicates that the line is a shell command.streamlit run app.py
runs the Streamlit application.--server.port 8501
specifies the port on which the app will run.--server.address 0.0.0.0
makes the app accessible from any network interface within the container.
Step 3: Run the Notebook Cell
- Execute the cell by clicking the "Run" button or pressing
Shift + Enter
. - The output will show logs from Streamlit indicating that the app is running.
Step 4: Expose the Streamlit Port
By default, Streamlit runs on port 8501
inside the container which is exposed to our host machine so should be accessible from localhost:8501 in your browser.
Step 5: Access the Streamlit App
- Open your web browser and navigate to
http://localhost:8501
. - You should see your Streamlit app running, displaying the unified customer data with interactive filters and visualizations.
Note on Environment Variables
To avoid hardcoding credentials in your code, using environment variables or a .env
file is a best practice. We've already demonstrated how to use a .env
file in this setup.
Conclusion
By following these steps, you've successfully:
- Connected to Dremio from within your Streamlit app using the
dremio-simple-query
library. - Fetched and displayed unified data from multiple sources without managing multiple database connections.
- Built an interactive Streamlit application that allows users to filter and visualize customer data.
This approach highlights the power of Dremio in simplifying data unification and the ease of building data-driven applications with Streamlit.
Get Started with Dremio Today or Schedule a Meeting to Learn How Dremio Can Bring Your Value!
Also make sure to try out this Hands-On Intro to Apache Iceberg with Apache Spark and Dremio!