h2h2h1h2h2h2h2h2h2h2h2h2h2h2h2h2h2h2h2h2h2h2h2h2h2h2h2h2h2h2h2h2h2h2h2h2h2h2h2h2h2h2h2h2h2h2h2h2h2h2h2h2h2h2h2h2h2h2h2h2h2h2h2h2h2h2h2h2h2h2h2h2

63 minute read · September 15, 2024

Unifying Data Sources with Dremio to Power a Streamlit App

Alex Merced

Alex Merced · Senior Tech Evangelist, Dremio

Businesses often face the daunting task of unifying data scattered across multiple platforms. Whether it's transactional data stored in PostgreSQL, customer preferences housed in MongoDB, or analytics data in a data warehouse like Snowflake, integrating these disparate sources into a cohesive dataset is a complex challenge. This fragmentation not only complicates data analytics but also hinders the development of robust applications that rely on consolidated information.

Overview of Data Unification Challenges

Complexity of Handling Multiple Data Sources

  • Diverse Data Formats: Different databases and data stores often use varying data formats and schemas, making it difficult to standardize data for analysis.
  • Multiple Connection Protocols: Each data source may require a unique connection setup, authentication method, and query language, increasing the complexity of the codebase.
  • Data Silos: Isolated data sources can lead to data silos, where critical information is inaccessible to other systems or teams, impeding holistic analysis.
  • Performance Bottlenecks: Joining data across platforms can be resource-intensive and slow, negatively impacting application performance and user experience.

The Need for Seamless Integration

To derive meaningful insights and make informed decisions, organizations require a unified view of their data. Seamless integration enables:

  • Efficient Data Analytics: Streamlined access to all data sources accelerates analysis and reporting.
  • Improved Decision-Making: A holistic view of data supports better strategic decisions.
  • Enhanced Application Development: Simplified data management reduces development time and complexity, allowing for more focus on building features and user experiences.

Introducing Dremio and Streamlit

To address these challenges, we explore two powerful tools that simplify data unification and application development: Dremio and Streamlit.

Dremio: The Lakehouse Platform for Unified Analytics

What is Dremio?

Dremio is an data lakehouse platform that offers unified analytics by connecting to a wide array of data sources—all in one place. It bridges the gap between data lakes, data warehouses, and various databases, providing a single interface for querying and managing data.

Key Features:

  • Unified Connectivity: Connects seamlessly with databases like PostgreSQL and MongoDB, data warehouses like Snowflake and Redshift, data lakes, and lakehouse catalogs like Nessie and Polaris.
  • Powerful SQL Query Engine: Allows for federated queries across multiple data sources without the need to move or copy data.
  • Semantic Layer: Enables the curation of business metrics and definitions to ensure consistency across the organization.
  • Lakehouse Management: Automates optimization and cleanup tasks for data formats like Apache Iceberg, and supports catalog versioning for creating zero-copy environments and workload isolation.

Streamlit: Simplifying Data App Development

What is Streamlit?

Streamlit is an open-source Python library designed for creating intuitive and interactive web applications for machine learning and data science projects. It turns data scripts into shareable web apps in minutes.

Why Use Streamlit?

  • Ease of Use: Build apps with minimal code using familiar Python syntax.
  • Interactivity: Incorporate widgets to allow users to interact with data in real-time.
  • Integration: Works seamlessly with popular data manipulation and visualization libraries like Pandas, NumPy, and Matplotlib.
  • Rapid Development: Accelerates the development process, allowing data scientists and analysts to focus on insights rather than web development intricacies.

By leveraging Dremio's unified analytics capabilities and Streamlit's simplicity in app development, we can overcome the challenges of data unification. In the following sections, we'll delve into how to join data from MongoDB and PostgreSQL using Dremio and build a Streamlit app that taps into this unified dataset—all without the need to manage multiple database connections within our Python code.

Introduction to Streamlit

The ability to quickly prototype and deploy interactive applications is invaluable. Streamlit is a heavyweight in this space, offering a straightforward way for data scientists and developers to turn data scripts into shareable web apps with minimal effort.

Key Features of Streamlit

  • Ease of Use: With its intuitive API, you can create web apps by writing Python scripts as you normally would. Streamlit automatically handles the UI elements.
  • Real-Time Interactivity: Streamlit apps update in real-time as users interact with widgets like sliders, text inputs, and buttons.
  • Integration with Data Libraries: Seamlessly integrates with popular Python libraries such as Pandas, NumPy, Matplotlib, Plotly, and more.
  • Rapid Prototyping: Ideal for quickly prototyping ideas and sharing them with others without extensive frontend development knowledge.

Why Use Streamlit?

Simplifies Development of Data-Driven Applications

Traditionally, building a web application requires knowledge of web frameworks and front-end development. Streamlit abstracts these complexities, allowing you to focus on the data and logic rather than the intricacies of web development.

  • No Web Development Skills Required: Write apps entirely in Python without dealing with HTML, CSS, or JavaScript.
  • Fast Development Cycle: Make changes to your script, save, and see the results immediately in the browser.
  • Collaboration-Friendly: Easy to share your app with colleagues or deploy it for broader access.

Enhances Data Visualization and Interaction

Streamlit makes it effortless to visualize data and build interactive dashboards.

  • Dynamic Visualizations: Create charts and graphs that update based on user input.
  • Customizable Widgets: Use built-in widgets to allow users to filter data, select options, and control parameters.
  • Responsive Layouts: Automatically adjusts the layout for different screen sizes and devices.

Supports Machine Learning and AI Workflows

Streamlit is particularly well-suited for machine learning practitioners.

  • Model Exploration: Build interfaces to adjust model parameters and observe outcomes in real-time.
  • Data Inspection: Interactively explore datasets to gain insights and identify patterns.
  • Deployment of Models: Share trained models with stakeholders through an accessible web app.

How Streamlit Works

Streamlit watches your Python script for changes and reruns it from top to bottom whenever it detects a modification or user interaction.

Basic Structure of a Streamlit App

Here's a simple example to illustrate how a Streamlit app is structured:

import streamlit as st
import pandas as pd

# Title of the app
st.title('Simple Streamlit App')

# Load data
data = pd.read_csv('data.csv')

# Display data
st.dataframe(data)

# User input
number = st.slider('Select a number', min_value=1, max_value=100)

# Display output
st.write(f'The selected number is {number}')

When you run this script using streamlit run app.py, it opens in your web browser and provides an interactive interface based on your code.

Running a Streamlit App

To run a Streamlit app:

  1. Install Streamlit: If you haven't already, install Streamlit using pip install streamlit.
  2. Execute the Script: Run the command streamlit run your_script.py in your terminal.
  3. View in Browser: The app will automatically open in a new browser tab.

Deployment Options

  • Local Deployment: Ideal for development and testing on your local machine.
  • Cloud Deployment: Deploy your app on cloud platforms like Heroku, AWS, or Streamlit Sharing for broader access.
  • Docker Containers: Use Docker to containerize your app for consistent deployment environments.

By incorporating Streamlit into your workflow, you can significantly enhance the way you interact with your data and models. It transforms static scripts into engaging, interactive applications, making it easier to communicate insights and collaborate with others.

In the context of our exercise, Streamlit will serve as the front-end interface for our unified dataset. By connecting to Dremio's unified data source, we'll build an interactive app that allows users to explore and analyze data joined from MongoDB and PostgreSQL—all without the complexity of managing multiple database connections within our code.

In the next section, we'll delve into a practical exercise where we simulate a real-world scenario: joining data from MongoDB and PostgreSQL using Dremio and building a Streamlit app to interact with this data.

The Exercise: Joining MongoDB and PostgreSQL Data for a Streamlit App

In this section, we'll delve into a practical exercise that demonstrates how Dremio simplifies the process of unifying data from multiple sources. We'll explore how to join data from MongoDB and PostgreSQL—two fundamentally different databases—and use this unified dataset to build an interactive Streamlit application. This exercise highlights the efficiency gained by using Dremio to handle data integration, allowing us to focus on developing the app without worrying about complex database connections and data merging logic in our code.

Scenario Overview

Imagine a scenario where you have customer data spread across two different databases:

  • PostgreSQL: Contains core customer information such as customer IDs, names, emails, and order history.
  • MongoDB: Stores customer preferences, loyalty statuses, and other semi-structured data.

Your goal is to build a Streamlit app that presents a comprehensive view of each customer by joining data from both databases. This unified view will enable users to:

  • View Customer Profiles: See personal information alongside preferences and loyalty status.
  • Analyze Purchasing Behavior: Correlate order history with customer preferences.
  • Enhance Marketing Strategies: Identify high-value customers based on combined data insights.

Data Details

  • PostgreSQL Tables:
    • customers: Contains customer_id, customer_name, email, and created_at.
    • orders: Includes order_id, order_date, amount, and customer_id (foreign key to customers).
  • MongoDB Collection:
    • customer_preferences: Stores documents with customer_email, preference, loyalty_status, and last_updated.

Traditional Approach vs. Dremio Approach

Traditional Approach: Managing Multiple Database Connections in Python

In a conventional setup without Dremio, you would need to:

  1. Establish Separate Connections: Write code to connect to both PostgreSQL and MongoDB using their respective drivers (e.g., psycopg2 for PostgreSQL and pymongo for MongoDB).
  2. Retrieve Data Individually: Execute queries to fetch data from each database.
  3. Data Transformation: Convert the retrieved data into compatible formats (e.g., Pandas DataFrames).
  4. Data Joining Logic: Implement logic to merge datasets, often requiring careful handling of data types and missing values.
  5. Error Handling: Manage exceptions and connection issues for both databases separately.
  6. Maintain Credentials and Configurations: Securely store and manage connection strings, usernames, and passwords for each database.

Challenges with the Traditional Approach:

  • Complexity: Increased code complexity due to handling multiple connections and data transformations.
  • Maintenance Overhead: More code to maintain, test, and debug.
  • Performance Issues: Potential inefficiencies in data retrieval and merging, especially with large datasets.
  • Security Risks: Managing multiple sets of credentials increases the risk of misconfiguration or exposure.

Dremio Approach: Simplifying Data Unification

By leveraging Dremio, you can streamline the entire process:

  1. Single Connection Point: Connect to Dremio using its powerful SQL query engine.
  2. Unified Data Access: Dremio connects to both PostgreSQL and MongoDB, handling the data retrieval internally.
  3. Federated Queries: Write SQL queries in Dremio to join data across both databases seamlessly.
  4. Virtual Datasets: Create virtual datasets (views) in Dremio that represent the joined data, without physically moving or copying data.
  5. Simplified Application Code: In your Streamlit app, connect only to Dremio to access the unified dataset.

Advantages of the Dremio Approach:

  • Reduced Complexity: Eliminate the need to manage multiple database connections in your application code.
  • Performance Optimization: Dremio's query engine optimizes data retrieval and joins, improving performance.
  • Centralized Security: Manage access control and credentials centrally within Dremio.
  • Scalability: Easily extend to additional data sources by configuring them in Dremio without altering application code.
  • Consistent Data Schema: Ensure consistent data types and schemas through Dremio's semantic layer.

How Dremio Simplifies the Process

Data Source Configuration

In Dremio, you configure data sources for PostgreSQL and MongoDB:

  • Add PostgreSQL Source:
    • Specify connection details such as host, port, database name, username, and password.
  • Add MongoDB Source:
    • Provide connection information, including host, port, and authentication details.

Creating a Unified View

With data sources configured, you can:

  1. Explore Schemas: Browse the tables in PostgreSQL and collections in MongoDB directly from Dremio's UI.
  2. Write SQL Queries: Use Dremio's SQL editor to write queries that join data across these sources.
  3. Save as Virtual Dataset: Save this query as a virtual dataset in Dremio for easy reuse.
  4. Apply Transformations: Use Dremio's tools to further refine data, such as filtering, calculating new fields, or standardizing data types.

Accessing Unified Data in Streamlit

In your Streamlit app:

  • Single Connection: Use the dremio-simple-query library to connect to Dremio.
  • Query Unified Dataset: Fetch data from the virtual dataset you created.
  • Build Interactive Features: Utilize Streamlit's components to display and interact with the data.

By handling the data unification within Dremio, you significantly simplify your application code. This approach allows you to focus on delivering insights and enhancing user experience rather than dealing with the intricacies of data integration.

In the next sections, we'll guide you through setting up the environment, configuring data sources in Dremio, and building the Streamlit app step by step. You'll see firsthand how Dremio streamlines data unification and how easy it is to build a powerful, interactive application using Streamlit.

Setting Up the Environment

To embark on this exercise, you'll need to set up a development environment that simulates a real-world data ecosystem with multiple data sources. We'll use Docker to containerize and manage our services, making it easier to replicate and maintain the environment. The repository AlexMercedCoder/dremio-notebook provides all the necessary configurations and seed data to get you started.

Forking and Cloning the Repository

First, you'll need to fork and clone the repository to your local machine:

Step 1: Fork the Repository

  1. Navigate to the Repository: Go to AlexMercedCoder/dremio-notebook on GitHub.
  2. Fork the Repository: Click on the Fork button at the top right corner of the page. This creates a copy of the repository under your GitHub account.

Step 2: Clone the Repository

Once you've forked the repository:

  1. Clone the Repository: Open your terminal and run the following command, replacing your-username with your GitHub username:

    git clone https://github.com/your-username/dremio-notebook.git
  2. Navigate to the Directory:

    cd dremio-notebook

Now, you have the repository cloned locally and are ready to set up the Docker environment.

Understanding the docker-compose.yml File

The docker-compose.yml file orchestrates multiple services required for this exercise. Let's break down the services and their configurations.

Overview of Services

The Docker Compose file defines the following services:

  1. Dremio: The Lakehouse Platform that will unify data sources.
  2. Spark: Provides a notebook server and computational engine.
  3. Minio: An S3-compatible object storage for data lakes.
  4. Nessie: A version control system for data lakes.
  5. PostgreSQL: Relational database holding customer data.
  6. MongoDB: NoSQL database storing customer preferences.

Detailed Breakdown

1. Dremio Service

dremio:
  image: dremio/dremio-oss:latest
  container_name: dremio
  ports:
    - 9047:9047       # Dremio UI
    - 31010:31010     # Dremio JDBC/ODBC
    - 32010:32010     # Dremio Arrow Flight
    - 45678:45678     # Dremio internal
  environment:
    - DREMIO_JAVA_SERVER_EXTRA_OPTS=-Dpaths.dist=file:///opt/dremio/data/dist
  networks:
    - dremio-notebook
  • Image: Uses the latest Community Edition Dremio image.
  • Ports:
    • 9047: Access the Dremio UI.
    • 32010: Used for Arrow Flight connections (important for our dremio-simple-query library).
  • Environment Variables: Configures Dremio's Java options to store reflections on the container.
  • Network: Connected to the dremio-notebook network for inter-container communication.

2. Spark Service

spark:
  platform: linux/x86_64
  image: alexmerced/spark35nb:latest
  ports:
    - 8080:8080       # Spark Master Web UI
    - 7077:7077       # Spark Master Port
    - 8081:8081       # Spark Worker Web UI
    - 4040-4045:4040-4045 # Spark Job UIs
    - 18080:18080     # Spark History Server
    - 8888:8888       # Jupyter Notebook
    - 8501:8501       # For Streamlit Apps
  environment:
    # Spark and AWS configurations
  volumes:
    - ./seed/notebook-seed:/workspace/seed-data
  container_name: spark
  entrypoint: >
    /bin/bash -c "
    pip install streamlit python-dotenv && \
    # Start Spark services
    /opt/spark/sbin/start-master.sh && \
    /opt/spark/sbin/start-worker.sh spark://$(hostname):7077 && \
    mkdir -p /tmp/spark-events && \
    start-history-server.sh && \
    # Start Jupyter Lab
    jupyter lab --ip=0.0.0.0 --port=8888 --no-browser --allow-root --NotebookApp.token='' --NotebookApp.password='' && \
    tail -f /dev/null
    "
  networks:
    - dremio-notebook
  • Image: Custom image with Spark and Jupyter Lab.
  • Ports:
    • 8888: Access the Jupyter Notebook server.
    • Other ports are for Spark services and UI.
    • 8501 for our streamlit app
  • Volumes: Mounts seed data into the container.
  • Entry Point:
    • Installs Streamlit and python-dotenv.
    • Starts Spark master and worker.
    • Initiates Jupyter Lab.
  • Network: Connected to dremio-notebook.

3. Minio Service (not needed for this exercise)

minio:
  image: minio/minio
  container_name: minio
  environment:
    # Minio configurations
  ports:
    - "9000:9000"     # Minio API
    - "9001:9001"     # Minio Console
  volumes:
    - ./seed/minio-data:/minio-data
  entrypoint: >
    /bin/sh -c "
    minio server /data --console-address ':9001' & \
    sleep 5; \
    mc alias set myminio http://localhost:9000 admin password; \
    # Create buckets and upload seed data
    mc mb myminio/datalake; \
    mc mb myminio/datalakehouse; \
    mc mb myminio/warehouse; \
    mc mb myminio/seed; \
    mc cp /minio-data/* myminio/seed/; \
    tail -f /dev/null"
  networks:
    - dremio-notebook
  • Purpose: Provides S3-compatible storage.
  • Ports:
    • 9000: Minio API.
    • 9001: Minio Console for UI access.
  • Volumes: Seeds data into Minio buckets.
  • Entry Point:
    • Starts Minio server.
    • Uses mc (Minio Client) to create buckets and upload seed data.

4. Nessie Service (not needed for this exercise)

nessie:
  image: projectnessie/nessie:latest
  container_name: nessie
  environment:
    # Nessie configurations
  ports:
    - "19120:19120"   # Nessie API port
  networks:
    - dremio-notebook
  • Purpose: Provides cataloging of Iceberg tables and catalog versioning.
  • Ports:
    • 19120: Nessie API.
  • Configuration: Uses RocksDB for storage.

5. PostgreSQL Service

postgres:
  image: postgres:13
  container_name: postgres
  environment:
    POSTGRES_USER: admin
    POSTGRES_PASSWORD: password
    POSTGRES_DB: mydatabase
  ports:
    - "5435:5432"
  volumes:
    - ./seed/postgres:/docker-entrypoint-initdb.d/
  networks:
    - dremio-notebook
  • Purpose: Stores relational customer data.
  • Ports:
    • 5435: Mapped to PostgreSQL's default port 5432.
  • Volumes: Seeds initial data using SQL scripts placed in ./seed/postgres.
  • Environment Variables: Sets up the default user, password, and database.

6. MongoDB Service

mongo:
  image: mongo:4.4
  container_name: mongo
  ports:
    - "27017:27017"
  volumes:
    - ./seed/mongo:/docker-entrypoint-initdb.d/
  environment:
    MONGO_INITDB_ROOT_USERNAME: admin
    MONGO_INITDB_ROOT_PASSWORD: password
  networks:
    - dremio-notebook
  • Purpose: Holds customer preference data in a NoSQL format.
  • Ports:
    • 27017: Default MongoDB port.
  • Volumes: Seeds initial data using a JavaScript file in ./seed/mongo.
  • Environment Variables: Configures root username and password.

Network Configuration

All services are connected to a custom Docker network named dremio-notebook. This allows the containers to communicate with each other using their service names (e.g., postgres, mongo).

Seed Files for MongoDB and PostgreSQL

To simulate a realistic data environment, seed files are used to populate the databases with initial data upon container startup.

MongoDB Seed File (./seed/mongo/init.js)

Purpose: Inserts customer preference data into the customer_preferences collection.

Content:

// Use the 'mydatabase' database
db = db.getSiblingDB('mydatabase');

// Create a collection named 'customer_preferences'
db.createCollection('customer_preferences');

// Insert data into the 'customer_preferences' collection
db.customer_preferences.insertMany([
  {
    customer_email: '[email protected]',
    preference: 'Electronics',
    loyalty_status: 'Gold',
    last_updated: new Date('2024-09-01')
  },
  {
    customer_email: '[email protected]',
    preference: 'Books',
    loyalty_status: 'Silver',
    last_updated: new Date('2024-09-02')
  },
  {
    customer_email: '[email protected]',
    preference: 'Clothing',
    loyalty_status: 'Bronze',
    last_updated: new Date('2024-09-03')
  }
]);

Data Schema:

  • customer_email: String (acts as a key to join with PostgreSQL data).
  • preference: String (customer's preferred category).
  • loyalty_status: String (e.g., Gold, Silver, Bronze).
  • last_updated: Date (timestamp of the last update).

PostgreSQL Seed File (./seed/postgres/init.sql)

Purpose: Creates tables and inserts customer and order data.

Content:

-- Create customers table
CREATE TABLE customers (
    customer_id SERIAL PRIMARY KEY,
    customer_name VARCHAR(255) NOT NULL,
    email VARCHAR(255) NOT NULL UNIQUE,
    created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);

-- Insert customer data
INSERT INTO customers (customer_name, email)
VALUES 
('John Doe', '[email protected]'),
('Jane Smith', '[email protected]'),
('Bob Johnson', '[email protected]');

-- Create orders table
CREATE TABLE orders (
    order_id SERIAL PRIMARY KEY,
    order_date DATE NOT NULL,
    amount DECIMAL(10, 2) NOT NULL,
    customer_id INT REFERENCES customers(customer_id)
);

-- Insert order data
INSERT INTO orders (order_date, amount, customer_id)
VALUES 
('2024-09-01', 100.50, 1),
('2024-09-02', 150.75, 2),
('2024-09-03', 200.00, 3);

Data Schema:

  • Customers Table:
    • customer_id: Serial primary key.
    • customer_name: String.
    • email: String (unique, used to join with MongoDB data).
    • created_at: Timestamp.
  • Orders Table:
    • order_id: Serial primary key.
    • order_date: Date.
    • amount: Decimal.
    • customer_id: Integer (foreign key referencing customers.customer_id).

How the Seed Files Work

  • Automatic Execution:
    • For PostgreSQL, any .sql files in /docker-entrypoint-initdb.d/ are executed upon container startup.
    • For MongoDB, any .js files in the same directory are executed.
  • Data Population:
    • This ensures that the databases are pre-populated with sample data, eliminating the need for manual data entry.

Running the Environment

Step 1: Build and Run the Containers

In your terminal, within the dremio-notebook directory, run:

docker-compose up -d
  • The -d flag runs the containers in detached mode.

Step 2: Verify the Services

Ensure that all services are running:

docker-compose ps

You should see all the services (dremio, spark, minio, nessie, postgres, mongo) listed and running.

Step 3: Access the Services

  • Dremio UI: Open your browser and navigate to http://localhost:9047.
  • Jupyter Notebook: Accessible at http://localhost:8888.
  • Minio Console: Visit http://localhost:9001.
  • PostgreSQL: Can be accessed via localhost:5435.
  • MongoDB: Available at localhost:27017.

By setting up this environment, you have all the necessary components to proceed with connecting the data sources to Dremio and building your Streamlit application. The Docker Compose setup abstracts the complexity of installing and configuring each service individually, allowing you to focus on the core objectives of the exercise.

In the next section, we'll guide you through connecting PostgreSQL and MongoDB to Dremio, joining the data, and creating a unified view that will serve as the backbone of our Streamlit app.

Connecting PostgreSQL and MongoDB to Dremio

Now that Dremio is running, the next step is to connect our data sources. When you first access Dremio on localhost:9047 you'll have to setup a username and password, make note of these for later.

Step 1: Add PostgreSQL as a Data Source

Steps:

  1. Log in to Dremio: If you're not already logged in, go to http://localhost:9047 and log in.
  2. Navigate to Sources:
    • Click on the Sources tab in the left-hand sidebar.
  3. Add Postgres as a Source:
    • Click + Source in the upper-right corner.
    • Select PostgreSQL from the list of available source types.
  4. Configure Postgres Source:
    • Name: Give your Postgres source a name (e.g., postgres_nessie).
    • Hostname: Enter postgres (or postgres if referencing the Docker container name).
    • Port: Enter 5432.
    • Database: Enter mydatabase (the database name set in your Docker Compose environment variables).
    • Username: Enter admin (the Postgres user defined in your Docker Compose environment).
    • Password: Enter password (the Postgres password defined in your Docker Compose environment).
  5. Click Save: After filling in the details, click Save. You can now query data stored in your Postgres database from Dremio.

Step 2: Add MongoDB as a Data Source

1. Add a New Source

  • Click on the "Add Source" button again.

2. Select MongoDB Connector

  • Choose "MongoDB" from the list.

3. Configure MongoDB Connection

MongoDB is a NoSQL database, and Dremio has native support for MongoDB connections.

Steps:

  1. Log in to Dremio: If you're not already logged in, go to http://localhost:9047 and log in.
  2. Navigate to Sources:
    • Click on the Sources tab in the left-hand sidebar.
  3. Add MongoDB as a Source:
    • Click + Source in the upper-right corner.
    • Select MongoDB from the list of available source types.
  4. Configure MongoDB Source:
    • Name: Give your MongoDB source a name (e.g., mongo).
    • Host: Enter mongo (or mongo if referencing the Docker container name).
    • Port: Enter 27017 (the default MongoDB port).
    • Authentication: If MongoDB is set to require authentication, enable authentication and fill in the following:
      • Usernameadmin (as per your Docker Compose configuration).
      • Passwordpassword (as per your Docker Compose configuration).
      • Authentication Databaseadmin (the default MongoDB authentication database).
  5. Click Save: Once you’ve filled in the necessary details, click Save. Dremio will now be connected to MongoDB, allowing you to query collections directly from Dremio.

Creating a Unified View

With both data sources connected, we can now join the data to create a unified view.

Step 1: Explore the Data Sources

PostgreSQL Data

  • Navigate to "Sources" > "postgres" > "mydatabase" > "public".
  • You should see the tables customers and orders.

MongoDB Data

  • Navigate to "Sources" > "mongo" > "mydatabase".
  • You should see the collection customer_preferences.

Step 2: Join the PostgreSQL Tables

1. Create a New Tab for New Query

  • Create a new tab.

2. Write the SQL Query

Join the customers and orders tables:

SELECT
  c.customer_id,
  c.customer_name,
  c.email,
  c.created_at,
  o.order_id,
  o.order_date,
  o.amount
FROM
  postgres.public.customers c
  INNER JOIN postgres.public.orders o ON c.customer_id = o.customer_id

3. Run the Query

  • Click on "Run" to execute the query and preview the results.

4. Save the Result as a Virtual Dataset

  • Click on "Save View" (the disk icon).
  • Name it customer_orders.
  • Save it under your "@username" space

Step 3: Join with MongoDB Collection

1. Open the customer_orders Dataset

  • Navigate to "@username" > customer_orders.
  • Click on it to open.

2. Create a New Query Based on customer_orders

  • With customer_orders open, click a new tab for the new query.

3. Write the SQL Query to Join with MongoDB

SELECT
  co.customer_id,
  co.customer_name,
  co.email,
  co.created_at,
  co.order_id,
  co.order_date,
  co.amount,
  cp.preference,
  cp.loyalty_status,
  cp.last_updated
FROM
  "@username".customer_orders co
  INNER JOIN mongo.mydatabase.customer_preferences cp ON co.email = cp.customer_email
  • Explanation:
    • We're joining the customer_orders dataset with the customer_preferences collection from MongoDB.
    • The join is performed on the email field from PostgreSQL and the customer_email field from MongoDB.

4. Run the Query

  • Click on "Run" to execute and preview the unified dataset.

5. Save the Unified Dataset

  • Click on "Save View".
  • Name it customer_unified_view.
  • Save it under "@username".

Step 4: Verify the Unified Dataset

  • Navigate to "@username" > customer_unified_view.
  • Open it to ensure all fields are correctly populated and the data looks as expected.

Step 5: Optional Data Refinement

You can use Dremio's tools to further refine your dataset:

  • Rename Columns: Click on the column header to rename it for clarity.
  • Change Data Types: Ensure all data types are correct (e.g., dates are recognized as date types).
  • Add Calculated Fields: Create new fields based on existing data if necessary.
  • Filter Data: Apply filters to focus on specific subsets of data.

Benefits of Using Dremio for Data Unification

By using Dremio to join data from PostgreSQL and MongoDB:

  • Simplified Data Access: You interact with a single unified dataset instead of managing multiple data sources in your application code.
  • Performance Optimization: Dremio optimizes queries and can cache results, improving performance.
  • No Data Movement: Data remains in its original source; Dremio queries data in place.
  • Security and Governance: Centralized control over data access and permissions.
  • Scalability: Easily extend to include additional data sources without altering your application code.

With the unified dataset customer_unified_view now available in Dremio, we're ready to move on to building the Streamlit app. In the next section, we'll access the notebook server, write the Streamlit code to consume this dataset, and run the app to visualize and interact with our unified data.

Building the Streamlit App

With the unified dataset customer_unified_view available in Dremio, it's time to build an interactive Streamlit app to visualize and interact with the data. In this section, we'll:

  • Access the Jupyter Notebook server provided by the Spark container.
  • Write the Streamlit application code in a Python file.
  • Use the dremio-simple-query library to connect to Dremio and fetch the data.
  • Create a notebook to run the bash command that launches the Streamlit app.
  • Modify the docker-compose.yml file to expose the necessary ports for Streamlit.

Accessing the Notebook Server

The Spark container in our Docker environment comes with a Jupyter Notebook server, which we'll use to write and manage our code.

Step 1: Access the Jupyter Notebook Server

  • Open your web browser and navigate to http://localhost:8888.
  • Since the notebook server was started with --NotebookApp.token='' and --NotebookApp.password='', no authentication is required.

Writing the Streamlit Code

We'll create a Python script for the Streamlit app that connects to Dremio, fetches the unified dataset, and displays it.

Step 1: Create a New Python File

  • In the Jupyter interface, click on "New" > "Text File".
  • Rename the file to app.py:
    • Click on "Untitled.txt".
    • Rename it to app.py and confirm.

Step 2: Install Required Libraries

Before writing the code normally, you'd need to ensure that all required Python libraries are installed in the Spark container. In this case, the container should have all standard data libraries pre-installed including the ones needed for this exercise (dremio-simple-query, pandas, streamlit, python-dotenv).

Step 3: Write the Streamlit App Code

Open app.py and start coding.

Import Necessary Libraries

import streamlit as st
import pandas as pd
from dremio_simple_query.connect import DremioConnection, get_token
from dotenv import load_dotenv
import os

Load Environment Variables

We'll use a .env file to store sensitive information like usernames and passwords.

  • Create a file named vars.env.
  • Add the following content (replace placeholders with your actual credentials):
    DREMIO_USERNAME=your_dremio_username
    DREMIO_PASSWORD=your_dremio_password
    DREMIO_ENDPOINT=dremio:9047
    DREMIO_FLIGHT_ENDPOINT=dremio:32010
  • In your app.py, load the environment variables:
# Load environment variables from .env file 
load_dotenv('vars.env') # Retrieve variables 
username = os.getenv('DREMIO_USERNAME') 
password = os.getenv('DREMIO_PASSWORD') 
dremio_endpoint = os.getenv('DREMIO_ENDPOINT') 
flight_endpoint = os.getenv('DREMIO_FLIGHT_ENDPOINT')

Obtain Dremio Authentication Token

# Dremio REST API login endpoint
login_endpoint = f"http://{dremio_endpoint}/apiv2/login"

# Payload for login
payload = {
    "userName": username,
    "password": password
}

# Get authentication token
token = get_token(uri=login_endpoint, payload=payload)

Set Up Dremio Connection

# Dremio Arrow Flight endpoint
arrow_endpoint = f"grpc://{flight_endpoint}"

# Establish connection
dremio = DremioConnection(token, arrow_endpoint)

Fetch Data from Dremio

# SQL query to fetch data from the unified view
query = """
SELECT * FROM "@username".customer_unified_view
"""

# Fetch data as a Pandas DataFrame
df = dremio.toPandas(query)

Make sure to replace "username" with your username, so if your username was john it would read "@john".customer_unified_view.

Build the Streamlit App Interface

# Set the title of the app
st.title("Customer Insights Dashboard")

# Display the DataFrame
st.subheader("Unified Customer Data")
st.dataframe(df)

# Add interactive elements
st.sidebar.header("Filter Options")

# Filter by Loyalty Status
loyalty_status_options = df['loyalty_status'].unique()
selected_loyalty_status = st.sidebar.multiselect(
    "Select Loyalty Status:",
    options=loyalty_status_options,
    default=loyalty_status_options
)

# Filter by Preference
preference_options = df['preference'].unique()
selected_preferences = st.sidebar.multiselect(
    "Select Preferences:",
    options=preference_options,
    default=preference_options
)

# Apply Filters
filtered_df = df[
    (df['loyalty_status'].isin(selected_loyalty_status)) &
    (df['preference'].isin(selected_preferences))
]

st.subheader("Filtered Customer Data")
st.dataframe(filtered_df)

# Display Summary Metrics
st.subheader("Summary Metrics")
total_customers = filtered_df['customer_id'].nunique()
total_orders = filtered_df['order_id'].nunique()
total_revenue = filtered_df['amount'].sum()

col1, col2, col3 = st.columns(3)
col1.metric("Total Customers", total_customers)
col2.metric("Total Orders", total_orders)
col3.metric("Total Revenue", f"${total_revenue:,.2f}")

Add Data Visualization

# Import additional libraries
import matplotlib.pyplot as plt

# Revenue by Loyalty Status
revenue_by_loyalty = filtered_df.groupby('loyalty_status')['amount'].sum().reset_index()

st.subheader("Revenue by Loyalty Status")
fig, ax = plt.subplots()
ax.bar(revenue_by_loyalty['loyalty_status'], revenue_by_loyalty['amount'], color='skyblue')
ax.set_xlabel("Loyalty Status")
ax.set_ylabel("Total Revenue")
st.pyplot(fig)

Step 4: Save app.py

  • Ensure you save the file by clicking "File" > "Save" or pressing Ctrl + S.

Creating a Notebook to Run the Streamlit App

We need to run the Streamlit app from within the container. We'll create a Jupyter notebook to execute the necessary bash command.

Step 1: Create a New Notebook

  • In the Jupyter interface, click on "New" > "Python 3" to create a new notebook.
  • Rename the notebook to run_streamlit.ipynb:
    • Click on "Untitled" at the top.
    • Rename it and confirm.

Step 2: Write the Bash Command

In the first cell of the notebook, write the following code:

!pip install python-dotenv
!streamlit run app.py --server.port 8501 --server.address 0.0.0.0

Explanation:

  • ! indicates that the line is a shell command.
  • streamlit run app.py runs the Streamlit application.
  • --server.port 8501 specifies the port on which the app will run.
  • --server.address 0.0.0.0 makes the app accessible from any network interface within the container.

Step 3: Run the Notebook Cell

  • Execute the cell by clicking the "Run" button or pressing Shift + Enter.
  • The output will show logs from Streamlit indicating that the app is running.

Step 4: Expose the Streamlit Port

By default, Streamlit runs on port 8501 inside the container which is exposed to our host machine so should be accessible from localhost:8501 in your browser.

Step 5: Access the Streamlit App

  • Open your web browser and navigate to http://localhost:8501.
  • You should see your Streamlit app running, displaying the unified customer data with interactive filters and visualizations.

Note on Environment Variables

To avoid hardcoding credentials in your code, using environment variables or a .env file is a best practice. We've already demonstrated how to use a .env file in this setup.

Conclusion

By following these steps, you've successfully:

  • Connected to Dremio from within your Streamlit app using the dremio-simple-query library.
  • Fetched and displayed unified data from multiple sources without managing multiple database connections.
  • Built an interactive Streamlit application that allows users to filter and visualize customer data.

This approach highlights the power of Dremio in simplifying data unification and the ease of building data-driven applications with Streamlit.

Get Started with Dremio Today or Schedule a Meeting to Learn How Dremio Can Bring Your Value!

Also make sure to try out this Hands-On Intro to Apache Iceberg with Apache Spark and Dremio!

Ready to Get Started?

Enable the business to create and consume data products powered by Apache Iceberg, accelerating AI and analytics initiatives and dramatically reducing costs.