h2h2h2h2h2h2h2h2

24 minute read · December 11, 2024

Football Playoffs Hackathon powered by Dremio

Alex Merced

Alex Merced · Senior Tech Evangelist, Dremio

The Output

Welcome to the 2024 Football Playoffs Hackathon powered by Dremio. Teams from across the globe will apply their analytics prowess to predict: 

  • American Champion
  • National Champion
  • Overall League Winner

Each team must analyze current stats provided to support their selections with detailed insights. 

Judging criteria will include the accuracy of predictions, the quality of analysis, the clarity of visual presentation, and the depth of insights shared. 

Time for kick off!

Why Compete

  • All Qualified Submissions: Meet the requirements, and all team members will receive a special edition Gnarly football t-shirt, water bottle and Hackathon digital badge.
  • Top Submission: The selected winning team will be given the chance to present their solution live at Subsurface 2025 (Date TBD) in person in NYC. 

Note: Teams shouldn’t be larger than 5 people. You can have more than 5 team members if you wish, but qualified team submissions will only receive  Dremio swag for up to 5 people. Current Dremio employees are ineligible to compete in this Hackathon.

Disclaimer: Due to applicable trade control law and the policies of our shipping partners, prizes cannot be shipped to the following countries: Cuba, Iran, North Korea, Syria, Ukraine, Russia, and Belarus. This list is subject to change without notice. 

Introduction

This guide will walk you through setting up a powerful local data environment so you can focus on what matters—gaining insights and building visualizations, applications, or even AI/ML models with real football data.

Using Docker Compose, we’ll set up a local environment to run Dremio for querying, MinIO for data storage, Apache Superset for BI and a Jupyter Notebook environment for interactive data science work. Here’s everything you need to get started.

How to Participate

Once you've set up your local environment and loaded the football data, here's how to proceed:

  1. Dive into the Dataset
    • Analyze the main dataset provided, exploring key insights and patterns.
    • Feel free to supplement this data by integrating additional sources to enrich your analysis.
  2. Transform Your Data with Dremio Views
    • Use Dremio’s semantic layer to create custom views, transforming the data to suit your specific project goals.
  3. Build Your Final Project
    • Design a compelling visualization, generate an insightful report, or develop an application that highlights your findings.

Link to Dremio University Hackathon Course where you'll submit your entries.

Final Submission Steps:

Once you're happy with your project, create a short video presentation:

  • Record a 3-5 minute video:
    • Spend 1-2 minutes walking through your data modeling and transformations in Dremio.
    • Use 1-2 minutes to showcase your final product.
    • Dedicate 1-2 minutes to share your experience and insights from the project.
  • Upload the Video:
    • Post the video to YouTube as “Unlisted” (or public, if you prefer) and submit the link via the provided form.
  • Submit the form:

Setting Up Your Environment

Step 1: Understanding Docker and Docker Compose

Docker is a platform for developing, shipping, and running container applications. Containers bundle software with its dependencies, ensuring consistent behavior across environments.

Docker Compose is a tool that allows you to define and run multi-container Docker applications using a single docker-compose.yml file. In this file, you define all services, their configurations, and how they interact.

Step 2: Creating the Docker Compose File

Let’s create the docker-compose.yml file that defines our services.

  1. Open a text editor (VS Code, Notepad, etc.).
  2. Create a new file named docker-compose.yml.

Copy and paste the following configuration into it (superset and datanotebook are optional if you prefer other tools):

version: "3"

services:
  # Nessie Catalog Server Using In-Memory Store
  nessie:
    image: projectnessie/nessie:latest
    container_name: nessie
    networks:
      - iceberg
    ports:
      - 19120:19120

  # MinIO Storage Server
  ## Creates two buckets named lakehouse and lake
  minio:
    image: minio/minio:latest
    container_name: minio
    environment:
      - MINIO_ROOT_USER=admin
      - MINIO_ROOT_PASSWORD=password
    networks:
      - iceberg
    ports:
      - 9001:9001
      - 9000:9000
    command: ["server", "/data", "--console-address", ":9001"]
    entrypoint: >
      /bin/sh -c "
      minio server /data --console-address ':9001' &
      sleep 5 &&
      mc alias set myminio http://localhost:9000 admin password &&
      mc mb myminio/lakehouse &&
      mc mb myminio/lake &&
      tail -f /dev/null
      "

  # Dremio
  dremio:
    platform: linux/x86_64
    image: dremio/dremio-oss:latest
    container_name: dremio
    environment:
      - DREMIO_JAVA_SERVER_EXTRA_OPTS=-Dpaths.dist=file:///opt/dremio/data/dist
    networks:
      - iceberg
    ports:
      - 9047:9047
      - 31010:31010
      - 32010:32010

  # Superset
  superset:
    image: alexmerced/dremio-superset
    container_name: superset
    networks:
      - iceberg
    ports:
      - 8088:8088

  # Data Science Notebook (Jupyter Notebook)
  datanotebook:
    image: alexmerced/datanotebook
    container_name: datanotebook
    environment:
      - JUPYTER_TOKEN= # Set a token if desired, or leave blank to disable token authentication
    networks:
      - iceberg
    ports:
      - 8888:8888
    volumes:
      - ./notebooks:/home/pydata/work # Mounts a local folder for persistent notebook storage

networks:
  iceberg:


Explanation of Services:

  • Nessie: Provides version control for data, useful for tracking data lineage and historical states.
  • MinIO: Acts as an S3-compatible object storage, holding data buckets that Dremio will use as data sources.
  • Dremio: The query engine that enables SQL-based interactions with our data stored in MinIO and Nessie.
  • Superset: A BI tool for creating and visualizing dashboards based on data queried through Dremio.

Step 3: Running the Environment

With the docker-compose.yml file ready, let’s start the environment.

  1. Open a terminal and navigate to the folder where you saved docker-compose.yml.

Run the following command to start all services in detached mode:
docker-compose up -d

  1. Wait a few moments for the services to initialize. Verify they are running with:
    docker ps
  1. You should see containers for Nessie, MinIO, Dremio, Datanotebook and Superset.
  2. Run the following command to initialize superset before using it:
    docker exec -it superset superset init

Step 4: Verifying the Services

After starting the containers, check each service to ensure it’s accessible.

Step 5: Adding Nessie and MinIO as Data Sources in Dremio

Now, let’s configure Dremio to use Nessie as a catalog and MinIO as an S3-compatible data source.

Connecting Nessie as a Catalog in Dremio

  1. In Dremio, go to Add Source.
  2. Choose Nessie from the source types and enter the following configuration:
    • General Settings:
      • Name: lakehouse
      • Endpoint URL: http://nessie:19120/api/v2
      • Authentication: None
    • Storage Settings:
      • Access Key: admin
      • Secret Key: password
      • Root Path: lakehouse
    • Connection Properties:
      • fs.s3a.path.style.access: true
      • fs.s3a.endpoint: minio:9000
      • dremio.s3.compat: true
  3. Save the source. Dremio will now connect to Nessie, and lakehouse should appear in the Datasets section.

Connecting MinIO as an S3-Compatible Source in Dremio

  1. Again, click Add Source in Dremio and select S3 as the source type.
  2. Configure MinIO with these settings:
    • General Settings:
      • Name: lake
      • Credentials: AWS Access Key
      • Access Key: admin
      • Secret Key: password
      • Encrypt Connection: unchecked
    • Advanced Options:
      • Enable Compatibility Mode: true
      • Root Path: /lake
    • Connection Properties:
      • fs.s3a.path.style.access: true
      • fs.s3a.endpoint: minio:9000
  3. Save the source. The lake source will appear in the Datasets section of Dremio.

Step 6: Setting Up Superset for BI Visualizations

Superset allows us to create dashboards based on data queried from Dremio.

Initialize Superset by running this command in a new terminal:
docker exec -it superset superset init

  1. Open Superset at http://localhost:8088, log in, and navigate to Settings > Database Connections.
  2. Add a new database:
    • Select Other as the type.

Enter the connection string (replace USERNAME and PASSWORD with your Dremio credentials):
dremio+flight://USERNAME:PASSWORD@dremio:32010/?UseEncryption=false

  1. Click Test Connection to verify connectivity, then Save.
  2. To add datasets, select the + icon, choose your desired table (e.g., sales_data), and add it to your workspace.
  3. Now, create charts and add them to a dashboard.

Step 7: Shutting Down the Environment

To stop the environment, run:

docker-compose down -v

This will remove all containers and volumes, giving you a clean slate for the next session.

Conclusion

You’ve now set up a powerful local data environment with Nessie for versioned data, MinIO for S3-compatible storage, Dremio for SQL querying, and Superset for BI visualization. This setup enables you to perform SQL-based data operations, track data history, and create visual insights from your data lakehouse environment, all running locally on Docker Compose. Happy data engineering!

Loading Your Data

To incorporate the football dataset from Kaggle into your Dremio and MinIO environment, you’ll start by downloading the data, then upload it to the MinIO service. Once in MinIO, it will be accessible in Dremio as a data source for analysis and querying. Here’s how to do it step-by-step.

Step 1: Download the Dataset from Kaggle

  1. Visit the Dataset on Kaggle:
  2. Download the Dataset:
    • If you have a Kaggle account, log in and click on the Download button on the dataset page.
    • The dataset will download as a compressed file (usually a .zip file) containing multiple .parq files (Parquet format), ideal for analysis and compatible with data lake storage.
  3. Extract the Files:
    • Once the download is complete, unzip the file. You’ll find files like:
      • games.parq
      • players.parq
      • plays.parq
      • tackles.parq
      • tracking_all_weeks.parq

Step 2: Prepare the MinIO Environment

Next, we’ll upload this data to the MinIO instance, simulating an S3 bucket storage.

  1. Access MinIO:
    • In your browser, navigate to the MinIO console at http://localhost:9001.
    • Log in using the credentials specified in the Docker Compose setup:
      • Username: admin
      • Password: password
  2. Locate or Create the lake Bucket:
    • In the MinIO console, you should see a bucket called lake, which was created automatically by our Docker Compose configuration.
    • If you do not see the lake bucket, click + to create a new bucket and name it lake.

Step 3: Upload the Dataset Files to MinIO

  1. Upload Files:
    • In the MinIO console, navigate to the lake bucket.
    • Click the Upload button and select the Parquet files you extracted from the Kaggle download (e.g., games.parq, players.parq, plays.parq, tackles.parq, tracking_all_weeks.parq).
    • MinIO will store these files in the lake bucket, making them available as raw data for querying in Dremio.
  2. Verify Uploads:
    • After uploading, you should see each of the dataset files listed in the lake bucket within MinIO.

Step 4: Access the Data in Dremio

Now that the data is stored in MinIO, you can connect to it in Dremio.

  1. Open Dremio:
  2. Access the MinIO Data Source:
    • In Dremio, navigate to the Datasets section.
    • You should already see the lake data source (added in our Docker Compose setup) listed under Datasets.
    • You can them click on “Format Dataset” to register it as a dataset you can query
  3. Browse the Dataset:
    • Expand the lake source, and you should see the files uploaded to MinIO (games.parq, players.parq, plays.parq, tackles.parq, tracking_all_weeks.parq).
    • Click on any file to view its schema and preview the data.
  4. Run SQL Queries:

Use Dremio’s SQL Runner to query the files. For example:
SELECT * FROM lake." players.parq" LIMIT 10;

  • This will allow you to explore, filter, and analyze the dataset directly from the lake bucket.

Useful Reference

Ready to Get Started?

Enable the business to create and consume data products powered by Apache Iceberg, accelerating AI and analytics initiatives and dramatically reducing costs.