24 minute read · December 11, 2024
Football Playoffs Hackathon powered by Dremio
· Senior Tech Evangelist, Dremio
The Output
Welcome to the 2024 Football Playoffs Hackathon powered by Dremio. Teams from across the globe will apply their analytics prowess to predict:
- American Champion
- National Champion
- Overall League Winner
Each team must analyze current stats provided to support their selections with detailed insights.
Judging criteria will include the accuracy of predictions, the quality of analysis, the clarity of visual presentation, and the depth of insights shared.
Time for kick off!
Why Compete
- All Qualified Submissions: Meet the requirements, and all team members will receive a special edition Gnarly football t-shirt, water bottle and Hackathon digital badge.
- Top Submission: The selected winning team will be given the chance to present their solution live at Subsurface 2025 (Date TBD) in person in NYC.
Note: Teams shouldn’t be larger than 5 people. You can have more than 5 team members if you wish, but qualified team submissions will only receive Dremio swag for up to 5 people. Current Dremio employees are ineligible to compete in this Hackathon.
Disclaimer: Due to applicable trade control law and the policies of our shipping partners, prizes cannot be shipped to the following countries: Cuba, Iran, North Korea, Syria, Ukraine, Russia, and Belarus. This list is subject to change without notice.
Introduction
This guide will walk you through setting up a powerful local data environment so you can focus on what matters—gaining insights and building visualizations, applications, or even AI/ML models with real football data.
Using Docker Compose, we’ll set up a local environment to run Dremio for querying, MinIO for data storage, Apache Superset for BI and a Jupyter Notebook environment for interactive data science work. Here’s everything you need to get started.
How to Participate
Once you've set up your local environment and loaded the football data, here's how to proceed:
- Dive into the Dataset
- Analyze the main dataset provided, exploring key insights and patterns.
- Feel free to supplement this data by integrating additional sources to enrich your analysis.
- Transform Your Data with Dremio Views
- Use Dremio’s semantic layer to create custom views, transforming the data to suit your specific project goals.
- Build Your Final Project
- Design a compelling visualization, generate an insightful report, or develop an application that highlights your findings.
Link to Dremio University Hackathon Course where you'll submit your entries.
Final Submission Steps:
Once you're happy with your project, create a short video presentation:
- Record a 3-5 minute video:
- Spend 1-2 minutes walking through your data modeling and transformations in Dremio.
- Use 1-2 minutes to showcase your final product.
- Dedicate 1-2 minutes to share your experience and insights from the project.
- Upload the Video:
- Post the video to YouTube as “Unlisted” (or public, if you prefer) and submit the link via the provided form.
- Submit the form:
Setting Up Your Environment
Step 1: Understanding Docker and Docker Compose
Docker is a platform for developing, shipping, and running container applications. Containers bundle software with its dependencies, ensuring consistent behavior across environments.
Docker Compose is a tool that allows you to define and run multi-container Docker applications using a single docker-compose.yml file. In this file, you define all services, their configurations, and how they interact.
Step 2: Creating the Docker Compose File
Let’s create the docker-compose.yml file that defines our services.
- Open a text editor (VS Code, Notepad, etc.).
- Create a new file named docker-compose.yml.
Copy and paste the following configuration into it (superset and datanotebook are optional if you prefer other tools):
version: "3" services: # Nessie Catalog Server Using In-Memory Store nessie: image: projectnessie/nessie:latest container_name: nessie networks: - iceberg ports: - 19120:19120 # MinIO Storage Server ## Creates two buckets named lakehouse and lake minio: image: minio/minio:latest container_name: minio environment: - MINIO_ROOT_USER=admin - MINIO_ROOT_PASSWORD=password networks: - iceberg ports: - 9001:9001 - 9000:9000 command: ["server", "/data", "--console-address", ":9001"] entrypoint: > /bin/sh -c " minio server /data --console-address ':9001' & sleep 5 && mc alias set myminio http://localhost:9000 admin password && mc mb myminio/lakehouse && mc mb myminio/lake && tail -f /dev/null " # Dremio dremio: platform: linux/x86_64 image: dremio/dremio-oss:latest container_name: dremio environment: - DREMIO_JAVA_SERVER_EXTRA_OPTS=-Dpaths.dist=file:///opt/dremio/data/dist networks: - iceberg ports: - 9047:9047 - 31010:31010 - 32010:32010 # Superset superset: image: alexmerced/dremio-superset container_name: superset networks: - iceberg ports: - 8088:8088 # Data Science Notebook (Jupyter Notebook) datanotebook: image: alexmerced/datanotebook container_name: datanotebook environment: - JUPYTER_TOKEN= # Set a token if desired, or leave blank to disable token authentication networks: - iceberg ports: - 8888:8888 volumes: - ./notebooks:/home/pydata/work # Mounts a local folder for persistent notebook storage networks: iceberg:
Explanation of Services:
- Nessie: Provides version control for data, useful for tracking data lineage and historical states.
- MinIO: Acts as an S3-compatible object storage, holding data buckets that Dremio will use as data sources.
- Dremio: The query engine that enables SQL-based interactions with our data stored in MinIO and Nessie.
- Superset: A BI tool for creating and visualizing dashboards based on data queried through Dremio.
Step 3: Running the Environment
With the docker-compose.yml file ready, let’s start the environment.
- Open a terminal and navigate to the folder where you saved docker-compose.yml.
Run the following command to start all services in detached mode:docker-compose up -d
- Wait a few moments for the services to initialize. Verify they are running with:
docker ps
- You should see containers for Nessie, MinIO, Dremio, Datanotebook and Superset.
- Run the following command to initialize superset before using it:
docker exec -it superset superset init
Step 4: Verifying the Services
After starting the containers, check each service to ensure it’s accessible.
- Dremio: Go to http://localhost:9047 and log in or set up a new admin account.
- MinIO: Access MinIO at http://localhost:9001, logging in with admin as the username and password as the password.
- Superset: Visit Superset at http://localhost:8088 and log in using admin for both username and password.
- DataNotebook: Visit http://localhost:8080 to see the Jupyter Network server
Step 5: Adding Nessie and MinIO as Data Sources in Dremio
Now, let’s configure Dremio to use Nessie as a catalog and MinIO as an S3-compatible data source.
Connecting Nessie as a Catalog in Dremio
- In Dremio, go to Add Source.
- Choose Nessie from the source types and enter the following configuration:
- General Settings:
- Name: lakehouse
- Endpoint URL: http://nessie:19120/api/v2
- Authentication: None
- Storage Settings:
- Access Key: admin
- Secret Key: password
- Root Path: lakehouse
- Connection Properties:
- fs.s3a.path.style.access: true
- fs.s3a.endpoint: minio:9000
- dremio.s3.compat: true
- General Settings:
- Save the source. Dremio will now connect to Nessie, and lakehouse should appear in the Datasets section.
Connecting MinIO as an S3-Compatible Source in Dremio
- Again, click Add Source in Dremio and select S3 as the source type.
- Configure MinIO with these settings:
- General Settings:
- Name: lake
- Credentials: AWS Access Key
- Access Key: admin
- Secret Key: password
- Encrypt Connection: unchecked
- Advanced Options:
- Enable Compatibility Mode: true
- Root Path: /lake
- Connection Properties:
- fs.s3a.path.style.access: true
- fs.s3a.endpoint: minio:9000
- General Settings:
- Save the source. The lake source will appear in the Datasets section of Dremio.
Step 6: Setting Up Superset for BI Visualizations
Superset allows us to create dashboards based on data queried from Dremio.
Initialize Superset by running this command in a new terminal:docker exec -it superset superset init
- Open Superset at http://localhost:8088, log in, and navigate to Settings > Database Connections.
- Add a new database:
- Select Other as the type.
Enter the connection string (replace USERNAME and PASSWORD with your Dremio credentials):dremio+flight://USERNAME:PASSWORD@dremio:32010/?UseEncryption=false
- Click Test Connection to verify connectivity, then Save.
- To add datasets, select the + icon, choose your desired table (e.g., sales_data), and add it to your workspace.
- Now, create charts and add them to a dashboard.
Step 7: Shutting Down the Environment
To stop the environment, run:
docker-compose down -v
This will remove all containers and volumes, giving you a clean slate for the next session.
Conclusion
You’ve now set up a powerful local data environment with Nessie for versioned data, MinIO for S3-compatible storage, Dremio for SQL querying, and Superset for BI visualization. This setup enables you to perform SQL-based data operations, track data history, and create visual insights from your data lakehouse environment, all running locally on Docker Compose. Happy data engineering!
Loading Your Data
To incorporate the football dataset from Kaggle into your Dremio and MinIO environment, you’ll start by downloading the data, then upload it to the MinIO service. Once in MinIO, it will be accessible in Dremio as a data source for analysis and querying. Here’s how to do it step-by-step.
Step 1: Download the Dataset from Kaggle
- Visit the Dataset on Kaggle:
- Go to the Stats for BDB 2024 and Fantasy Football dataset on Kaggle.
- This dataset includes player stats, game details, and play-by-play data aggregated from several sources.
- Download the Dataset:
- If you have a Kaggle account, log in and click on the Download button on the dataset page.
- The dataset will download as a compressed file (usually a .zip file) containing multiple .parq files (Parquet format), ideal for analysis and compatible with data lake storage.
- Extract the Files:
- Once the download is complete, unzip the file. You’ll find files like:
- games.parq
- players.parq
- plays.parq
- tackles.parq
- tracking_all_weeks.parq
- Once the download is complete, unzip the file. You’ll find files like:
Step 2: Prepare the MinIO Environment
Next, we’ll upload this data to the MinIO instance, simulating an S3 bucket storage.
- Access MinIO:
- In your browser, navigate to the MinIO console at http://localhost:9001.
- Log in using the credentials specified in the Docker Compose setup:
- Username: admin
- Password: password
- Locate or Create the lake Bucket:
- In the MinIO console, you should see a bucket called lake, which was created automatically by our Docker Compose configuration.
- If you do not see the lake bucket, click + to create a new bucket and name it lake.
Step 3: Upload the Dataset Files to MinIO
- Upload Files:
- In the MinIO console, navigate to the lake bucket.
- Click the Upload button and select the Parquet files you extracted from the Kaggle download (e.g., games.parq, players.parq, plays.parq, tackles.parq, tracking_all_weeks.parq).
- MinIO will store these files in the lake bucket, making them available as raw data for querying in Dremio.
- Verify Uploads:
- After uploading, you should see each of the dataset files listed in the lake bucket within MinIO.
Step 4: Access the Data in Dremio
Now that the data is stored in MinIO, you can connect to it in Dremio.
- Open Dremio:
- Go to http://localhost:9047 and log into Dremio with your credentials.
- Access the MinIO Data Source:
- In Dremio, navigate to the Datasets section.
- You should already see the lake data source (added in our Docker Compose setup) listed under Datasets.
- You can them click on “Format Dataset” to register it as a dataset you can query
- Browse the Dataset:
- Expand the lake source, and you should see the files uploaded to MinIO (games.parq, players.parq, plays.parq, tackles.parq, tracking_all_weeks.parq).
- Click on any file to view its schema and preview the data.
- Run SQL Queries:
Use Dremio’s SQL Runner to query the files. For example:SELECT * FROM lake." players.parq" LIMIT 10;
- This will allow you to explore, filter, and analyze the dataset directly from the lake bucket.