h4h4h4h4h4h4h4

335 minute read · October 3, 2024

Tutorial: Accelerating Queries with Dremio Reflections (Laptop Exercise)

Alex Merced

Alex Merced · Senior Tech Evangelist, Dremio

In this tutorial, you'll learn how to use Dremio's Reflections to accelerate query performance. We'll walk through the process of setting up a Dremio environment using Docker, connecting to sample datasets, running a complex query, and then using Reflections to significantly improve the query’s performance.

Further Reading on Reflections

Step 1: Spin Up a Dremio Environment

First, let's get Dremio running locally. Use the following command to spin up a Dremio instance in Docker:

docker run -p 9047:9047 -p 31010:31010 -p 45678:45678 -p 32010:32010 -e DREMIO_JAVA_SERVER_EXTRA_OPTS=-Dpaths.dist=file:///opt/dremio/data/dist --name try-dremio dremio/dremio-oss

This will expose Dremio’s UI on port 9047 and set up the necessary ports for other services like JDBC and ODBC connections.

Step 2: Connecting to Data Sources with Dremio

One of the key benefits of Dremio is its ability to connect to a wide variety of data sources, including databases, cloud storage, and file systems. Dremio can federate queries across these data sources, allowing you to perform analytics across multiple systems without moving the data.

For this tutorial, we'll use Dremio’s Sample Source to access a sample dataset.

  1. Open your browser and go to http://localhost:9047 to access the Dremio UI.
  2. Navigate to the Source section and add the Sample Source to your workspace. This will provide access to several preloaded datasets.

Step 3: Promoting the NYC Weather Dataset

Next, we'll use the NYC-weather.csv dataset from the sample source.

  1. Browse to the NYC-weather.csv file in the sample source.
  2. Click on the file and select Format to convert this file into a SQL-queryable dataset.

3. When formatting the dataset, make sure to select Extract Column Names so the first row of the file is used as the header.

    The schema of the dataset should look like this:

    • station: String
    • name: String
    • date: String
    • awnd: String
    • prcp: String
    • snow: String
    • snwd: String
    • tempmax: String
    • tempmin: String

    Step 4: Converting Data Types

    By default, all fields in this CSV dataset are imported as strings since CSVs are schema-less. You'll need to convert some of these fields to appropriate data types to make the dataset more useful for analytical queries. We could do this with SQL but Dremio provides UI helps so users of any technical level can curate the data they need:

    1. Run the query to see the data in the data set.
    2. Convert the date field to a date type.
    3. Convert awnd, prcp, snow, snwd, tempmax, and tempmin to float types.

    The resulting query with all the type conversions should look like this:

    SELECT station, name, TO_DATE("NYC-weather.csv"."date", 'YYYY-MM-DD', 1) AS "date", CONVERT_TO_FLOAT(awnd, 1, 1, 0) AS awnd, CONVERT_TO_FLOAT(prcp, 1, 1, 0) AS prcp, CONVERT_TO_FLOAT(snow, 1, 1, 0) AS snow, CONVERT_TO_FLOAT(snwd, 1, 1, 0) AS snwd, CONVERT_TO_FLOAT(tempmax, 1, 1, 0) AS tempmax, CONVERT_TO_FLOAT(tempmin, 1, 1, 0) AS tempmin
    FROM Samples."samples.dremio.com"."NYC-weather.csv" AS "NYC-weather.csv";

    After making these changes, save the dataset as a view called "weather" to your "home" which is called "@username".

    Now go back to the main data explorer and bring up this new view, Dremio allows you to create views from data across all the sources you connect.

    Step 5: Writing and Running a Complex Query

    Now that the data is ready, write a complex SQL aggregation query. Here’s an example that calculates average weather metrics grouped by month:

    SELECT 
      EXTRACT(MONTH FROM "date") AS "month",
      AVG(tempmax) AS avg_tempmax,
      AVG(tempmin) AS avg_tempmin,
      AVG(awnd) AS avg_wind_speed,
      SUM(prcp) AS total_precipitation,
      SUM(snow) AS total_snowfall
    FROM 
      "@alexmerced".weather
    GROUP BY 
      "month"
    ORDER BY 
      "month";

    Run this query, it should take a few seconds to compute. Using reflections we can make this query run in sub-seconds.

    Step 6: Creating an Aggregate Reflection

    To speed up this query in the future, you can create an Aggregate Reflection.

    1. In the Dremio UI, click on "edit" for the dataset in the details bar on the left


    2. navigate to the Reflections tab and select Toggle On Aggregate Reflections.


    3. Select the view you created earlier as the base dataset.
    4. For the Dimensions, select the date field.
    5. For the Measures, select the aggregated fields tempmax, tempmin, awnd, prcp, and snow.


    6. Save the reflection and allow it to build. You'll see the footprint when done building.

    Step 7: Running the Query Again

    Once the reflection is created, run the same complex SQL query again. You should notice a significant improvement in performance, as Dremio now uses the reflection to speed up the query instead of recomputing the results from scratch. On the jobs page, you'll see the lightning bolt symbol representing a query accelerated by reflections. To see an even more dramatic example try running some aggregate query on the NYC Taxi Dataset in the sample data, but keep in mind that dataset is quite large (over 300 million records) so it may take a few minutes to run non-accelerated aggregations directly from your laptop.

    To shut down the docker container

    docker stop try-dremio

    If you ever want to turn this container back on in the future

    docker start try-dremio

    Conclusion

    With Dremio’s Reflections, you can dramatically accelerate complex queries without the need for excessive partitioning or additional data engineering. By intelligently managing reflections, you can optimize query performance, reduce compute time, and ensure that your data is always ready for fast, real-time analysis.

    In this tutorial, we demonstrated how to set up Dremio, promote and format a dataset, create a complex query, and then use an Aggregate Reflection to optimize that query for better performance. With this approach, you can easily scale your data analytics workload while keeping query times low.

    Further Reading on Reflections

    More Hands-on Tutorials with Dremio

    Ready to Get Started?

    Enable the business to create and consume data products powered by Apache Iceberg, accelerating AI and analytics initiatives and dramatically reducing costs.