23 minute read · May 1, 2019
Visualizing Azure Data Lake with Apache Superset and Dremio
· Dremio Team
Intro
Azure Data Lake is a scalable data storage and analytics service. It reduces efforts for everyone to store data of any format and type and do processing and analytics across platforms and languages. Azure Data Lake provides a variety of functions and solutions for data management and governance. To make your application more powerful, you can integrate ADLS with operational stores and data warehouses.
Superset is a modern BI web application project that is in the incubating stages at The Apache Software Foundation. It is an open source project that provides users with an intuitive, visual and interactive data exploration platform. Some of the key features that Superset offer are:
- Over 30 types of visualizations
- Druid.io integration
- Easy to use constructor for visualizations
- Easy to share and collaborate on dashboards
- Enterprise-ready authentication
- A simple semantic layer that allows users to decide which fields they want to use in their visualizations
- And much more.
In this tutorial we will explore how can we visualize your data lake using Superset and Dremio and Azure Data Lake Store. We will connect Dremio to ADLS and then after preparing and curating the data, we will connect Superset to Dremio and generate a dashboard. Without further ado, let’s dive right in!
Assumptions
We assume that you have Dremio and ODBC driver installed; if not, go to Dremio’s deploy page, pick the installations for your operating system, and visit Dremio Docs to read about installations and deployments. You should have an Azure Account and storage account.
Also, for this tutorial we expect that you have Superset installed and configured, if you haven’t done so, take a look at the Superset installation documentation, there you will find detailed information about pre-requisites, and installation steps for different operating systems.
- Be familiar with the fundamentals of Dremio (we will be using Dremio 3.2)
- Have the latest version of python and pip installed
Installing Superset
For this tutorial, I will be installing Supserset on Mac OSX Mojave. If you need details about how to install it on a different OS, take a look at their documentation page, they have very well detailed information about how to do so.
First, we need to update pip
pip install --upgrade setuptools pip
Then follow these steps to install Superset
# Install superset pip install superset # Create an admin user (you can hit enter in each entry to leave values as default, however, you will need to set an admin password) fabmanager create-admin --app superset # Initialize the database superset db upgrade # Load sample data superset load_examples # Create default roles and permissions superset init # To start a development web server on port 8088, use -p to bind to another port superset runserver -d
Once all the steps are completed, open a new tab in your browser and go to http://localhost:8088/ here you will be presented with the login screen
If you left the default values when you were installing Superset, just type admin and the password that you chose in the respective fields.
Before we continue on Superset, let’s switch gears for a moment and get our data ready in Dremio.
Importing data into Azure Data Lake Store
In this tutorial we will analyze this App reviews dataset from Kaggle. This dataset contains house sale prices for King County, namely homes sold between May 2014 and May 2015.
To import data into Azure Azure Data Lake Storage, go to Azure All services page, open Storage and choose Data Lake Storage Gen1. Then, add a container and upload data as shown on the screenshots below.
Fill in the necessary information
Then, click “create” and after a few minutes your container will be created. You will see a notification when the process has been successfully created. If you don’t see it immediately, simply refresh the page and you will see the item listed on the main screen
To upload the data, select your storage item and then click on “data explorer”
Then from the toolbar on the top of the screen, select ‘Upload’
Then we will select the file that I want to upload
Connecting to Dremio
Once the data has been uploaded, log in to Dremio and select add a new source, and then select “Azure Data Lake Store” from the list of available data sources
Before connecting we need to do some preparations:
- Go to the Azure Portal and select Azure Active Directory from the left navigation bar. Then select App Registrations.
- Click on Endpoints
- Copy the OAUTH 2.0 TOKEN ENDPOINT to the Dremio source form.
- Go back to the Azure portal and click New registration.
- Select a Name
- Click on the registered app and copy Application ID to Dremio source form.
- While in the registered app, select Keys under API Access. Enter a Description and select an expiration. Click Save and copy Value to Dremio source form.
- Also, in Data Explorer -> Access provide your app access to read, write and execute data.
Finally, your ADLS form in Dremio should look like this
If everything went well, you should see the file on your main screen
To see the data, simply click on the listed file and click on “Extract field names” and then “save”
Data curation in Dremio
Now, we can begin data preprocessing. First, I need to change the data type for a few of the columns that. Notice that Rating and Reviews are not integer. To change the type, all I have to do is select the ‘ABC’ icon next to the column name and then select the data type I want to use, in this case ‘Float’
I also will apply the same procedure to the “reviews” field and the ‘Last Updated’ field.
There are many more operations that you can perform on your dataset using Dremio. We have a great tutorial dedicated to data curation where you can take a look at all the tips and tricks related to data curation in Dremio.
Now, let’s save the dataset and since we don’t have to make any more changes to it, let’s start gaining insights from it!
Creating and Connecting a Database in Superset
At this point we are ready to start working with our ADLS data from Superset. Navigate to the Superset instance that you have open in a different tab. On the main screen, select “Sources” and then from the drop-down menu select “Databases”
Then select the plus (+) sign on the main screen
Now type in the following parameters in the indicated fields
Database: Dremio
SQLAlchemy URI: dremio+pyodbc://[Dremio username]:[Dremio password]@[Dremio host]:31010/dremio
Expose in SQL Lab: check
Allow Run Sync: check
NOTE: Replace the parameters in the SQLAlchemy URI to match the information from your environment. In my case the URI looks like the following:
dremio+pyodbc://lucio:pa$$w0rd]@localhost:31010/dremio
Now, click on “Test Connection”, if everything went well, you should see the following pop-up
After all the parameters have been typed in, scroll down and then click ‘Save’
If everything went as expected, you will see your newly created source in the “List Databases” menu
Adding a new table
At this point we have successfully added a new data source to Supserset, which in this case is Dremio. Now we need to add the table that we want to work with. To do this, first select “Tables” from the “Sources menu”
From the new screen, click on the plus sign and then from the database menu, select “Dremio”. Then I’ll go ahead and fill in the rest of the parameters based on the table that I want to use. In this case, is the employee VDS that we created at the beginning of the tutorial.
Once you click “Save”, Superset will present you with the list of available datasets
Previewing your data
Before we start visualizing data, I’ll run a query to double check the data that I connected to. To do this, I’ll go to the SQL Lab tab in Superset
Then I’ll select the database and schema that I want to query and hit “Run query”
Exploring your data
Now we can start exploring the data, to do so, head back to the ‘Sources’ drop-down menu and select ‘Tables’. Then, click on the name of the dataset that we added.
Superset will present you with a table visualization type by default
Now let’s walk through the steps of exploring the data that we are working with. First, I’ll change the default time range filter to ‘No Filter’
Then, since I want to know the distribution of apps by category that I have in the dataset, I’ll group by “category” and then select “count(* )” as the metric to use
After the changes are done, hit the “Run query” button and then you will be presented with the results based on the parameters selected.
Before I continue creating other visualizations, I’ll go ahead and save this table as “Apps by Category”. To name a visualization, simply click on the “- untitled” field and add a name, then hit enter. The visualization will be saved automatically.
Now, I’m going to create a different visualization. I want to see what is the average size of each app for each one of the categories that we have in our dataset. To do so, I’ll select the “Bar chart” chart from the visualization options.
Then, I will change the metric configuration to represent the “size_MB” column with the AVG aggregate
After running the query, the new chart looks like the following
Now, I would like to create a filter that I can use later in a dashboard. To do so, change the visualization type to “Filter Box” and will leave the rest of the parameters with their default value.
Then I’ll save this element as ‘Filter’
Creating a Dashboard in Superset
Creating dashboards in Superset is very simple and intuitive. The first thing that we need to do is select “Dashboards” in the main menu and then click on the plus sign to create a new one
I’ll name it “App Reviews ADLS”, assign an owner and then save it.
Once you save it, Superset will present you with the list of dashboards available. I’ll click on the name link which will take me to the dashboard editor.
Here, select “Edit Dashboard”
Then select “Your charts & filters”
In this screen you will be able to see all the elements available to build the dashboard, in this case, I have three different elements: 2 charts and 1 filter.
To add elements to the dashboard, open the visualization from the charts menu, then click “Save” and select “Add to existing dashboard”
Then select “save and go to dashboard”
The visualization will be automatically added to the dashboard, here we can continue adding more elements. I’ll click on “Edit dashboard” and finish customizing the dashboard with the rest of the elements that I have available.
Once I’m done, I’ll save the changes and I can start slicing the data using the filer that I created. For example, I can change the values of the filter so the dashboard can show me the data about apps that were updated between February and March of 2018
Then I can start answering questions such as which was the highest ranking app between February and March 2018.
Conclusion
In this tutorial, we focused our attention on working with a modern BI tool such as Superset and Dremio. We showed you the steps to connect, curate and prepare data from disparate data sources using Dremio. Moreover, we walked through the steps of connecting Supserset to Dremio and then creating several visualizations and a filter that we finally used to create a dashboard.
This is just the tip of the iceberg of the number of great things that you can do with Dremio. Whether you are a data architect or a data consumer, to learn more about Dremio, and experience the benefits of using Dremio Enterprise Edition, please visit Dremio University, a free online Dremio learning platform where you can become a Dremio expert in no time.