Introduction
Amazon Redshift is a powerful data warehouse service in the cloud. It is a simple mean of analyzing data across your data warehouse and data lake. Moreover, it is cost-effective. Redshift delivers ten times faster performance than other data warehouses because of involved technics, such as machine learning, massively parallel query execution, and columnar storage on high-performance disk.
In this tutorial, we will explain how to connect Amazon Redshift, Dremio and Python. Also, we will demonstrate options of data curation using Dremio and ways of visualizing data in Jupyter Notebook.
Assumptions
We assume that you have Dremio and ODBC driver installed; if not, go to Dremio’s deployments page, pick the installations for your operating system, and visit Dremio Docs to read about installations and deployments. Also, we will be working with Amazon Redshift and Python, so you need to have them installed and configured as well.
Connecting Amazon Redshift to Dremio
We will work with the dataset based on tracking sales activity for the fictional TICKIT website, where users buy and sell tickets online for different events. Here we can get information about selling time, rates for sellers, customers, their tastes, etc.
First of all, you have to load data into Amazon, for that you should do the following:
- Launch your AWS Redshift cluster.
- Create a new database with empty tables in this cluster.
- Load sample data from Amazon S3 by using the COPY command table by table.
- Perform queries that you need and review results on AWS console.
You can read more here.
To connect to Dremio, follow these steps.
- Go to Sources -> Add Source -> Amazon Redshift.
- Fill in the form as we do.
Copy the JDBC connection string from the AWS console:
Type 0 on the “Record fetch size” field and Save.
Now, you are connected and can see your data in Dremio.
Data Curation with Dremio
Our data consists of several tables, but we only need the necessary information. That is why we will take two tables and join them.
For that, choose your first table and click “Join”. Then, select join condition. Remember that you can preview the result firstly, and only then apply changes.
To add, we want to clean up our data from nulls. It’s very easy to do using Dremio. All you need is to click “Exclude” on the desired field and choose data to remove.
You can try to perform other preparations using this tutorial.
Here is the final SQL query that Dremio generates automatically:
SELECT nested_0.salesid AS salesid, nested_0.listid AS listid, nested_0.sellerid AS sellerid, nested_0.eventid AS eventid, nested_0.dateid AS dateid, nested_0.qtysold AS qtysold, nested_0.pricepaid AS pricepaid, nested_0.commission AS commission, nested_0.saletime AS saletime, nested_0.buyerid AS buyerid, join_users.userid AS userid, join_users.username AS username, join_users.firstname AS firstname, join_users.lastname AS lastname, join_users.city AS city, join_users.state AS state, join_users.email AS email, join_users.phone AS phone, join_users.likesports AS likesports, join_users.liketheatre AS liketheatre, join_users.likeconcerts AS likeconcerts, join_users.likejazz AS likejazz, join_users.likeclassical AS likeclassical, join_users.likeopera AS likeopera, join_users.likerock AS likerock, join_users.likevegas AS likevegas, join_users.likebroadway AS likebroadway, join_users.likemusicals AS likemusicals FROM ( SELECT salesid, listid, sellerid, buyerid, eventid, dateid, qtysold, pricepaid, commission, saletime FROM "Amazon Redshift".dev.public.sales ) nested_0 INNER JOIN "Amazon Redshift".dev.public.users AS join_users ON nested_0.buyerid = join_users.userid WHERE join_users.likesports
Now, you can save the dataset to the appropriate space.
Connecting Dremio to Python
When the data curation is done, we can connect Dremio to Python for further analysis and visualization.
First, we need to make sure that the proper libraries are imported:
import pyodbc import pandas as pd import numpy as np
Then, initialize variables:
host = "localhost" port = 31010 uid = 'your user' pwd = 'your password' driver = "Dremio Connector" #for Windows driver = "file path to odbc driver" #for Linux
The SQL query you can get from the Editor:
cnxn = pyodbc.connect("Driver={};ConnectionType=Direct;HOST={};PORT={};AuthenticationType=Plain;UID={};PWD={}".format(driver, host,port,uid,pwd),autocommit=True) sql = "SELECT * FROM aws.data" dataframe = pd.read_sql(sql,cnxn)
Let’s look at our data:
dataframe.head()
Building plots
After importing data we can build plots. Building a plot is an important part of data analysis as it allows to check variables dependency. Let’s review several options.
First one is a scatter plot. It’s a good way to estimate data distribution.
Create scatter plots
g = sns.FacetGrid(dataframe, col="liketheatre", row="likesports", margin_titles=True) g.map(plt.scatter,"pricepaid","commission")
Show the plot
plt.show()
Here we can see that ticket price paid depends on commission because data distribution is similar to a line. Therefore, it makes sense to try linear regression in further analysis. Moreover, we can see that theater taste doesn’t effect on data.
Swarmplot does practically the same, but it also demonstrates the proportion of data by using different colors.
sns.swarmplot(x="pricepaid", y="commission", data=dataframe)
Show plot
plt.show()
The last one is barplot. In this example we tried to estimate the effect of taste on ticket price paid. In opera case, we can see that people who don’t like opera spend more money.
sns.barplot(x='likeopera',y='pricepaid',data=dataframe)
Conclusions
Dremio is a good mean of connecting Amazon Redshift and Python. Moreover, it makes data curation much easier. Combining Amazon Redshift and Python gives an opportunity of common using data warehouse and data analysis. In this tutorial, we demonstrated how to connect Amazon Redshift to Dremio and Dremio to Python and provided simple analysis by using plots.
We hope that you found this tutorial useful. Stay tuned to learn more about how Dremio can help you get more value from your data, faster.