28 minute read · October 22, 2019

Machine Learning Models on S3 and Redshift with Python

Dremio Authors: Insights and Perspectives · Dremio Team

Introduction

Assumptions

Data overview and AWS S3 operations

AWS Redshift database

Dremio operations

Machine learning models and algorithms

Conclusion

Introduction

An important requirement for large and small business is the proper resource management. Classical solutions for such tasks can be presented as different optimization and control methods. But for the last few years, there appeared some approaches that use mathematical tools, statistics, and probability theory. They allow solving the optimization problems by detecting dependencies in the data. These methods and solutions are known as machine learning algorithms.

In this article, we will focus on the creation of the machine learning models for predicting the usage of Bike Sharing resources.

The data processing will have the form of the next scheme:

So the research algorithms and data operations will be described in the following steps:

1 . Data sources:

Downloading dataset in CSV format
Uploading a complete dataset to AWS S3 Storage.

2.1. Databases:

Creating a data table using AWS Redshift
Loading the research data into an AWS Redshift data table.

2.2. Data preprocessing:

Importing data to the Dremio platform
Data curation.

3 . ML algorithms:

Exporting data to the Python environment – Jupyter notebook
ML models creation and configuration
Data prediction.

4 . Results analysis.

Assumptions

We assume that you have the following items already installed and setup:

Amazon account
Dremio
OBDC driver
Jupyter Notebook
Python packages: Pandas, Numpy, Datatime, PyODBC, Sklearn.

Data overview and AWS S3 operations

The Bike Sharing dataset presents the hourly and daily count of rental bikes in the Capital bikeshare system with the corresponding weather and seasonal information.

We will use Jupyter Notebook and Pandas DataFrame to analyze the data structure.

import pandas as pd
# path to data
csv_day = os.path.join(os.path.abspath(os.curdir),'day.csv')
# load CSV to DataFrame
df_day = pd.read_csv(csv_day)
# data info
print("info: ", df_day.info())
df_day.head(5)

There are two data categories: Bike Sharing data (date, time, bike-sharing quantities) and weather data (temperature, humidity, wind speed). A more detailed description of these parameters is available in the dataset description.

We will model one of the best practices for data analysis for this article:

Split the data into categories.
Data processing method for Bike Sharing Service rates (metrics) will be defined as follows:

AWS S3 – data storage
AWS Redshift – real database
Dremio – for data operations and curations
Jupyter notebook – for ML algorithms.

Weather data processing will be as follows:

Store data to AWS S3
Attach to the main data processing algorithm for data curation in Dremio as a data augmentation procedure.

Data splitting and uploading algorithms are shown below.

# splitting data
df_day[['instant', 'dteday', 'season', 'yr', 'mnth', 'holiday', 'weekday',
   	'workingday', 'casual', 'registered', cnt']]
   	.to_csv(path_or_buf='bike_sharing.csv', index=False)
df_day[['dteday', 'weathersit', 'temp', 'atemp', 'hum', 'windspeed']]
   	.to_csv(path_or_buf='weather.csv', index=False)
# uploading algorithm
import boto3
# AWS CREDENTIALS
aws_access_key_id = "KEY_ID"
aws_secret_access_key = "ACCESS_KEY"
# upload files
s3_client = boto3.client('s3')
s3_client.create_bucket('bike-sharing')
s3_client.upload_file('bike_sharing.csv', bike-sharing)
s3_client.upload_file('weather.csv', bike-sharing)

After downloading the data, you should get the following:

AWS Redshift database

AWS Redshift is one of the most efficient, flexible, accessible, and popular databases, so we are going to use it as the main repository for our data.

The first thing to do is to create a Redshift cluster for the database.

After that, you will have to define the cluster and database parameters:

node type and quantity – defines the database performance
cluster identifier, database name – database identification parameters
username and password – credentials for database access.

After you have entered the valid parameters and clicked the Lunch cluster button, you will receive the following results in a few minutes.

Once we obtained the endpoint for the JDBC driver, we can proceed with data operations in AWS Redshift.

We’ll use the recommended SQL-client – SQL Workbench/J for database operations. Installation and setup procedures for this software can be found in the manuals. Before performing any data operations, it is necessary to configure the SQL-client connection for the AWS Redshift database:

Define a connection name.
Select the driver for the connection - Amazon Redshift JDBC driver.
Enter the database username and password and select the “Autocommit” option.

Next, we need to create an empty data table where we will copy the Bike Sharing Service rates data. Let’s do this with an SQL query:

CREATE TABLE sharing_data
(
  instant   	INTEGER NOT NULL,
  dteday    	DATE NOT NULL,
  season    	INTEGER NOT NUL,
  yr        	INTEGER  NOT NULL,
  mnth      	INTEGER  NOT NULL,
  holiday   	INTEGER  NOT NULL,
  weekday   	INTEGER  NOT NULL,
  workingday	INTEGER NOT NULL,
  casual    	INTEGER NOT NULL,
  registered	INTEGER NOT NULL,
  cnt       	INTEGER NOT NULL
);

We will copy the data into an empty AWS Redshift database table with the help of the recommended algorithm using the COPY* function. As a result, we will receive the following data:

Note that the credential parameters are available in the IAM Management console.

Dremio operations

Dremio is a flexible platform designed for various data operations with the connection options for different data sources. We will use this platform for data conversion, combination and aggregation, and data preparation for the ML algorithm.

Connecting data sources

The first thing we need to do is to connect data sources to the Dremio platform. Let’s connect the Bikesharing service main database hosted at the AWS Redshift. So, you should do the following:

Add a new data source in the Sources menu.
Select the data source type, AWS Redshift in our case.
Click the General tab and specify the main database parameters.
Copy the JDBC connection string from AWS Redshift console.
Specify the credential details of your database.

As an augmentation process for basic data, we will add weather data stored in AWS S3.

The details for connecting AWS S3 simple storage to Dremio were described in our previous tutorials. Below, we will only show the basic configuration and note that the main credential parameters are Access key ID and Secret access key,which can be found in AWS IAM.

As a result, you’ll get two new data sources.

Data curation

Let’s define the weather data format through the Dremio interface. Select Text data format, determine the field delimiter and select the option to extract field names.

After performing these steps, we receive a data table with columns in the text format (set by default). Therefore, we will specify the data format for each column.

As a result, we receive the next data table for weather data:

The same operations for the main data are not necessary since we defined the formats while uploading it into the AWS Redshift database. Once we’ve defined the data format, we need to save it as virtual datasets in Dremio space. Two steps should be done here:

creating new Dremio space
saving each dataset with its own name into the created space.

Next, let’s join the data into one virtual dataset and add new features using Dremio tools. Open the data table of the main data set, click the Join button and select the weather table. Define the columns by which tables will be merged (in this case, it’s date columns). Click the Preview button and check the merge. Press the Apply button.

After that, one of the duplicate date columns can be dropped.

Next, we will add data to this table by combining the available data using logical and mathematical operators. Click the Add Field button and add a formula for calculating new data based on the available data.

In this example, we have added only a new combination of temperature and bike counting. But it should be noted that the total number of parameter additions and combinations can be relatively high. This machine learning procedure is called features engineering and often has a strong correlation to results accuracy. When all data operations are completed, they should be saved into a new virtual data set.

Machine learning models and algorithms

Machine learning models are used to solve classification, regression, or clustering tasks. We need to predict bike sharing on the available data, and that is the regression task.

Usually, the machine learning pipeline includes the following operations:

Data acquisition
Feature extractions (features engineering)
Data analysis
Data preparation
ML model operations: selection, configuration, training, validation, optimization
Result analysis.

In our case, we need to complete the last four steps, since all the rest we did in Dremio.

Connecting the Jupyter notebook to Dremio

We’ll use Jupyter notebook as a popular and convenient python environment for the machine learning modeling.

The first thing we are going to do is to connect the data to our environment using the Dremio API. One of the key aspects for connecting to Dremio instance is to use the pyodbc library and to create a SQL-query to the bike_sharing_data virtual dataset.

import pyodbc
# Credentails and query
id = 'user'
pwd = 'password'
dremio_query = f'DSN=Dremio Connector;UID={id};PWD={pwd}'
# connection to the Dremio by OBDC driver
dremio_connector = pyodbc.connect(dremio_query, autocommit=True)
# SQL query to Dremio
sql_query = r' SELECT * FROM "bike_sharing"."bike_sharing_data" '
# load data from Dremio by SQL query
df = pd.read_sql(sql_query, dremio_connector)

Data analysis

Let’s visualize the dataset features for researching the data patterns.

We’ll build histograms to identify statistical dataset features.

df.hist(figsize=(12,10))

As we can see, data has the following characteristics: normal (or close to it) distributions and categorical features. Note that such features as ‘instance’ and ‘dteday’ have high correlation values and we won’t use them for ML models.

# drop useless feature
df.drop(['dteday', 'instant'], axis=1, inplace=True)

Let’s build a main parameters dependence, the number of rental cycles per temperature:

import seaborn as sns
sns.jointplot('temp','cnt', data=df)

We can conclude that the dependence between these parameters is linear and can be described by simple regression. To make sure, let’s draw the dependency between the same parameters but with reference to the categorical features ‘workingday’ and ‘yr’:

sns.lmplot('temp','cnt', row='workingday', col='yr', data=df)

We can do the same conclusions as for the previous graphics – the dependence between the features can be described by the regression model. Next, we’re going to build a diagram that will show the correlation between all the dataset features:

sns.heatmap(df.corr(), annot=True, linewidth=2, cmap=plt.cm.Blues)

We can see that some features have strong correlations, that’s why we won’t include them into our ML algorithms.

df.drop(['registered', 'casual', 'atemp'], axis=1, inplace=True)

Data preparation

We’ll split our data into two parts:

train – for training regression model
test – for testing model’s accuracy.

Then, we’ll allocate the main dataset features and target feature (‘cnt’ – rental cycles).

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
# data features and target feature
x = df.drop('cnt', axis=1)
y = df['cnt']
# split data
X_train, X_test, y_train, y_test= train_test_split(X,
                                                   y,
                                                   random_state=0,
                                                   test_size=0.3)

Linear Regression

Let’s build a linear regression model with the default parameters. To simplify the process, let’s create a function for regression.

def regressor(model, name):
	'''model: regressor model
	   name: string model name'''
	# model and fit
	model.fit(X_train, y_train)
	# prediction
	predictions = model.predict(X_test)
	# plot prediction
	plt.scatter(y_test, predictions)
	plt.title(name)
	plt.xlabel("Test")
	plt.ylabel("Prediction")
	plt.grid(True)
	plt.show()
	# metrics
	print('Accuracy metrics:')
	print('MAE: ', metrics.mean_absolute_error(y_test, predictions))
	print('MSE: ', metrics.mean_squared_error(y_test, predictions))
	print('Root MSE: ', np.sqrt(metrics.mean_squared_error(y_test,
                                                          predictions)))
	print('Score: ', model.score(X_test, y_test))
from sklearn.linear_model import LinearRegression
from sklearn import metrics
# Linear Regression Model
regressor(LinearRegression(), 'Linear Regression Model')

As we can see, the linear regression model gives a high value (0,79) of the determination coefficient – the model’s relevance. Let’s try to use a more complex model – polynomials regression.

Polynomials Regression

This model can describe more complex patterns using polynomial regression representation.

from sklearn.preprocessing import PolynomialFeatures
# poly degree
poly_feat = PolynomialFeatures(2)
X_train = poly_feat.fit_transform(X_train)
X_test = poly_feat.fit_transform(X_test)
regressor(LinearRegression(), 'Polynomials Regression Model')

The prediction result of the polynomial regression model has higher accuracy (over 10%) that is shown by accuracy and score metrics. Next, we are going to build an even more complex and popular model with the decision trees.

Random Forest and Gradient Boosting

Let’s build models using Random Forest and Gradient Boosting regressors.

from sklearn.ensemble import GradientBoostingRegressor
from sklearn.ensemble import RandomForestRegressor
regressor(GradientBoostingRegressor(), 'Gradient Boosting Model')
regressor(RandomForestRegressor(), 'Random Forest Model')

The result has improved by 4-6%. But for even more improvement, let’s try one of the most popular methods for hyperparameter optimization by using the GridSearchCV for the Random Forest model.

from sklearn.model_selection import GridSearchCV
# model and parameters grids
model = RandomForestRegressor()
parameters = [{'n_estimators' : [250, 255, 260, 265],
           	'criterion' : ['mse', 'mae'],
           	'max_features' : ['auto', 'sqrt', 'log2']}]
# fit model
grid_search = GridSearchCV(estimator = model, param_grid = parameters)
grid_search = grid_search.fit(X_train, y_train)
best_parameters = grid_search.best_params_
best_accuracy = grid_search.best_score_
print('best parameters:', best_parameters)
print('best accuracy:', best_accuracy)

We can easily receive the following result with optimal hyperparameters for the Random Forest model:

We received more accurate results, but at the same time, we can conclude that a simpler model is more suitable for our data. To gain better results, let’s analyze the built model by exploring features importance of our data.

# get column titles except the target column
features = df.columns.tolist()
features.remove('cnt')
# Features evaluating
feature_importance = model.feature_importances_
feature_importance = 100 * (feature_importance / feature_importance.max())
index = np.argsort(feature_importance)
feature_index = np.arange(index.shape[0])
# Plot the feature importances of the Random forest
plt.barh(feature_index, feature_importance[index], align="center")
plt.yticks(feature_index, np.array(features)[index])
plt.title('Features Importance')
plt.xlabel('Importance in %')
plt.ylabel('Features')
plt.grid(True)
plt.show()

Most of the categorical data contribute to the predictions by the ML models. So let’s define possible ways to improve our models:

use One Hot encoding for categorical data
use the cross-validation procedures.

Conclusion

This article has shown the Dremio platform’s functionality for creating ML models and algorithms with data stored in AWS services.

We have split and stored data in AWS services and then connected these services to Dremio. We performed data operations in Dremio and connected Dremio to the Jupyter notebook. Finally, we created ML models for predicting bike-sharing service.

Machine Learning Models on S3 and Redshift with Python

Table of Contents

Introduction

Assumptions

Data overview and AWS S3 operations

AWS Redshift database

Dremio operations

Connecting data sources

Data curation

Machine learning models and algorithms

Connecting the Jupyter notebook to Dremio

Data analysis

Data preparation

Linear Regression

Polynomials Regression

Random Forest and Gradient Boosting

Conclusion

Get Started Free

See Dremio in Action

Talk to an Expert

Ready to Get Started?

Table of Contents

Introduction

Assumptions

Data overview and AWS S3 operations

AWS Redshift database

Dremio operations

Connecting data sources

Data curation

Machine learning models and algorithms

Connecting the Jupyter notebook to Dremio

Data analysis

Data preparation

Linear Regression

Polynomials Regression

Random Forest and Gradient Boosting

Conclusion

Additional Resources

Apache Iceberg: The Definitive Guide

What Is Apache Iceberg? Features & Benefits

Introduction to Data Engineering

Get Started Free

See Dremio in Action

Talk to an Expert

Ready to Get Started?