28 minute read · October 22, 2019
Machine Learning Models on S3 and Redshift with Python
· Dremio Team
Introduction
An important requirement for large and small business is the proper resource management. Classical solutions for such tasks can be presented as different optimization and control methods. But for the last few years, there appeared some approaches that use mathematical tools, statistics, and probability theory. They allow solving the optimization problems by detecting dependencies in the data. These methods and solutions are known as machine learning algorithms.
In this article, we will focus on the creation of the machine learning models for predicting the usage of Bike Sharing resources.
The data processing will have the form of the next scheme:
So the research algorithms and data operations will be described in the following steps:
1 . Data sources:
- Downloading dataset in CSV format
- Uploading a complete dataset to AWS S3 Storage.
2.1. Databases:
- Creating a data table using AWS Redshift
- Loading the research data into an AWS Redshift data table.
2.2. Data preprocessing:
- Importing data to the Dremio platform
- Data curation.
3 . ML algorithms:
- Exporting data to the Python environment – Jupyter notebook
- ML models creation and configuration
- Data prediction.
4 . Results analysis.
Assumptions
We assume that you have the following items already installed and setup:
- Amazon account
- Dremio
- OBDC driver
- Jupyter Notebook
- Python packages: Pandas, Numpy, Datatime, PyODBC, Sklearn.
Data overview and AWS S3 operations
The Bike Sharing dataset presents the hourly and daily count of rental bikes in the Capital bikeshare system with the corresponding weather and seasonal information.
We will use Jupyter Notebook and Pandas DataFrame to analyze the data structure.
import pandas as pd # path to data csv_day = os.path.join(os.path.abspath(os.curdir),'day.csv') # load CSV to DataFrame df_day = pd.read_csv(csv_day) # data info print("info: ", df_day.info()) df_day.head(5)
There are two data categories: Bike Sharing data (date, time, bike-sharing quantities) and weather data (temperature, humidity, wind speed). A more detailed description of these parameters is available in the dataset description.
We will model one of the best practices for data analysis for this article:
- Split the data into categories.
- Data processing method for Bike Sharing Service rates (metrics) will be defined as follows:
- AWS S3 – data storage
- AWS Redshift – real database
- Dremio – for data operations and curations
- Jupyter notebook – for ML algorithms.
- Weather data processing will be as follows:
- Store data to AWS S3
- Attach to the main data processing algorithm for data curation in Dremio as a data augmentation procedure.
Data splitting and uploading algorithms are shown below.
# splitting data df_day[['instant', 'dteday', 'season', 'yr', 'mnth', 'holiday', 'weekday', 'workingday', 'casual', 'registered', cnt']] .to_csv(path_or_buf='bike_sharing.csv', index=False) df_day[['dteday', 'weathersit', 'temp', 'atemp', 'hum', 'windspeed']] .to_csv(path_or_buf='weather.csv', index=False) # uploading algorithm import boto3 # AWS CREDENTIALS aws_access_key_id = "KEY_ID" aws_secret_access_key = "ACCESS_KEY" # upload files s3_client = boto3.client('s3') s3_client.create_bucket('bike-sharing') s3_client.upload_file('bike_sharing.csv', bike-sharing) s3_client.upload_file('weather.csv', bike-sharing)
After downloading the data, you should get the following:
AWS Redshift database
AWS Redshift is one of the most efficient, flexible, accessible, and popular databases, so we are going to use it as the main repository for our data.
The first thing to do is to create a Redshift cluster for the database.
After that, you will have to define the cluster and database parameters:
- node type and quantity – defines the database performance
- cluster identifier, database name – database identification parameters
- username and password – credentials for database access.
After you have entered the valid parameters and clicked the Lunch cluster button, you will receive the following results in a few minutes.
Once we obtained the endpoint for the JDBC driver, we can proceed with data operations in AWS Redshift.
We’ll use the recommended SQL-client – SQL Workbench/J for database operations. Installation and setup procedures for this software can be found in the manuals. Before performing any data operations, it is necessary to configure the SQL-client connection for the AWS Redshift database:
- Define a connection name.
- Select the driver for the connection - Amazon Redshift JDBC driver.
- Enter the database username and password and select the “Autocommit” option.
Next, we need to create an empty data table where we will copy the Bike Sharing Service rates data. Let’s do this with an SQL query:
CREATE TABLE sharing_data ( instant INTEGER NOT NULL, dteday DATE NOT NULL, season INTEGER NOT NUL, yr INTEGER NOT NULL, mnth INTEGER NOT NULL, holiday INTEGER NOT NULL, weekday INTEGER NOT NULL, workingday INTEGER NOT NULL, casual INTEGER NOT NULL, registered INTEGER NOT NULL, cnt INTEGER NOT NULL );
We will copy the data into an empty AWS Redshift database table with the help of the recommended algorithm using the COPY* function. As a result, we will receive the following data:
Note that the credential parameters are available in the IAM Management console.
Dremio operations
Dremio is a flexible platform designed for various data operations with the connection options for different data sources. We will use this platform for data conversion, combination and aggregation, and data preparation for the ML algorithm.
Connecting data sources
The first thing we need to do is to connect data sources to the Dremio platform. Let’s connect the Bikesharing service main database hosted at the AWS Redshift. So, you should do the following:
- Add a new data source in the Sources menu.
- Select the data source type, AWS Redshift in our case.
- Click the General tab and specify the main database parameters.
- Copy the JDBC connection string from AWS Redshift console.
- Specify the credential details of your database.
As an augmentation process for basic data, we will add weather data stored in AWS S3.
The details for connecting AWS S3 simple storage to Dremio were described in our previous tutorials. Below, we will only show the basic configuration and note that the main credential parameters are Access key ID and Secret access key,which can be found in AWS IAM.
As a result, you’ll get two new data sources.
Data curation
Let’s define the weather data format through the Dremio interface. Select Text data format, determine the field delimiter and select the option to extract field names.
After performing these steps, we receive a data table with columns in the text format (set by default). Therefore, we will specify the data format for each column.
As a result, we receive the next data table for weather data:
The same operations for the main data are not necessary since we defined the formats while uploading it into the AWS Redshift database. Once we’ve defined the data format, we need to save it as virtual datasets in Dremio space. Two steps should be done here:
- creating new Dremio space
- saving each dataset with its own name into the created space.
Next, let’s join the data into one virtual dataset and add new features using Dremio tools. Open the data table of the main data set, click the Join button and select the weather table. Define the columns by which tables will be merged (in this case, it’s date columns). Click the Preview button and check the merge. Press the Apply button.
After that, one of the duplicate date columns can be dropped.
Next, we will add data to this table by combining the available data using logical and mathematical operators. Click the Add Field button and add a formula for calculating new data based on the available data.
In this example, we have added only a new combination of temperature and bike counting. But it should be noted that the total number of parameter additions and combinations can be relatively high. This machine learning procedure is called features engineering and often has a strong correlation to results accuracy. When all data operations are completed, they should be saved into a new virtual data set.
Machine learning models and algorithms
Machine learning models are used to solve classification, regression, or clustering tasks. We need to predict bike sharing on the available data, and that is the regression task.
Usually, the machine learning pipeline includes the following operations:
- Data acquisition
- Feature extractions (features engineering)
- Data analysis
- Data preparation
- ML model operations: selection, configuration, training, validation, optimization
- Result analysis.
In our case, we need to complete the last four steps, since all the rest we did in Dremio.
Connecting the Jupyter notebook to Dremio
We’ll use Jupyter notebook as a popular and convenient python environment for the machine learning modeling.
The first thing we are going to do is to connect the data to our environment using the Dremio API. One of the key aspects for connecting to Dremio instance is to use the pyodbc library and to create a SQL-query to the bike_sharing_data virtual dataset.
import pyodbc # Credentails and query id = 'user' pwd = 'password' dremio_query = f'DSN=Dremio Connector;UID={id};PWD={pwd}' # connection to the Dremio by OBDC driver dremio_connector = pyodbc.connect(dremio_query, autocommit=True) # SQL query to Dremio sql_query = r' SELECT * FROM "bike_sharing"."bike_sharing_data" ' # load data from Dremio by SQL query df = pd.read_sql(sql_query, dremio_connector)
Data analysis
Let’s visualize the dataset features for researching the data patterns.
We’ll build histograms to identify statistical dataset features.
df.hist(figsize=(12,10))
As we can see, data has the following characteristics: normal (or close to it) distributions and categorical features. Note that such features as ‘instance’ and ‘dteday’ have high correlation values and we won’t use them for ML models.
# drop useless feature df.drop(['dteday', 'instant'], axis=1, inplace=True)
Let’s build a main parameters dependence, the number of rental cycles per temperature:
import seaborn as sns sns.jointplot('temp','cnt', data=df)
We can conclude that the dependence between these parameters is linear and can be described by simple regression. To make sure, let’s draw the dependency between the same parameters but with reference to the categorical features ‘workingday’ and ‘yr’:
sns.lmplot('temp','cnt', row='workingday', col='yr', data=df)
We can do the same conclusions as for the previous graphics – the dependence between the features can be described by the regression model. Next, we’re going to build a diagram that will show the correlation between all the dataset features:
sns.heatmap(df.corr(), annot=True, linewidth=2, cmap=plt.cm.Blues)
We can see that some features have strong correlations, that’s why we won’t include them into our ML algorithms.
df.drop(['registered', 'casual', 'atemp'], axis=1, inplace=True)
Data preparation
We’ll split our data into two parts:
- train – for training regression model
- test – for testing model’s accuracy.
Then, we’ll allocate the main dataset features and target feature (‘cnt’ – rental cycles).
from sklearn.model_selection import train_test_split from sklearn.preprocessing import StandardScaler # data features and target feature x = df.drop('cnt', axis=1) y = df['cnt'] # split data X_train, X_test, y_train, y_test= train_test_split(X, y, random_state=0, test_size=0.3)
Linear Regression
Let’s build a linear regression model with the default parameters. To simplify the process, let’s create a function for regression.
def regressor(model, name): '''model: regressor model name: string model name''' # model and fit model.fit(X_train, y_train) # prediction predictions = model.predict(X_test) # plot prediction plt.scatter(y_test, predictions) plt.title(name) plt.xlabel("Test") plt.ylabel("Prediction") plt.grid(True) plt.show() # metrics print('Accuracy metrics:') print('MAE: ', metrics.mean_absolute_error(y_test, predictions)) print('MSE: ', metrics.mean_squared_error(y_test, predictions)) print('Root MSE: ', np.sqrt(metrics.mean_squared_error(y_test, predictions))) print('Score: ', model.score(X_test, y_test)) from sklearn.linear_model import LinearRegression from sklearn import metrics # Linear Regression Model regressor(LinearRegression(), 'Linear Regression Model')
As we can see, the linear regression model gives a high value (0,79) of the determination coefficient – the model’s relevance. Let’s try to use a more complex model – polynomials regression.
Polynomials Regression
This model can describe more complex patterns using polynomial regression representation.
from sklearn.preprocessing import PolynomialFeatures # poly degree poly_feat = PolynomialFeatures(2) X_train = poly_feat.fit_transform(X_train) X_test = poly_feat.fit_transform(X_test) regressor(LinearRegression(), 'Polynomials Regression Model')
The prediction result of the polynomial regression model has higher accuracy (over 10%) that is shown by accuracy and score metrics. Next, we are going to build an even more complex and popular model with the decision trees.
Random Forest and Gradient Boosting
Let’s build models using Random Forest and Gradient Boosting regressors.
from sklearn.ensemble import GradientBoostingRegressor from sklearn.ensemble import RandomForestRegressor regressor(GradientBoostingRegressor(), 'Gradient Boosting Model') regressor(RandomForestRegressor(), 'Random Forest Model')
The result has improved by 4-6%. But for even more improvement, let’s try one of the most popular methods for hyperparameter optimization by using the GridSearchCV for the Random Forest model.
from sklearn.model_selection import GridSearchCV # model and parameters grids model = RandomForestRegressor() parameters = [{'n_estimators' : [250, 255, 260, 265], 'criterion' : ['mse', 'mae'], 'max_features' : ['auto', 'sqrt', 'log2']}] # fit model grid_search = GridSearchCV(estimator = model, param_grid = parameters) grid_search = grid_search.fit(X_train, y_train) best_parameters = grid_search.best_params_ best_accuracy = grid_search.best_score_ print('best parameters:', best_parameters) print('best accuracy:', best_accuracy)
We can easily receive the following result with optimal hyperparameters for the Random Forest model:
We received more accurate results, but at the same time, we can conclude that a simpler model is more suitable for our data. To gain better results, let’s analyze the built model by exploring features importance of our data.
# get column titles except the target column features = df.columns.tolist() features.remove('cnt') # Features evaluating feature_importance = model.feature_importances_ feature_importance = 100 * (feature_importance / feature_importance.max()) index = np.argsort(feature_importance) feature_index = np.arange(index.shape[0]) # Plot the feature importances of the Random forest plt.barh(feature_index, feature_importance[index], align="center") plt.yticks(feature_index, np.array(features)[index]) plt.title('Features Importance') plt.xlabel('Importance in %') plt.ylabel('Features') plt.grid(True) plt.show()
Most of the categorical data contribute to the predictions by the ML models. So let’s define possible ways to improve our models:
- use One Hot encoding for categorical data
- use the cross-validation procedures.
Conclusion
This article has shown the Dremio platform’s functionality for creating ML models and algorithms with data stored in AWS services.
We have split and stored data in AWS services and then connected these services to Dremio. We performed data operations in Dremio and connected Dremio to the Jupyter notebook. Finally, we created ML models for predicting bike-sharing service.