Fundamentals of Machine Learning - Mastering Machine Learning with Python in Six Steps: A Practical Implementation Guide to Predictive Data Analytics Using Python (2017)

Mastering Machine Learning with Python in Six Steps: A Practical Implementation Guide to Predictive Data Analytics Using Python (2017)

3. Step 3 – Fundamentals of Machine Learning

Manohar Swamynathan1

(1)

Bangalore, Karnataka, India

This chapter is focused on different algorithms of supervised and unsupervised machine learning using two key Python packages.

Scikit-learn: In 2007, David Cournapeau developed Scikit-learn as part of the Google summer of code project. INRIA got involved in 2010 and beta v0.1 was released for the public. Currently there are more than 700 plus active contributors and it has paid sponsorship from INRIA, Python Software Foundation, Google, and Tinyclues. Many of the functions of Scikit-learn are built upon SciPy (Scientific Python) library, and it provides great breadth of efficiently implemented essential supervised and unsupervised learning algorithms.

Note

Scikit-learn is also known as sklearn in short, so these two terms are used interchangeably throughout this book.

Statsmodels : This complements the SciPy package and is one of the best packages to run regression models as it provides an extensive list of statistics results for each estimator of the model.

Machine Learning Perspective of Data

Data is the fact and figures (can also be referred as raw data) that we have available with respect to the business context. Data are made up of these two aspects:

1. a.

Objects such as people, tree, animals, etc.

2. b.

Attributes that were recorded for objects such as age, size, weight, cost, etc.

When we measure the attributes of an object, we obtain a value that varies between objects. For example, if we consider individual plants in a garden as objects, and the attribute ‘height' will vary between them. Correspondingly different attributes vary between objects, so attributes are more collectively known as variables.

The things we measure, control, or manipulate for objects are the variables and it differs in “how well” they can be measured that is how much measurable information their measurement scale can provide. The amount of information that can be provided by a variable is determined by its type of measurement scale.

At a high level there are two types of variables based on the type of values that it can take.

1. 1.

Continuous or quantitative : variables can take any positive or negative numerical value between a large range. Retail sales amount, insurance claims amounts are examples for continuous variables that can take any number within large ranges. These types of variables are also generally known as numerical variables.

2. 2.

Discrete or qualitative : variables can take only particular values: retail store location area, state, city are examples for discrete variables as it can take only one particular value for a store (here store is our object). These types of variables are also known as categorical variables.

Scales of Measurement

In general, variables can be measured on four different scales. Mean, median, and mode are the way to understand the central tendency, that is, the middle point of data distribution. Standard deviation, variance, and range are the most commonly used dispersion measures used to understand the spread of the data.

Nominal Scale of Measurement

Data are measured at the nominal level when each case is classified into one of a number of discrete categories. This is also called categorical, that is, used only for classification. As mean is not meaningful, all that we can do is to count the number of occurrences of each type and compute the proportion (number of occurrences of each type / total occurrences). See Table 3-1.

Table 3-1.

Nomial scale examples

Variable Name

Example Measurement Values

Color

Red, Green, Yellow, etc.

Gender

Female, Male

Football Players Jersey Number

1, 2, 3, 4, 5, etc.

Ordinal Scale of Measurement

Data are measured on an ordinal scale if the categories imply order. The difference between ranks is consistent in direction and authority, but not magnitude. See Table 3-2.

Table 3-2.

Ordinal scale example

Variable Name

Example Measurement Values

Military rank

Second Lieutenant, First Lieutenant, Captain, Major, Lieutenant Colonel, Colonel, etc.

Clothing size

Small, Medium, Large, Extra Large. Etc.

Class rank in an exam

1,2,3,4,5, etc.

Interval Scale of Measurement

If the differences between values have meanings, the data are measured at the interval scale . See Table 3-3.

Table 3-3.

Interval scale example

Variable Name

Example Measurement Values

Temperature

10, 20, 30, 40, etc.

IQ rating

85 - 114, 115 - 129, 130 - 144, 145 – 159, etc.

Ratio Scale of Measurement

Data measured on a ratio scale have differences that are meaningful, and relate to some true zero point. This is the most common scale of measurement. See Tables 3-4 and 3-5.

Table 3-4.

Ratio scale example

Variable Name

Example Measurement Values

Weight

10, 20, 30, 40, 50, 60, etc.

Height

5, 6, 7, 8, 9, etc.

Age

1, 2, 3, 4, 5, 6, 7, etc.

Table 3-5.

Comparison of the different scales of measurement

Scales of measurement

Nominal

Ordinal

Interval

Ratio

Properties

Identity

Identity Magnitude

Identity Magnitude Equal intervals

Identity Magnitude Equal intervals True zero

Mathematical

Operations

Count

Rank order

Addition Subtraction

Addition Subtraction Multiplication Division

Descriptive

Statistics

Mode Proportion

Mode Median Range statistics

Mode Median Range statistics Variance Standard deviation

Mode

Median

Range statistics

Variance

Standard deviation

Feature Engineering

The output or the prediction quality of any machine learning algorithm depends predominantly on the quality of input being passed. The process of creating appropriate data features by applying business context is called feature engineering , and it is one of the most important aspects of building efficient machine learning systems. Business context here means the expression of the business problem that we are trying to address, why we are trying to solve it, and what is the expected outcome. So let’s understand the fundamentals of feature engineering before proceeding to different types of machine learning algorithms.

The logical flow of raw data to machine learning algorithm is represented in Figure 3-1. Data from different sources “as-is” is the raw data and when we apply business logic to process the raw data, the outcome is information (processed data). Further insight is derived from information. The process of converting raw data into information into insight with a business context to address a particular business problem is an important aspect of feature engineering. The output of feature engineering is a clean and meaningful set of features that can be consumed by algorithms to identify patterns and build a machine learning model, which can further be applied on unseen data to predict the possible outcome. In order to have an efficient machine learning system, often feature optimization is carried out to reduce the feature dimension and retain only the important/meaningful features that will reduce the computation time and improve prediction performance. Note that machine learning model building is an iterative process. Let’s look at some of the common practices that are part of feature engineering.

A434293_1_En_3_Fig1_HTML

Figure 3-1.

Logical flow of data in Machine Learning model building

Dealing with Missing Data

Missing data can mislead or create problems for analyzing the data. In order to avoid any such issues, you need to impute missing data. There are four most commonly used techniques for data imputation.

· Delete: You could simply delete the rows containing missing values. This technique is more suitable and effective when the number of missing value rows count is insignificant (say < 5%) compare to the overall record count. You can achieve this using Panda's dropna() function.

· Replace with summary: This is probably the most commonly used imputation technique. Summarization here is the mean, mode, or median for a respective column. For continuous or quantitative variables, either mean/average or mode or median value of the respective column can be used to replace the missing values. Whereas for categorical or qualitative variables, the mode (most frequent) summation technique works better. You can achieve this using Panda's fillna() function (please refer to Chapter 2 Pandas section).

· Random replace: You can also replace the missing values with a randomly picked value from the respective column. This technique would be appropriate where the missing values row count is insignificant.

· Using predictive model: This is an advanced technique. Here you can train a regression model for continuous variables and classification model for categorical variables with the available data and use the model to predict the missing values.

Handling Categorical Data

Most of the machine’s learning libraries are designed to work well with numerical variables. So categorical variables in their original form of text description can’t be directly used for model building. Let’s learn some of the common methods of handling categorical data based on their number of levels.

Create dummy variable: This is a Boolean variable that indicates the presence of a category with the value 1 and 0 for absence. You should create k-1 dummy variables, where k is the number of levels. Scikit-learn provides a useful function ‘One Hot Encoder’ to create a dummy variable for a given categorical variable. See Listing 3-1.

import pandas as pd

from patsy import dmatrices

df = pd.DataFrame({'A': ['high', 'medium', 'low'],

'B': [10,20,30]},

index=[0, 1, 2])

print df

#----output----

A B

0 high 10

1 medium 20

2 low 30

# using get_dummies function of pandas package

df_with_dummies= pd.get_dummies(df, prefix='A', columns=['A'])

print df_with_dummies

#----output----

B A_high A_low A_medium

0 10 1.0 0.0 0.0

1 20 0.0 0.0 1.0

2 30 0.0 1.0 0.0

Listing 3-1.

Creating dummy variables

Convert to number: Another simple method is to represent the text description of each level with a number by using the ‘Label Encoder’ function of Scikit-learn. If the number of levels are high (example zip code, state, etc.), then you apply the business logic to combine levels to groups. For example zip code or state can be combined to regions; however, in this method there is a risk of losing critical information. Another method is to combine categories based on similar frequency (new category can be high, medium, low). See Listing 3-2.

import pandas as pd

# using pandas package's factorize function

df['A_pd_factorized'] = pd.factorize(df['A'])[0]

# Alternatively you can use sklearn package's LabelEncoder function

from sklearn.preprocessing import LabelEncoder

le = LabelEncoder()

df['A_LabelEncoded'] = le.fit_transform(df.A)

print df

#----output----

A B A_pd_factorized A_LabelEncoded

0 high 10 0

0

1 medium 20 1

2

2 low 30

2 1

Listing 3-2.

Converting categorical variable to numerics

Normalizing Data

A unit or scale of measurement for different variables varies, so an analysis with the raw measurement could be artificially skewed toward the variables with higher absolute values. Bringing all the different types of variable units in the same order of magnitude thus eliminates the potential outlier measurements that would misrepresent the finding and negatively affect the accuracy of the conclusion. Two broadly used methods for rescaling data are normalization and standardization.

Normalizing data can be achieved by Min-Max scaling; the formula is given below, which will scale all numeric values in the range 0 to 1.

$$ {\mathrm{X}}_{\mathrm{normalized}}=\frac{\left(\mathrm{X}\ \hbox{--}\ {\mathrm{X}}_{\min}\right)}{\left({\mathrm{X}}_{\max }\ \hbox{--} \kern0.5em {\mathrm{X}}_{\min}\right)} $$

Note

Ensure you remove extreme outliers before applying the above technique as it can skew the normal values in your data to a small interval.

The standardization technique will transform the variables to have a zero mean and standard deviation of one. The formula for standardization is given below and the outcome is commonly known as z-scores.

$$ \mathrm{Z}=\frac{\left(\mathrm{X} - \upmu \right)}{\upsigma} $$

Where μ is the mean and σ is the standard deviation.

Standardization has often been the preferred method for various analysis as it tells us where each data point lies within its distribution and a rough indication of outliers. See Listing 3-3.

from sklearn import datasets

import numpy as np

from sklearn import preprocessing

iris = datasets.load_iris()

X = iris.data[:, [2, 3]]

y = iris.target

std_scale = preprocessing.StandardScaler().fit(X)

X_std = std_scale.transform(X)

minmax_scale = preprocessing.MinMaxScaler().fit(X)

X_minmax = minmax_scale.transform(X)

print('Mean before standardization: petal length={:.1f}, petal width={:.1f}'

.format(X[:,0].mean(), X[:,1].mean()))

print('SD before standardization: petal length={:.1f}, petal width={:.1f}'

.format(X[:,0].std(), X[:,1].std()))

print('Mean after standardization: petal length={:.1f}, petal width={:.1f}'

.format(X_std[:,0].mean(), X_std[:,1].mean()))

print('SD after standardization: petal length={:.1f}, petal width={:.1f}'

.format(X_std[:,0].std(), X_std[:,1].std()))

print('\nMin value before min-max scaling: patel length={:.1f}, patel width={:.1f}'

.format(X[:,0].min(), X[:,1].min()))

print('Max value before min-max scaling: petal length={:.1f}, petal width={:.1f}'

.format(X[:,0].max(), X[:,1].max()))

print('Min value after min-max scaling: patel length={:.1f}, patel width={:.1f}'

.format(X_minmax[:,0].min(), X_minmax[:,1].min()))

print('Max value after min-max scaling: petal length={:.1f}, petal width={:.1f}'

.format(X_minmax[:,0].max(), X_minmax[:,1].max()))

#----output----

Mean before standardization: petal length=3.8, petal width=1.2

SD before standardization: petal length=1.8, petal width=0.8

Mean after standardization: petal length=0.0, petal width=-0.0

SD after standardization: petal length=1.0, petal width=1.0

Min value before min-max scaling: patel length=1.0, patel width=0.1

Max value before min-max scaling: petal length=6.9, petal width=2.5

Min value after min-max scaling: patel length=0.0, patel width=0.0

Max value after min-max scaling: petal length=1.0, petal width=1.0

Listing 3-3.

Normalization and scaling

Feature Construction or Generation

Machine learning algorithms give best results only when we provide it the best possible features that structure the underlying form of the problem that you are trying to address. Often these features have to be manually created by spending a lot of time with actual raw data and trying to understand its relationship with all other data that you have collected to address a business problem.

It means thinking about aggregating, splitting, or combining features to create new features, or decomposing features. Often this part is talked about as an art form and is the key differentiator in competitive machine learning.

Feature construction is manual, slow, and requires subject-matter expert intervention heavily in order to create rich features that can be exposed to predictive modeling algorithms to produce best results.

Summarizing the data is a fundamental technique to help us understand the data quality and issues/gaps. Figure 3-2 maps the tabular and graphical data summarization methods for different data types. Note that this mapping is the obvious or commonly used methods, and not an exhaustive list.

A434293_1_En_3_Fig2_HTML

Figure 3-2.

Commonly used data summarization methods

Exploratory Data Analysis (EDA)

EDA is all about understanding your data by employing summarizing and visualizing techniques. At a high level the EDA can be performed in two folds, that is, univariate analysis and multivariate analysis.

Let’s learn and consider an example dataset to learn practicality. Iris dataset is one of a well-known datasets used extensively in pattern recognition literature. It is hosted at UC Irvine Machine Learning Repository. The dataset contains petal length, petal width, sepal length, and sepal width measurement for three types of iris flowers, that is, setosa, versicolor, and virginica. See Figure 3-3.

A434293_1_En_3_Fig3_HTML

Figure 3-3.

Iris verisicolor

Univariate Analysis

Individual variables are analyzed in isolation to have a better understanding about them. Pandas provide the describe function to create summary statistics in tabular format for all variables. These statistics are very useful for numerical types of variables to understand any quality issues such as missing values and the presence of outliers. See Listings 3-4 and 3-5.

from sklearn import datasets

import numpy as np

import pandas as pd

import matplotlib.pyplot as plt

iris = datasets.load_iris()

# Let's convert to dataframe

iris = pd.DataFrame(data= np.c_[iris['data'], iris['target']],

columns= iris['feature_names'] + ['species'])

# replace the values with class labels

iris.species = np.where(iris.species == 0.0, 'setosa', np.where(iris.species==1.0,'versicolor', 'virginica'))

# let's remove spaces from column name

iris.columns = iris.columns.str.replace(' ','')

iris.describe()

#----output----

sepallength(cm)sepalwidth(cm)petallength(cm) petalwidth(cm)

Count 150.00 150.00

150.00 150.00

Mean5.84 3.05 3.75 1.19

std 0.82 0.43 1.76 0.76

min 4.30 2.00 1.00 0.10

25% 5.10 2.80 1.60 0.30

50% 5.80 3.00 4.35 1.30

75% 6.40 3.30 5.10 1.80

max 7.90 4.40 6.90 2.50

The columns ‘species’ is categorical, so lets check the frequency distribution for each category.

print iris['species'].value_counts()

#----output----

setosa50

versicolor 50

virginica 50

Pandas supports plotting functions to quick visualization on attributes. We can see from the plot that 'species' has 3 category with 50 records each.

Listing 3-4.

Univariate analysis

# Set the size of the plot

plt.figsize(15, 8)

iris.hist() # plot histogram

plt.suptitle("Histogram", fontsize=16) # use suptitle to add title to all sublots

plt.show()

iris.boxplot() # plot boxplot

plt.title("Bar Plot", fontsize=16)

plt.show()

#----output----

Listing 3-5.

Pandas dataframe visualization

A434293_1_En_3_Figa_HTML

Multivariate Analysis

In multivariate analysis you try to establish a sense of relationship of all variables with one other.

Let’s understand the mean of each feature by species type. See Listing 3-6.

# print the mean for each column by species

iris.groupby(by = "species").mean()

# plot for mean of each feature for each label class

iris.groupby(by = "species").mean().plot(kind="bar")

plt.title('Class vs Measurements')

plt.ylabel('mean measurement(cm)')

plt.xticks(rotation=0) # manage the xticks rotation

plt.grid(True)

# Use bbox_to_anchor option to place the legend outside plot area to be tidy

plt.legend(loc="upper left", bbox_to_anchor=(1,1))

#----output----

sepallength(cm)sepalwidth(cm) petallength(cm) petalwidth(cm)

setosa 5.006 3.418 1.464 0.244

versicolor 5.936 2.770 4.260 1.326

virginica 6.588 2.974 5.552 2.026

Listing 3-6.

Multivariate analysis

A434293_1_En_3_Figb_HTML

Correlation Matrix

The correlation function uses Pearson correlation coefficient, which results in a number between -1 to 1. A strong negative relationship is indicated by a coefficient closer to -1 and a strong positive correlation is indicated by a coefficient toward 1. See Listing 3-7.

# create correlation matrix

corr = iris.corr()

print(corr)

import statsmodels.api as sm

sm.graphics.plot_corr(corr, xnames=list(corr.columns))

plt.show()

#----output----

sepallength(cm) sepalwidth(cm) petallength(cm) sepallength(cm)

1.000000 -0.109369 0.871754

sepalwidth(cm) -0.109369 1.000000 -0.420516

petallength(cm) 0.871754 -0.420516 1.000000

petalwidth(cm) 0.817954 -0.356544 0.962757

petalwidth(cm)

sepallength(cm) 0.817954

sepalwidth(cm) -0.356544

petallength(cm) 0.962757

petalwidth(cm) 1.000000

Listing 3-7.

Correlation matrix

A434293_1_En_3_Figc_HTML

Pair Plot

You can understand the relationship attributes by looking at the distribution of the interactions of each pair of attributes. This uses a built-in function to create a matrix of scatter plots of all attributes against all attributes. See Listing 3-8.

from pandas.tools.plotting import scatter_matrix

scatter_matrix(iris, figsize=(10, 10))

# use suptitle to add title to all sublots

plt.suptitle("Pair Plot", fontsize=20)

#----output----

Listing 3-8.

Pair plot

A434293_1_En_3_Figd_HTML

Findings from EDA

· There are no missing values.

· Sepal is longer than petal. Sepal length ranges between 4.3 to 7.9 with average length of 5.8, whereas petal length ranges between 1 to 6.9 with average length of 3.7.

· Sepal is also wider than petal. Sepal width ranges between 2 to 4.4 with a average width of 3.05, whereas petal width ranges between 0.1 to 2.5 with average width of 1.19.

· Average petal length of setosa is much smaller than versicolor and virginica; however the average sepal width of setosa is higher than versicolor and virginica.

· Petal length and width are strongly correlated, that is, 96% of the time width increases with increase in length.

· Petal length has negative correlation with sepal width, that is, 42% of the time increase in sepal width will decrease petal length.

· Initial conclusion from data: Based on length and width of sepal/petal alone, you can conclude that versicolor/virginica might resemble in size; however setosa characteristics seem to be noticeably different from the other two.

Further looking at the characteristics of the three Iris flower characteristics visually in Figure 3-4, we can ascertain the hypothesis from our EDA.

A434293_1_En_3_Fig4_HTML

Figure 3-4.

Iris flowers

Statistics and mathematics is the base for machine learning algorithms. Let’s begin by understanding some of the basic concepts and algorithms that are derived from the statistical world and gradually move onto advanced machine learning algorithms.

Supervised Learning – Regression

Can you guess what is common in the below set of business questions across different domains? See Table 3-6.

Table 3-6.

Supervised learning use cases examples

Domain

Question

Retail

How much will be the daily, monthly, and yearly sales for a given store for the next three years?

Retail

How many car parking spaces should be allocated for a retail store?

Manufacturing

How much will be the product-wise manufacturing labor cost?

Manufacturing / Retail

How much will be my monthly electricity cost for the next three years?

Banking

What is the credit score of a customer?

Insurance

How many customers will claim the insurance this year?

Energy / Environmental

What will be the temperature for the next five days?

You might have guessed it right! The presence of the words ‘how much’ and ‘how many’ implies that the answer for these questions will be a quantitative or continuous number. The regression is one of the fundamental techniques that will help us to find answers to these types of questions by studying the relationship between the different variables that are relevant to the questions that we are trying to answer.

Let’s consider a use case where we have collected students’ average test grade scores and their respective average number of study hours for the test for group of similar IQ students. See Listing 3-9.

import pandas as pd

import matplotlib.pyplot as plt

%matplotlib inline

# Load data

df = pd.read_csv('Data/Grade_Set_1.csv')

print df

# Simple scatter plot

df.plot(kind='scatter', x='Hours_Studied', y='Test_Grade', title='Grade vs Hours Studied')

# check the correlation between variables

Print df.corr()

# ---- output ----

Hours_Studied Test_Grade

0 2 57

1 3 66

2 4 73

3 5 76

4 6 79

5 7 81

6 8 90

7 9 96

8 10 100

Correlation Matrix:

Hours_Studied Test_Grade

Hours_Studied 1.000000 0.987797

Test_Grade 0.987797 1.000000

Listing 3-9.

Students score vs. hours studied

A434293_1_En_3_Fige_HTML

A simple scatter plot with hours studied on the x-axis and the test grades on the y-axis shows that the grade gradually increases with the increase in hours studied. This implies that there is a linear relationship between the two variables. Further performing the correlation analysis shows that there is 98% positive relationship between the two variables, which means there is 98% chance that any change in study hours will lead to a change in grade.

Correlation and Causation

Although correlation helps us determine the degree of relationship between two or more variables, it does not tell about the cause and effect relationship. A high degree of correlation does not always necessarily mean a relationship of cause and effect exists between variables. Note that correlation does not imply causation though the existence of causation always implies correlation. Let’s understand this better with examples.

· More firemen’s presence during a fire instance signifies that the fire is big but the fire is not caused by firemen.

· When one sleeps with shoes on, he is likely to get a headache. This may be due to alcohol intoxication.

The significant degree of correlation in the above examples may be due to below reasons

· Small samples are prone to show a higher correlation due to pure chance.

· Variables may be influencing each other so it becomes hard to designate one as the cause and the other as the effect.

· Correlated variables may be influenced by one or more other related variables.

The domain knowledge or involvement of subject matter expert is very important to ascertain the correlation due to causation.

Fitting a Slope

Let’s try to fit a slope line through all the points such that the error or residual, that is, the distance of line from each point is the best possible minimal. See Figure 3-5.

A434293_1_En_3_Fig5_HTML

Figure 3-5.

Linear regression model components

The error could be positive or negative based on its location from the slope, because of which if we take a simple sum of all the errors, it will be zero. So we should square the error to get rid of negativity and then sum the squared error. Hence, the slope is also referred to as least squares line.

· The slope equation is given by Y = mX + c, where Y is the predicted value for a given x value.

· m is the change in y, divided by change in x, that is, m is the slope of the line for the x variable and it indicates the steepness at which it increases with every unit increase in x variable value.

· c is the intercept that indicates the location or point on the axis where it intersects, in the case of Figure 3-5 it is 52. Intercept is a constant that represents the variability in Y that is not explained by the X. It is the value of Y when X is zero.

Together the slope and intercept define the linear relationship between the two variables and can be used to predict or estimate an average rate of change. Now using this relation, for a new student we can determine the score based on his study hours. Say a student is planning to study an overall of 6 hours in preparation for the test. Simply drawing a connecting line from the x-axis and y-axis to the slope shows that there is a possibility of him scoring 80. We can use the slope equation to predict the score for any given number of hours of study. In this case the test grade is the dependent variable, denoted by ‘Y’ and hours studied is an independent variable or predictor, denoted by ‘X’. Let’s use the linear regression function from the scikit-learn library to find the values of m (x’s coefficient) and c (intercept). See Listing 3-10.

# importing linear regression function

import sklearn.linear_model as lm

# Create linear regression object

lr = lm.LinearRegression()

x= df.Hours_Studied[:, np.newaxis] # independent variable

y= df.Test_Grade.values # dependent variable

# Train the model using the training sets

lr.fit(x, y)

print "Intercept: ", lr.intercept_

print "Coefficient: ", lr.coef_

# manual prediction for a given value of x

print "Manual prdiction :", 52.2928994083 + 4.74260355*6

# predict using the built-in function

print "Using predict function: ", lr.predict(6)

# plotting fitted line

plt.scatter(x, y, color='black')

plt.plot(x, lr.predict(x), color='blue', linewidth=3)

plt.title('Grade vs Hours Studied')

plt.ylabel('Test_Grade')

plt.xlabel('Hours_Studied')

# ---- output ----

Intercept: 52.2928994083

Coefficient: [ 4.74260355]

Manual prdiction : 80.7485207083

Using predict function: [ 80.74852071]

Listing 3-10.

Linear regression

A434293_1_En_3_Figf_HTML

Let’s put the appropriate values in the slope equation (m * X + c = Y), 4.74260355 * 6 + 52.2928994083 = 80.74 that means a student studying 6 hours has the probability of scoring 80.74 test grade.

Note that if X is zero, the value of Y will be 52.29 that mean even if the student does not study there is a possibility that he’ll score 52.29; this signifies that there are other variables that have a causation effect on score that we currently do not have access to.

How Good Is Your Model?

There are three metrics widely used for evaluating linear model performance.

· R-squared

· RMSE

· MAE

R-Squared for Goodness of Fit

The R-squared metric is the most popular practice of evaluating how well your model fits the data. R-squared value designates the total proportion of variance in the dependent variable explained by the independent variable. It is a value between 0 and 1; the value toward 1 indicates a better model fit. See Table 3-7.

Table 3-7.

Sample table for R-squared calculation

A434293_1_En_3_Figg_HTML

Where,

A434293_1_En_3_Figh_HTML

Total Sum of Square Residual (∑ SSR)

R-squared = ------------------------------------

Sum of Square Total(∑ SST)

R-squared = 1510.01 / 1547.55 = 0.97

In this case R-squared can be interpreted as 97% of variablility in the dependent variable (test score) can be explained by the independent variable (hours studied).

Root Mean Squared Error (RMSE)

This is the square root of the mean of the squared errors. RMSE indicates how close the predicted values are to the actual values; hence a lower RMSE value signifies that the model performance is good. One of the key properties of RMSE is that the unit will be the same as the target variable.

$$ \sqrt{\frac{1}{n}{\displaystyle {\sum}_{i\kern0.5em =\kern0.5em 1}^n{\left({y}_i-{\widehat{y}}_i\right)}^2}} $$

Mean Absolute Error

This is the mean or average of absolute value of the errors, that is, the predicted - actual. See Listing 3-11.

$$ \frac{1}{n}{\displaystyle {\sum}_{i\kern0.5em =\kern0.5em 1}^n\left|{y}_i-{\widehat{y}}_i\right|} $$

# function to calculate r-squared, MAE, RMSE

from sklearn.metrics import r2_score , mean_absolute_error, mean_squared_error

# add predict value to the data frame

df['Test_Grade_Pred'] = lr.predict(x)

# Manually calculating R Squared

df['SST'] = np.square(df['Test_Grade'] - df['Test_Grade'].mean())

df['SSR'] = np.square(df['Test_Grade_Pred'] - df['Test_Grade'].mean())

print "Sum of SSR:", df['SSR'].sum()

print "Sum of SST:", df['SST'].sum()

print "R Squared using manual calculation: ", df['SSR'].sum() / df['SST'].sum()

# Using built-in function

print "R Squared using built-in function: ", r2_score(df.Test_Grade, y)

print "Mean Absolute Error: ", mean_absolute_error(df.Test_Grade, df.Test_Grade_Pred)

print "Root Mean Squared Error: ", np.sqrt(mean_squared_error(df.Test_Grade, df.Test_Grade_Pred))

# ---- output ----

Sum of SSR: 1510.01666667

Sum of SST: 1547.55555556

R Squared using manual calculation: 0.97574310741

R Squared using built-in function: 0.97574310741

Mean Absolute Error: 1.61851851852

Root Mean Squared Error: 2.04229959955

Listing 3-11.

Linear regression model accuracy matrices

Polynomial Regression

It is a form of higher-order linear regression modeled between dependent and independent variables as an nth degree polynomial. Although it’s linear it can fit curves better. Essentially we’ll be introducing higher-order degree variables of the same independent variable in the equation. See Table 3-8 and Listing 3-12.

Table 3-8.

Polynomial regression higher degrees

Degree

Regression Equation

Quadratic (2)

$$ \mathrm{Y}\kern0.5em =\kern0.5em {\mathrm{m}}_1\mathrm{X}\kern0.5em +\kern0.5em {\mathrm{m}}_2\mathrm{X}\hat{\mkern6mu} 2+\mathrm{c} $$

Cubic (3)

$$ \mathrm{Y}\kern0.5em =\kern0.5em {\mathrm{m}}_1\mathrm{X}\kern0.5em +\kern0.5em {\mathrm{m}}_2\mathrm{X}\hat{\mkern6mu} 2\kern0.5em +{\mathrm{m}}_3\mathrm{X}\hat{\mkern6mu} 3+\mathrm{c} $$

Nth

$$ \mathrm{Y}\kern0.5em =\kern0.5em {\mathrm{m}}_1\mathrm{X}\kern0.5em +\kern0.5em {\mathrm{m}}_2\mathrm{X}\hat{\mkern6mu} 2\kern0.5em +\kern0.5em {\mathrm{m}}_3\mathrm{X}\hat{\mkern6mu} 3\kern0.5em +\kern0.5em ....\kern0.5em +{\mathrm{m}}_{\mathrm{n}}\mathrm{X}\hat{\mkern6mu} \mathrm{n}\kern0.5em +\kern0.5em \mathrm{c} $$

x = np.linspace(-3,3,1000) # 1000 sample number between -3 to 3

# Plot subplots

fig, ((ax1, ax2, ax3), (ax4, ax5, ax6)) = plt.subplots(nrows=2, ncols=3)

ax1.plot(x, x)

ax1.set_title('linear')

ax2.plot(x, x**2)

ax2.set_title('degree 2')

ax3.plot(x, x**3)

ax3.set_title('degree 3')

ax4.plot(x, x**4)

ax4.set_title('degree 4')

ax5.plot(x, x**5)

ax5.set_title('degree 5')

ax6.plot(x, x**6)

ax6.set_title('degree 6')

plt.tight_layout()# tidy layout

# --- output ----

Listing 3-12.

Polynomial regression

A434293_1_En_3_Figi_HTML

Let’s consider another set of students’ average test grade scores and their respective average number of hours studied for similar IQ students. See Listing 3-13.

# Load data

df = pd.read_csv('Data/Grade_Set_2.csv')

print df

# Simple scatter plot

df.plot(kind='scatter', x='Hours_Studied', y='Test_Grade', title='Grade vs Hours Studied')

# check the correlation between variables

print("Correlation Matrix: ")

df.corr()

# Create linear regression object

lr = lm.LinearRegression()

x= df.Hours_Studied[:, np.newaxis] # independent variable

y= df.Test_Grade # dependent variable

# Train the model using the training sets

lr.fit(x, y)

# plotting fitted line

plt.scatter(x, y, color='black')

plt.plot(x, lr.predict(x), color='blue', linewidth=3)

plt.title('Grade vs Hours Studied')

plt.ylabel('Test_Grade')

plt.xlabel('Hours_Studied')

print "R Squared: ", r2_score(y, lr.predict(x))

# ---- output ----

Hours_Studied Test_Grade

0 0.5 20

1 1.0 21

2 2.0 22

3 3.0 23

4 4.0 25

5 5.0 37

6 6.0 48

7 7.0 56

8 8.0 67

9 9.0 76

10 10.0 90

11 11.0 89

12 12.0 90

Correlation Matrix:

Hours_Studied Test_Grade

Hours_Studied 1.000000 0.974868

Test_Grade 0.974868 1.000000

R Squared: 0.9503677767

Listing 3-13.

Polynomial regression example

A434293_1_En_3_Figj_HTML

The correlation analysis shows a 97% positive relationship between hours studied and the test grade, and 95% (r-squared) of variation in test grade can be explained by hours studied. Note that up to 4 hours of average study results in less than a 30 test grade and post 9 hours of study there is not a grade value to add to the grade. This is not a perfect linear relationship, although we can fit a linear line. Let’s try higher-order polynomial degrees. See Listing 3-14.

lr = lm.LinearRegression()

x= df.Hours_Studied # independent variable

y= df.Test_Grade # dependent variable

NumPy's vander function will return powers of the input vector

for deg in [1, 2, 3, 4, 5]:

lr.fit(np.vander(x, deg + 1), y);

y_lr = lr.predict(np.vander(x, deg + 1))

plt.plot(x, y_lr, label='degree ' + str(deg));

plt.legend(loc=2);

print r2_score(y, y_lr)

plt.plot(x, y, 'ok')

# ---- output ----

R-squared for degree 1 = 0.9503677767

R-squared for degree 2 = 0.960872656868

R-squared for degree 3 = 0.993832312037

R-squared for degree 4 = 0.99550001841

R-squared for degree 5 = 0.99562049139

Listing 3-14.

r-squared for different polynomial degrees

A434293_1_En_3_Figk_HTML

Note degree 1 here is the linear fit, and the higher-order polynomial regression is fitting the curve better and r-square jumps 4% higher at degree 3. Beyond the 3rd degree there is not a massive change in r-squared so we can say that the 3rd degree fits better.

Scikit-learn provides a function to generate a new feature matrix consisting of all polynomial combinations of the features with the degree less than or equal to the specified degree. See Listing 3-15.

from sklearn.preprocessing import PolynomialFeatures

from sklearn.pipeline import make_pipeline

x= df.Hours_Studied[:, np.newaxis] # independent variable

y= df.Test_Grade # dependent variable

degree = 4

model = make_pipeline(PolynomialFeatures(degree), lr)

model.fit(x, y)

plt.scatter(x, y, color='black')

plt.plot(x, model.predict(x), color='green')

print "R Squared using built-in function: ", r2_score(y, model.predict(x))

# ---- output ----

R Squared using built-in function: 0.993832312037

Listing 3-15.

scikit-learn polynomial features

A434293_1_En_3_Figl_HTML

Multivariate Regression

So far we have seen simple regression with one independent variable for a given dependent variable. In most of the real-life use cases there will be more than one independent variable, so the concept of having multiple independent variables is called multivariate regression. The equation takes the form below.

$$ \mathrm{y}\kern0.5em =\kern0.5em {\mathrm{m}}_1{\mathrm{x}}_1+{\mathrm{m}}_2{\mathrm{x}}_2+{\mathrm{m}}_3{\mathrm{x}}_3+\dots +{\mathrm{m}}_{\mathrm{n}}{\mathrm{x}}_{\mathrm{n}} $$

Here, each independent variable is represented by x’s, and m’s are the corresponding coefficients. We’ll be using the ‘statsmodels’ Python library to learn the basics of multivariate regression, as it provide more useful statistics results that are helpful from a learning perspective. Once you understand the fundamental concepts, you can either use ‘scikit-learn’ or ‘statsmodels’ package as both are efficient.

We'll be using the housing dataset (from RDatasets), which contains sales prices of houses in the city of Windsor. Below is the brief description about each variable. See Table 3-9.

Table 3-9.

Housing dataset (from RDatasets)

Variable Name

Description

Data type

Price

Sale price of a house

Numeric

Lotsize

The lot size of a property in square feet

Numeric

Bedrooms

Number of bedrooms

Numeric

Bathrms

Number of full bathrooms

Numeric

Stories

Number of stories excluding basement

Categorical

Driveway

Does the house have a driveway?

Boolean/Categorical

Recroom

Does the house have a recreational room?

Boolean/Categorical

Fullbase

Does the house have a full finished basement?

Boolean/Categorical

Gashw

Does the house use gas for hot water heating?

Boolean/Categorical

Airco

Does the house have central air conditioning?

Boolean/Categorical

Garagepl

Number of garage places

Numeric

Prefarea

Is the house located in the preferred neighborhood of the city?

Boolean/Categorical

Let’s build a model to predict the house price (dependent variable), by considering the rest of the variables as independent variables.

The categorical variables need to be handled appropriately before running the first iteration of the model. Scikit-learn provides useful built-in preprocessing functions to handle categorical variables.

· Label Binarizer: This will replace the binary variable text with numeric vales. We’ll be using this function for the binary categorical variables.

· Label Encoder: This will replace category level with number representation.

· One Hot Encoder: This will convert n levels to n-1 new variable, and the new variables will use 1 to indicate the presence of level and 0 for otherwise. Note that before calling OneHotEncoder, we should use LabelEncoder to convert levels to number. Alternatively we can achieve the same using get_dummies of the Pandas package. This is much more efficient to use as we can directly use it on the column with text description without having to convert to numbers first.

Multicollinearity and Variation Inflation Factor (VIF)

The dependent variable should have a strong relationship with independent variables. However, any independent variables should not have strong correlations among other independent variables. Multicollinearity is an incident where one or more of the independent variables are strongly correlated with each other. In such incidents, we should use only one among correlated independent variables.

VIF is an indicator of the existence of multicollinearity, and ‘statsmodel’ provides a function to calculate the VIF for each independent variable and a value of greater than 10 is the rule of thumb for possible existence of high multicollinearity. The standard guideline for VIF value is as follows, VIF = 1 means no correlation exists, VIF > 1, but < 5 means moderate correlation exists. See Listing 3-16.

$$ V I{F}_i=\frac{1}{1-{R}_i^2} $$

Where Ri2 is the coefficient of determination of variable Xi

# Load data

df = pd.read_csv('Data/Housing_Modified.csv')

# Convert binary fields to numeric boolean fields

lb = preprocessing.LabelBinarizer()

df.driveway = lb.fit_transform(df.driveway)

df.recroom = lb.fit_transform(df.recroom)

df.fullbase = lb.fit_transform(df.fullbase)

df.gashw = lb.fit_transform(df.gashw)

df.airco = lb.fit_transform(df.airco)

df.prefarea = lb.fit_transform(df.prefarea)

# Create dummy variables for stories

df_stories = pd.get_dummies(df['stories'], prefix='stories', drop_first=True)

# Join the dummy variables to the main dataframe

df = pd.concat([df, df_stories], axis=1)

del df['stories']

# lets plot correlation matrix using statmodels graphics packages's plot_corr

# create correlation matrix

corr = df.corr()

sm.graphics.plot_corr(corr, xnames=list(corr.columns))

plt.show()

# ---- output ----

Listing 3-16.

Multicollinearity and VIF

A434293_1_En_3_Figm_HTML

We can notice from the plot that stories_one has a strong negative correlation with stories_two. Let’s perform the VIF analysis to eliminate strongly correlated independent variables. See Listings 3-17 and 3-18.

# create a Python list of feature names

independent_variables = ['lotsize', 'bedrooms', 'bathrms','driveway', 'recroom', 'fullbase','gashw','airco','garagepl', 'prefarea', 'stories_one','stories_two','stories_three']

# use the list to select a subset from original DataFrame

X = df[independent_variables]

y = df['price']

thresh = 10

for i in np.arange(0,len(independent_variables)):

vif = [variance_inflation_factor(X[independent_variables].values, ix) for ix in range(X[independent_variables].shape[1])]

maxloc = vif.index(max(vif))

if max(vif) > thresh:

print "vif :", vif

print('dropping \'' + X[independent_variables].columns[maxloc] + '\' at index: ' + str(maxloc))

del independent_variables[maxloc]

else:

break

print 'Final variables:', independent_variables

# ---- output ----

vif : [8.9580980878443359, 18.469878559519948, 8.9846723472908643, 7.0885785420918861, 1.4770152815033917, 2.013320236472385, 1.1034879198994192, 1.7567462065609021, 1.9826489313438442, 1.5332946465459893, 3.9657526747868612, 5.5117024083548918, 1.7700402770614867]

dropping 'bedrooms' at index: 1

Final variables: ['lotsize', 'bathrms', 'driveway', 'recroom', 'fullbase', 'gashw', 'airco', 'garagepl', 'prefarea', 'stories_one', 'stories_two', 'stories_three']

We can notice that VIF analysis has elemenated bedrooms has its greater than 10, however stories_one and stories_two has been retained.

Let’s run the first iteration of multivariate regression model with the set of independent variables that has passed the VIF analysis.

To test the model performance the common practice is to split the dataset into 80/20 (or 70/30) for train/test respectively and use the train data set to build the model, then apply the trained model on the test dataset evaluate the performance of the model.

Listing 3-17.

Remove multicollinearity

# create a Python list of feature names

independent_variables = ['lotsize', 'bathrms','driveway', 'recroom', 'fullbase','gashw','airco','garagepl', 'prefarea', 'stories_one','stories_two','stories_three']

# use the list to select a subset from original DataFrame

X = df[independent_variables]

y = df['price']

# Split your data set into 80/20 for train/test datasets

from sklearn.cross_validation import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=.80, random_state=1)

# create a fitted model & print the summary

lm = sm.OLS(y_train, X_train).fit()

print lm.summary()

# make predictions on the testing set

y_train_pred = lm.predict(X_train)

y_test_pred = lm.predict(X_test)

print "Train MAE: ", metrics.mean_absolute_error(y_train, y_train_pred)

print "Train RMSE: ", np.sqrt(metrics.mean_squared_error(y_train, y_train_pred))

print "Test MAE: ", metrics.mean_absolute_error(y_test, y_test_pred)

print "Test RMSE: ", np.sqrt(metrics.mean_squared_error(y_test, y_test_pred))

# ---- output ----

Train MAE: 11987.66016

Train RMSE: 15593.4749178

Test MAE: 12722.0796754

Test RMSE: 17509.25004

Listing 3-18.

Build the multivariate linear regression model

A434293_1_En_3_Fign_HTML

Interpreting the OLS Regression Results

Adjusted R-squared: Simple R-squared value will keep increase with addition of independent variable. To fix this issue adjusted R-squared is considered for multivariate regression to understand the explanatory power of the independent variables.

$$ Adjusted\kern0.5em {R}^2=1-\frac{\left(1-{R}^2\right)\left( N-1\right)}{N- p-1} $$

Here, N is total observations or sample size and p is the number of predictors. See Figure 3-6.

A434293_1_En_3_Fig6_HTML

Figure 3-6.

R-squared vs. Adjusted R-squared

· Figure 3-6 shows how R-squared follows Adjusted R-squared with increase of more variables

· With inclusion of more variables R-squared always tend to increase

· Adjusted R-squared will drop if the variable added does not explain the variable in the dependent variable

Coefficient: These are the individual coefficients for respective independent variables. It can be either a positive or negative number, which indicates that an increase in every unit of that independent variable will have a positive or negative impact on the dependent variable value.

Standard error : This is the average distance of the respective independent observed values from the regression line. The smaller values show that the model fitting is good.

Durbin-Watson: It’s one of the common statistics used to determine the existence of multicollinearity, which means two or more independent variables used in the multivariate regression model are highly correlated. The Durbin-Watson statistics are always between the number 0 and 4. A value around 2 is ideal (range of 1.5 to 2.5 is relatively normal), and it means that there is no autocorrelation between the variables used in the model.

Confidence interval: This is the coefficient to calculate 95% confidence interval for the independent variable’s slope.

t and p-value: p-value is one of the import statistics. In order to get a better understanding we’ll have to understand the concept of hypothesis testing and normal distribution.

Hypothesis testing is an assertion regarding the distribution of the observations and validating this assertion. The hypothesis testing steps are given below.

· A hypothesis is made.

· The validity of the hypothesis is tested.

· If the hypothesis is found to be true, it is accepted.

· If it is found to be untrue, it is rejected.

· The hypothesis that is being tested for possible rejection is called null hypothesis.

· Null hypothesis is denoted by H0.

· The hypothesis that is accepted when null hypothesis is rejected is called alternate hypothesis Ha.

· The alternative hypothesis is often the interesting one and often the one that someone sets out to prove.

· For example, null hypothesis H0 is that the lot size has a real effect on house price; in this case the coefficient m is equal to zero in the regression equation (y = m * lot size + c).

· Alternative hypothesis Ha is that the lot size does not have a real effect on house price and the effect you saw was due to chance. This means the coefficient m is not equal to zero in the regression equation.

· In order to be able to say whether the regression estimate is close enough to the hypothesized value to be acceptable, we take the range of estimate implied by the estimated variance and look to see whether this range will contain the hypothesized value. To do this, we can transform the estimate into a standard normal distribution and we know that 95% of all values of a variable that has a mean of 0 and a variance of 1 will lie within 0 to 2 standard deviations. Given a regression estimate and its standard error, we can be 95% confident that the true (unknown) value of m will lie in this region. See Figure 3-7.

A434293_1_En_3_Fig7_HTML

Figure 3-7.

Normal distribution (red is the rejection region)

· The t-value is used to determine a p value (probability), and p-value ≤ 0.05 signifies strong evidence against the null hypothesis, so you reject the null hypothesis. A p-value > 0.05 signifies weak evidence against the null hypothesis, so you fail to reject the null hypothesis. So in our case the variables with ≤ 0.05 means the variables are significant for the model.

· Process of testing a hypothesis indicates that there is a possibility of making an error. There are two types of errors for any given dataset, and these two types of errors are inversely related, which means the smaller the risk of one, the higher the risk of the other.

· Type I error: The error of rejecting the null hypothesis H0 even though H0 was true.

· Type II error: The error of accepting the null hypothesis H0 even though H0 was false.

· Note that variable ‘stores_three’ and ‘recroom’ have a large p value indicating it’s insignificant. So let’s re-run the regression without this variable and look at the results .

A434293_1_En_3_Figo_HTML

Train MAE: 11993.3436816

Train RMSE: 15634.9995429

Test MAE: 12902.4799591

Test RMSE: 17694.9341405

· Note that dropping the variables has not impacted adjusted R-squared negatively.

Regression Diagnosis

· There is a set of procedures and assumptions that need to be verified about our model results; without that the model could be misleading. Let’s look at some of the important regression diagnostics.

Outliers

· Data points that are far away from the fitted regression line are called outliers, and these can impact the accuracy of the model. Plotting normalized residual vs. leverage will give us a good understanding of the outliers points. Residual is the difference between actual vs. predicted, and leverage is a measure of how far away the independent variable values of an observation are from those of the other observations.

# lets plot the normalized residual vs leverage

from statsmodels.graphics.regressionplots import plot_leverage_resid2

fig, ax = plt.subplots(figsize=(8,6))

fig = plot_leverage_resid2(lm, ax = ax)

# ---- output ----

A434293_1_En_3_Figp_HTML

From the chart we see that there are many observations that have high leverage and residual. Running a Bonferroni outlier test will give us p-values for each observation, and those observations with p value < 0.05 are the outliers affecting the accuracy. It is a good practice to consult or apply business domain knowledge to make a decision on removing the outlier points and re-running the model, as these points could be natural in the process although they are mathematically found as outliers. See Listing 3-19.

# Find outliers #

# Bonferroni outlier test

test = lm.outlier_test()

print 'Bad data points (bonf(p) < 0.05):'

print test[test.icol(2) < 0.05]

# ---- output ----

Bad data points (bonf(p) < 0.05):

student_resid unadj_p bonf(p)

377 4.387449 0.000014 0.006315

Listing 3-19.

Find outliers

Homoscedasticity and Normality

The error variance should be constant, which is known has homoscedasticity and the error should be normally distributed. See Listing 3-20.

# plot to check homoscedasticity

plt.plot(lm.resid,'o')

plt.title('Residual Plot')

plt.ylabel('Residual')

plt.xlabel('Observation Numbers')

plt.show()

plt.hist(lm.resid, normed=True)

# ---- output ----

Listing 3-20.

Homoscedasticity test

A434293_1_En_3_Figq_HTML

Linearity – the relationships between the predictors and the outcome variables should be linear. If the relationship is not linear then appropriate transformation (such as log, square root, and higher-order polynomials etc) should be applied to the dependent/independent variable to fix the issue. See Listing 3-21.

# linearity plots

fig = plt.figure(figsize=(10,15))

fig = sm.graphics.plot_partregress_grid(lm, fig=fig)

# ---- output ----

Listing 3-21.

Linearity check

A434293_1_En_3_Figr_HTML

Over-fitting and Under-fitting

Under-fitting occurs when the model does not fit the data well and is unable to capture the underlying trend in it. In this case we can notice a low accuracy in training and test dataset.

To the contrary, over-fitting occurs when the model fits the data too well, capturing all the noises. In this case we can notice a high accuracy in the training dataset, whereas the same model will result in a low accuracy on the test dataset. This means the model has fitted the line so well to the train dataset that it failed to generalize it to fit well on an unseen dataset. Figure 3-8 shows how the different fitting would look like on the earlier discussed example use case. The choice of right order polynomial degree is very important to avoid an over-fitting or under-fitting issue in regression. We’ll also discuss in detail about different ways of handling these problems in the next chapter. See Figure 3-8.

A434293_1_En_3_Fig8_HTML

Figure 3-8.

Model fittings

Regularization

With an increase in number of variables, and increase in model complexity, the probability of over-fitting also increases. Regularization is a technique to avoid the over-fitting problem. Over-fitting occurs when the model fits the data too well, capturing all the noises. In this case we can notice a high accuracy in the training dataset, whereas the same model will result in a low accuracy on the test dataset. This means the model has fitted the line so well to the train dataset that it failed to generalize it to fit well on the unseen dataset.

Statsmodel and the scikit-learn provides Ridge and LASSO (Least Absolute Shrinkage and Selection Operator) regression to handle the over-fitting issue. With an increase in model complexity, the size of coefficients increase exponentially, so the ridge and LASSO regression apply penalty to the magnitude of the coefficient to handle the issue.

LASSO: This provides a sparse solution, also known as L1 regularization. It guides parameter value to be zero, that is, the coefficients of the variables that add minor value to the model will be zero, and it adds a penalty equivalent to absolute value of the magnitude of coefficients.

Ridge Regression: Also known as Tikhonov (L2) regularization, it guides parameters to be close to zero, but not zero. You can use this when you have many variables that add minor value to the model accuracy individually; however it improves overall the model accuracy and cannot be excluded from the model. Ridge regression will apply a penalty to reduce the magnitude of the coefficient of all variables that add minor value to the model accuracy, and which adds penalty equivalent to square of the magnitude of coefficients. Alpha is the regularization strength and must be a positive float. See Figure 3-9 and Listing 3-22.

A434293_1_En_3_Fig9_HTML

Figure 3-9.

Regularizations

from sklearn import linear_model

# Load data

df = pd.read_csv('Data/Grade_Set_2.csv')

df.columns = ['x','y']

for i in range(2,50): # power of 1 is already there

colname = 'x_%d'%i # new var will be x_power

df[colname] = df['x']**i

independent_variables = list(df.columns)

independent_variables.remove('y')

X= df[independent_variables] # independent variable

y= df.y # dependent variable

# split data into train and test

X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=.80, random_state=1)

# Ridge regression

lr = linear_model.Ridge(alpha=0.001)

lr.fit(X_train, y_train)

y_train_pred = lr.predict(X_train)

y_test_pred = lr.predict(X_test)

print("------ Ridge Regression ------")

print "Train MAE: ", metrics.mean_absolute_error(y_train, y_train_pred)

print "Train RMSE: ", np.sqrt(metrics.mean_squared_error(y_train, y_train_pred))

print "Test MAE: ", metrics.mean_absolute_error(y_test, y_test_pred)

print "Test RMSE: ", np.sqrt(metrics.mean_squared_error(y_test, y_test_pred))

print "Ridge Coef: ", lr.coef_

# LASSO regression

lr = linear_model.Lasso(alpha=0.001)

lr.fit(X_train, y_train)

y_train_pred = lr.predict(X_train)

y_test_pred = lr.predict(X_test)

print("----- LASSO Regression -----")

print "Train MAE: ", metrics.mean_absolute_error(y_train, y_train_pred)

print "Train RMSE: ", np.sqrt(metrics.mean_squared_error(y_train, y_train_pred))

print "Test MAE: ", metrics.mean_absolute_error(y_test, y_test_pred)

print "Test RMSE: ", np.sqrt(metrics.mean_squared_error(y_test, y_test_pred))

print "LASSO Coef: ", lr.coef_

#--- output ----

------ Ridge Regression ------

Train MAE: 13.1168856247

Train RMSE: 16.8257485401

Test MAE: 22.0861723747

Test RMSE: 22.1213599428

Ridge Coef: [ 9.99646940e-89 1.26287785e-87 1.39941783e-86 1.48384493e-85

1.53867101e-84 1.57509733e-83 1.59948276e-82 1.61560028e-81

1.62575609e-80 1.63139718e-79 1.63345182e-78 1.63252488e-77

1.62901252e-76 1.62317116e-75 1.61516012e-74 1.60506865e-73

1.59293349e-72 1.57875067e-71 1.56248359e-70 1.54406874e-69

1.52341994e-68 1.50043156e-67 1.47498127e-66 1.44693238e-65

1.41613625e-64 1.38243475e-63 1.34566311e-62 1.30565333e-61

1.26223824e-60 1.21525668e-59 1.16455980e-58 1.11001906e-57

1.05153619e-56 9.89055473e-56 9.22579214e-55 8.52186708e-54

7.78057774e-53 7.00501714e-52 6.19992888e-51 5.37214345e-50

4.53111186e-49 3.68955745e-48 2.86427041e-47 2.07707515e-46

1.35600615e-45 7.36735837e-45 2.64306411e-44 -4.77164338e-45

2.09761759e-46]

----- LASSO Regression -----

Train MAE: 0.842374298887

Train RMSE: 1.21912918556

Test MAE: 4.32364759404

Test RMSE: 4.8723243497

LASSO Coef: [ 1.29948409e+00 3.92103580e-01 1.75369422e-02 7.79647589e-04

3.02339084e-05 3.35699852e-07 -1.13749601e-07 -1.79773817e-08

-1.93826156e-09 -1.78643532e-10 -1.50240566e-11 -1.18610891e-12

-8.91794276e-14 -6.43309631e-15 -4.46487394e-16 -2.97784537e-17

-1.89686955e-18 -1.13767046e-19 -6.22157254e-21 -2.84658206e-22

-7.32019963e-24 5.16015995e-25 1.18616856e-25 1.48398312e-26

1.55203577e-27 1.48667153e-28 1.35117812e-29 1.18576052e-30

1.01487234e-31 8.52473862e-33 7.05722034e-34 5.77507464e-35

4.68162529e-36 3.76585569e-37 3.00961249e-38 2.39206785e-39

1.89235649e-40 1.49102460e-41 1.17072537e-42 9.16453614e-44

7.15512017e-45 5.57333358e-46 4.33236496e-47 3.36163309e-48

2.60423554e-49 2.01461728e-50 1.55652093e-51 1.20123190e-52

9.26105400e-54]

Listing 3-22.

Regularization

Nonlinear Regression

Linear models are mostly linear in nature, although they need not be straight fitting. In contrast the nonlinear model’s fitted line can take any shape; this scenario usually occurs when models are derived on the basis of physical or biological considerations. The nonlinear models have direct interpretation in terms of the process under study. Scipy library provides curve_fit function to fit models to scientific data based on a theory to determine the parameters of a physical system. Some of the example use cases are Michaelis–Menten’s enzyme kinetics, weibull distribution, power law distribution, etc. See Listing 3-23.

import numpy as np

import matplotlib.pyplot as plt

%matplotlib inline

from scipy.optimize import curve_fit

x= np.array([-2,-1.64,-0.7,0,0.45,1.2,1.64,2.32,2.9])

y = np.array([1.0, 1.5, 2.4, 2, 1.49, 1.2, 1.3, 1.2, 0.5])

def func(x, p1,p2):

return p1*np.sin(p2*x) + p2*np.cos(p1*x)

popt, pcov = curve_fit(func, x, y,p0=(1.0,0.2))

p1 = popt[0]

p2 = popt[1]

residuals = y - func(x,p1,p2)

fres = sum(residuals**2)

curvex=np.linspace(-2,3,100)

curvey=func(curvex,p1,p2)

plt.plot(x,y,'bo ')

plt.plot(curvex,curvey,'r')

plt.title('Non-linear fitting')

plt.xlabel('x')

plt.ylabel('y')

plt.legend(['data','fit'],loc='best')

plt.show()

# ---- output ----

Listing 3-23.

Nonlinear regression

A434293_1_En_3_Figs_HTML

Supervised Learning – Classification

Let’s look at another set of questions, and can you guess what is common in these set of business questions across different domains. See Table 3-10.

Table 3-10.

Classification use case examples

Domain

Question

Telecom

Is a customer likely to leave the network? (churn prediction)

Retail

Is he a prospective customer?, that is, likelihood of purchase vs. non-purchase?

Insurance

To issue insurance, should a customer be sent for a medical checkup?

Insurance

Will the customer renew the insurance?

Banking

Will a customer default on the loan amount?

Banking

Should a customer be given a loan?

Manufacturing

Will the equipment fail?

Health Care

Is the patient infected with a disease?

Health Care

What type of disease does a patient have?

Entertainment

What is the genre of music?

The answers to these questions are a discrete class. The number of level or class can vary from a minimum of two (example: true or false, yes or no) to multiclass. In machine learning, classification deals with identifying the probability of a new object being a member of a class or set. The classifiers are the algorithms that map the input data (also called features) to categories.

Logistic Regression

Let’s consider a use case where we have to predict students’ test outcomes, that is, pass (1) or fail (0) based on hours studied. In this case the outcome to be predicted is discrete. Let’s build a linear regression and try to use a threshold, that is, anything over some value is pass, or else it’s fail. See Listing 3-24.

# Load data

df = pd.read_csv('Data/Grade_Set_1_Classification.csv')

print df

x= df.Hours_Studied[:, np.newaxis] # independent variable

y= df.Result # dependent variable

# Create linear regression object

lr = lm.LinearRegression()

# Train the model using the training sets

lr.fit(x, y)

# plotting fitted line

plt.scatter(x, y, color='black')

plt.plot(x, lr.predict(x), color='blue', linewidth=3)

plt.title('Hours Studied vs Result')

plt.ylabel('Result')

plt.xlabel('Hours_Studied')

# add predict value to the data frame

df['Result_Pred'] = lr.predict(x)

# Using built-in function

print "R Squared : ", r2_score(df.Result, df.Result_Pred)

print "Mean Absolute Error: ", mean_absolute_error(df.Result, df.Result_Pred)

print "Root Mean Squared Error: ", np.sqrt(mean_squared_error(df.Result, df.Result_Pred))

# ---- output ----

R Squared : 0.675

Mean Absolute Error: 0.22962962963

Root Mean Squared Error: 0.268741924943

Listing 3-24.

Logistic regression

A434293_1_En_3_Figt_HTML

The outcome that we are expecting is either 1 or 0, and the issue with linear regression is that it can give values large than 1 or less than 0. In the above plot we can see that linear regression is not able to draw boundaries to classify observations.

The solution to this is to introduce sigmoid or Logit function (which takes a S shape) to the regression equation. The fundamental idea here is that the hypothesis will use the linear approximation, then map it with a logistic function for binary prediction.

linear regression equation in this case is y = mx + c

Logistic regression can be explained better in odds ratio. The odds of an event occurring are defined as the probability of an event occurring divided by the probability of that event not occurring. See Figure 3-10 and Listing 3-25.

A434293_1_En_3_Fig10_HTML

Figure 3-10.

Linear regression vs. logistic regression

odds ratio of pass vs fail = $$ probability\left( y=1\right)/1- probability\left( y=1\right) $$

A logit is the log base e(log) of the odds, so using the logit model:

· log(p / p(1 - p)) = mx + c

Logistic regression equation probability(y=1) = 1 / $$ 1 + {e}^{-\left( mx+ c\right)} $$

# plot sigmoid function

x = np.linspace(-10, 10, 100)

y = 1.0 / (1.0 + np.exp(-x))

plt.plot(x, y, 'r-', label='logit')

plt.legend(loc='lower right')

# --- output ----

Listing 3-25.

Plot sigmoid function

A434293_1_En_3_Figu_HTML

Let’s run logistic regression using the scikit-learn package. See Listing 3-26.

from sklearn.linear_model import LogisticRegression

# manually add intercept

df['intercept'] = 1

independent_variables = ['Hours_Studied', 'intercept']

x = df[independent_variables] # independent variable

y = df['Result'] # dependent variable

# instantiate a logistic regression model, and fit with X and y

model = LogisticRegression(\)

model = model.fit(x, y)

# check the accuracy on the training set

model.score(x, y)

print model.predict(x)

print model.predict_proba(x)[:,0]

# plotting fitted line

plt.scatter(df.Hours_Studied, y, color='black')

plt.yticks([0.0, 0.5, 1.0])

plt.plot(df.Hours_Studied, model.predict_proba(x)[:,1], color='blue', linewidth=3)

plt.title('Hours Studied vs Result')

plt.ylabel('Result')

plt.xlabel('Hours_Studied')

# ---- output ----

Listing 3-26.

Logistic regression using scikit-learn

A434293_1_En_3_Figv_HTML

Evaluating a Classification Model Performance

Confusion matrix is the table that is used for describing the performance of a classification model. Figure 3-11 shows the confusion matrix.

A434293_1_En_3_Fig11_HTML

Figure 3-11.

Confusion matrix

· True Negatives (TN): Actual FALSE, which was predicted as FALSE

· False Positives (FP): Actual FALSE, which was predicted as TRUE (Type I error)

· False Negatives (FN): Actual TRUE, which was predicted as FALSE (Type II error)

· True Positives (TP): Actual TRUE, which was predicted as TRUE

Ideally a good model should have high TN and TP and less of Type I & II errors. Table 3-11 describes the key metrics derived out of the confusion matrix to understand the classification model performance. Also see Listing 3-27.

Table 3-11.

Classification performance matrices

Metric

Description

Formula

Accuracy

what % of predictions were correct?

(TP+TN)/(TP+TN+FP+FN)

Misclassification Rate

what % of prediction is wrong?

(FP+FN)/(TP+TN+FP+FN)

True Positive Rate OR Sensitivity OR Recall (completeness)

what % of positive cases did model catch?

TP/(FN+TP)

False Positive Rate

what % of 'No' were predicted as 'Yes'?

FP/(FP+TN)

Specificity

what % of 'No' were predicted as 'No'?

TN/(TN+FP)

Precision (exactness)

what % of positive predictions were correct?

TP/(TP+FP)

F1 score

Weighted average of precision and recall

2*((precision * recall) / (precision + recall))

from sklearn import metrics

# generate evaluation metrics

print "Accuracy :", metrics.accuracy_score(y, model.predict(x))

print "AUC :", metrics.roc_auc_score(y, model.predict_proba(x)[:,1])

print "Confusion matrix :",metrics.confusion_matrix(y, model.predict(x))

print "classification report :", metrics.classification_report(y, model.predict(x))

# ----output----

Accuracy : 0.88

AUC : 1.0

Confusion matrix : [[2 1]

[0 6]]

classification report :

precision recall f1-score support

0 1.00 0.67 0.80 3

1 0.86 1.00 0.92 6

avg/total 0.90 0.89 0.88 9

Listing 3-27.

Confusion matrix

ROC Curve

A ROC curve is one more important metric, and it’s a most commonly used way to visualize the performance of a binary classifier, and AUC is believed to be one of the best ways to summarize performance in a single number. AUC indicates that the probability of a randomly selected positive example will be scored higher by the classifier than a randomly selected negative example. If you have multiple models with nearly the same accuracy, you can pick the one that gives a higher AUC. See Listing 3-28.

# Determine the false positive and true positive rates

fpr, tpr, _ = metrics.roc_curve(y, model.predict_proba(x)[:,1])

# Calculate the AUC

roc_auc = metrics.auc(fpr, tpr)

print 'ROC AUC: %0.2f' % roc_auc

# Plot of a ROC curve for a specific class

plt.figure()

plt.plot(fpr, tpr, label='ROC curve (area = %0.2f)' % roc_auc)

plt.plot([0, 1], [0, 1], 'k--')

plt.xlim([0.0, 1.0])

plt.ylim([0.0, 1.05])

plt.xlabel('False Positive Rate')

plt.ylabel('True Positive Rate')

plt.title('ROC Curve')

plt.legend(loc="lower right")

plt.show()

#---- output ----

Listing 3-28.

Area Under the Curve

A434293_1_En_3_Figw_HTML

In the above case, AUC is 100% as the model is able to predict all the positive instances as true positive.

Fitting Line

The inverse of regularization is one of the key aspects of fitting a logistic regression line. It defines the complexity of the fitted line. Let’s try to fit lines for different values for this parameter (C, default is 1) and see how the fitting line and the accuracy changes. See Listing 3-29.

#instantiate a logistic regression model with default c value, and fit with X and y

model = LogisticRegression()

model = model.fit(x, y)

#check the accuracy on the training set

print "C = 1 (default), Accuracy :", metrics.accuracy_score(y, model.predict(x))

#instantiate a logistic regression model with c = 10, and fit with X and y

model1 = LogisticRegression(C=10)

model1 = model1.fit(x, y)

#check the accuracy on the training set

print "C = 10, Accuracy :", metrics.accuracy_score(y, model1.predict(x))

#instantiate a logistic regression model with c = 100, and fit with X and y

model2 = LogisticRegression(C=100)

model2 = model2.fit(x, y)

#check the accuracy on the training set

print "C = 100, Accuracy :", metrics.accuracy_score(y, model2.predict(x))

#instantiate a logistic regression model with c = 1000, and fit with X and y

model3 = LogisticRegression(C=1000)

model3 = model3.fit(x, y)

#check the accuracy on the training set

print "C = 1000, Accuracy :", metrics.accuracy_score(y, model3.predict(x))

#plotting fitted line

plt.scatter(df.Hours_Studied, y, color='black', label='Result')

plt.yticks([0.0, 0.5, 1.0])

plt.plot(df.Hours_Studied, model.predict_proba(x)[:,1], color='gray', linewidth=2, label='C=1.0')

plt.plot(df.Hours_Studied, model1.predict_proba(x)[:,1], color='blue', linewidth=2,label='C=10')

plt.plot(df.Hours_Studied, model2.predict_proba(x)[:,1], color='green', linewidth=2,label='C=100')

plt.plot(df.Hours_Studied, model3.predict_proba(x)[:,1], color='red', linewidth=2,label='C=1000')

plt.legend(loc='lower right') # legend location

plt.title('Hours Studied vs Result')

plt.ylabel('Result')

plt.xlabel('Hours_Studied')

plt.show()

#----output----

C = 1 (default), Accuracy : 0.88

C = 10, Accuracy : 1.0

C = 100, Accuracy : 1.0

C = 1000, Accuracy : 1.0

Listing 3-29.

Controling complexity for fitting a line

A434293_1_En_3_Figx_HTML

Stochastic Gradient Descent

Fitting the right slope that minimizes the error (also known as cost function) for a large dataset can be tricky. However this can be achieved through a stochastic gradient descent (steepest descent) optimization algorithm. In case of regression problems, the cost function J to learn the weights can be defined as the sum of squared errors (SSE) between actual vs. predicted value.

J(w) = $$ \frac{1}{2}{\displaystyle {\sum}_{\mathrm{i}}\left({\mathrm{y}}^{\mathrm{i}} - {\widehat{\mathrm{y}}}^{\mathrm{i}}\right)} $$, Where yi the ith is actual value, and ŷi is the ith predicted value.

The stochastic gradient descent algorithm to update weight (w), for every weight j of every training sample i can be given as, repeat until convergence {$$ {\mathrm{W}}_{\mathrm{j}}\kern0.5em :={\mathrm{W}}_{\mathrm{j}} + \upalpha\ {\displaystyle \sum_{\mathrm{i}=1}^{\mathrm{m}}}\left({\mathrm{y}}^{\mathrm{i}} - {\widehat{\mathrm{y}}}^{\mathrm{i}}\right){\mathrm{x}}_{\mathrm{j}}^{\mathrm{i}} $$ }.

Alpha (α) is the learning rate, and choosing a smaller value for the same will ensure that the algorithm dos not miss global cost minimum. See Figure 3-12.

A434293_1_En_3_Fig12_HTML

Figure 3-12.

Gradient descent

The default solver parameter for logistic regression in scikit-learn is 'liblinear', which works fine for smaller datasets. For a large dataset with a large number of independent variables, ‘sag’ (stochastic average gradient descent) is the recommended solver to fit the optimal slope faster.

Regularization

With an increase in the number of variables, the probability of over-fitting also increases. LASSO (L1) and Ridge (L2) can be applied for logistic regression as well to avoid over-fitting. Let’s look at an example to understand the over-/under-fitting issue in logistic regression. See Listing 3-30.

import pandas as pd

data = pd.read_csv('Data\LR_NonLinear.csv')

pos = data['class'] == 1

neg = data['class'] == 0

x1 = data['x1']

x2 = data['x2']

# function to draw scatter plot between two variables

def draw_plot():

plt.figure(figsize=(6, 6))

plt.scatter(np.extract(pos, x1),

np.extract(pos, x2),

c='b', marker='s', label='pos')

plt.scatter(np.extract(neg, x1),

np.extract(neg, x2),

c='r', marker='o', label='neg')

plt.xlabel('x1');

plt.ylabel('x2');

plt.axes().set_aspect('equal', 'datalim')

plt.legend();

# create hihger order polynomial for independent variables

order_no = 6

# map the variable 1 & 2 to its higher order polynomial

def map_features(variable_1, variable_2, order=order_no):

assert order >= 1

def iter():

for i in range(1, order + 1):

for j in range(i + 1):

yield np.power(variable_1, i - j) * np.power(variable_2, j)

return np.vstack(iter())

out = map_features(data['x1'], data['x2'], order=order_no)

X = out.transpose()

y = data['class']

# split the data into train and test

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)

# function to draw classifier line

def draw_boundary(classifier):

dim = np.linspace(-0.8, 1.1, 100)

dx, dy = np.meshgrid(dim, dim)

v = map_features(dx.flatten(), dy.flatten(), order=order_no)

z = (np.dot(classifier.coef_, v) + classifier.intercept_).reshape(100, 100)

plt.contour(dx, dy, z, levels=[0], colors=['r'])

# fit with c = 0.01

clf = LogisticRegression(C=0.01).fit(X_train, y_train)

print 'Train Accuracy for C=0.01: ', clf.score(X_train, y_train)

print 'Test Accuracy for C=0.01: ', clf.score(X_test, y_test)

draw_plot()

plt.title('Fitting with C=0.01')

draw_boundary(clf)

plt.legend();

# fit with c = 1

clf = LogisticRegression(C=1).fit(X_train, y_train)

print 'Train Accuracy for C=1: ', clf.score(X_train, y_train)

print 'Test Accuracy for C=1: ', clf.score(X_test, y_test)

draw_plot()

plt.title('Fitting with C=1')

draw_boundary(clf)

plt.legend();

# fit with c = 10000

clf = LogisticRegression(C=10000).fit(X_train, y_train)

print 'Train Accuracy for C=10000: ', clf.score(X_train, y_train)

print 'Test Accuracy for C=10000: ', clf.score(X_test, y_test)

draw_plot()

plt.title('Fitting with C=10000')

draw_boundary(clf)

plt.legend();

#----output----

Train Accuracy for C=0.01: 0.624242424242

Test Accuracy for C=0.01: 0.619718309859

Train Accuracy for C=1: 0.842424242424

Test Accuracy for C=1: 0.859154929577

Train Accuracy for C=10000: 0.860606060606

Test Accuracy for C=10000: 0.788732394366

Listing 3-30.

Under-fitting, right-fitting, and over-fitting

A434293_1_En_3_Figy_HTML

Notice that with higher-order regularization, value over-fitting occurs, and the same can be determined by looking at the accuracy between train and test datasets, that is, the accuracy drops significantly in the test dataset.

Multiclass Logistic Regression

Logistic regression can also be used to predict the dependent or target variable with multiclass. Let’s learn multiclass prediction with iris dataset, one of the best-known databases to be found in the pattern recognition literature. The dataset contains 3 classes of 50 instances each, where each class refers to a type of iris plant. This comes as part of the scikit-learn datasets, where the third column represents the petal length, and the fourth column the petal width of the flower samples. The classes are already converted to integer labels where 0=Iris-Setosa, 1=Iris-Versicolor, 2=Iris-Virginica. See Listing 3-31.

Load Data

from sklearn import datasets

import numpy as np

import pandas as pd

iris = datasets.load_iris()

X = iris.data

y = iris.target

print('Class labels:', np.unique(y))

#----output----

('Class labels:', array([0, 1, 2]))

Listing 3-31.

Load data

Normalize Data

The unit of measurement might differ so let’s normalize the data before building the model . See Listing 3-32.

from sklearn.preprocessing import StandardScaler

sc = StandardScaler()

sc.fit(X)

X = sc.transform(X)

Listing 3-32.

Normalize data

Split Data

Split data into train and test. Whenever we are using random function it’s advised to use a seed to ensure the reproducibility of the results. See Listing 3-33.

# split data into train and test

from sklearn.cross_validation import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)

Listing 3-33.

Split data into train and test

Training Logistic Regression Model and Evaluating

from sklearn.linear_model import LogisticRegression

# l1 regularization gives better results

lr = LogisticRegression(penalty='l1', C=10, random_state=0)

lr.fit(X_train, y_train)

from sklearn import metrics

# generate evaluation metrics

print "Train - Accuracy :", metrics.accuracy_score(y_train, lr.predict(X_train))

print "Train - Confusion matrix :",metrics.confusion_matrix(y_train, lr.predict(X_train))

print "Train - classification report :", metrics.classification_report(y_train, lr.predict(X_train))

print "Test - Accuracy :", metrics.accuracy_score(y_test, lr.predict(X_test))

print "Test - Confusion matrix :",metrics.confusion_matrix(y_test, lr.predict(X_test))

print "Test - classification report :", metrics.classification_report(y_test, lr.predict(X_test))

#----output----

Train - Accuracy : 0.990476190476

Train - Confusion matrix : [[34 0 0]

[ 0 31 1]

[ 0 0 39]]

Train - classification report : precision recall f1-score support

0 1.00 1.00 1.00 34

1 1.00 0.97 0.98 32

2 0.97 1.00 0.99 39

avg / total 0.99 0.99 0.99 105

Test - Accuracy : 0.933333333333

Test - Confusion matrix : [[16 0 0]

[ 0 15 3]

[ 0 0 11]]

Test - classification report : precision recall f1-score support

0 1.00 1.00 1.00 16

1 1.00 0.83 0.91 18

2 0.79 1.00 0.88 11

avg / total 0.95 0.93 0.93 45

Listing 3-34.

Logistic regression model training and evaluation

Generalized Linear Models

GLM was an effort by John Nelder and Robert Wedderburn to unify commonly used various statistical models such as linear, logistic, and poisson, etc. See Table 3-12 and Listing 3-35.

Table 3-12.

Different GLM distribution family

Family

Description

Binomial

Target variable is binary response.

Poisson

Target variable is a count of occurrence.

Gaussian

Target variable is a continuous number.

Gamma

This distribution arises when the waiting times between Poisson distribution events are relevant, that is, the number of events occurred between two time periods.

InverseGaussian

The tails of the distribution decrease slower than normal distribution, that is, there is an inverse relationship between the time required to cover a unit distance and distance covered in unit time.

NegativeBinomial

Target variable denotes number of successes in a sequence before a random failure.

df = pd.read_csv('Data/Grade_Set_1.csv')

print('####### Linear Regression Model ########')

# Create linear regression object

lr = lm.LinearRegression()

x= df.Hours_Studied[:, np.newaxis] # independent variable

y= df.Test_Grade.values # dependent variable

# Train the model using the training sets

lr.fit(x, y)

print "Intercept: ", lr.intercept_

print "Coefficient: ", lr.coef_

print('\n####### Generalized Linear Model ########')

import statsmodels.api as sm

# To be able to run GLM, we'll have to add the intercept constant to x variable

x = sm.add_constant(x, prepend=False)

# Instantiate a gaussian family model with the default link function.

model = sm.GLM(y, x, family = sm.families.Gaussian())

model = model.fit()

print model.summary()

#----output----

####### Linear Regression Model ########

Intercept: 49.6777777778

Coefficient: [ 5.01666667]

####### Generalized Linear Model ########

Generalized Linear Model Regression Results

==========================================================================

Dep. Variable: y No. Observations: 9

Model: GLM Df Residuals: 7

Model Family: Gaussian Df Model: 1

Link Function: identity Scale: 5.3626984127

Method: IRLS Log-Likelihood: -19.197

Date: Sun, 25 Dec 2016 Deviance: 37.539

Time: 21:27:42 Pearson chi2: 37.5

No. Iterati 4

=========================================================================

coef std err z P>|z| [95.0% Conf. Int.]

-------------------------------------------------------------------------

x1 5.0167 0.299 16.780 0.000 4.431 5.603

const 49.6778 1.953 25.439 0.000 45.850 53.505

=========================================================================

Listing 3-35.

Generalized Linear Model

Note that the coefficients are the same for both linear regression and GLM. However GLM can be used for other distributions such as binomial, poisson, etc., by just changing the family parameter.

Supervised Learning – Process Flow

At this point you have seen how to build a regression and a logistic regression model, so let me summarize the process flow for supervised learning in Figure 3-13.

A434293_1_En_3_Fig13_HTML

Figure 3-13.

Supervised learning process flow

First you need to train and validate a supervised model by applying machine learning techniques to historical data. Then apply this model onto the new dataset to predict the future value .

Decision Trees

In 1986, J. R. Quinlan published Induction of Decision Trees summarizing an approach to synthesizing decision trees using machine learning with an illustrative example dataset, where the objective is to make a decision on whether to play outside on a Saturday morning. As the name suggests, a decision tree is a tree-like structure where internal nodes represent a test on an attribute, each branch represents outcome of a test, and each leaf node represents class label, and the decision is made after computing all attributes. A path from root to leaf represents classification rules. Thus, a decision tree consists of three types of nodes. See Figure 3-14.

A434293_1_En_3_Fig14_HTML

Figure 3-14.

J. R. Quinlan’s example for synthesizing decision tree

· Root node

· Branch node

· Leaf node (class label)

Decision tree model output is easy to interpret and it provides the rules that drive a decision or event; in the above use case we can get the rules that lead to a don’t play scenario, that is 1) sunny and temperature >30°c 2)rainy and windy is true. Often businesses might be interested in these decision rules rather than the decision itself. For example, an insurance company might be interested in the rules or conditions in which an insurance applicant should be sent for a medical checkup rather than feeding the applicants data to a black box model to find the decision.

Use training data to build a tree generator model, which will determine which variable to split at a node and the value of the split. A decision to stop or split again assigns leaf nodes to a class. An advantage of a decision tree is that there is no need for the exclusive creation of dummy variables.

How the Tree Splits and Grows ?

· The base algorithm is known as a greedy algorithm, in which the tree is constructed in a top-down recursive divide-and-conquer manner.

· At start, all the training examples are at the root.

· Input data is partitioned recursively based on selected attributes.

· Test attributes at each node are selected on the basis of a heuristic or statistical impurity measure example, gini, or information gain (entropy).

· Gini = 1 - $$ {\displaystyle \sum_{\mathrm{i}}}{\left({\mathrm{p}}_{\mathrm{i}}\right)}^2 $$, where pi is the probability of each label.

· Entropy = -p log2(p) – q log2(q), where p and q represent the probability of success/failure respectively in a given node.

Conditions for Stopping Partition ing

· All samples for a given node belong to the same class.

· There are no remaining attributes for further partitioning – majority voting is employed for classifying the leaf.

· There are no samples left.

Note

Default criterion is “gini” as it’s comparatively faster to compute than “entropy”; however both measures give almost identical decisions on split. See Listing 3-36.

from sklearn import datasets

import numpy as np

import pandas as pd

from sklearn import tree

iris = datasets.load_iris()

# X = iris.data[:, [2, 3]]

X = iris.data

y = iris.target

from sklearn.preprocessing import StandardScaler

sc = StandardScaler()

sc.fit(X)

X = sc.transform(X)

# split data into train and test

from sklearn.cross_validation import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)

clf = tree.DecisionTreeClassifier(criterion = 'entropy', random_state=0)

clf.fit(X_train, y_train)

# generate evaluation metrics

print "Train - Accuracy :", metrics.accuracy_score(y_train, clf.predict(X_train))

print "Train - Confusion matrix :",metrics.confusion_matrix(y_train, clf.predict(X_train))

print "Train - classification report :", metrics.classification_report(y_train, clf.predict(X_train))

print "Test - Accuracy :", metrics.accuracy_score(y_test, clf.predict(X_test))

print "Test - Confusion matrix :",metrics.confusion_matrix(y_test, clf.predict(X_test))

print "Test - classification report :", metrics.classification_report(y_test, clf.predict(X_test))

tree.export_graphviz(clf, out_file='tree.dot')

from sklearn.externals.six import StringIO

import pydot

out_data = StringIO()

tree.export_graphviz(clf, out_file=out_data,

feature_names=iris.feature_names,

class_names=clf.classes_.astype(int).astype(str),

filled=True, rounded=True,

special_characters=True,

node_ids=1,)

graph = pydot.graph_from_dot_data(out_data.getvalue())

graph[0].write_pdf("iris.pdf") # save to pdf

#----output----

Listing 3-36.

Decision tree model

A434293_1_En_3_Figz_HTML

Key Parameters for Stopping Tree Growth

One of the key issues with the decision tree is that the tree can grow very large, ending up creating one leaf per observation.

max_features: maximum features to be considered while deciding each split, default=“None” which means all features will be considered

min_samples_split: split will not be allowed for nodes that do not meet this number

min_samples_leaf: leaf node will not be allowed for nodes less than the minimum samples

max_depth: no further split will be allowed, default=“None”

Support Vector Machine (SVM )

Vladimir N. Vapnik and Alexey Ya. Chervonenkis in 1963 proposed SVM. The key objective of SVM is to draw a hyperplane that separates the two classes optimally such that the margin is maximum between the hyperplane and the observations. Figure 3-15 illustrates that there is the possibility of different hyperplanes. However the objective of SVM is to find the one which gives us a high margin.

A434293_1_En_3_Fig15_HTML

Figure 3-15.

Support Vector Machine

To maximize the margin we need to minimize (1/2)||w||2 subject to yi(WTXi + b)-1 ≥ 0 for all i.

The final SVM equation can be written mathematically as

$$ \mathrm{L}={\displaystyle {\sum}_i di-\frac{1}{2}{\displaystyle {\sum}_{i j}{\alpha}_i{\alpha}_j{y}_i{y}_j\left(\overline{X}\mathrm{i}\overline{X}\mathrm{j}\right)}} $$

Note

SVM is comparatively less prone to outliers than logistic regression as it only cares about the points that are closest to the decision boundary or support vectors.

Key Parameters

C: This is the penalty parameter and helps in fitting the boundaries smoothly and appropriately, default=1

Kernel: A kernel is a similarity function for pattern analysis. It must be one of rbf/linear/poly/sigmoid/precomputed, default=’rbf’ (Radial Basis Function). Choosing an appropriate kernel will result in a better model fit. See Listings 3-37 and 3-38.

from sklearn import datasets

import numpy as np

import pandas as pd

from sklearn import tree

from sklearn import metrics

iris = datasets.load_iris()

X = iris.data[:, [2, 3]]

y = iris.target

print('Class labels:', np.unique(y))

from sklearn.preprocessing import StandardScaler

sc = StandardScaler()

sc.fit(X)

X = sc.transform(X)

# split data into train and test

from sklearn.cross_validation import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)

from sklearn.svm import SVC

clf = SVC(kernel='linear', C=1.0, random_state=0)

clf.fit(X_train, y_train)

# generate evaluation metrics

print "Train - Accuracy :", metrics.accuracy_score(y_train, clf.predict(X_train))

print "Train - Confusion matrix :",metrics.confusion_matrix(y_train, clf.predict(X_train))

print "Train - classification report :", metrics.classification_report(y_train, clf.predict(X_train))

print "Test - Accuracy :", metrics.accuracy_score(y_test, clf.predict(X_test))

print "Test - Confusion matrix :",metrics.confusion_matrix(y_test, clf.predict(X_test))

print "Test - classification report :", metrics.classification_report(y_test, clf.predict(X_test))

#----output----

Train - Accuracy : 0.952380952381

Train - Confusion matrix : [[34 0 0]

[ 0 30 2]

[ 0 3 36]]

Train - classification report : precision recall f1-score support

0 1.00 1.00 1.00 34

1 0.91 0.94 0.92 32

2 0.95 0.92 0.94 39

avg / total 0.95 0.95 0.95 105

Test - Accuracy : 0.977777777778

Test - Confusion matrix : [[16 0 0]

[ 0 17 1]

[ 0 0 11]]

Test - classification report : precision recall f1-score support

0 1.00 1.00 1.00 16

1 1.00 0.94 0.97 18

2 0.92 1.00 0.96 11

avg / total 0.98 0.98 0.98 45

Plotting Decision Boundary:

Let's consider a two-class example to keep things simple

Listing 3-37.

Support vector machine (SVM) model

# Let's use sklearn make_classification function to create some test data.

from sklearn.datasets import make_classification

X, y = make_classification(100, 2, 2, 0, weights=[.5, .5], random_state=0)

# build a simple logistic regression model

clf = SVC(kernel='linear', random_state=0)

clf.fit(X, y)

# get the separating hyperplane

w = clf.coef_[0]

a = -w[0] / w[1]

xx = np.linspace(-5, 5)

yy = a * xx - (clf.intercept_[0]) / w[1]

# plot the parallels to the separating hyperplane that pass through the

# support vectors

b = clf.support_vectors_[0]

yy_down = a * xx + (b[1] - a * b[0])

b = clf.support_vectors_[-1]

yy_up = a * xx + (b[1] - a * b[0])

# Plot the decision boundary

plot_decision_regions(X, y, classifier=clf)

# plot the line, the points, and the nearest vectors to the plane

plt.scatter(clf.support_vectors_[:, 0], clf.support_vectors_[:, 1], s=80, facecolors='none')

plt.plot(xx, yy_down, 'k--')

plt.plot(xx, yy_up, 'k--')

plt.xlabel('X1')

plt.ylabel('X2')

plt.legend(loc='upper left')

plt.tight_layout()

plt.show()

#----output----

Listing 3-38.

Ploting SVM decision boundaries

A434293_1_En_3_Figaa_HTML

k Nearest Neighbors (kNN)

K nearest neighbor classification was developed from the need to perform discriminant analysis when reliable parametric estimates of probability densities are unknown or difficult to determine. Fix and Hodges in 1951 introduced a non-parametric method for pattern classification that has since become known the k nearest neighbor rule.

As the name suggests the algorithm works based on a majority vote of its k nearest neighbors class. In Figure 3-16, k = 5 nearest neighbors for the unknown data point are identified based on the chosen distance measure, and the unknown point will be classified based on the majority class among identified nearest data points class. The key drawback of kNN is the complexity in searching the nearest neighbors for each sample. See Listing 3-39.

A434293_1_En_3_Fig16_HTML

Figure 3-16.

k Nearest Neighbors with k = 5

Things to remember:

· Choose an odd k value for a two-class problem

· k must not be a multiple of the number of classes.

from sklearn.neighbors import KNeighborsClassifier

clf = KNeighborsClassifier(n_neighbors=5, p=2, metric='minkowski')

clf.fit(X_train, y_train)

# generate evaluation metrics

print "Train - Accuracy :", metrics.accuracy_score(y_train, clf.predict(X_train))

print "Train - Confusion matrix :",metrics.confusion_matrix(y_train, clf.predict(X_train))

print "Train - classification report :", metrics.classification_report(y_train, clf.predict(X_train))

print "Test - Accuracy :", metrics.accuracy_score(y_test, clf.predict(X_test))

print "Test - Confusion matrix :",metrics.confusion_matrix(y_test, clf.predict(X_test))

print "Test - classification report :", metrics.classification_report(y_test, clf.predict(X_test))

#----output----

Train - Accuracy : 0.971428571429

Train - Confusion matrix : [[34 0 0]

[ 0 31 1]

[ 0 2 37]]

Train - classification report : precision recall f1-score support

0 1.00 1.00 1.00 34

1 0.94 0.97 0.95 32

2 0.97 0.95 0.96 39

avg / total 0.97 0.97 0.97 105

Test - Accuracy : 0.977777777778

Test - Confusion matrix : [[16 0 0]

[ 0 17 1]

[ 0 0 11]]

Test - classification report : precision recall f1-score support

0 1.00 1.00 1.00 16

1 1.00 0.94 0.97 18

2 0.92 1.00 0.96 11

avg / total 0.98 0.98 0.98 45

Listing 3-39.

k Nearest Neighbors model

Note

Decision Trees, SVM, and kNNbase algorithm concepts can essentially be applied to predict dependent variables that are continuous numbers in nature, and Scikit-learn provides DecisionTreeRegressor, SVR (support vector regressor), and kNeighborsRegressor for the same.

Time-Series Forecasting

In simple terms data points that are collected sequentially at a regular interval with association over a time period is termed time-series data. A time-series data having the mean and variance as a constant is called a stationary time series.

Time series tend to have a linear relationship between lagged variables and this is called an autocorrelation. Hence a time series historic data can be modeled to forecast the future data points without involvement of any other independent variables; these types of models are generally known as time-series forecasting. To name some key areas of applications of time series, these include sales forecasting, economic forecasting, stock market forecasting, etc.

Components of Time Series

A time series can be made up of three key components. See Figure 3-17.

· Trend – A long-term increase or decrease are termed trends.

· Seasonality An effect of seasonal factors for a fixed or known period. For example, retail stores sales will be high during weekends and festival seasons.

· Cycle – These are the longer ups and downs that are not of fixed or known periods caused by external factors.

A434293_1_En_3_Fig17_HTML

Figure 3-17.

Time series components

Autoregressive Integrated Moving Average (ARIMA)

ARIMA is one of the key and popular time-series models, so understanding the concept involved will set the base for you around time-series modeling.

Autoregressive Model (AM) : As the name indicates, it is a regression of the variable against itself , that is, the linear combination of past values of the variables are used to forecast the future value.

$$ {\mathrm{y}}_{\mathrm{t}}\kern0.5em =\kern0.5em \mathrm{c}\kern0.5em +\kern0.5em {\varPhi}_1{\mathrm{y}}_{\mathrm{t}-1}+{\varPhi}_2{\mathrm{y}}_{\mathrm{t}-2} + \dots + {\varPhi}_{\mathrm{n}}{\mathrm{y}}_{\mathrm{t}-\mathrm{n}} + {\mathrm{e}}_{\mathrm{t}} $$, where c is constant, et is the random error, $$ {\mathrm{y}}_{\mathrm{t}-1} $$ is first-order correlation, and $$ {\mathrm{y}}_{\mathrm{t}-2} $$ is second-order correlation between values two periods apart.

Moving average (MA) : Instead of past values, a past forecast’s errors are used to build a model.

$$ {\mathrm{y}}_{\mathrm{t}}=\kern0.5em \mathrm{c}\kern0.5em +\kern0.5em {\uptheta \mathrm{y}}_{\mathrm{t}-1}+{\uptheta}_2{\mathrm{y}}_{\mathrm{t}-2} + \dots + {\uptheta}_{\mathrm{n}}{\mathrm{y}}_{\mathrm{t}-\mathrm{n}} + {\mathrm{e}}_{\mathrm{t}} $$

Autoregressive (AR), moving average (MA) model with integration (opposite of differencing) is called the ARIMA model.

$$ {\mathrm{y}}_{\mathrm{t}}=\kern0.5em \mathrm{c}\kern0.5em +\kern0.5em {\varPhi}_1{\mathrm{y}}_{\mathrm{t}-1}+\kern0.5em {\varPhi}_2{\mathrm{y}}_{\mathrm{t}-2}+\dots +\kern0.5em {\varPhi}_{\mathrm{n}}{\mathrm{y}}_{\mathrm{t}-\mathrm{n}}+{\uptheta \mathrm{y}}_{\mathrm{t}-1}+{\uptheta}_2{\mathrm{y}}_{\mathrm{t}-2} + \dots + {\uptheta}_{\mathrm{n}}{\mathrm{y}}_{\mathrm{t}-\mathrm{n}} + {\mathrm{e}}_{\mathrm{t}} $$

The predictors on the right side of the equation are the lagged values, errors, and it is also known as ARIMA (p, d, q) model. These are the key parameters of ARIMA and picking the right value for p, d, q will yield better model results.

p = order of the autoregressive part. That is the number of unknown terms that multiply your signal at past times (so many past times as your value p).

d = degree of first differencing involved. Number of times you have to difference your time series to have a stationary one.

q = order of the moving average part. That is the number of unknown terms that multiply your forecast errors at past times (so many past times as your value q). See Listing 3-40.

Running ARIMA Model

· Plot the chart to ensure trend, cycle, or seasonality exists in the dataset.

· Stationarize series: To stationarize series we need to remove trend (varying mean) and seasonality (variance) components from the series. Moving average and differencing technique can be used to stabilize trend, whereas log transform will stabilize the seasonality variance. Further, the Dickey Fuller test can be used to assess the stationarity of series, that is, null hypothesis for a Dickey Fuller test is that the data are stationary, so test result with p value > 0.05 means data is non-stationary.

· Find optimal parameter: Once the series is stationarized you can look at the Autocorrelation function (ACF) and Partial autocorrelation function (PACF) graphical plot to pick the number of AR or MA terms needed to remove autocorrelation. ACF is a bar chart between correlation coefficients and lags; similarly PACF is the bar chart between partial correlation (correlation between variable and lag of itself not explained by correlation at all lower-order lags) coefficient and lags.

· Build Model and Evaluate: Since time series is a continuous number Mean Absolute Error and Root Mean Squared Error can be used to evaluate the deviation between actual and predicted values in train dataset. Other useful matrices would be Akaike Information Criterion (AIC) and Bayesian Information Criterion (BIC); these are part of information theory to estimate the quality of individual models given a collection of models, and they favor a model with smaller residual errors.

AIC = -2log(L) + 2(p+q+k+1) where L is the maximum likelihood function of fitted model and p, q, k are the number of parameters in the model

BIC = AIC+(log(T)−2)(p+q+k+1)

# Data Source: O.D. Anderson (1976), in file: data/anderson14, Description: Monthly sales of company X Jan ’65 – May ’71 C. Cahtfield

df = pd.read_csv('Data/TS.csv')

ts = pd.Series(list(df['Sales']), index=pd.to_datetime(df['Month'],format='%Y-%m'))

from statsmodels.tsa.seasonal import seasonal_decompose

decomposition = seasonal_decompose(ts_log)

trend = decomposition.trend

seasonal = decomposition.seasonal

residual = decomposition.resid

plt.subplot(411)

plt.plot(ts_log, label='Original')

plt.legend(loc='best')

plt.subplot(412)

plt.plot(trend, label='Trend')

plt.legend(loc='best')

plt.subplot(413)

plt.plot(seasonal,label='Seasonality')

plt.legend(loc='best')

plt.tight_layout()

Listing 3-40.

Decompose time series

A434293_1_En_3_Figab_HTML

Checking for Stationary

# log transform

ts_log = np.log(ts)

ts_log.dropna(inplace=True)

s_test = adfuller(ts_log, autolag='AIC')

print "Log transform stationary check p value: ", s_test[1]

#Take first difference:

ts_log_diff = ts_log - ts_log.shift()

ts_log_diff.dropna(inplace=True)

plt.title('Trend removed plot with first order difference')

plt.plot(ts_log_diff)

plt.ylabel('First order log diff')

s_test = adfuller(ts_log_diff, autolag='AIC')

print "First order difference stationary check p value: ", s_test[1]

#----output----

Log transform stationary check p value: 0.785310212485

First order difference stationary check p value: 0.0240253928399

Listing 3-41.

Check stationary

A434293_1_En_3_Figac_HTML

Autocorrelation Test

We determined that the log of time series requires at least one order differencing to stationarize. Now let’s plot ACV and PACF charts for a first-order log series. See Figure 3-42.

fig, (ax1, ax2) = plt.subplots(1, 2, figsize = (10,3))

# ACF chart

fig = sm.graphics.tsa.plot_acf(ts_log_diff.values.squeeze(), lags=20, ax=ax1)

# draw 95% confidence interval line

ax1.axhline(y=-1.96/np.sqrt(len(ts_log_diff)),linestyle='--',color='gray')

ax1.axhline(y=1.96/np.sqrt(len(ts_log_diff)),linestyle='--',color='gray')

ax1.set_xlabel('Lags')

# PACF chart

fig = sm.graphics.tsa.plot_pacf(ts_log_diff, lags=20, ax=ax2)

# draw 95% confidence interval line

ax2.axhline(y=-1.96/np.sqrt(len(ts_log_diff)),linestyle='--',color='gray')

ax2.axhline(y=1.96/np.sqrt(len(ts_log_diff)),linestyle='--',color='gray')

ax2.set_xlabel('Lags')

#----output----

Listing 3-42.

Check autocorrelation

A434293_1_En_3_Figad_HTML

PACF plot has a significant spike only at lag 1, meaning that all the higher-order autocorrelations are effectively explained by the lag-1 and lag-2 autocorrelation. Ideal lag values are p = 2 and q = 2, that is, the lag value where the ACF/PACF chart crosses the upper confidence interval for the first time.

Build Model and Evaluate

Let’s fit the ARIMA model on the dataset and evaluate the model performance as shown in Listing 3-43.

# build model

model = sm.tsa.ARIMA(ts_log, order=(2,0,2))

results_ARIMA = model.fit(disp=-1)

# Evaluate model

print "AIC: ", results_ARIMA.aic

print "BIC: ", results_ARIMA.bic

print "Mean Absolute Error: ", mean_absolute_error(ts_log.values, ts_predict.values)

print "Root Mean Squared Error: ", np.sqrt(mean_squared_error(ts_log.values, ts_predict.values))

# check autocorrelation

print "Durbin-Watson statistic :", sm.stats.durbin_watson(results_ARIMA.resid.values)

#----output-----

AIC: 7.85211053808

BIC: 21.9149430692

Mean Absolute Error: 0.167260085121

Root Mean Squared Error: 0.216145783845

Durbin-Watson statistic : 1.86457752659

Usual practice is to build several models with different p and q and select the one with smallest value of AIC, BIC, MAE and RMSE. Now lets' increase p to 3 and see if there is any difference in result.

model = sm.tsa.ARIMA(ts_log, order=(3,0,2))

results_ARIMA = model.fit(disp=-1)

# ts_predict = results_ARIMA.predict('1965-01-01', '1972-05-01', dynamic=True)

ts_predict = results_ARIMA.predict()

plt.title('ARIMA Prediction')

plt.plot(ts_log, label='Actual')

plt.plot(ts_predict, 'r--', label='Predicted')

plt.xlabel('Year-Month')

plt.ylabel('Sales')

plt.legend(loc='best')

print "AIC: ", results_ARIMA.aic

print "BIC: ", results_ARIMA.bic

print "Mean Absolute Error: ", mean_absolute_error(ts_log.values, ts_predict.values)

print "Root Mean Squared Error: ", np.sqrt(mean_squared_error(ts_log.values, ts_predict.values))

# check autocorrelation

print "Durbin-Watson statistic :", sm.stats.durbin_watson(results_ARIMA.resid.values)

#----output----

AIC: -7.78613717769

BIC: 8.62050077528

Mean Absolute Error: 0.167260085121

Root Mean Squared Error: 0.216145783845

Durbin-Watson statistic : 2.51941762513

Listing 3-43.

Build ARIMA model and evaluate

A434293_1_En_3_Figae_HTML

Let's try with first-order differencing, that is, d = 1. See Listing 3-44.

model = sm.tsa.ARIMA(ts_log, order=(3,1,2))

results_ARIMA = model.fit(disp=-1)

ts_predict = results_ARIMA.predict()

# Correctcion for difference

predictions_ARIMA_diff = pd.Series(ts_predict, copy=True)

predictions_ARIMA_diff_cumsum = predictions_ARIMA_diff.cumsum()

predictions_ARIMA_log = pd.Series(ts_log.ix[0], index=ts_log.index)

predictions_ARIMA_log = predictions_ARIMA_log.add(predictions_ARIMA_diff_cumsum,fill_value=0)

#----output----

AIC: -35.4189877386

BIC: -19.1038543566

Mean Absolute Error: 0.138765317903

Root Mean Squared Error: 0.183102425139

Durbin-Watson statistic : 1.94116742899

Listing 3-44.

ARIMA with first-order differencing

A434293_1_En_3_Figaf_HTML

In the above chart we can see that the model is over predicting at some places and AIC and BIC values is higher than the previous model. Note: AIC/BIC can be positive or negative; however we should look at the absolute value of it for evaluation.

Predicting the Future Values

Below values (p=3, d=0, q=2) are giving the smaller number for evaluation matrices, so let’s use this as a final model to predict the future values for the year 1972. See Listing 3-45.

# final model

model = sm.tsa.ARIMA(ts_log, order=(3,0,2))

results_ARIMA = model.fit(disp=-1)

# predict future values

ts_predict = results_ARIMA.predict('1971-06-01', '1972-05-01')

plt.title('ARIMA Future Value Prediction - order(3,1,2)')

plt.plot(ts_log, label='Actual')

plt.plot(ts_predict, 'r--', label='Predicted')

plt.xlabel('Year-Month')

plt.ylabel('Sales')

plt.legend(loc='best')

#----output----

Listing 3-45.

ARIMA predict function

A434293_1_En_3_Figag_HTML

Note

A minimum of 3 to 4 years’ worth of historic data is required to ensure the seasonal patterns are regular.

Unsupervised Learning Process Flow

Unsupervised learning process flow is given in Figure 3-18 below. Similar to supervised learning, we can train a model and use it to predict the unknown dataset; however the key difference is that there is no predefined category or labels available for target variables, and the goal often is to create a category or label based on patterns available in data.

A434293_1_En_3_Fig18_HTML

Figure 3-18.

Unsupervised learning process flow

Clustering

Clustering is an unsupervised learning problem. Key objective is to identify distinct groups (called clusters) based on some notion of similarity within a given dataset. Clustering analysis origins can be traced to the area of Anthropology and Psychology in the 193’s. The most popularly used clustering techniques are k-means (divisive) and hierarchical (agglomerative).

K-means

The key objective of a k-means algorithm is to organize data into clusters such that there is high intra-cluster similarity and low inter-cluster similarity. An item will only belong to one cluster, not several, that is, it generates a specific number of disjoint, non-hierarchical clusters. K-means uses the strategy of divide and concur, and it is a classic example for an expectation maximization (EM) algorithm. EM algorithms are made up of two steps: the first step is known as expectation(E) and is used to find the expected point associated with a cluster; and the second step is known as maximization(M) and is used to improve the estimation of the cluster using knowledge from the first step. The two steps are processed repeatedly until convergence is reached.

Suppose we have ‘n’ data points that we need to cluster into k (c1, c2, c3) groups. See Figure 3-19.

A434293_1_En_3_Fig19_HTML

Figure 3-19.

Expectation maximization algorithm work flow

Step 1: In the first step k centroids (in above case k=3) is randomly picked (only in the first iteration) and all the points that are nearest to each centroid point are assigned to that specific cluster. Centroid is the arithmetic mean or average position of all the points.

Step 2: Here the centroid point is recalculated using the average of the coordinates of all the points in that cluster. Then step one is repeated (assign nearest point) until the clusters converge.

Note

K-means is designed for Euclidean distance only.

$$ Euclidean\kern0.5em Distace= d=\sqrt{{\displaystyle \sum_{i=1}^N{\left( Xi- Yi\right)}^2}} $$

Limitations of K-means

· K-means clustering needs the number of clusters to be specified.

· K-means has problems when clusters are of differing sized, densities, and non-globular shapes.

· Presence of outlier can skew the results.

Let's load the iris data and assume for a moment that the species column is missing, that is, we have the measured values for sepal length/width and petal length/width but we do not know how many species exists.

Now let's use unsupervised learning, that is, clustering to find out how many species exists. The goal here is to group all similar items into a cluster. We can assume a k of 3 for now; we’ll learn later about an approach to find the value of k. See Listings 3-46 and 3-47.

from sklearn import datasets

import numpy as np

import pandas as pd

from sklearn.cluster import KMeans

from sklearn.preprocessing import StandardScaler

iris = datasets.load_iris()

# Let's convert to dataframe

iris = pd.DataFrame(data= np.c_[iris['data'], iris['target']],

columns= iris['feature_names'] + ['species'])

# let's remove spaces from column name

iris.columns = iris.columns.str.replace(' ','')

iris.head()

X = iris.ix[:,:3] # independent variables

y = iris.species # dependent variable

sc = StandardScaler()

sc.fit(X)

X = sc.transform(X)

# K Means Cluster

model = KMeans(n_clusters=3, random_state=11)

model.fit(X)

print model.labels_

# ----output----

[1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

1 1 1 1 0 1 1 1 1 1 1 1 1 2 2 2 0 2 0 2 0 2 0 0 0 0 0 0 2 0 0 0 0 2 0 0 0

2 2 2 2 0 0 0 0 0 0 0 2 2 0 0 0 0 2 0 0 0 0 0 0 0 0 2 0 2 2 2 2 0 2 2 2 2

2 2 0 0 2 2 2 2 0 2 0 2 0 2 2 0 2 2 2 2 2 2 2 0 2 2 2 0 2 2 2 0 2 2 2 0 2

2 0]

We see that the clustering algorithm has assigned a cluster label for each record. Let’s compare this with the actual species label to understand the accuracy of grouping similar records.

Listing 3-46.

k-means clustering

# since its unsupervised the labels have been assigned

# not in line with the actual lables so let's convert all the 1s to 0s and 0s to 1s

# 2's look fine

iris['pred_species'] = np.choose(model.labels_, [1, 0, 2]).astype(np.int64)

print "Accuracy :", metrics.accuracy_score(iris.species, iris.pred_species)

print "Classification report :", metrics.classification_report(iris.species, iris.pred_species)

# Set the size of the plot

plt.figure(figsize=(10,7))

# Create a colormap

colormap = np.array(['red', 'blue', 'green'])

# Plot Sepal

plt.subplot(2, 2, 1)

plt.scatter(iris['sepallength(cm)'], iris['sepalwidth(cm)'], c=colormap[iris.species], marker='o', s=50)

plt.xlabel('sepallength(cm)')

plt.ylabel('sepalwidth(cm)')

plt.title('Sepal (Actual)')

plt.subplot(2, 2, 2)

plt.scatter(iris['sepallength(cm)'], iris['sepalwidth(cm)'], c=colormap[iris.pred_species], marker='o', s=50)

plt.xlabel('sepallength(cm)')

plt.ylabel('sepalwidth(cm)')

plt.title('Sepal (Predicted)')

plt.subplot(2, 2, 3)

plt.scatter(iris['petallength(cm)'], iris['petalwidth(cm)'], c=colormap[iris.species],marker='o', s=50)

plt.xlabel('petallength(cm)')

plt.ylabel('petalwidth(cm)')

plt.title('Petal (Actual)')

plt.subplot(2, 2, 4)

plt.scatter(iris['petallength(cm)'], iris['petalwidth(cm)'], c=colormap[iris.pred_species],marker='o', s=50)

plt.xlabel('petallength(cm)')

plt.ylabel('petalwidth(cm)')

plt.title('Petal (Predicted)')

plt.tight_layout()

#----output----

Accuracy : 0.806666666667

Classification report : precision recall f1-score support

0.0 1.00 0.98 0.99 50

1.0 0.71 0.70 0.71 50

2.0 0.71 0.74 0.73 50

avg / total 0.81 0.81 0.81 150

Lising 3-47.

Accuracy of k-means clustering

A434293_1_En_3_Figah_HTML

We can see from the above chart that k-means has done a decent job of clustering the similar labels with an accuracy of 80% compared to the actual labels .

Finding Value of k

Two methods are commonly used to determine the value of k.

· Elbow method

· Average silhouette method

Elbow Method

Perform k-means clustering on the dataset for a range of value k (for example 1 to 10) and calculate the sum of squared error (SSE) or percentage of variance explained for each k. Plot a line chart for cluster number vs. SSE and then look for an elbow shape on the line graph, which is the ideal number of clusters. With increase in k the SSE tends to decrease toward 0. The SSE is zero if k is equal to the total number of data points in the dataset as at this stage each data point becomes its own cluster, and no error exists between cluster and its center. So the goal with the elbow method is to choose a small value of k that has a low SSE, and the elbow usually represents this value. Percentage of variance explained tends to increase with increase in k and we’ll pick the point where the elbow shape appears. See Listing 3-48.

from scipy.spatial.distance import cdist, pdist

from sklearn.cluster import KMeans

K = range(1,10)

KM = [KMeans(n_clusters=k).fit(X) for k in K]

centroids = [k.cluster_centers_ for k in KM]

D_k = [cdist(X, cent, 'euclidean') for cent in centroids]

cIdx = [np.argmin(D,axis=1) for D in D_k]

dist = [np.min(D,axis=1) for D in D_k]

avgWithinSS = [sum(d)/X.shape[0] for d in dist]

# Total with-in sum of square

wcss = [sum(d**2) for d in dist]

tss = sum(pdist(X)**2)/X.shape[0]

bss = tss-wcss

varExplained = bss/tss*100

kIdx = 10-1

##### plot ###

kIdx = 2

# elbow curve

# Set the size of the plot

plt.figure(figsize=(10,4))

plt.subplot(1, 2, 1)

plt.plot(K, avgWithinSS, 'b*-')

plt.plot(K[kIdx], avgWithinSS[kIdx], marker='o', markersize=12,

markeredgewidth=2, markeredgecolor='r', markerfacecolor='None')

plt.grid(True)

plt.xlabel('Number of clusters')

plt.ylabel('Average within-cluster sum of squares')

plt.title('Elbow for KMeans clustering')

plt.subplot(1, 2, 2)

plt.plot(K, varExplained, 'b*-')

plt.plot(K[kIdx], varExplained[kIdx], marker='o', markersize=12,

markeredgewidth=2, markeredgecolor='r', markerfacecolor='None')

plt.grid(True)

plt.xlabel('Number of clusters')

plt.ylabel('Percentage of variance explained')

plt.title('Elbow for KMeans clustering')

plt.tight_layout()

#----output----

Listing 3-48.

Elbow method

A434293_1_En_3_Figai_HTML

Average Silhouette Method

In 1986, Peter J. Rousseuw described the silhouette method, which aims to explain the consistancy within cluster data. The silhouette value will range between -1 and 1 and a high value indicates that items are well matched within clusters and weakly matched to neighboring clusters. See Listing 4-49.

s(i) = b(i) – a(i) / max {a(i), b(i)}, where a(i) is average dissimilarity of ith item with other data points from same cluster, b(i) lowest average dissimilarity of i to other cluster to which i is not a member.

from sklearn.metrics import silhouette_score

from matplotlib import cm

score = []

for n_clusters in range(2,10):

kmeans = KMeans(n_clusters=n_clusters)

kmeans.fit(X)

labels = kmeans.labels_

centroids = kmeans.cluster_centers_

score.append(silhouette_score(X, labels, metric='euclidean'))

# Set the size of the plot

plt.figure(figsize=(10,4))

plt.subplot(1, 2, 1)

plt.plot(score)

plt.grid(True)

plt.ylabel("Silouette Score")

plt.xlabel("k")

plt.title("Silouette for K-means")

# Initialize the clusterer with n_clusters value and a random generator

model = KMeans(n_clusters=3, init='k-means++', n_init=10, random_state=0)

model.fit_predict(X)

cluster_labels = np.unique(model.labels_)

n_clusters = cluster_labels.shape[0]

# Compute the silhouette scores for each sample

silhouette_vals = silhouette_samples(X, model.labels_)

plt.subplot(1, 2, 2)

y_lower, y_upper = 0,0

yticks = []

for i, c in enumerate(cluster_labels):

c_silhouette_vals = silhouette_vals[cluster_labels ==c]

c_silhouette_vals.sort()

y_upper += len(c_silhouette_vals)

color = cm.spectral(float(i) / n_clusters)

plt.barh(range(y_lower, y_upper), c_silhouette_vals, facecolor=color, edgecolor=color, alpha=0.7)

yticks.append((y_lower + y_upper) / 2)

y_lower += len(c_silhouette_vals)

silhouette_avg = np.mean(silhouette_vals)

plt.yticks(yticks, cluster_labels+1)

# The vertical line for average silhouette score of all the values

plt.axvline(x=silhouette_avg, color="red", linestyle="--")

plt.ylabel('Cluster')

plt.xlabel('Silhouette coefficient')

plt.title("Silouette for K-means")

plt.show()

#---output----

Listing 4-49.

Silhouette method

A434293_1_En_3_Figaj_HTML

Hierarchical Clustering

Agglomerative clustering is a hierarchical cluster technique that builds nested clusters with a bottom-up approach where each data point starts in its own cluster and as we move up, the clusters are merged, based on a distance matrix.

Key Parameters

n_clusters: number of clusters to find, default is 2.

linkage: It has to be one of the following, that is, ward or complete or average, default=ward.

Let’s understand each linkage a bit more. The Ward’s method will merge clusters if the in-cluster variance or the sum of square error is a minimum. All pairwise distances of both clusters are used in ‘average’ method, and it is less affected by outliers. The ‘complete’ method considers the distance between the farthest elements of two clusters, so it is also known as maximum linkage . See Figure 3-20 and Listing 3-50.

A434293_1_En_3_Fig20_HTML

Figure 3-20.

Agglomerative clustering linkage

from sklearn.cluster import AgglomerativeClustering

# Agglomerative Cluster

model = AgglomerativeClustering(n_clusters=3)

model.fit(X)

iris['pred_species'] = model.labels_

print "Accuracy :", metrics.accuracy_score(iris.species, iris.pred_species)

print "Classification report :", metrics.classification_report(iris.species, iris.pred_species)

#----outout----

Accuracy : 0.773333333333

Classification report : precision recall f1-score support

0.0 1.00 0.98 0.99 50

1.0 0.64 0.74 0.69 50

2.0 0.70 0.60 0.65 50

avg / total 0.78 0.77 0.77 150

Heirarchical clusterings results arrangement can be better interpreted with dendogram visualization. Scipy provides necessary functions for dendogram visualization (currently scikit-learn lack these functions)

from scipy.cluster.hierarchy import cophenet, dendrogram, linkage

from scipy.spatial.distance import pdist

# generate the linkage matrix

Z = linkage(X, 'ward')

c, coph_dists = cophenet(Z, pdist(X))

# calculate full dendrogram

plt.figure(figsize=(25, 10))

plt.title('Agglomerative Hierarchical Clustering Dendrogram')

plt.xlabel('sample index')

plt.ylabel('distance')

dendrogram(

Z,

leaf_rotation=90., # rotates the x axis labels

leaf_font_size=8., # font size for the x axis labels

)

plt.tight_layout()

#----output----

Listing 3-50.

Hierarchical clustering

A434293_1_En_3_Figak_HTML

Since we know that k=3, we can cut the tree at a distance threshold of around 10 to get exactly 3 distinct clusters.

Principal Component Analysis (PCA)

Existence of a large number of features or dimensions makes analysis computationally intensive and hard for performing machine learning tasks for pattern identification. PCA is the most popular unsupervised linear transformation technique for dimensionality reduction. PCA finds the directions of maximum variance in high-dimensional data such that most of the information is retained and projects it onto a smaller dimensional subspace. See Figure 3-21.

A434293_1_En_3_Fig21_HTML

Figure 3-21.

Principal Component Analysis

The PCA approach can be summarized as below. And see Listing 3-51.

· Standardize data.

· Use standardized data to generate covariance matrix or correlation matrix.

· Perform eigen decomposition, that is, compute eigen vectors that are the principal component which will give the direction and compute eigen values which will give the magnitude.

· Sort the eigen pairs and select eigen vectors with the largest eigen values that cumulatively capture information above a certain threshold (say 95%).

from sklearn.preprocessing import StandardScaler

from sklearn.decomposition import PCA

iris = load_iris()

X = iris.data

# standardize data

X_std = StandardScaler().fit_transform(X)

# create covariance matrix

cov_mat = np.cov(X_std.T)

print('Covariance matrix \n%s' %cov_mat)

eig_vals, eig_vecs = np.linalg.eig(cov_mat)

print('Eigenvectors \n%s' %eig_vecs)

print('\nEigenvalues \n%s' %eig_vals)

# sort eigenvalues in decreasing order

eig_pairs = [(np.abs(eig_vals[i]), eig_vecs[:,i]) for i in range(len(eig_vals))]

tot = sum(eig_vals)

var_exp = [(i / tot)*100 for i in sorted(eig_vals, reverse=True)]

print "Cummulative Variance Explained", cum_var_exp

plt.figure(figsize=(6, 4))

plt.bar(range(4), var_exp, alpha=0.5, align='center',

label='Individual explained variance')

plt.step(range(4), cum_var_exp, where='mid',

label='Cumulative explained variance')

plt.ylabel('Explained variance ratio')

plt.xlabel('Principal components')

plt.legend(loc='best')

plt.tight_layout()

plt.show()

#----output----

Covariance matrix

[[ 1.00671141 -0.11010327 0.87760486 0.82344326]

[-0.11010327 1.00671141 -0.42333835 -0.358937 ]

[ 0.87760486 -0.42333835 1.00671141 0.96921855]

[ 0.82344326 -0.358937 0.96921855 1.00671141]]

Eigenvectors

[[ 0.52237162 -0.37231836 -0.72101681 0.26199559]

[-0.26335492 -0.92555649 0.24203288 -0.12413481]

[ 0.58125401 -0.02109478 0.14089226 -0.80115427]

[ 0.56561105 -0.06541577 0.6338014 0.52354627]]

Eigenvalues

[ 2.93035378 0.92740362 0.14834223 0.02074601]

Cummulative Variance Explained:[72.77045 95.8009799.48480 100]

Listing 3-51.

Principal component analysis

A434293_1_En_3_Figal_HTML

In the above plot we can see that the first three principal components are explaining 99% of the variance. Let’s perform PCA using scikit-learn and plot the first three eigen vectors. See Listing 3-52.

# source: http://scikit-learn.org/stable/auto_examples/datasets/plot_iris_dataset.html#

import matplotlib.pyplot as plt

from mpl_toolkits.mplot3d import Axes3D

from sklearn import datasets

from sklearn.decomposition import PCA

# import some data to play with

iris = datasets.load_iris()

Y = iris.target

# To getter a better understanding of interaction of the dimensions

# plot the first three PCA dimensions

fig = plt.figure(1, figsize=(8, 6))

ax = Axes3D(fig, elev=-150, azim=110)

X_reduced = PCA(n_components=3).fit_transform(iris.data)

ax.scatter(X_reduced[:, 0], X_reduced[:, 1], X_reduced[:, 2], c=Y, cmap=plt.cm.Paired)

ax.set_title("First three PCA directions")

ax.set_xlabel("1st eigenvector")

ax.w_xaxis.set_ticklabels([])

ax.set_ylabel("2nd eigenvector")

ax.w_yaxis.set_ticklabels([])

ax.set_zlabel("3rd eigenvector")

ax.w_zaxis.set_ticklabels([])

plt.show()

#---output----

Listing 3-52.

Visualize pca

A434293_1_En_3_Figam_HTML

Endnotes

With this we have reached the end of step 3. We briefly learned different fundamental machine learning concepts and their implementation. Data quality is an important aspect to build efficient machine learning systems; in line with this we learned about different types of data, commonly practiced EDA techniques for understanding the data quality, and the fundamental preprocessing techniques to fix the data gaps. Supervised models such as linear and nonlinear regression techniques are useful to model patterns to predict continuous numerical data types. Whereas logistic regression, decision trees, SVM and kNN are useful to model classification problems (functions are available to use for regression as well).You also learned ARIMA, which is one of the key time-series forecasting models. Unsupervised techniques such as k-means and hierarchical clustering are useful to group similar items, whereas principal component analysis can be used to reduce a large dimension data to lower a dimension to enable efficient computation.

In the next step you’ll learn how to pick best parameters for a model, widely known as “Hyperparamerter tuning” to improve model accuracy. What are the common practices followed to pick the best model among multiple models for a given problem? You'll also learn to combine multiple models to get the best from individual models.

© Manohar Swamynathan 2017

Manohar SwamynathanMastering Machine Learning with Python in Six Steps10.1007/978-1-4842-2866-1_4