Notes about the IBM course “Machine Learning with Python”

Machine Learning with Python

Linear Regression

In linear regression, the dependant values (target) should be continuous and cannot be discrete value. However, the independant variables can be measured on either a categorical or continuous measurement scale. Linear Regression is used to predict continuous values, for example it can be used to predict :

  • Sales forecasting ( Target : total yearly sales. Independant variables : Age, Education, Years of experience )
  • Satisfaction analysis ( Target : Individual satisfaction. Independant variables : Demographic ans psychological factors )
  • Price estimation ( Target : Price of a house in an area. Independant variables : Size, number of bedrooms, …)
  • Employment income ( Target : Income. Independant variables : Hours of work, education, occupation, sex, age, years of experience, …)

There are many Regression Algorithms, to name a few :

  • Ordinal regression
  • Poisson regression
  • Fast forest quantile regression
  • Linear, polynomial, Lasso, Stepwise, Ridge regression
  • Bayesian linear regression
  • Neural network regression
  • Decision forest regression
  • Boosted decision tree regression
  • KNN ( K-nearest Neighbors )

Simple Linear Regression

Simple Linear Regression is when 1 independant variable is used to estimate a dependant, continuous variable. As you can guess, when more than one independant variable is used, the process is called Multiple Linear Regression. Pros of Linear Regression :

  • Very Fast
  • No paramter tuning
  • Easy to understand and higly interpretable

Fitting Line

The fit line is the line that goes through the data, representing a tendance of how dependant variable values evolve related to the independant variables. It is usually represented as a polynomial :

=> \(\boxed{\hat{y} = \overbrace{\theta_0 + \theta_1}^{\text{parameters}}\underbrace{x_1}_{\text{predictor}}}\)

\(\hat{y}\) = dependant variable or the predicted value
\(x_1\) = independant variable

Coefficients of the linear equation :
\(\theta_0\) = intercept
\(\theta_1\) = slope or gradient of the fitting line

For each data, there is a delta between the predicted value and the actual value on our data set. The best fit is where the mean (squarred here) of all the residual erros is the lowest. See below for the definition of the Mean Squarred Error (MSE). The objective of linear regression is to minimized this MSE equation by finding the best parameters \(\theta_0\) and \(\theta_1\).

Calculating \(\theta_0\) and \(\theta_1\)

Linear regression estimates the coefficient of the line. To do so, we must calculate \(\theta_0\) and \(\theta_1\) (adjust the parameters) to find the best line to fit the data.

\[\hat{y} = \color{orange}{\theta_0} + \color{cyan}{\theta_1x_1}\]

\(\color{cyan}{\theta_1 = \cfrac{\sum_{i}^s = 1(x_i - \bar{x})(y_i - \bar{y})}{\sum_{i}^s = 1(x_i - \bar{x})^2}}\) (slope)

\(\color{orange}{\theta_0 = \bar{y} - \theta_1\bar{x}}\) (Bias Coefficient)

\(\bar{x}\) = mean of independant variables (features)
\(\bar{y}\) = mean of dependant variables (target)

With \(\theta_0\) being the intercept and \(\theta_1\) being the slope. As it will go through all the data one by one, \(i\) represents the current vPalue and \(s\) the total number of values.

Error and Accuracy

In order to evaluate our model, we need to compare our predictions ( \(\hat{y}\)) with the actual values ( \(y\) ). If we “Train and Test” on the same dataset, we will end up with a model that has “High Training accuracy” and “Low out-of-sample accuracy”. This is called over-fitting. Over-fitting is when the model is overly trained to the dataset, which may capture noise and produce a non-generalized model. To avoid over-fitting, we split the dataset in training set and testing set in an approach called “Train/Test split”. In this approach, the Training Set and the Testing Set are mutually exclusive. This method provides a more accurate evaluation on out-of-sample accuracy.

K-Fold cross-validation

One down side of that approach is that it makes it higly depdendant on which datasets the data is trained and tested. The K-Fold cross-validation is there to overcome this downside. It basically is using multiple (K) Train/Test Splits on the same dataset, using different test sample each time and calculating the average accuracy from the different Train/Test splits. This provide a more consistent out-of-sample accuracy

Evaluation metrics

The error of the model is the difference between the data points and the trend line generated by the algorithm. The choice of metric completely depends on the type of model, your data type and domain of knowledge.

Mean Absolute Error (MAE)

It is the average of all errors. This metric is used when you want all errors to have the same weight. It is usually used to get a simple and intuitive mesure of performances.
MAE = \(\cfrac{1}{n}\sum_{j=1}^n |y_j - \hat{y}_j|\)

MAE = \(\cfrac{(y_{j1} - \hat{y}_{j1}) + (y_{j2} - \hat{y}_{j2}) + (y_{j3} - \hat{y}_{j3}) + ... + (y_{jn} - \hat{y}_{jn})}{n}\)

Mean Squared Error (MSE)

Very popular as the focus is geared more toward large errors. This metrics is used when huge errors are more critical and need to be avoided ( health, critical infrastructure, finance ).

MSE = \(\cfrac{1}{n}\sum_{i=1}^n(y_i - \hat{y}_i)^2\)

or

MSE = \(\cfrac{RSS}{n}\)

with \(n\) the number of observations

Root Mean Squared Error (RMSE)

Popular as well as it can be interpreted in the same unit as the response vector (as MAE) or \(y\) units while still punishing large errors (as MSE). Making it easy to interpret its information.

RMSE = \(\sqrt{\cfrac{1}{n}\sum_{j=1}^n(y_j - \hat{y}_j)^2}\)

Residual Sum of Squares (RSS)

This metric focuses on the errors, the lower, the better.

RSS = \(\sum_{i=1}^n(y_i - \hat{y}_i)^2\)

Relative Absolute Error (RAE)

Where \(\bar{y}\) is a mean value of \(y\), takes the total absolute error and normalizes it, by dividing by the total absolute error of the sample predictor. It compares the error of the prediction against the error of the mean of all targets.

if RAE < 1 : The model is better than a prediction using the average of all targets.
if RAE = 1 : The model is as good ( or as bad ) as using the average of reel values.
if RAE > 1 : The model is worst.

It can be used in prediction of energy consumption, sales per month, etc
RAE = \(\cfrac{\sum_{j=1}^n|y_j - \hat{y}_j|}{\sum_{j=1}^n|y_j - \bar{y}|}\)

Relative Squared Error (RSE)

Similar to RAE, this time focused on large errors, and widly adopted by the data science community.

RSE = \(\cfrac{\sum_{j=1}^n(y_j - \hat{y}_j)^2}{\sum_{j=1}^n(y_j - \bar{y})^2}\)

if RSE < 1 : The model is better than a prediction using the average of all targets.
if RSE = 1 : The model is as good ( or as bad ) as using the average of reel values.
if RSE > 1 : The model is worst.

The RSE is often used in Data Science as it is necessary to obtain the \(R^2\) value, used to evaluate the performance.It represents how close the data values are to the fitted regresion line. The higher it is, the better your model fits the data.

\[R^2 = 1 - RSE\]

The Total Sum of Squares express the variance inside the data. \(R^2\) can also be expressed this way :

\[R^2 = 1 - \cfrac{RSS}{TSS}\]

Multiple Linear Regression

When multiple independant variables are used to predict one dependant variable, the process is called Multiple Linear Regression.

Basically there are two applications for Multiple Linear Regression :

  • Indepdendant variables effectiveness on prediction. Does this, this or that impact the prediction ?
  • Predicting impact of changes. How much the dependant variable is affected when we modify one independant variable ?

It uses multiple independant variables ( or predictors ) to predict a continuous value, the dependant variable.

In Multiple Linear Regression, the target value Y is a linear combination of independant variables X.

Generally the model if of the form :

-> \(\hat{y} = \theta_0 + \theta_1x_1 + \theta_2x_2 + ... + \theta_nx_n\)

Vector form : \(\hat{y} = \theta^TX\)

It can be shown as a dot product of two vectors : the parameters vector and the feature set vector.

Generally we can show the equation for a multi dimensional space as \(\theta\) transposed \(X\). Where \(\theta\) is an n by 1 vector of unknown parameters in a multi demensional space and \(X\) is the vector of the featured sets.
As \(\theta\) is a vector of coefficients and is supposed to be multiplied by X.

Conventionnaly it is shown as transposed theta : \(\theta^T = [\theta_0, \theta_1, \theta_2, ...]\)

\(\theta\) is also called the parameters or weights vectors of the regression equation.

and \(X\) is the feature set : \(X = \begin{bmatrix}1 \\X_1 \\X_2 \\...\end{bmatrix}\)

Here \(X_1\) could be the engine size, \(X_2\) the number of cylinders, and so on.
The first element of the feature set is set to 1 as it turns the \(\theta_0\) into the intercept or bias parameter when the vector is multiplied by the parameter vector.

\(\theta^TX\) in a 1-dimensional space is the equation of a line. We used it in Simple Linear Regression. In higher dimensions, when we have more than 1 input (or X), the line is called a plane, or a hyperplane. This is what we use in multiple linear regression. We try to find the best fit hyperplane for our data.

How do we find the optimized parameters ? How do we find the values for \(\theta\) vector that minimize the error of the prediction ?

Optimized parameters are the one wich lead to the fewest errors.

Classification

Binary Classification VS Mutliclass Classification

K-Nearest Neighbours

Evaluation Metrics in Classification

Jaccard Index / Jaccard Similarity Coefficient

F1-Score

Log Loss