Notes about the IBM course “Machine Learning with Python”

Machine Learning with Python

Linear Regression

In linear regression, the dependant values (target) should be continuous and cannot be discrete value. However, the independant variables can be measured on either a categorical or continuous measurement scale. Linear Regression is used to predict continuous values, for example it can be used to predict :

  • Sales forecasting ( Target : total yearly sales. Independant variables : Age, Education, Years of experience )
  • Satisfaction analysis ( Target : Individual satisfaction. Independant variables : Demographic ans psychological factors )
  • Price estimation ( Target : Price of a house in an area. Independant variables : Size, number of bedrooms, …)
  • Employment income ( Target : Income. Independant variables : Hours of work, education, occupation, sex, age, years of experience, …)

There are many Regression Algorithms, to name a few :

  • Ordinal regression
  • Poisson regression
  • Fast forest quantile regression
  • Linear, polynomial, Lasso, Stepwise, Ridge regression
  • Bayesian linear regression
  • Neural network regression
  • Decision forest regression
  • Boosted decision tree regression
  • KNN ( K-nearest Neighbors )

Simple Linear Regression

Simple Linear Regression is when 1 independant variable is used to estimate a dependant, continuous variable. As you can guess, when more than one independant variable is used, the process is called Multiple Linear Regression. Pros of Linear Regression :

  • Very Fast
  • No paramter tuning
  • Easy to understand and higly interpretable

Fitting Line

The fit line is the line that goes through the data, representing a tendance of how dependant variable values evolve related to the independant variables. It is usually represented as a polynomial :

=> \(\boxed{\hat{y} = \overbrace{\theta_0 + \theta_1}^{\text{parameters}}\underbrace{x_1}_{\text{predictor}}}\)

\(\hat{y}\) = dependant variable or the predicted value
\(x_1\) = independant variable

Coefficients of the linear equation :
\(\theta_0\) = intercept
\(\theta_1\) = slope or gradient of the fitting line

For each data, there is a delta between the predicted value and the actual value on our data set. The best fit is where the mean (squarred here) of all the residual erros is the lowest. See below for the definition of the Mean Squarred Error (MSE). The objective of linear regression is to minimized this MSE equation by finding the best parameters \(\theta_0\) and \(\theta_1\).

Calculating \(\theta_0\) and \(\theta_1\)

Linear regression estimates the coefficient of the line. To do so, we must calculate \(\theta_0\) and \(\theta_1\) (adjust the parameters) to find the best line to fit the data.

\[\hat{y} = \color{orange}{\theta_0} + \color{cyan}{\theta_1x_1}\]

\(\color{cyan}{\theta_1 = \cfrac{\sum_{i}^s = 1(x_i - \bar{x})(y_i - \bar{y})}{\sum_{i}^s = 1(x_i - \bar{x})^2}}\) (slope)

\(\color{orange}{\theta_0 = \bar{y} - \theta_1\bar{x}}\) (Bias Coefficient)

\(\bar{x}\) = mean of independant variables (features)
\(\bar{y}\) = mean of dependant variables (target)

With \(\theta_0\) being the intercept and \(\theta_1\) being the slope. As it will go through all the data one by one, \(i\) represents the current vPalue and \(s\) the total number of values.

Error and Accuracy

In order to evaluate our model, we need to compare our predictions ( \(\hat{y}\)) with the actual values ( \(y\) ). If we “Train and Test” on the same dataset, we will end up with a model that has “High Training accuracy” and “Low out-of-sample accuracy”. This is called over-fitting. Over-fitting is when the model is overly trained to the dataset, which may capture noise and produce a non-generalized model. To avoid over-fitting, we split the dataset in training set and testing set in an approach called “Train/Test split”. In this approach, the Training Set and the Testing Set are mutually exclusive. This method provides a more accurate evaluation on out-of-sample accuracy.

K-Fold cross-validation

One down side of that approach is that it makes it higly depdendant on which datasets the data is trained and tested. The K-Fold cross-validation is there to overcome this downside. It basically is using multiple (K) Train/Test Splits on the same dataset, using different test sample each time and calculating the average accuracy from the different Train/Test splits. This provide a more consistent out-of-sample accuracy

Evaluation metrics

The error of the model is the difference between the data points and the trend line generated by the algorithm. The choice of metric completely depends on the type of model, your data type and domain of knowledge.

Mean Absolute Error (MAE)

It is the average error.

MAE = $$ \cfrac{1}{n} \sum_{j=1}^n y_j - \hat{y}_j $$

MAE = \(\cfrac{(y_{j1} - \hat{y}_{j1}) + (y_{j2} - \hat{y}_{j2}) + (y_{j3} - \hat{y}_{j3}) + ... + (y_{jn} - \hat{y}_{jn})}{n}\)

Mean Squared Error (MSE)

More popular as the focus is geared more toward large errors.

MSE = \(\cfrac{1}{n}\sum_{i=1}^n(y_i - \hat{y}_i)^2\)

Root Meaned Squared Error (RMSE)

Popular as well as it can be interpreted in the same unit as the response vector or \(y\) units. Making it easy to relay its information.

RMSE = \(\sqrt{\cfrac{1}{n}\sum_{j=1}^n(y_j - \hat{y}_j)^2}\)

Relative Absolute Error (RAE)

Also known as residual sum of square. ERRATUM HERE. where \(\bar{y}\) is a mean value of \(y\), takes the total absolute error and normalizes it, by dividing by the total absolute error of the sample predictor.

RAE = \(\cfrac{\sum_{j=1}^n|y_j - \hat{y}_j|}{\sum_{j=1}^n|y_j - \bar{y}|}\)

Relative Squared Error (RSE)

Similar to RAE and widly adopted by the data science community.

RSE = \(\cfrac{\sum_{j=1}^n(y_j - \hat{y}_j)^2}{\sum_{j=1}^n(y_j - \bar{y})^2}\)

The RSE is often used in Data Science as it is necessary to obtain the \(R^2\) value, used to evaluate the performance.It represents how close the data values are to the fitted regresion line. The higher it is, the better your model fits the data.

\[R^2 = 1 - RSE\]

Classification

Binary Classification VS Mutliclass Classification

K-Nearest Neighbours

Evaluation Metrics in Classification

Jaccard Index / Jaccard Similarity Coefficient

F1-Score

Log Loss