Lien de la note Hackmd

Supervised learning

Supervised learning: process of teaching a model by feeding it input data as well as correct output data. The model will (hopefully) deduce a correct relationship between the input and output

An input/output pair is called labeled data
- All pairs form the training set
Once training is completed, the model can infer new outputs if fed with new inputs.

Given some training data ${x_i,y_i}^n_{i=1}$, supervised learning aims at finding a model $f$ correctly mapping input data $x_i$ to their respective output

The model can predict new outputs
The learning mechanism is called regression or classification

Managing data for supervised learning

Hide some data out during training ($\simeq20\%$ data) to further evaluate model performances $\Rightarrow$ train/test split Use validation set ($\simeq15\%$ data) if parameters are iteratively adjusted $\Rightarrow$ tain/validation split

Stratified sampling

For classification purposes

CLasses might be imbalanaced $\Rightarrow$ use stratified sampling to guarantee a fair balance of train/est samples for each class

Regression

The art of predicting values

Regression: the output value to predict $y$ is quantitative (real number)

$\Rightarrow$ How to mathematically model the relationship between predictor variables $x_i$ and their numerical output $y_i$ ?

Linear regression

Sometimes, there’s no need for a complicated model…

Ordinary Least Squares

Anscombes’ quartet

For all 4 datasets ${(x_1,y_1),(x_2,y_2),…,(x_{11},y_{11})}$

Le 3e regression a une donnee aberrante, cad une donnee tres eloignee des autres qui risque de fausser la regression (probablement du au capteur qui s’est chie dessus)

$\Rightarrow$ Linear regression line $y=3+0.5x$ and $R^2=0.67$ are the SAME for all 4 datasets

Least absolute deviation

Linear regression by OLS is sensitive to outliers (tj=hank you $L_2$ norm…)

Is it a good idea ?

$\beta_{LAD}$ is the MLE estimator of $\beta$ when noise follows a Laplace distribution
No analyticial formula for LAD
- Harder to find the solution
- Must use gradient descent approach
Solution of LAD may not be unique

Toutes les droites dans le cone sont optimales

Adding some regularization

Add apenalty term to OLS to eforce particular properties to $\hat\beta$

From regression to classification

Logistic regression

Linear regression predicts a real value $\hat y$ based on predictor variables $x=(x^{(1)},…,x^(k))$

Does not work is $y$ is boolean
$P(y=1)=p$ and $P(y=0)=1-p$
Use logistic regression instead

Linear relationship between predictor variables and logit of event:

k-nearest neighbors

k-NN classifier simply assigns test data points to the majority class in the neighborood of the test points

no real training step

Result:

Choosing k

small k: simple but noisy decision boundary
large k: smoothed boundaries but computationally intensive
$k=\sqrt{n}$ can also serve as a starting heuristic, refined by cross-validation
$k$ should be odd for binary classification

k-nearest neighbors for regression

Use the k nearest neighbors (in terms of features only) and average to get predicted value

Support Vector Machine

Linear SVM

Training set: $\{x_i,y_i\}_{i=1}^n$ with $x_i\in\mathbb R^p$ and $y_i\in{-1,+1}$

Goal: find hyperplane that best divide positive sample and negative samples

Qu’est-ce qu’on a envie de faire ici ? Une moyenne

On cherche la droite qui passe le plus au centre

Rappel: produit scalaire de 2 vecteurs colineaires:

\[<\vec w, \vec{AB}> = \Vert \vec w\Vert.\Vert \vec{AB}\Vert\]

Soft margin SVM

Data may not be fully linearly separable

Kernel SVM

Remember the kernel trick ?

Kernel trick:

map data points into high dimesional space where they would become linearly separable
Effortlessly interfaced with the SVM by replacing dot product $<.,.>$ by kernelizes version $k(.,.)$

Widely used kernel functions:

Polynomial kernel
Gaussian RBF kernel
Sigmoid kernel

Choosing the right kernel with the right hyperparameters

Kernel $\Rightarrow$ Try linear first. If does not work, RBF is probably the best kernel choice (unless you have some prior information on the geometry of your dataset)

Hyperparameters ($C$ + kernel parameter(s)) $\Rightarrow$ grid search and cross-validation

Mutliclass SVM

What if we have more than 2 classes ?

2 possible strategies

one vs all: One SVM model per class $\to$ separate the class from all other classes

Assign new points with winner takes all rule
if no outright winner, assign point to the class of closest hyperplane (Platt scaling)

One versus one: one SVM model per pair of classes $\to$ separate 2 classes at a time, ignoring the other data

assign new points with majority voting rule

Decision trees

Decision trees use recusrive partitioning to create a sequence of decision rules on input features that nested split of data points

Input features can be numeric (decision $\le$) or categorical (decision $==$)

Decision node $=$ decision rule for one feature

Classification tree $\to$ predict class Regression tree $\to$ predict real number

On the current node, try to apply all the possible decision rules for all features and select the decision that best split the data Classification tree $\to$ impurity riterion Regression tree $\to$ variance reduction

Final decision boundaries $\equiv$ overlapping orthogonal half planes Decision on new data $\to$ running it down through the branches and assign classes

How to split a node

Which split should we choose between

La reponse est goche

Stop recursive partitionning if node is pure

Pros and cons of decision trees

Pros

Simple decision rules
Surprisingly computationally efficient
Handle multiclass problems
Handle numeric and categorical features at the same time

Cons

Strongly overfit data
- Bad predictive accuracy

Potential solution Restrain the growth of the tree by imposing a maximal tree depth

Random forests

Bagging several decision trees

Decision trees are weak classifiers when considered individually

Average the decision of several of them
- Compensate their respective errors (wisdom of crowds)
Useless if all decision trees see the same data
- introduce some variability with bagging (bootstrap aggregating)
Introduce more variability by selecting only $p$ out of $m$ total features for each split in each decision tree (typically $p=\sqrt{m}$)

Final decision is taken by majority voting on all decision tree outputs

Decision boundaries comparison

Evaluating regression/classification performances

Cross-validation

$k$-fold cross validation

Divide whole data into $k$ non-overlapping sample blocks
Train $k$ models on $(k-1)$ training blocks and test on remaining block
Compte perf metrics of each model + avergae & standard deviation of all $k$ models

IML: Supervised learning

Supervised learning

Managing data for supervised learning

Stratified sampling

Regression

Linear regression

Ordinary Least Squares

Anscombes’ quartet

Least absolute deviation

Adding some regularization

From regression to classification

Logistic regression

k-nearest neighbors

Choosing k

k-nearest neighbors for regression

Support Vector Machine

Linear SVM

Soft margin SVM

Kernel SVM

Choosing the right kernel with the right hyperparameters

Mutliclass SVM

What if we have more than 2 classes ?

Decision trees

How to split a node

Pros and cons of decision trees

Pros

Cons

Random forests

Decision boundaries comparison

Evaluating regression/classification performances

Cross-validation

Confusion matrix

IML: Supervised learning

Supervised learning

Managing data for supervised learning

Stratified sampling

Regression

Linear regression

Ordinary Least Squares

Anscombes’ quartet

Least absolute deviation

Adding some regularization

From regression to classification

Logistic regression

k-nearest neighbors

Choosing k

k-nearest neighbors for regression

Support Vector Machine

Linear SVM

Soft margin SVM

Kernel SVM

Choosing the right kernel with the right hyperparameters

Mutliclass SVM

What if we have more than 2 classes ?

Decision trees

How to split a node

Pros and cons of decision trees

Pros

Cons

Random forests

Decision boundaries comparison

Evaluating regression/classification performances

Cross-validation

Confusion matrix

Further Reading

IML: Introduction

IML: Dimensionality reduction

IML: Unsupervised clustering