Home IML: Supervised learning
Post
Cancel

IML: Supervised learning

Lien de la note Hackmd

Supervised learning

  • An input/output pair is called labeled data
    • All pairs form the training set
  • Once training is completed, the model can infer new outputs if fed with new inputs.

Given some training data ${x_i,y_i}^n_{i=1}$, supervised learning aims at finding a model $f$ correctly mapping input data $x_i$ to their respective output

  • The model can predict new outputs
  • The learning mechanism is called regression or classification

Managing data for supervised learning

Hide some data out during training ($\simeq20\%$ data) to further evaluate model performances $\Rightarrow$ train/test split Use validation set ($\simeq15\%$ data) if parameters are iteratively adjusted $\Rightarrow$ tain/validation split

Stratified sampling

For classification purposes

CLasses might be imbalanaced $\Rightarrow$ use stratified sampling to guarantee a fair balance of train/est samples for each class

Regression

The art of predicting values

$\Rightarrow$ How to mathematically model the relationship between predictor variables $x_i$ and their numerical output $y_i$ ?

Linear regression

Sometimes, there’s no need for a complicated model…

Ordinary Least Squares

Anscombes’ quartet

For all 4 datasets ${(x_1,y_1),(x_2,y_2),…,(x_{11},y_{11})}$

Le 3e regression a une donnee aberrante, cad une donnee tres eloignee des autres qui risque de fausser la regression (probablement du au capteur qui s’est chie dessus)

$\Rightarrow$ Linear regression line $y=3+0.5x$ and $R^2=0.67$ are the SAME for all 4 datasets

Least absolute deviation

Linear regression by OLS is sensitive to outliers (tj=hank you $L_2$ norm…)

Is it a good idea ?

  • $\beta_{LAD}$ is the MLE estimator of $\beta$ when noise follows a Laplace distribution
  • No analyticial formula for LAD
    • Harder to find the solution
    • Must use gradient descent approach
  • Solution of LAD may not be unique

Adding some regularization

Add apenalty term to OLS to eforce particular properties to $\hat\beta$

From regression to classification

Logistic regression

Linear regression predicts a real value $\hat y$ based on predictor variables $x=(x^{(1)},…,x^(k))$

  • Does not work is $y$ is boolean
  • $P(y=1)=p$ and $P(y=0)=1-p$
  • Use logistic regression instead

Linear relationship between predictor variables and logit of event:

k-nearest neighbors

k-NN classifier simply assigns test data points to the majority class in the neighborood of the test points

  • no real training step

Result:

Choosing k

  • small k: simple but noisy decision boundary
  • large k: smoothed boundaries but computationally intensive
  • $k=\sqrt{n}$ can also serve as a starting heuristic, refined by cross-validation
  • $k$ should be odd for binary classification

k-nearest neighbors for regression

Use the k nearest neighbors (in terms of features only) and average to get predicted value

Support Vector Machine

Linear SVM

Training set: \(\{x_i,y_i\}_{i=1}^n\) with $x_i\in\mathbb R^p$ and $y_i\in{-1,+1}$

Goal: find hyperplane that best divide positive sample and negative samples

Qu’est-ce qu’on a envie de faire ici ? Une moyenne

Rappel: produit scalaire de 2 vecteurs colineaires:

\[<\vec w, \vec{AB}> = \Vert \vec w\Vert.\Vert \vec{AB}\Vert\]

Soft margin SVM

Data may not be fully linearly separable

Kernel SVM

Remember the kernel trick ?

Kernel trick:

  • map data points into high dimesional space where they would become linearly separable
  • Effortlessly interfaced with the SVM by replacing dot product $<.,.>$ by kernelizes version $k(.,.)$

Widely used kernel functions:

  • Polynomial kernel
  • Gaussian RBF kernel
  • Sigmoid kernel

Choosing the right kernel with the right hyperparameters

Kernel $\Rightarrow$ Try linear first. If does not work, RBF is probably the best kernel choice (unless you have some prior information on the geometry of your dataset)

Hyperparameters ($C$ + kernel parameter(s)) $\Rightarrow$ grid search and cross-validation

Mutliclass SVM

What if we have more than 2 classes ?

2 possible strategies

one vs all: One SVM model per class $\to$ separate the class from all other classes

  • Assign new points with winner takes all rule
  • if no outright winner, assign point to the class of closest hyperplane (Platt scaling)

One versus one: one SVM model per pair of classes $\to$ separate 2 classes at a time, ignoring the other data

  • assign new points with majority voting rule

Decision trees

Input features can be numeric (decision $\le$) or categorical (decision $==$)

Decision node $=$ decision rule for one feature

Classification tree $\to$ predict class Regression tree $\to$ predict real number

On the current node, try to apply all the possible decision rules for all features and select the decision that best split the data Classification tree $\to$ impurity riterion Regression tree $\to$ variance reduction

Final decision boundaries $\equiv$ overlapping orthogonal half planes Decision on new data $\to$ running it down through the branches and assign classes

How to split a node

Which split should we choose between

La reponse est goche

Pros and cons of decision trees

Pros

  • Simple decision rules
  • Surprisingly computationally efficient
  • Handle multiclass problems
  • Handle numeric and categorical features at the same time

Cons

  • Strongly overfit data
    • Bad predictive accuracy

Random forests

Bagging several decision trees

Decision trees are weak classifiers when considered individually

  • Average the decision of several of them
    • Compensate their respective errors (wisdom of crowds)
  • Useless if all decision trees see the same data
    • introduce some variability with bagging (bootstrap aggregating)
  • Introduce more variability by selecting only $p$ out of $m$ total features for each split in each decision tree (typically $p=\sqrt{m}$)

Final decision is taken by majority voting on all decision tree outputs

Decision boundaries comparison

Evaluating regression/classification performances

Cross-validation

$k$-fold cross validation

  • Divide whole data into $k$ non-overlapping sample blocks
  • Train $k$ models on $(k-1)$ training blocks and test on remaining block
  • Compte perf metrics of each model + avergae & standard deviation of all $k$ models

Confusion matrix

This post is licensed under CC BY 4.0 by the author.