Lien de la note Hackmd
Supervised learning
Supervised learning: process of teaching a model by feeding it input data as well as correct output data. The model will (hopefully) deduce a correct relationship between the input and output
- An input/output pair is called labeled data
- All pairs form the training set
- Once training is completed, the model can infer new outputs if fed with new inputs.
Given some training data ${x_i,y_i}^n_{i=1}$, supervised learning aims at finding a model $f$ correctly mapping input data $x_i$ to their respective output
- The model can predict new outputs
- The learning mechanism is called regression or classification
Managing data for supervised learning
Hide some data out during training ($\simeq20\%$ data) to further evaluate model performances $\Rightarrow$ train/test split Use validation set ($\simeq15\%$ data) if parameters are iteratively adjusted $\Rightarrow$ tain/validation split
Stratified sampling
For classification purposes
CLasses might be imbalanaced $\Rightarrow$ use stratified sampling to guarantee a fair balance of train/est samples for each class
Regression
The art of predicting values
Regression: the output value to predict $y$ is quantitative (real number)
$\Rightarrow$ How to mathematically model the relationship between predictor variables $x_i$ and their numerical output $y_i$ ?
Linear regression
Sometimes, there’s no need for a complicated model…
Ordinary Least Squares
Anscombes’ quartet
For all 4 datasets ${(x_1,y_1),(x_2,y_2),…,(x_{11},y_{11})}$
Le 3e regression a une donnee aberrante, cad une donnee tres eloignee des autres qui risque de fausser la regression (probablement du au capteur qui s’est chie dessus)
$\Rightarrow$ Linear regression line $y=3+0.5x$ and $R^2=0.67$ are the SAME for all 4 datasets
Least absolute deviation
Linear regression by OLS is sensitive to outliers (tj=hank you $L_2$ norm…)
Is it a good idea ?
- $\beta_{LAD}$ is the MLE estimator of $\beta$ when noise follows a Laplace distribution
- No analyticial formula for LAD
- Harder to find the solution
- Must use gradient descent approach
- Solution of LAD may not be unique
Toutes les droites dans le cone sont optimales
Adding some regularization
Add apenalty term to OLS to eforce particular properties to $\hat\beta$
From regression to classification
Logistic regression
Linear regression predicts a real value $\hat y$ based on predictor variables $x=(x^{(1)},…,x^(k))$
- Does not work is $y$ is boolean
- $P(y=1)=p$ and $P(y=0)=1-p$
- Use logistic regression instead
Linear relationship between predictor variables and logit of event:
k-nearest neighbors
k-NN classifier simply assigns test data points to the majority class in the neighborood of the test points
- no real training step
Result:
Choosing k
- small k: simple but noisy decision boundary
- large k: smoothed boundaries but computationally intensive
- $k=\sqrt{n}$ can also serve as a starting heuristic, refined by cross-validation
- $k$ should be odd for binary classification
k-nearest neighbors for regression
Use the k nearest neighbors (in terms of features only) and average to get predicted value
Support Vector Machine
Linear SVM
Training set: \(\{x_i,y_i\}_{i=1}^n\) with $x_i\in\mathbb R^p$ and $y_i\in{-1,+1}$
Goal: find hyperplane that best divide positive sample and negative samples
Qu’est-ce qu’on a envie de faire ici ? Une moyenne
On cherche la droite qui passe le plus au centre
Rappel: produit scalaire de 2 vecteurs colineaires:
\[<\vec w, \vec{AB}> = \Vert \vec w\Vert.\Vert \vec{AB}\Vert\]Soft margin SVM
Data may not be fully linearly separable
Kernel SVM
Remember the kernel trick ?
Kernel trick:
- map data points into high dimesional space where they would become linearly separable
- Effortlessly interfaced with the SVM by replacing dot product $<.,.>$ by kernelizes version $k(.,.)$
Widely used kernel functions:
- Polynomial kernel
- Gaussian RBF kernel
- Sigmoid kernel
Choosing the right kernel with the right hyperparameters
Kernel $\Rightarrow$ Try linear first. If does not work, RBF is probably the best kernel choice (unless you have some prior information on the geometry of your dataset)
Hyperparameters ($C$ + kernel parameter(s)) $\Rightarrow$ grid search and cross-validation
Mutliclass SVM
What if we have more than 2 classes ?
2 possible strategies
one vs all: One SVM model per class $\to$ separate the class from all other classes
- Assign new points with winner takes all rule
- if no outright winner, assign point to the class of closest hyperplane (Platt scaling)
One versus one: one SVM model per pair of classes $\to$ separate 2 classes at a time, ignoring the other data
- assign new points with majority voting rule
Decision trees
Decision trees use recusrive partitioning to create a sequence of decision rules on input features that nested split of data points
Input features can be numeric (decision $\le$) or categorical (decision $==$)
Decision node $=$ decision rule for one feature
Classification tree $\to$ predict class Regression tree $\to$ predict real number
On the current node, try to apply all the possible decision rules for all features and select the decision that best split the data Classification tree $\to$ impurity riterion Regression tree $\to$ variance reduction
Final decision boundaries $\equiv$ overlapping orthogonal half planes Decision on new data $\to$ running it down through the branches and assign classes
How to split a node
Which split should we choose between
La reponse est goche
Stop recursive partitionning if node is pure
Pros and cons of decision trees
Pros
- Simple decision rules
- Surprisingly computationally efficient
- Handle multiclass problems
- Handle numeric and categorical features at the same time
Cons
- Strongly overfit data
- Bad predictive accuracy
Potential solution Restrain the growth of the tree by imposing a maximal tree depth
Random forests
Bagging several decision trees
Decision trees are weak classifiers when considered individually
- Average the decision of several of them
- Compensate their respective errors (wisdom of crowds)
- Useless if all decision trees see the same data
- introduce some variability with bagging (bootstrap aggregating)
- Introduce more variability by selecting only $p$ out of $m$ total features for each split in each decision tree (typically $p=\sqrt{m}$)
Final decision is taken by majority voting on all decision tree outputs
Decision boundaries comparison
Evaluating regression/classification performances
Cross-validation
$k$-fold cross validation
- Divide whole data into $k$ non-overlapping sample blocks
- Train $k$ models on $(k-1)$ training blocks and test on remaining block
- Compte perf metrics of each model + avergae & standard deviation of all $k$ models