Lien de la note Hackmd
Practice 5: Take home messages
BoVW is linear classification friendly
- And linear classifier are to be prefered whenever possible
Data preparation is tedious
- An important part of the time is dedicated to data analysis
- Plus we prepared a lot of things for you in the previsous sesssions
Scikit-learn is easy and super powerful
- Calssifier evaluation in 1 line
- But there is more: parameter tuning, cross-validation, etc. in 1 or 2 lines
- Data preprocessing + classification (pipelines) in 1-3 lines
Some classifiers – part 2
How to build non-linear classifiers ?
2 solutions:
- Preprocess the data - seen last time
- Ex: explicit embedding, kernel trick…
- Change the input to make it linearly separable
- Combine multiple linear classifiers into nonlinear classifier - current topic
- Ex: boosting, neural networks…
- Split the input space into linear subspaces
Non-linear classification using combinations of linear classifiers
Multi-layer Perceptron
Combine features linearly, apply a linear activation function $\phi$, repeat
Universal approximation theorem
What if $\phi$ not linear ?
Universal approximation theorem (Cybenko 89, Hornik 91)
Decision tree
Works on categorical (like, “red”, “black”) and numerical (both discrete and continuous) random variables
Train by optimizing classification “purity” at each decision (threshold on a particular dimension in numerical case)
Very fast training and testing. Non parametric.
No need to preprocess the features
BUT: very prone to overfitting without strong limits on depth
Random Forests
Average the decision of multiple decision trees
Randomize in 2 ways:
- For each tree, pick a bootstrap sample of data
- For each split, pick random sample of features
More trees are always better
Ensemble methods
“Bagging” or “bootstrap aggregating”
Underlying idea: part of the variance is due to the specific choice of the training data set
- Let use create many similar training data sets using bootstrap
- For each of them, train a new classifier
- The final function will tbe the average of each function ouptuts
If generalization error is decomposed into bias and variance terms then bagging reduces variance (averag of large number of random error $\simeq 0$)
Random forest = a way of bagging trees
“Boosting”, AdaBoost variant
Combinaison of weak classifiers $\sum_m\alpha_mG_m(x)$
$\alpha_m$ increases with precision (less errors, bigger $\alpha_m$)
The classifier $G_m$ is trained with increased error cost for the observatins which were misclassified by $G_{m-1}$
A quick comparison
More tricks
Data augmentation
Add realistic deformations to your input in order to improve domain coverage.
For image data, depending on what is possible in productin: rotations, horizontal & vertical dlips, scaling, translation, illumination change, warping, noise, etc.
For vector data: intersting problem. Possible approach: train/fit PCA then add random noise in low-energy features
Reject
Several options:
- Improve the model of class boundary
- In 1-vs-all training, add noise to the “others” samples
- Adjust the decision function dependinf on your application
- Look at the prediction probabiblity of your classifier, and threshold it as per your need using a ROC curve
- Model the noise
- Add a “none” class to your classifier, with samples for real life cases of negatives samples
More theory on ML
What is our goal ?
Given samples (described by features) and true labels, find a good function which wil correctly predict labels given new data samples
Problems:
- Which family for our function?
- What is “good”?
- How to train / find such function?
What are the sources of error ?
- Noise
- Your data is not perfect. (or “Every model is wrong.”)
- Even if there exist an optimal underlying model, the observations are corrupted by noise (e.g. multiple y for a given x).
- - Even the optimal solution could be wrong.
- Bias
- You need to simplify to generalize.
- You classifier needs to drop some information about the training set to have generalization power.
- The set of solutions explored does not contain the optimal solution.
- Variance
- You have many ways to explain your training dataset
- It is hard to find an optimal solution among those many possibilities.
- If we draw another training set from the same distribution, we would obtain another solution.
2 big issues
Under-fitting
- Caused by bias
- Your model assumptions are too strong for the data, so the model won’t fill well
Over-fitting
- Caused by variance
- Your algorithm has memorized the data including the noise, so it can’t generalize.
The theory
Bias (statistical definition)
Let $T$ be a statistic used to estimate a parameter $\theta$.
If $E[T] = \theta + bias(\theta)$ then $bias(\theta)$ is called the bias of the statistic $T$, where $E[T]$ represents the expected value of the statistics $T$.
If $bias(\theta) = 0$, then $E[T] = \theta$. So, $T$ is an unbiased estimator of the true parameter, say $\theta$.
Expected Risk
Let $D_n$ be a training set of examples $z_i$ drawn independently from an unknown distribution $p(z)$
We need a set of functions F. Example: linear functions $f(x) = a \times x + b$
We need a loss function $L(z, f)$. Example: $L((x, y), f ) = (f (x) − y)^2$
The Expected Risk, i.e. the expected generalization error, is:
But we do not know $p(z)$, and we cannot test all $z$!
Empirical Risk
Because we cannot measure the real Expected Risk, we have to estimate it using the Empirical Risk:
$D_n$ is our dataset
And our training procedure then relies on Empirical Risk Minimization (ERM):
And the training error is given by:
Does this make sense?
The empirical risk is an unbiased estimate of the risk, i.e. the more test samples we have, the more accurate our estimate is, under iid assumption.
But the training risk is biased
The training error is a biased estimate of the risk, i.e. the solution $f^★ (D_n)$ found by minimizing the training error is better on $D_n$ than on any other set $D’_n$ drawn from $p(z)$.
However, under certain assumptions, the difference between the expected and the empirical risks can be bounded. This is an important result from the work of Vapnik
Note that the empirical risk on the test set is an unbiased estimate of the risk.
Estimate the Expected Risk with the Empirical Risk
For a given capacity, using more samples to train and evaluate your predictor should make your Empirical Risk converge toward the best possible Expected Risk, if the ERM is consistent for $F$, given your training set $D_n$.
The difference between Expected Risk and Empirical Risk is bounded but depends on the capacity of $F$ (set of possible functions).
There is an optimal capacity for a given number of training samples $n$.
Capacity
The capacity $h(F)$ is a measure of its size, or complexity (or VC dimension)
For classification, the capacity of $F$ is defined by Vapnik & Chervonenkis as:
- the largest $n$
- such that there exist a set of examples $D_n$
- such that one can always find an $f \in F$
- which gives the correct answer for all examples in $D_{n’}$
- for any possible labeling.
The Bias-Variance Dilemma
Intrinsic dilemma: when the capacity $h(F)$ grows, the bias goes down, but the variance goes up!
Decomposing the bias-variance-error for MSE
For a regression problem with a mean square loss, we have the following decomposition. Let $Y = f(X) + \varepsilon$, with $\varepsilon \sim N(0, \sigma_{\varepsilon}^2)$ and $f_D(X)$ an estimator of $f(X)$, learned over the training set $D$. The error at a particular point $X = x_0$ is:
In practice
Empirical Risk and Expected Risk
Measure train and test error
Use hold-out sets, cross-validations, etc. to get a test error.
Train error: Empirical Risk. Can my model learn something (by heart)?
Test error: Coarse estimate of the Expected Risk. Can my model generalize to unseen data?
Detect under-fitting and over-fitting
Some solutions / hints
How to get started?
- Get enough data in the right format from your customer (hard)
- Check and split data (boring but mandatory)
- Agree on a loss function and minimum performance goal (moderate)
- Try to overfit a predictor on some samples (train set loss), increase complexity only if needed (capacity check)
- Fit on more data (more = better)
- Check for overfitting (val set loss) and add regularization if needed
- Evaluate performance thoroughly (test set loss) (reports, identify failure cases, etc.)
- Do some hyper-parameter optimization, try other models…
- …
Introduction to practice session 6
Using classification to segment images
Until now
- 1 image $\to$ many vectors (instance recognition)
- 1 image $\to$ 1 vector (image retrieval, image classification)
Today / next practice session :
- 1 pixel →1 vector (pixel classification, image semantic segmentation)
Brain Anatomy and Imaging
Human brain = Where human OS is stored and run
To investigate brain malfunction, two options:
Magnetic Resonance Imaging (MRI)
Everything you always wanted to know about MRI
Hydrogen atoms are naturally abundant in humans, particularly in water and fat.
Pulses of radio waves excite the nuclear spin energy transition, and macroscopic polarization that is detected by antennas.
Magnetic field gradients localize the polarization in space. By varying the parameters of the pulse sequence, different contrasts may be generated between tissues based on the relaxation properties of the hydrogen atoms therein.
What you actually need to know
MRI is a large family of imaging techniques
They can produce 3D scans of various appearances in order to emphasize some human tissues versus others.
BraTS: Brain Tumor Segmentation Competition
Original segmentation task
Given a 3D scan (skull-stripped, registered) of a patient with T1, T2, T1C and FLAIR modalities, predict a tumor class for each voxel (the patient suffers from a glioma):
Avec:
- edema (yellow),
- non-enhancing solid core (red),
- necrotic/cystic core (green),
- enhancing core (blue).
Original dataset
The 2018 competition we use the data from originally contains 285 brain scans.
Your Mission
A simplified competition
Because dealing with 3D and data normalization would take you much time and pain, we:
- already performed data normalization
- extracted 2D (axial) slices that you have to process
Actual task
Given a $240\times240$ image with $4$ modalities (already normalized), predict for each pixel whether it belongs to a tumor or nor.
Actual dataset
Train set
- $256$ normalized slices, one per patient, containing $240\times240$ images with $4$ channels ($1$ for each modality),
float32
- $256$ target segmentations, one per patient, containing $240\times240$ images with $1$ channel (indicating tumor or clean region),
uint8
Test set
- $29$ normalized slices, one per patient (not in the training set), containing $240\times240$ images with $4$ channels ($1$ for each modality),
float32
- Ground truth kept secret for grading
Suggested Pipeline
Data preprocessing
We already did this step.
Choose and train a classifier
There are several suggestions in the reference notebook: SVM, neural network, etc.
- Input = 1 vector of 4 components for each pixels
- Output = 1 for tumor, 0 for “not tumor”
Do not use background (“black”) pixels for training, they would ruin your classification
Deep nets can work but they are harder to train well.
And don’t use deep nets, we’ll play with them next semester
Validate your training
Create and use a validation set extracted from the full training set.
To not train on the samples it contains.
sklearn.model_selection.train_test_split
may be your friend.
Check visually results from both train and val sets!
Interpret your results
Add some context to each pixel
You can get better results by looking at the neighborhood of a pixel to classify it better: train with vectors of size $N\times M$ instead of $1\times M$.
Fighting underfitting and overfitting
You do not have much data to train on
If you pick a classifier which is too simple, you may underfit: you will get low and similar scores both on the train and test sets
Choosing another classifier may be a good idea here.
You may also easily overfit your classifier, especially if you use one with a large capacity: you will get excellent scores on the train set, and bad ones on the test set.
Regularization may be necessary.
Post processing
We suggest in the notebook to “clean up” the results by removing very small isolated pixels marked as tumor.
You may have many other ideas here
Going Further
Many options
- Data augmentation to increase train set
- Larger / better neighborhood for each pixel
- Better ANN structure than the one suggested in the notebook
- Change the representation space? (Fourier, wavelets…)
- As the tumors under consideration may not have “holes”, improve the post-processing
Super heavy classifiers (UNet, Gradient Boosted Trees…)- …
Conclusion
Course overview: a very small glimpse of CV/PR/ML
Welcome to 2012
AlexNet by A. Krizhevsky, I. Sutskever, G. E. Hinton halved error rate on ImageNet competition
Deep learning
Will be there for a few years!
Is a natural extension of what we saw: feature extraction, encoding, pooling, classification in a single, integrated, globally optimized pipeline.
Requires skills you learned: dev, math, data preparation, evaluation. Input data still need to be properly normalized, for instance.
Requires a lot of practice: read papers, don’t be impressed by the math, implement them.
If not applicable, then pick one of the good old technique we talked about.