Lien de la note Hackmd

Practice 5: Take home messages

BoVW is linear classification friendly

And linear classifier are to be prefered whenever possible

Data preparation is tedious

An important part of the time is dedicated to data analysis
Plus we prepared a lot of things for you in the previsous sesssions

Scikit-learn is easy and super powerful

Calssifier evaluation in 1 line
But there is more: parameter tuning, cross-validation, etc. in 1 or 2 lines
Data preprocessing + classification (pipelines) in 1-3 lines

Some classifiers – part 2

How to build non-linear classifiers ?

2 solutions:

Preprocess the data - seen last time
- Ex: explicit embedding, kernel trick…
- Change the input to make it linearly separable
Combine multiple linear classifiers into nonlinear classifier - current topic
- Ex: boosting, neural networks…
- Split the input space into linear subspaces

Non-linear classification using combinations of linear classifiers

Multi-layer Perceptron

Combine features linearly, apply a linear activation function $\phi$, repeat

Universal approximation theorem

What if $\phi$ not linear ?

Universal approximation theorem (Cybenko 89, Hornik 91)

Decision tree

Works on categorical (like, “red”, “black”) and numerical (both discrete and continuous) random variables

Train by optimizing classification “purity” at each decision (threshold on a particular dimension in numerical case)

Very fast training and testing. Non parametric.

No need to preprocess the features

BUT: very prone to overfitting without strong limits on depth

Random Forests

Average the decision of multiple decision trees

Randomize in 2 ways:

For each tree, pick a bootstrap sample of data
For each split, pick random sample of features

More trees are always better

Ensemble methods

“Bagging” or “bootstrap aggregating”

Underlying idea: part of the variance is due to the specific choice of the training data set

Let use create many similar training data sets using bootstrap
For each of them, train a new classifier
The final function will tbe the average of each function ouptuts

If generalization error is decomposed into bias and variance terms then bagging reduces variance (averag of large number of random error $\simeq 0$)

Random forest = a way of bagging trees

“Boosting”, AdaBoost variant

Combinaison of weak classifiers $\sum_m\alpha_mG_m(x)$

$\alpha_m$ increases with precision (less errors, bigger $\alpha_m$)

The classifier $G_m$ is trained with increased error cost for the observatins which were misclassified by $G_{m-1}$

A quick comparison

More tricks

Data augmentation

Add realistic deformations to your input in order to improve domain coverage.

For image data, depending on what is possible in productin: rotations, horizontal & vertical dlips, scaling, translation, illumination change, warping, noise, etc.

For vector data: intersting problem. Possible approach: train/fit PCA then add random noise in low-energy features

Reject

Several options:

Improve the model of class boundary
- In 1-vs-all training, add noise to the “others” samples
Adjust the decision function dependinf on your application
- Look at the prediction probabiblity of your classifier, and threshold it as per your need using a ROC curve
Model the noise
- Add a “none” class to your classifier, with samples for real life cases of negatives samples

More theory on ML

What is our goal ?

Given samples (described by features) and true labels, find a good function which wil correctly predict labels given new data samples

Problems:

Which family for our function?
What is “good”?
How to train / find such function?

What are the sources of error ?

Noise
- Your data is not perfect. (or “Every model is wrong.”)
- Even if there exist an optimal underlying model, the observations are corrupted by noise (e.g. multiple y for a given x).
- - Even the optimal solution could be wrong.
Bias
- You need to simplify to generalize.
- You classifier needs to drop some information about the training set to have generalization power.
- The set of solutions explored does not contain the optimal solution.
Variance
- You have many ways to explain your training dataset
- It is hard to find an optimal solution among those many possibilities.
- If we draw another training set from the same distribution, we would obtain another solution.

2 big issues

Under-fitting

Caused by bias
Your model assumptions are too strong for the data, so the model won’t fill well

Over-fitting

Caused by variance
Your algorithm has memorized the data including the noise, so it can’t generalize.

The theory

Bias (statistical definition)

Let $T$ be a statistic used to estimate a parameter $\theta$.

If $E[T] = \theta + bias(\theta)$ then $bias(\theta)$ is called the bias of the statistic $T$, where $E[T]$ represents the expected value of the statistics $T$.

If $bias(\theta) = 0$, then $E[T] = \theta$. So, $T$ is an unbiased estimator of the true parameter, say $\theta$.

Expected Risk

Let $D_n$ be a training set of examples $z_i$ drawn independently from an unknown distribution $p(z)$

We need a set of functions F. Example: linear functions $f(x) = a \times x + b$

We need a loss function $L(z, f)$. Example: $L((x, y), f ) = (f (x) − y)^2$

The Expected Risk, i.e. the expected generalization error, is:

But we do not know $p(z)$, and we cannot test all $z$!

Empirical Risk

Because we cannot measure the real Expected Risk, we have to estimate it using the Empirical Risk:

$D_n$ is our dataset

And our training procedure then relies on Empirical Risk Minimization (ERM):

And the training error is given by:

Does this make sense?

The empirical risk is an unbiased estimate of the risk, i.e. the more test samples we have, the more accurate our estimate is, under iid assumption.

But the training risk is biased

The training error is a biased estimate of the risk, i.e. the solution $f^★ (D_n)$ found by minimizing the training error is better on $D_n$ than on any other set $D’_n$ drawn from $p(z)$.

However, under certain assumptions, the difference between the expected and the empirical risks can be bounded. This is an important result from the work of Vapnik

Note that the empirical risk on the test set is an unbiased estimate of the risk.

Estimate the Expected Risk with the Empirical Risk

For a given capacity, using more samples to train and evaluate your predictor should make your Empirical Risk converge toward the best possible Expected Risk, if the ERM is consistent for $F$, given your training set $D_n$.

The difference between Expected Risk and Empirical Risk is bounded but depends on the capacity of $F$ (set of possible functions).

There is an optimal capacity for a given number of training samples $n$.

Capacity

The capacity $h(F)$ is a measure of its size, or complexity (or VC dimension)

For classification, the capacity of $F$ is defined by Vapnik & Chervonenkis as:

the largest $n$
such that there exist a set of examples $D_n$
such that one can always find an $f \in F$
which gives the correct answer for all examples in $D_{n’}$
for any possible labeling.

The Bias-Variance Dilemma

Intrinsic dilemma: when the capacity $h(F)$ grows, the bias goes down, but the variance goes up!

Decomposing the bias-variance-error for MSE

For a regression problem with a mean square loss, we have the following decomposition. Let $Y = f(X) + \varepsilon$, with $\varepsilon \sim N(0, \sigma_{\varepsilon}^2)$ and $f_D(X)$ an estimator of $f(X)$, learned over the training set $D$. The error at a particular point $X = x_0$ is:

In practice

Empirical Risk and Expected Risk

Measure train and test error

Use hold-out sets, cross-validations, etc. to get a test error.

Train error: Empirical Risk. Can my model learn something (by heart)?

Test error: Coarse estimate of the Expected Risk. Can my model generalize to unseen data?

Detect under-fitting and over-fitting

Some solutions / hints

How to get started?

Get enough data in the right format from your customer (hard)
Check and split data (boring but mandatory)
Agree on a loss function and minimum performance goal (moderate)
Try to overfit a predictor on some samples (train set loss), increase complexity only if needed (capacity check)
Fit on more data (more = better)
Check for overfitting (val set loss) and add regularization if needed
Evaluate performance thoroughly (test set loss) (reports, identify failure cases, etc.)
Do some hyper-parameter optimization, try other models…
…

Introduction to practice session 6

Using classification to segment images

Until now

1 image $\to$ many vectors (instance recognition)
1 image $\to$ 1 vector (image retrieval, image classification)

Today / next practice session :

1 pixel →1 vector (pixel classification, image semantic segmentation)

Brain Anatomy and Imaging

Human brain = Where human OS is stored and run

To investigate brain malfunction, two options:

Magnetic Resonance Imaging (MRI)

Everything you always wanted to know about MRI

Hydrogen atoms are naturally abundant in humans, particularly in water and fat.

Pulses of radio waves excite the nuclear spin energy transition, and macroscopic polarization that is detected by antennas.

Magnetic field gradients localize the polarization in space. By varying the parameters of the pulse sequence, different contrasts may be generated between tissues based on the relaxation properties of the hydrogen atoms therein.

What you actually need to know

MRI is a large family of imaging techniques

They can produce 3D scans of various appearances in order to emphasize some human tissues versus others.

BraTS: Brain Tumor Segmentation Competition

Original segmentation task

Given a 3D scan (skull-stripped, registered) of a patient with T1, T2, T1C and FLAIR modalities, predict a tumor class for each voxel (the patient suffers from a glioma):

Avec:

edema (yellow),
non-enhancing solid core (red),
necrotic/cystic core (green),
enhancing core (blue).

Original dataset

The 2018 competition we use the data from originally contains 285 brain scans.

Your Mission

A simplified competition

Because dealing with 3D and data normalization would take you much time and pain, we:

already performed data normalization
extracted 2D (axial) slices that you have to process

Actual task

Given a $240\times240$ image with $4$ modalities (already normalized), predict for each pixel whether it belongs to a tumor or nor.

Actual dataset

Train set

$256$ normalized slices, one per patient, containing $240\times240$ images with $4$ channels ($1$ for each modality), float32
$256$ target segmentations, one per patient, containing $240\times240$ images with $1$ channel (indicating tumor or clean region), uint8

Test set

$29$ normalized slices, one per patient (not in the training set), containing $240\times240$ images with $4$ channels ($1$ for each modality), float32
Ground truth kept secret for grading

Suggested Pipeline

Data preprocessing

We already did this step.

Choose and train a classifier

There are several suggestions in the reference notebook: SVM, neural network, etc.

Input = 1 vector of 4 components for each pixels
Output = 1 for tumor, 0 for “not tumor”

Do not use background (“black”) pixels for training, they would ruin your classification

Deep nets can work but they are harder to train well.

And don’t use deep nets, we’ll play with them next semester

Validate your training

Create and use a validation set extracted from the full training set.

To not train on the samples it contains.

sklearn.model_selection.train_test_split may be your friend.

Check visually results from both train and val sets!

Interpret your results

Add some context to each pixel

You can get better results by looking at the neighborhood of a pixel to classify it better: train with vectors of size $N\times M$ instead of $1\times M$.

Fighting underfitting and overfitting

You do not have much data to train on

If you pick a classifier which is too simple, you may underfit: you will get low and similar scores both on the train and test sets

Choosing another classifier may be a good idea here.

You may also easily overfit your classifier, especially if you use one with a large capacity: you will get excellent scores on the train set, and bad ones on the test set.

Regularization may be necessary.

Post processing

We suggest in the notebook to “clean up” the results by removing very small isolated pixels marked as tumor.

You may have many other ideas here

Going Further

Many options

Data augmentation to increase train set
Larger / better neighborhood for each pixel
Better ANN structure than the one suggested in the notebook
Change the representation space? (Fourier, wavelets…)
As the tumors under consideration may not have “holes”, improve the post-processing
~~Super heavy classifiers (UNet, Gradient Boosted Trees…)~~
…

Conclusion

Course overview: a very small glimpse of CV/PR/ML

Welcome to 2012

AlexNet by A. Krizhevsky, I. Sutskever, G. E. Hinton halved error rate on ImageNet competition

Deep learning

Will be there for a few years!

Is a natural extension of what we saw: feature extraction, encoding, pooling, classification in a single, integrated, globally optimized pipeline.

Requires skills you learned: dev, math, data preparation, evaluation. Input data still need to be properly normalized, for instance.

Requires a lot of practice: read papers, don’t be impressed by the math, implement them.

If not applicable, then pick one of the good old technique we talked about.

MLRF: Lecture 06