Lien de la note Hackmd
Agenda
- Introduction
- Image classification overview
- Some classifiers - part 1
- Classifier evaluation
Summary of last lecture
Content-based image retrieval
- 2 strategies: keep all local descriptors for all images vs 1 descriptor per image
- Bag of Visual Words pipeline
- Focus on encoding
Evaluation of image retrieval systems
- Precision
- Recall
- F-Measure
- mAP
Texture descriptors (on les a pas du tout vu)
- What is a texture ?
- Fast and classic approaches
- Descripteurs a l’ancienne
Practice session 4: Take home messages
BoVW
- Usually requires some preprocessing of the descriptors: centering, rotation/axes permutation, dimensionality reduction…
- Is based on a quantization step (assign descriptors to clusters)
- Is just a histogram, like the color histogram of sessino 2
- We can compute more advanced statistics to get better results (VLAD, FVs)
Best practices:
- Test arrays shapes and types as soon as possible
- Make a small change, test, fix, tes, validate, repeat
- Get a complete, basic pipeline ASAP and improve it until time is over
Next practice session
Implement a simple image classifier:
Will be graded
Steps
- Load resources
- Train a BoVW model
- Split the dataset into training and validation sets
- Compute the BoVW descriptor for each image
- We will make a small change here (sqrt + L2-norm)
- Prepare training structures
- Train a classifier and evaluate its performance
- Training and evaluating is easy with scikit learn
- Display some results
- Test on meme image
- Compute the results on the test set and export them
Image classification overview
Instance recognition vs Class recognition
Instance recognition
Re-recognize a known 2D or 3D rigid object, potentially being viewed from a novel viewpoint, against a cluttered background, and with partial occlusions
Ex: practice session 3
Class recognition
Recognize any instance of a particular general class such as “cat”, “car” or “bicycle”
Aka category-level or generic object recognition
More challenging
This lecture and next practice session
Pipeline overview
Our image classification pipeline
This is a supervised machine learning task
- We need a dataset with samples
- Images will be represented as BoVW vectors of fixed size
- Targets will be encoded as integers
- 0: Muffin
- 1: Chihuahua
This is a very usual data representation for a classification problem
Classifier inputs = “samples” with “features” Classifier outputs = “labels”
Now we just need to select an appropriate method, prepare our data, run some training, test the results, adjust some parameters, compare approaches, display results, …
Data preparation
NumPy formatting
Training/validation/test separation
- You cannot estimate the generalization performance of your predictor/estimator/classifier on its training set
- You need to keep some samples aside for later evaluation
Other “funny” things to do IRL
- Collect data
- Clean data
- Check data
- Clean again
- Annotate
- Check
- Compute/convert/scale features
Feature selection
Consists in dropping some data columns
Can help later stages:
- Less data to process
- Better properties (like decorrelated features, etc.)
Which columns ?
- Hard problem in general
- Because features may be informative as a group
- Some simpler and helpful techniques:
- Remove features with low variances
- Dimensionality reduction techniques are not exactly feature selection, but can still have a similar effect
Some classifiers - part 1
Disclaimer
What follows is a very limited selection
Only classifiers suitable for image classification as we present it today
input = feature vector output = label
Many other approaches
What is our goal ?
Given samples (described by features) and true lables, find a good function which will correctly predict labels given new data samples
Parametric vs Non Parametric Classifiers
Parametric examples
Logisitic regression, Linear Discriminant Analysis, naive Bayes, Perceptrion, Simple Neural Networks..
A learning model that summarizes data with a set of parameters of fixed size (independant of the number of training examples) is called a parametric model. No matter how much data you throw in nature
Non-parametric examples
k-Neares Neighbors, Decision Trees, SVMs
“Non-parametric models differ from parametric models int that hte model structure is not specified a priori but is instead determined from data. The term non-parametric is not meant to imply that such models completely lack parameters but that the number and nature of the parameters are flexible and not fixed in advance” Wikipedia
“Nonparametric methods are good when you have a lot of data and no prior knowledge”
Dummy classifiers
Say you have a dataset with 9 muffins and 1 chihuahua.
You have a new sample to classify.
Which class should you bet on ?
If your class prior probabilities $P(C_1), P(C_2),\dots$ are not equal, then you should bet on the most frequent class! ($g(x)=argmax_yp(y)$)
Without such information, you can just pick at random
Waht is the expected accuracy (true predictions / total predictions) if you have N classes an pick one at random ?
Scikit-learn offers a DummyClassifier class which helps testing such a strategy
What’s the point ?
- Quickly build and test your complete pipeline with a mockup classifier
- Quickly get a baseline for the performance
- (look for obvious bias in the dataset, but you should have cleaned it before !)
K Nearest Neighbor (kNN)
Keep all training samples
View new samples as quieries over the previously learned / indexed samples
Assign the class of the closest(s) samples
\[f(x) = y_i, i = argmin_j\Vert x_j-x\Vert\]We can check more than one sample
Remember thi bias/variance compromise ?
Pros
- very simple to implement
- Capacity easily controlled with k
- Can be tuned to work on large datasets: indexing, data cleaning, etc.
- Good baseline
- Non parametric
- Lazy learner
Cons
- In high dimension, all samples tend to be very close (for Euclidean dimension)
- Large memory consumption on large datasets
- Requires a large amount of samples and large k to get best performance
Setting K:
\[K\simeq\sqrt{\frac{m}{C}}\]$\frac{m}{C}$: average number of training sample/class
Other distance-based classifier
Minimal euclidean distance
Very basic classifier
Distance to the mean $m_i$ of the class
It does not take into account differences in variance for each class
Predicted class for x:
\[g(x) = argmin_iD_i(x)\]Minimal quadratic distance (Mahalanobis)
For each class $i$, the mean $m_i$ and covariance matrix $S_i$ are computed from the set of examples
The covariance matrix is taken into account when computing the distance from an image to the class $i$
The feature vector of the image $x$ is projected over the eigenvectors of the class
\[g(x) = argmin_iD_i(x)\]A quick introduction to Bayesian Decision Theory
Example - RoboCup
General case: maximum a posteriori (MAP)
General case: need to tale into consideration $p(y)$ and $p(x)$
- $p(x\vert y)$: class conditional density (here: histograms)
- $p(y)$: class priors, e.g. for indoor RoboCup
- $p(floor) = 0.6$, $p(goal) = 0.3$, $p(ball) = 0.1$
- $p(x)$: probability of seeing data $x$
Optimal decision rule (Bayes classifier): maximum a posteriori (MAP):
\[g(x) = argmax_{y\in Y}p(y\vert x)\]How to compute $p(y\vert x)$ ?
\[p(y\vert x) = \frac{p(x\vert y)p(y)}{p(x)}\quad\text{Bayes' rule}\]If classes are equiprobables and error cost is the same, then, because $p(x)$ is constant, we get the maximum likelihood estimation:
\[g(x) = \underbrace{argmax_{y\in Y}p(y\vert x)}_{\text{MAP}}\simeq\underbrace{argmax_{y\in Y}p(x\vert y)}_{\text{ML}}\]Generative, discriminant and “direct” classifiers
Generative Probabilistic Models
Some classical Generative Probabilistic Models
Training data $X={x-1,\dots,x_n}$, $Y={y_1,\dots,x_n}$. $X\times Y\in\mathcal X\times\mathcal Y$
For each $y\in\mathcal Y$, build model for $p(x\vert y)$ of $X_y:={x_i\in X:y_i=y}$
- Histogram: if $x$ can have only a few discrete values
- Kernel Density Estimator
- Gaussian
- Mixture of Gaussians
Typically, $\mathcal Y$ small (few possibles lables), $\mathcal X$ low dimensional
Class conditional densities and posteriors
Naive Bayes Classifiers
Linear discriminant classifiers
General idea for binary classification
Learn w and b
- you can compute $p(y\vert x)\simeq\hat y$
Problem: how to learn w and b ?
Logistic Regression
Linear classifier, $f$ is logistic function
\(\sigma(x) = \frac{1}{(1+e^{-x})} = \frac{e^x}{(1+e^x)}\)
- Maps all real $\to[0,1]$
Optimize $\sigma(w^Tx+b)$ to find best $w$
Trained using gradient descent (no closed form solution)
Gradient descent
Formally:
\[w_{t+1}=w_t-\eta\nabla L(w)\]Where $\eta$ is step size, how far to step relative to the gradient
From 2 classes to C classes: 2 strategies
\[\hat y = argmax_{i\in Y}w_ix\]Maximum Margin classification
What is the best $w$ for this dataset ?
Trade-off: large margin vs few mistakes on training set
Support Vector Machin (SVM)
Logistic Regression vs SVM
Optimization problems:
About the regularizer
Effect of cost parameter C (regularization, again)
Non-linear discriminant classifiers
Non-linear classification
What is the best linear classifier for this dataset?
None. We need something nonlinear!
2 solutions:
- Preprocess the data (explicit embedding, kernel trick…)
- Combine multiple linear classifiers into nonlinear classifier (boosting, neural networks…)
Non-linear classification using linear classifiers with data preprocessing
Data preprocessing idea
Transform the dataset to enable linear separability
Linear separation is always possible
The original input space can always be mapped to some higher-dimensional feature space where the training set is separable.
Explicit embedding
Compute $\phi(x)$ for all $x$ in the dataset.
Then train a linear classifier just like before
Used to be avoided because of computation issues, but it is a hot topic again.
Kernel trick
Linear classification requires to compute only dot products $\phi(x_i),\phi(x_j)$
The function $\phi(x)$ does not need to be explicit, we can use a kernel function
\[k(x,z)=\phi(x)\phi(z)\]which represents a dot product in a “hidden” feature space.
This gives a non-linear boundary in the original feature space.
Popular kernel functions in Computer Vision
Linear kernel”: identical solution as linear SVM
“Hellinger kernel”: less sensitive to extreme value in feature vector
“Histogram intersection kernel”: very robust
“$X^2$-distance kernel”: good empirical results
“Gaussian kernel”: overall most popular kernel in Machine Learning
Explicit embedding for the Hellinger kernel
Using simple square root properties, we have:
\[k(x,x’) = \phi(x)\phi(x’) = \sqrt{x} \sqrt{x'}\]Tricks for next practice session: given a BoVW vector,
- L1 normalize it (neutralizes effect of number of descriptors)
- Take its square root (explicit Hellinger embedding)
- L2 normalize it (more linear-classifier friendly)
Metrics
Confusion matrix and Accuracy
Problems with Accuracy
All the following classifiers have a 90% accuracy
Do all errors have the same cost?
Precision, recall, F-score
Plotting a Precision/Recall for classification data
For binary classification
Instead of $\hat y = argmax_yp(y\vert x)$, take all possible thresholds for $p(y\vert x)$
TPR, FPR, ROC
ROC: “Receiver Operating Characteristic” Kind of signaloise measure under various tunings
Ligne rose: random results
More about ROC curves:
Adjusting the threshold
Class overlap
Split the dataset to assess generalization performance
Bootstrap
Draw randomly, with replacement samples from the training set.
Enables us to estimate the variance of estimators we use in the classification rule.
Holdout
Just keep a part of the dataset for later validation/testing
Cross validation
with meta parameter tuning
StratifiedKFold (best)
Missing things
- Cost of misclassification
- Multiclass classification evaluation
- …