Lien de la note Hackmd

Agenda

Introduction
Image classification overview
Some classifiers - part 1
Classifier evaluation

Summary of last lecture

Content-based image retrieval

2 strategies: keep all local descriptors for all images vs 1 descriptor per image
Bag of Visual Words pipeline
- Focus on encoding

Evaluation of image retrieval systems

Precision
Recall
F-Measure
mAP

Texture descriptors (on les a pas du tout vu)

What is a texture ?
Fast and classic approaches
Descripteurs a l’ancienne

Practice session 4: Take home messages

BoVW

Usually requires some preprocessing of the descriptors: centering, rotation/axes permutation, dimensionality reduction…
Is based on a quantization step (assign descriptors to clusters)
Is just a histogram, like the color histogram of sessino 2
We can compute more advanced statistics to get better results (VLAD, FVs)

Best practices:

Test arrays shapes and types as soon as possible
Make a small change, test, fix, tes, validate, repeat
Get a complete, basic pipeline ASAP and improve it until time is over

Next practice session

Implement a simple image classifier:

Will be graded

Steps

Load resources
Train a BoVW model
Split the dataset into training and validation sets
Compute the BoVW descriptor for each image
- We will make a small change here (sqrt + L2-norm)
Prepare training structures
Train a classifier and evaluate its performance
- Training and evaluating is easy with scikit learn
Display some results
Test on meme image
Compute the results on the test set and export them

Image classification overview

Instance recognition vs Class recognition

Instance recognition

Re-recognize a known 2D or 3D rigid object, potentially being viewed from a novel viewpoint, against a cluttered background, and with partial occlusions

Ex: practice session 3

Class recognition

Recognize any instance of a particular general class such as “cat”, “car” or “bicycle”

Aka category-level or generic object recognition

More challenging

This lecture and next practice session

Pipeline overview

Our image classification pipeline

This is a supervised machine learning task

We need a dataset with samples
Images will be represented as BoVW vectors of fixed size
Targets will be encoded as integers
- 0: Muffin
- 1: Chihuahua

This is a very usual data representation for a classification problem

Classifier inputs = “samples” with “features” Classifier outputs = “labels”

Now we just need to select an appropriate method, prepare our data, run some training, test the results, adjust some parameters, compare approaches, display results, …

Data preparation

NumPy formatting

Training/validation/test separation

You cannot estimate the generalization performance of your predictor/estimator/classifier on its training set
You need to keep some samples aside for later evaluation

Other “funny” things to do IRL

Collect data
Clean data
Check data
Clean again
Annotate
Check
Compute/convert/scale features

Feature selection

Consists in dropping some data columns

Can help later stages:

Less data to process
Better properties (like decorrelated features, etc.)

Which columns ?

Hard problem in general
- Because features may be informative as a group
Some simpler and helpful techniques:
- Remove features with low variances
- Dimensionality reduction techniques are not exactly feature selection, but can still have a similar effect

Some classifiers - part 1

Disclaimer

What follows is a very limited selection

Only classifiers suitable for image classification as we present it today

input = feature vector output = label

Many other approaches

What is our goal ?

Given samples (described by features) and true lables, find a good function which will correctly predict labels given new data samples

Parametric vs Non Parametric Classifiers

Parametric examples

Logisitic regression, Linear Discriminant Analysis, naive Bayes, Perceptrion, Simple Neural Networks..

A learning model that summarizes data with a set of parameters of fixed size (independant of the number of training examples) is called a parametric model. No matter how much data you throw in nature

Non-parametric examples

k-Neares Neighbors, Decision Trees, SVMs

“Non-parametric models differ from parametric models int that hte model structure is not specified a priori but is instead determined from data. The term non-parametric is not meant to imply that such models completely lack parameters but that the number and nature of the parameters are flexible and not fixed in advance” Wikipedia

“Nonparametric methods are good when you have a lot of data and no prior knowledge”

Dummy classifiers

Say you have a dataset with 9 muffins and 1 chihuahua.

You have a new sample to classify.

Which class should you bet on ?

If your class prior probabilities $P(C_1), P(C_2),\dots$ are not equal, then you should bet on the most frequent class! ($g(x)=argmax_yp(y)$)

Without such information, you can just pick at random

Waht is the expected accuracy (true predictions / total predictions) if you have N classes an pick one at random ?

Scikit-learn offers a DummyClassifier class which helps testing such a strategy

What’s the point ?

Quickly build and test your complete pipeline with a mockup classifier
Quickly get a baseline for the performance
(look for obvious bias in the dataset, but you should have cleaned it before !)

K Nearest Neighbor (kNN)

Keep all training samples

View new samples as quieries over the previously learned / indexed samples

Assign the class of the closest(s) samples

\[f(x) = y_i, i = argmin_j\Vert x_j-x\Vert\]

We can check more than one sample

Remember thi bias/variance compromise ?

Pros

very simple to implement
Capacity easily controlled with k
Can be tuned to work on large datasets: indexing, data cleaning, etc.
Good baseline
Non parametric
Lazy learner

Cons

In high dimension, all samples tend to be very close (for Euclidean dimension)
Large memory consumption on large datasets
Requires a large amount of samples and large k to get best performance

Setting K:

\[K\simeq\sqrt{\frac{m}{C}}\]

$\frac{m}{C}$: average number of training sample/class

Other distance-based classifier

Minimal euclidean distance

Very basic classifier

Distance to the mean $m_i$ of the class

It does not take into account differences in variance for each class

Predicted class for x:

\[g(x) = argmin_iD_i(x)\]

Minimal quadratic distance (Mahalanobis)

For each class $i$, the mean $m_i$ and covariance matrix $S_i$ are computed from the set of examples

The covariance matrix is taken into account when computing the distance from an image to the class $i$

The feature vector of the image $x$ is projected over the eigenvectors of the class

\[g(x) = argmin_iD_i(x)\]

A quick introduction to Bayesian Decision Theory

Example - RoboCup

General case: maximum a posteriori (MAP)

General case: need to tale into consideration $p(y)$ and $p(x)$

$p(x\vert y)$: class conditional density (here: histograms)
$p(y)$: class priors, e.g. for indoor RoboCup
$p(floor) = 0.6$, $p(goal) = 0.3$, $p(ball) = 0.1$
$p(x)$: probability of seeing data $x$

Optimal decision rule (Bayes classifier): maximum a posteriori (MAP):

\[g(x) = argmax_{y\in Y}p(y\vert x)\]

How to compute $p(y\vert x)$ ?

\[p(y\vert x) = \frac{p(x\vert y)p(y)}{p(x)}\quad\text{Bayes' rule}\]

If classes are equiprobables and error cost is the same, then, because $p(x)$ is constant, we get the maximum likelihood estimation:

\[g(x) = \underbrace{argmax_{y\in Y}p(y\vert x)}_{\text{MAP}}\simeq\underbrace{argmax_{y\in Y}p(x\vert y)}_{\text{ML}}\]

Generative, discriminant and “direct” classifiers

Generative Probabilistic Models

Some classical Generative Probabilistic Models

Training data $X={x-1,\dots,x_n}$, $Y={y_1,\dots,x_n}$. $X\times Y\in\mathcal X\times\mathcal Y$

For each $y\in\mathcal Y$, build model for $p(x\vert y)$ of $X_y:={x_i\in X:y_i=y}$

Histogram: if $x$ can have only a few discrete values
Kernel Density Estimator
Gaussian
Mixture of Gaussians

Typically, $\mathcal Y$ small (few possibles lables), $\mathcal X$ low dimensional

Class conditional densities and posteriors

Naive Bayes Classifiers

Linear discriminant classifiers

General idea for binary classification

Learn w and b

you can compute $p(y\vert x)\simeq\hat y$

Problem: how to learn w and b ?

Logistic Regression

Linear classifier, $f$ is logistic function

$\sigma(x) = \frac{1}{(1+e^{-x})} = \frac{e^x}{(1+e^x)}$

Maps all real $\to[0,1]$

Optimize $\sigma(w^Tx+b)$ to find best $w$

Trained using gradient descent (no closed form solution)

Gradient descent

Formally:

\[w_{t+1}=w_t-\eta\nabla L(w)\]

Where $\eta$ is step size, how far to step relative to the gradient

From 2 classes to C classes: 2 strategies

\[\hat y = argmax_{i\in Y}w_ix\]

Maximum Margin classification

What is the best $w$ for this dataset ?

Trade-off: large margin vs few mistakes on training set

Support Vector Machin (SVM)

Logistic Regression vs SVM

Optimization problems:

About the regularizer

Effect of cost parameter C (regularization, again)

Non-linear discriminant classifiers

Non-linear classification

What is the best linear classifier for this dataset?

None. We need something nonlinear!

2 solutions:

Preprocess the data (explicit embedding, kernel trick…)
Combine multiple linear classifiers into nonlinear classifier (boosting, neural networks…)

Non-linear classification using linear classifiers with data preprocessing

Data preprocessing idea

Transform the dataset to enable linear separability

Linear separation is always possible

The original input space can always be mapped to some higher-dimensional feature space where the training set is separable.

Explicit embedding

Compute $\phi(x)$ for all $x$ in the dataset.

Then train a linear classifier just like before

Used to be avoided because of computation issues, but it is a hot topic again.

Kernel trick

Linear classification requires to compute only dot products $\phi(x_i),\phi(x_j)$

The function $\phi(x)$ does not need to be explicit, we can use a kernel function

\[k(x,z)=\phi(x)\phi(z)\]

which represents a dot product in a “hidden” feature space.

This gives a non-linear boundary in the original feature space.

Popular kernel functions in Computer Vision

Linear kernel”: identical solution as linear SVM

“Hellinger kernel”: less sensitive to extreme value in feature vector

“Histogram intersection kernel”: very robust

“$X^2$-distance kernel”: good empirical results

“Gaussian kernel”: overall most popular kernel in Machine Learning

Explicit embedding for the Hellinger kernel

Using simple square root properties, we have:

\[k(x,x’) = \phi(x)\phi(x’) = \sqrt{x} \sqrt{x'}\]

Tricks for next practice session: given a BoVW vector,

L1 normalize it (neutralizes effect of number of descriptors)
Take its square root (explicit Hellinger embedding)
L2 normalize it (more linear-classifier friendly)

Metrics

Confusion matrix and Accuracy

Problems with Accuracy

All the following classifiers have a 90% accuracy

Do all errors have the same cost?

Precision, recall, F-score

Plotting a Precision/Recall for classification data

For binary classification

Instead of $\hat y = argmax_yp(y\vert x)$, take all possible thresholds for $p(y\vert x)$

TPR, FPR, ROC

ROC: “Receiver Operating Characteristic” Kind of signaloise measure under various tunings

Ligne rose: random results

More about ROC curves:

Adjusting the threshold

http://www.navan.name/roc/

Class overlap

Split the dataset to assess generalization performance

Bootstrap

Draw randomly, with replacement samples from the training set.

Enables us to estimate the variance of estimators we use in the classification rule.

MLRF: Lecture 05