Home MLRF: Lecture 05
Post
Cancel

MLRF: Lecture 05

Lien de la note Hackmd

Agenda

  1. Introduction
  2. Image classification overview
  3. Some classifiers - part 1
  4. Classifier evaluation

Summary of last lecture

Content-based image retrieval

  • 2 strategies: keep all local descriptors for all images vs 1 descriptor per image
  • Bag of Visual Words pipeline
    • Focus on encoding

Evaluation of image retrieval systems

  • Precision
  • Recall
  • F-Measure
  • mAP

Texture descriptors (on les a pas du tout vu)

  • What is a texture ?
  • Fast and classic approaches
  • Descripteurs a l’ancienne

Practice session 4: Take home messages

BoVW

  • Usually requires some preprocessing of the descriptors: centering, rotation/axes permutation, dimensionality reduction…
  • Is based on a quantization step (assign descriptors to clusters)
  • Is just a histogram, like the color histogram of sessino 2
  • We can compute more advanced statistics to get better results (VLAD, FVs)

Next practice session

Implement a simple image classifier:

Steps

  1. Load resources
  2. Train a BoVW model
  3. Split the dataset into training and validation sets
  4. Compute the BoVW descriptor for each image
    • We will make a small change here (sqrt + L2-norm)
  5. Prepare training structures
  6. Train a classifier and evaluate its performance
    • Training and evaluating is easy with scikit learn
  7. Display some results
  8. Test on meme image
  9. Compute the results on the test set and export them

Image classification overview

Instance recognition vs Class recognition

Instance recognition

Re-recognize a known 2D or 3D rigid object, potentially being viewed from a novel viewpoint, against a cluttered background, and with partial occlusions

Ex: practice session 3

Class recognition

Recognize any instance of a particular general class such as “cat”, “car” or “bicycle”

Aka category-level or generic object recognition

This lecture and next practice session

Pipeline overview

Our image classification pipeline

This is a supervised machine learning task

  • We need a dataset with samples
  • Images will be represented as BoVW vectors of fixed size
  • Targets will be encoded as integers
    • 0: Muffin
    • 1: Chihuahua

This is a very usual data representation for a classification problem

Now we just need to select an appropriate method, prepare our data, run some training, test the results, adjust some parameters, compare approaches, display results, …

Data preparation

NumPy formatting

Training/validation/test separation

  • You cannot estimate the generalization performance of your predictor/estimator/classifier on its training set
  • You need to keep some samples aside for later evaluation

Other “funny” things to do IRL

  • Collect data
  • Clean data
  • Check data
  • Clean again
  • Annotate
  • Check
  • Compute/convert/scale features

Feature selection

Can help later stages:

  • Less data to process
  • Better properties (like decorrelated features, etc.)

Which columns ?

  • Hard problem in general
    • Because features may be informative as a group
  • Some simpler and helpful techniques:
    • Remove features with low variances
    • Dimensionality reduction techniques are not exactly feature selection, but can still have a similar effect

Some classifiers - part 1

Disclaimer

Only classifiers suitable for image classification as we present it today

Many other approaches

What is our goal ?

Parametric vs Non Parametric Classifiers

Parametric examples

Logisitic regression, Linear Discriminant Analysis, naive Bayes, Perceptrion, Simple Neural Networks..

A learning model that summarizes data with a set of parameters of fixed size (independant of the number of training examples) is called a parametric model. No matter how much data you throw in nature

Non-parametric examples

k-Neares Neighbors, Decision Trees, SVMs

“Non-parametric models differ from parametric models int that hte model structure is not specified a priori but is instead determined from data. The term non-parametric is not meant to imply that such models completely lack parameters but that the number and nature of the parameters are flexible and not fixed in advance” Wikipedia

“Nonparametric methods are good when you have a lot of data and no prior knowledge”

Dummy classifiers

Say you have a dataset with 9 muffins and 1 chihuahua.

You have a new sample to classify.

Which class should you bet on ?

If your class prior probabilities $P(C_1), P(C_2),\dots$ are not equal, then you should bet on the most frequent class! ($g(x)=argmax_yp(y)$)

Without such information, you can just pick at random

Waht is the expected accuracy (true predictions / total predictions) if you have N classes an pick one at random ?

What’s the point ?

  1. Quickly build and test your complete pipeline with a mockup classifier
  2. Quickly get a baseline for the performance
  3. (look for obvious bias in the dataset, but you should have cleaned it before !)

K Nearest Neighbor (kNN)

Keep all training samples

View new samples as quieries over the previously learned / indexed samples

Assign the class of the closest(s) samples

\[f(x) = y_i, i = argmin_j\Vert x_j-x\Vert\]

We can check more than one sample

Remember thi bias/variance compromise ?

Pros

  • very simple to implement
  • Capacity easily controlled with k
  • Can be tuned to work on large datasets: indexing, data cleaning, etc.
  • Good baseline
  • Non parametric
  • Lazy learner

Cons

  • In high dimension, all samples tend to be very close (for Euclidean dimension)
  • Large memory consumption on large datasets
  • Requires a large amount of samples and large k to get best performance

Other distance-based classifier

Minimal euclidean distance

Very basic classifier

Distance to the mean $m_i$ of the class

It does not take into account differences in variance for each class

Predicted class for x:

\[g(x) = argmin_iD_i(x)\]

Minimal quadratic distance (Mahalanobis)

For each class $i$, the mean $m_i$ and covariance matrix $S_i$ are computed from the set of examples

The covariance matrix is taken into account when computing the distance from an image to the class $i$

The feature vector of the image $x$ is projected over the eigenvectors of the class

\[g(x) = argmin_iD_i(x)\]

A quick introduction to Bayesian Decision Theory

Example - RoboCup

General case: maximum a posteriori (MAP)

  • $p(x\vert y)$: class conditional density (here: histograms)
  • $p(y)$: class priors, e.g. for indoor RoboCup
  • $p(floor) = 0.6$, $p(goal) = 0.3$, $p(ball) = 0.1$
  • $p(x)$: probability of seeing data $x$

Optimal decision rule (Bayes classifier): maximum a posteriori (MAP):

\[g(x) = argmax_{y\in Y}p(y\vert x)\]

How to compute $p(y\vert x)$ ?

\[p(y\vert x) = \frac{p(x\vert y)p(y)}{p(x)}\quad\text{Bayes' rule}\]

If classes are equiprobables and error cost is the same, then, because $p(x)$ is constant, we get the maximum likelihood estimation:

\[g(x) = \underbrace{argmax_{y\in Y}p(y\vert x)}_{\text{MAP}}\simeq\underbrace{argmax_{y\in Y}p(x\vert y)}_{\text{ML}}\]

Generative, discriminant and “direct” classifiers

Generative Probabilistic Models

Some classical Generative Probabilistic Models

Training data $X={x-1,\dots,x_n}$, $Y={y_1,\dots,x_n}$. $X\times Y\in\mathcal X\times\mathcal Y$

For each $y\in\mathcal Y$, build model for $p(x\vert y)$ of $X_y:={x_i\in X:y_i=y}$

  • Histogram: if $x$ can have only a few discrete values
  • Kernel Density Estimator
  • Gaussian
  • Mixture of Gaussians

Typically, $\mathcal Y$ small (few possibles lables), $\mathcal X$ low dimensional

Class conditional densities and posteriors

Naive Bayes Classifiers

Linear discriminant classifiers

General idea for binary classification

Learn w and b

  • you can compute $p(y\vert x)\simeq\hat y$

Logistic Regression

Linear classifier, $f$ is logistic function

\(\sigma(x) = \frac{1}{(1+e^{-x})} = \frac{e^x}{(1+e^x)}\)

  • Maps all real $\to[0,1]$

Optimize $\sigma(w^Tx+b)$ to find best $w$

Trained using gradient descent (no closed form solution)

Gradient descent

Formally:

\[w_{t+1}=w_t-\eta\nabla L(w)\]

Where $\eta$ is step size, how far to step relative to the gradient

From 2 classes to C classes: 2 strategies

\[\hat y = argmax_{i\in Y}w_ix\]

Maximum Margin classification

What is the best $w$ for this dataset ?

Trade-off: large margin vs few mistakes on training set

Support Vector Machin (SVM)

Logistic Regression vs SVM

Optimization problems:

About the regularizer

Effect of cost parameter C (regularization, again)

Non-linear discriminant classifiers

Non-linear classification

What is the best linear classifier for this dataset?

2 solutions:

  1. Preprocess the data (explicit embedding, kernel trick…)
  2. Combine multiple linear classifiers into nonlinear classifier (boosting, neural networks…)

Non-linear classification using linear classifiers with data preprocessing

Data preprocessing idea

Transform the dataset to enable linear separability

Linear separation is always possible

The original input space can always be mapped to some higher-dimensional feature space where the training set is separable.

Explicit embedding

Compute $\phi(x)$ for all $x$ in the dataset.

Then train a linear classifier just like before

Kernel trick

Linear classification requires to compute only dot products $\phi(x_i),\phi(x_j)$

The function $\phi(x)$ does not need to be explicit, we can use a kernel function

\[k(x,z)=\phi(x)\phi(z)\]

which represents a dot product in a “hidden” feature space.

Linear kernel”: identical solution as linear SVM

“Hellinger kernel”: less sensitive to extreme value in feature vector

“Histogram intersection kernel”: very robust

“$X^2$-distance kernel”: good empirical results

“Gaussian kernel”: overall most popular kernel in Machine Learning

Explicit embedding for the Hellinger kernel

Using simple square root properties, we have:

\[k(x,x’) = \phi(x)\phi(x’) = \sqrt{x} \sqrt{x'}\]

Tricks for next practice session: given a BoVW vector,

  1. L1 normalize it (neutralizes effect of number of descriptors)
  2. Take its square root (explicit Hellinger embedding)
  3. L2 normalize it (more linear-classifier friendly)

Metrics

Confusion matrix and Accuracy

Problems with Accuracy

All the following classifiers have a 90% accuracy

Do all errors have the same cost?

Precision, recall, F-score

Plotting a Precision/Recall for classification data

For binary classification

Instead of $\hat y = argmax_yp(y\vert x)$, take all possible thresholds for $p(y\vert x)$

TPR, FPR, ROC

ROC: “Receiver Operating Characteristic” Kind of signaloise measure under various tunings

Ligne rose: random results

More about ROC curves:

Adjusting the threshold

http://www.navan.name/roc/

Class overlap

Split the dataset to assess generalization performance

Bootstrap

Draw randomly, with replacement samples from the training set.

Enables us to estimate the variance of estimators we use in the classification rule.

Holdout

Just keep a part of the dataset for later validation/testing

Cross validation

with meta parameter tuning

StratifiedKFold (best)

Missing things

  • Cost of misclassification
  • Multiclass classification evaluation
This post is licensed under CC BY 4.0 by the author.