Lien de la note Hackmd

Agenda for lecture 4

Introduction
Content-based image retrievalasl (CBIR) using bags of features
Evaluating CBIR / Ranked Retrieval (RR) systems
Texture descriptors
Character descriptors

Summary of last lecture

Descriptor matching
- 1-way
- Cross check
- Ratio test
- Radius threshold
Descriptor indexing
- Indexing pipeline: train/query
- Linear matching
- kD-Trees
- FLANN/hierarchical k-Means
- LSH
- Aproximate NN problem
Projective transformations
- Translation
- Rotation
- Scaling
- …
- Projective
Homography estimation
- Least square
- RANSAC
Geometric validation

Practice session 3: Take home messages

Twin it!: Extracting descriptors, matching them by hand

Detect keypoints and extract surrounding pixels to flat vector
Normalize them and compare them using cross correlation ($\sum_if_i\bullet g_i$)

Augmented Documents: Use an off-the-shelf detector/descriptor: ORB

Augmented Documents: Projective transforms and Homography estimation

OpenCV provides the solver for machinery: list of matches $to$ $3\times 3$ matrix
Just som coordinate transform (2D $\to$ 2D transform)
Remember the classical matrix forms: translation, rotation, …

Next practice session

Implement a simple image search engine

Will be graded

Local feature descriptors

Introduction

Given some keypoints in image 1, what are the more similar ones in image 2 ?

This is a nearest neighbor problem in descriptor space This is also a geometrical problem in coordinate space

Local feature detectors give use the same feature under several perturbations: perspective, illumination, blur…

Local feature descriptors will associate a vector to each local feature.

Such description vector should be:

Compact - to enable fast indexing and matching
Discriminative - to enable object recognition
Robust to perturbations - to tolerate real conditions

We will focus on 2 widely used descriptors for their pedagogical interest

HOG (Histogram of gradients), used im SIFT
BRIEF (Binary Robust Independent Elementary Features), used in ORB

Histogram of Gradient

Algorithm overview

(optional) global image normalization
Compute the gradient image in $x$ and $y$
Compute gradient hisograms
Normalise across blocks
Flatten into a feature vector
(quantify to integers)

Exemple

Cette sensation quand tu cogne ton coude au niveau du nerf

Summary

Pros

Very stable over illuminations changes perpectives changes, blur

Cons

Slow to compute
Quite large (128 bytes for original SIFT)

BRIEF

General idea:

Sample pairs of points $\{p(x), p(y)\}$ in the smoothed (very spiky otherwise, like derivatives) keypoints neighborhood
Compute a simple binary test: $p(x)\lt p(y)$
Accumulate the results of $n_d$ tests to form a binary vector of $n_d$ bits (256 in ref.)

Sampling strategies

GI: $(x_i, y_i)\sim$ i.i.d. Uniform($-\frac{S}{2}$, $+\frac{S}{2}$)
GII: $(x_i, y_i)\sim$ i.i.d. Gaussian($0$, $\frac{1}{25}S^2$)
GII:
- $x_i\sim$ i.i.d. Gaussian($0$, $\frac{1}{25}S^2$)
- $y_i\sim$ i.i.d. Gaussian($x_i$, $\frac{1}{100}S^2$)
GIV: $(x_i, y_i)$ randomly sampled from discrete locations of a coarse polar grid introducing a spatial quantization
GV:
- $x_i=(0,0)$
- $y_i$ takes all possible values on a coarse polar grid containing $n_d$ points

What is the best approach ?

C’est la strategie 2

Summary

Pros

Very fast to compute
Very fast to match
Very compact to store
Cons
Less robust than HoG(SIFT), DoH(SURF) on several real cases

Invariance check

Rotation invariance

Add an angle measure
Take the main gradient orientation
- (Take the averga around the keypoints)

We now have for each keypoint:

Coordinates
Orientation

Descriptor: computed over normalized patch

Scale invariance

Multi scale feature detection and computation:

Add a keypoint for each relevant scale
Possibly several keypoints at the same position

We now have for each keypoint:

Coordinates
Orientation
Scale

Descriptor: computed on a scaled patch

Reminder: Gaussian sigma vs window size

Illumation invariance

SIFT approach:

Normalize the vector
- Solves Affine but what non-linear sources like camera saturation?
Cap the vector elements to $20\%$ (!) and renormalize
- Now we have some illumination invariance

Viewpoint invariance

Better, but more complex approaches can tolerate extreme viewpoint change

Complete pipelines

SIFT (Scale invariant feature tr.)

Construct scale space
Take difference of Gaussians
Locate DoG Extrema
Sub pixel locate potential feature points
Filter edge and low contrast responses
Assigne keypoints orientations
Build keypoint descriptors
Matching, etc.

ORB (oriented FAST and rotated BRIEF)

Use FAST in pyramids to detect stable keypoints
Select the strongest features using FAST or Harris response
Finds their orientation using first-order moments
Computes the descriptors using BRIEF
- Where the coordinates of random point pairs are rotated according to the measured orientation

Conclusion about feature extraction

Selection of appropriate features:

It is a critical decision
Depends on the specific application

Features must:

Be invariant to data variations (depending on the application)
- rotation
- perpective
- noise
- etc.
Have low dimensionality for fast training, matching, reasonable storage

Features determine the type of info to work with:

gray-level, binary, color image
contours, vectorization, skeleton

Features also determine the type of classifier / indexer

Content based image retrieval

Two strategies using local descriptors

Keep all local descriptors

Pros:

Enables geometric validation
better part detection in theory

Cons:

Huge memory requirements

Like what we did in practice session 3 to match parts of an image (useful to validate geometric constraints and classify an image at the same time)

Build a global descriptor using local ones

Inspired by text retrieval
Compact representation
Tricks to embed spatial information
Limited memory requirements

Like what we did in practice session 2 with the color histogram, at the bubble level

Bag of Feature approach

Pipeline with local descriptors (prev. lecture)

Pipeline with bag of features (current lecture)

Features extraction

Sparse vs Dense detection

For dense detection, we usually filter regions with low variance

Dimensionality reduction

Often used before encoding to:

limit dictionary sizes
facilitate quantization

Several techniques:

Principal Component Analysis
Signualr-Value Decomposition

Encoding

Bag of Visual Words

Modern approaches are derived from this one
Reuses ideas of text/we search to images
From a set of descriptor, build a histogram of quantized descriptors much alike a color histogram

Quantization

Discretization of some signal - Lossy process!
Vectorial formulation: $f:R^d\to F$, with $F={1,2,…,k}$
Defines a Voronoi diagram, ie a decomposition of a metric space determined by the distances to a discrete set of points

Bag of Visual Words (continued)

Cluster centers are determined using k-Means (once for all on a training set)
Each descriptor is quantized: store the code of the closest centroid
Build a histogram of descriptor count for each cluster

The set of cluster centers is called the dictionary, the codebook or also the visual vocabulary

We can choose the number of words !

Vector size

The resulting vector size for a given image is given by:

\[D=\text{vocabulary size}\]

Usually, the bigger the vocabulary the better the results. Several thousands of words are common.

Normalization

Premiere methode

Problem:

The values in the histogram are absolute: each bin count the number of occurence of each visual word
This make the descriptor sensitive to the variation of number of descriptors

Solution:

Normalize the histogram

Seconde technique

Like for text retrieval, it is common to reweight the BOVW vectors using the TFDIF technique
Goal: give more importance to rare words than to frequent ones
For each dimension of the histogram, compute a new value $t_i$

Variant: Soft BoVW

Use soft assignment to clusters, add counts to neighbor bins

Other variants

BoVW is only about counting the number of local descriptors

VLAD: vector of locally aggregated descriptors

Fisher vector

IR evalutation

How to evaluate a retrieval system ?

We need a set of queries for which we know the expected results “Ground truth”, aka “targets”, “gold standard”

Precision and recall

Used to measure the balance between

Returning many results, hence a lot of the relevant results present in the database, but also a lot of noise
Returning very few results, leading to less noise, but also less relevant results

Precision ( P ) is the fraction of retrieved documents that are relevant:

Recall ( R ) is the fraction of relevant documents that are retrieved

F-measure

F-measure is the wighted harmonic mean of precision and recall

How to evaluate a ranked retrieval system ?

When results are ordered, more measures are availables.

Common useful measure are:

Precision-recall
ROC graph nd the area under it (AUC)

Precision-recall graph

Mean-average precision

Example: Compute the AP for a given query

For this query and the followinf results, plot the precision/recall graph

1, 3 et 9 sont pertinents ici

Is the first result relevant ? Oui

compute current precision: 1 relevant / 1 retrieved = 1
Recall: 1 relevant

Repeter pour chaque resultat

Construction du graphe en dent de scie

Certaines librairies garde les valeurs superieures, c’est pas bien

Case 2: what if $\vert e_i\vert=4$ ?

Ce qu’on veut c’est l’aire sous la courbe

MLRF: Lecture 04

Agenda for lecture 4

Summary of last lecture

Practice session 3: Take home messages

Next practice session

Local feature descriptors

Introduction

Histogram of Gradient

Algorithm overview

Exemple

Summary

Pros

Cons

BRIEF

Sampling strategies

Summary

Pros

Cons

Invariance check

Rotation invariance

Scale invariance

Reminder: Gaussian sigma vs window size

Illumation invariance

Viewpoint invariance

Complete pipelines

SIFT (Scale invariant feature tr.)

ORB (oriented FAST and rotated BRIEF)

Conclusion about feature extraction

Content based image retrieval

Two strategies using local descriptors

Keep all local descriptors

Build a global descriptor using local ones

Pipeline with local descriptors (prev. lecture)

Pipeline with bag of features (current lecture)

Features extraction

Sparse vs Dense detection

Dimensionality reduction

Encoding

Bag of Visual Words

Quantization

Bag of Visual Words (continued)

Vector size

Normalization

Premiere methode

Seconde technique

Variant: Soft BoVW

Other variants

VLAD: vector of locally aggregated descriptors

Fisher vector

IR evalutation

How to evaluate a retrieval system ?

Precision and recall

F-measure

How to evaluate a ranked retrieval system ?

Precision-recall graph

Mean-average precision

Example: Compute the AP for a given query

Further Reading

MLRF: Lecture 01

MLRF: Lecture 02

MLRF: Lecture 03