Lien de la note Hackmd
Agenda for lecture 4
- Introduction
- Content-based image retrievalasl (CBIR) using bags of features
- Evaluating CBIR / Ranked Retrieval (RR) systems
- Texture descriptors
- Character descriptors
Summary of last lecture
- Descriptor matching
- 1-way
- Cross check
- Ratio test
- Radius threshold
- Descriptor indexing
- Indexing pipeline: train/query
- Linear matching
- kD-Trees
- FLANN/hierarchical k-Means
- LSH
- Aproximate NN problem
- Projective transformations
- Translation
- Rotation
- Scaling
- …
- Projective
- Homography estimation
- Least square
- RANSAC
- Geometric validation
Practice session 3: Take home messages
Twin it!: Extracting descriptors, matching them by hand
- Detect keypoints and extract surrounding pixels to flat vector
- Normalize them and compare them using cross correlation ($\sum_if_i\bullet g_i$)
Augmented Documents: Use an off-the-shelf detector/descriptor: ORB
Augmented Documents: Projective transforms and Homography estimation
- OpenCV provides the solver for machinery: list of matches $to$ $3\times 3$ matrix
- Just som coordinate transform (2D $\to$ 2D transform)
- Remember the classical matrix forms: translation, rotation, …
Next practice session
Implement a simple image search engine
Will be graded
Local feature descriptors
Introduction
Given some keypoints in image 1, what are the more similar ones in image 2 ?
This is a nearest neighbor problem in descriptor space This is also a geometrical problem in coordinate space
Local feature detectors give use the same feature under several perturbations: perspective, illumination, blur…
Local feature descriptors will associate a vector to each local feature.
Such description vector should be:
- Compact - to enable fast indexing and matching
- Discriminative - to enable object recognition
- Robust to perturbations - to tolerate real conditions
We will focus on 2 widely used descriptors for their pedagogical interest
- HOG (Histogram of gradients), used im SIFT
- BRIEF (Binary Robust Independent Elementary Features), used in ORB
Histogram of Gradient
Algorithm overview
- (optional) global image normalization
- Compute the gradient image in $x$ and $y$
- Compute gradient hisograms
- Normalise across blocks
- Flatten into a feature vector
- (quantify to integers)
Exemple
Cette sensation quand tu cogne ton coude au niveau du nerf
Summary
Pros
- Very stable over illuminations changes perpectives changes, blur
Cons
- Slow to compute
- Quite large (128 bytes for original SIFT)
BRIEF
General idea:
- Sample pairs of points \(\{p(x), p(y)\}\) in the smoothed (very spiky otherwise, like derivatives) keypoints neighborhood
- Compute a simple binary test: $p(x)\lt p(y)$
- Accumulate the results of $n_d$ tests to form a binary vector of $n_d$ bits (256 in ref.)
Sampling strategies
- GI: $(x_i, y_i)\sim$ i.i.d. Uniform($-\frac{S}{2}$, $+\frac{S}{2}$)
- GII: $(x_i, y_i)\sim$ i.i.d. Gaussian($0$, $\frac{1}{25}S^2$)
- GII:
- $x_i\sim$ i.i.d. Gaussian($0$, $\frac{1}{25}S^2$)
- $y_i\sim$ i.i.d. Gaussian($x_i$, $\frac{1}{100}S^2$)
- GIV: $(x_i, y_i)$ randomly sampled from discrete locations of a coarse polar grid introducing a spatial quantization
- GV:
- $x_i=(0,0)$
- $y_i$ takes all possible values on a coarse polar grid containing $n_d$ points
What is the best approach ?
C’est la strategie 2
Summary
Pros
- Very fast to compute
- Very fast to match
- Very compact to store
Cons
- Less robust than HoG(SIFT), DoH(SURF) on several real cases
Invariance check
Rotation invariance
- Add an angle measure
- Take the main gradient orientation
- (Take the averga around the keypoints)
We now have for each keypoint:
- Coordinates
- Orientation
Descriptor: computed over normalized patch
Scale invariance
Multi scale feature detection and computation:
- Add a keypoint for each relevant scale
- Possibly several keypoints at the same position
We now have for each keypoint:
- Coordinates
- Orientation
- Scale
Descriptor: computed on a scaled patch
Reminder: Gaussian sigma vs window size
Illumation invariance
SIFT approach:
- Normalize the vector
- Solves Affine but what non-linear sources like camera saturation?
- Cap the vector elements to $20\%$ (!) and renormalize
- Now we have some illumination invariance
Viewpoint invariance
Better, but more complex approaches can tolerate extreme viewpoint change
Complete pipelines
SIFT (Scale invariant feature tr.)
- Construct scale space
- Take difference of Gaussians
- Locate DoG Extrema
- Sub pixel locate potential feature points
- Filter edge and low contrast responses
- Assigne keypoints orientations
- Build keypoint descriptors
- Matching, etc.
ORB (oriented FAST and rotated BRIEF)
- Use FAST in pyramids to detect stable keypoints
- Select the strongest features using FAST or Harris response
- Finds their orientation using first-order moments
- Computes the descriptors using BRIEF
- Where the coordinates of random point pairs are rotated according to the measured orientation
Conclusion about feature extraction
Selection of appropriate features:
- It is a critical decision
- Depends on the specific application
Features must:
- Be invariant to data variations (depending on the application)
- rotation
- perpective
- noise
- etc.
- Have low dimensionality for fast training, matching, reasonable storage
Features determine the type of info to work with:
- gray-level, binary, color image
- contours, vectorization, skeleton
Features also determine the type of classifier / indexer
Content based image retrieval
Two strategies using local descriptors
Keep all local descriptors
Pros:
- Enables geometric validation
- better part detection in theory
Cons:
- Huge memory requirements
Like what we did in practice session 3 to match parts of an image (useful to validate geometric constraints and classify an image at the same time)
Build a global descriptor using local ones
- Inspired by text retrieval
- Compact representation
- Tricks to embed spatial information
- Limited memory requirements
Like what we did in practice session 2 with the color histogram, at the bubble level
Bag of Feature approach
Pipeline with local descriptors (prev. lecture)
Pipeline with bag of features (current lecture)
Features extraction
Sparse vs Dense detection
For dense detection, we usually filter regions with low variance
Dimensionality reduction
Often used before encoding to:
- limit dictionary sizes
- facilitate quantization
Several techniques:
- Principal Component Analysis
- Signualr-Value Decomposition
Encoding
Bag of Visual Words
- Modern approaches are derived from this one
- Reuses ideas of text/we search to images
- From a set of descriptor, build a histogram of quantized descriptors much alike a color histogram
Quantization
- Discretization of some signal - Lossy process!
- Vectorial formulation: $f:R^d\to F$, with $F={1,2,…,k}$
- Defines a Voronoi diagram, ie a decomposition of a metric space determined by the distances to a discrete set of points
Bag of Visual Words (continued)
- Cluster centers are determined using k-Means (once for all on a training set)
- Each descriptor is quantized: store the code of the closest centroid
- Build a histogram of descriptor count for each cluster
The set of cluster centers is called the dictionary, the codebook or also the visual vocabulary
We can choose the number of words !
Vector size
The resulting vector size for a given image is given by:
\[D=\text{vocabulary size}\]Usually, the bigger the vocabulary the better the results. Several thousands of words are common.
Normalization
Premiere methode
Problem:
- The values in the histogram are absolute: each bin count the number of occurence of each visual word
- This make the descriptor sensitive to the variation of number of descriptors
Solution:
- Normalize the histogram
Seconde technique
- Like for text retrieval, it is common to reweight the BOVW vectors using the TFDIF technique
- Goal: give more importance to rare words than to frequent ones
- For each dimension of the histogram, compute a new value $t_i$
Variant: Soft BoVW
Use soft assignment to clusters, add counts to neighbor bins
Other variants
BoVW is only about counting the number of local descriptors
VLAD: vector of locally aggregated descriptors
Fisher vector
IR evalutation
How to evaluate a retrieval system ?
We need a set of queries for which we know the expected results “Ground truth”, aka “targets”, “gold standard”
Precision and recall
Used to measure the balance between
- Returning many results, hence a lot of the relevant results present in the database, but also a lot of noise
- Returning very few results, leading to less noise, but also less relevant results
Precision ( P ) is the fraction of retrieved documents that are relevant:
Recall ( R ) is the fraction of relevant documents that are retrieved
F-measure
F-measure is the wighted harmonic mean of precision and recall
How to evaluate a ranked retrieval system ?
When results are ordered, more measures are availables.
Common useful measure are:
- Precision-recall
- ROC graph nd the area under it (AUC)
Precision-recall graph
Mean-average precision
Example: Compute the AP for a given query
For this query and the followinf results, plot the precision/recall graph
1, 3 et 9 sont pertinents ici
Is the first result relevant ? Oui
- compute current precision: 1 relevant / 1 retrieved = 1
- Recall: 1 relevant
Repeter pour chaque resultat
Construction du graphe en dent de scie
Certaines librairies garde les valeurs superieures, c’est pas bien
Case 2: what if $\vert e_i\vert=4$ ?
Ce qu’on veut c’est l’aire sous la courbe