Home MLRF: Lecture 04
Post
Cancel

MLRF: Lecture 04

Lien de la note Hackmd

Agenda for lecture 4

  1. Introduction
  2. Content-based image retrievalasl (CBIR) using bags of features
  3. Evaluating CBIR / Ranked Retrieval (RR) systems
  4. Texture descriptors
  5. Character descriptors

Summary of last lecture

  • Descriptor matching
    • 1-way
    • Cross check
    • Ratio test
    • Radius threshold
  • Descriptor indexing
    • Indexing pipeline: train/query
    • Linear matching
    • kD-Trees
    • FLANN/hierarchical k-Means
    • LSH
    • Aproximate NN problem
  • Projective transformations
    • Translation
    • Rotation
    • Scaling
    • Projective
  • Homography estimation
    • Least square
    • RANSAC
  • Geometric validation

Practice session 3: Take home messages

Twin it!: Extracting descriptors, matching them by hand

  • Detect keypoints and extract surrounding pixels to flat vector
  • Normalize them and compare them using cross correlation ($\sum_if_i\bullet g_i$)

Augmented Documents: Use an off-the-shelf detector/descriptor: ORB

Augmented Documents: Projective transforms and Homography estimation

  • OpenCV provides the solver for machinery: list of matches $to$ $3\times 3$ matrix
  • Just som coordinate transform (2D $\to$ 2D transform)
  • Remember the classical matrix forms: translation, rotation, …

Next practice session

Local feature descriptors

Introduction

Given some keypoints in image 1, what are the more similar ones in image 2 ?

Local feature descriptors will associate a vector to each local feature.

Such description vector should be:

  • Compact - to enable fast indexing and matching
  • Discriminative - to enable object recognition
  • Robust to perturbations - to tolerate real conditions

We will focus on 2 widely used descriptors for their pedagogical interest

  • HOG (Histogram of gradients), used im SIFT
  • BRIEF (Binary Robust Independent Elementary Features), used in ORB

Histogram of Gradient

Algorithm overview

  1. (optional) global image normalization
  2. Compute the gradient image in $x$ and $y$
  3. Compute gradient hisograms
  4. Normalise across blocks
  5. Flatten into a feature vector
  6. (quantify to integers)

Exemple

Cette sensation quand tu cogne ton coude au niveau du nerf

Summary

Pros

  • Very stable over illuminations changes perpectives changes, blur

Cons

  • Slow to compute
  • Quite large (128 bytes for original SIFT)

BRIEF

General idea:

  1. Sample pairs of points \(\{p(x), p(y)\}\) in the smoothed (very spiky otherwise, like derivatives) keypoints neighborhood
  2. Compute a simple binary test: $p(x)\lt p(y)$
  3. Accumulate the results of $n_d$ tests to form a binary vector of $n_d$ bits (256 in ref.)

Sampling strategies

  • GI: $(x_i, y_i)\sim$ i.i.d. Uniform($-\frac{S}{2}$, $+\frac{S}{2}$)
  • GII: $(x_i, y_i)\sim$ i.i.d. Gaussian($0$, $\frac{1}{25}S^2$)
  • GII:
    • $x_i\sim$ i.i.d. Gaussian($0$, $\frac{1}{25}S^2$)
    • $y_i\sim$ i.i.d. Gaussian($x_i$, $\frac{1}{100}S^2$)
  • GIV: $(x_i, y_i)$ randomly sampled from discrete locations of a coarse polar grid introducing a spatial quantization
  • GV:
    • $x_i=(0,0)$
    • $y_i$ takes all possible values on a coarse polar grid containing $n_d$ points

What is the best approach ?

Summary

Pros

  • Very fast to compute
  • Very fast to match
  • Very compact to store

    Cons

  • Less robust than HoG(SIFT), DoH(SURF) on several real cases

Invariance check

Rotation invariance

  • Add an angle measure
  • Take the main gradient orientation
    • (Take the averga around the keypoints)

We now have for each keypoint:

  • Coordinates
  • Orientation

Descriptor: computed over normalized patch

Scale invariance

Multi scale feature detection and computation:

  • Add a keypoint for each relevant scale
  • Possibly several keypoints at the same position

We now have for each keypoint:

  • Coordinates
  • Orientation
  • Scale

Descriptor: computed on a scaled patch

Reminder: Gaussian sigma vs window size

Illumation invariance

SIFT approach:

  1. Normalize the vector
    • Solves Affine but what non-linear sources like camera saturation?
  2. Cap the vector elements to $20\%$ (!) and renormalize
    • Now we have some illumination invariance

Viewpoint invariance

Better, but more complex approaches can tolerate extreme viewpoint change

Complete pipelines

SIFT (Scale invariant feature tr.)

  1. Construct scale space
  2. Take difference of Gaussians
  3. Locate DoG Extrema
  4. Sub pixel locate potential feature points
  5. Filter edge and low contrast responses
  6. Assigne keypoints orientations
  7. Build keypoint descriptors
  8. Matching, etc.

ORB (oriented FAST and rotated BRIEF)

  1. Use FAST in pyramids to detect stable keypoints
  2. Select the strongest features using FAST or Harris response
  3. Finds their orientation using first-order moments
  4. Computes the descriptors using BRIEF
    • Where the coordinates of random point pairs are rotated according to the measured orientation

Conclusion about feature extraction

Selection of appropriate features:

  • It is a critical decision
  • Depends on the specific application

Features must:

  • Be invariant to data variations (depending on the application)
    • rotation
    • perpective
    • noise
    • etc.
  • Have low dimensionality for fast training, matching, reasonable storage

Features determine the type of info to work with:

  • gray-level, binary, color image
  • contours, vectorization, skeleton

Features also determine the type of classifier / indexer

Content based image retrieval

Two strategies using local descriptors

Keep all local descriptors

Pros:

  • Enables geometric validation
  • better part detection in theory

Cons:

  • Huge memory requirements

Build a global descriptor using local ones

  • Inspired by text retrieval
  • Compact representation
  • Tricks to embed spatial information
  • Limited memory requirements

Pipeline with local descriptors (prev. lecture)

Pipeline with bag of features (current lecture)

Features extraction

Sparse vs Dense detection

Dimensionality reduction

Often used before encoding to:

  • limit dictionary sizes
  • facilitate quantization

Several techniques:

  • Principal Component Analysis
  • Signualr-Value Decomposition

Encoding

Bag of Visual Words

  • Modern approaches are derived from this one
  • Reuses ideas of text/we search to images
  • From a set of descriptor, build a histogram of quantized descriptors much alike a color histogram

Quantization

  • Discretization of some signal - Lossy process!
  • Vectorial formulation: $f:R^d\to F$, with $F={1,2,…,k}$
  • Defines a Voronoi diagram, ie a decomposition of a metric space determined by the distances to a discrete set of points

Bag of Visual Words (continued)

  • Cluster centers are determined using k-Means (once for all on a training set)
  • Each descriptor is quantized: store the code of the closest centroid
  • Build a histogram of descriptor count for each cluster

Vector size

The resulting vector size for a given image is given by:

\[D=\text{vocabulary size}\]

Usually, the bigger the vocabulary the better the results. Several thousands of words are common.

Normalization

Premiere methode

Problem:

  • The values in the histogram are absolute: each bin count the number of occurence of each visual word
  • This make the descriptor sensitive to the variation of number of descriptors

Solution:

  • Normalize the histogram

Seconde technique

  • Like for text retrieval, it is common to reweight the BOVW vectors using the TFDIF technique
  • Goal: give more importance to rare words than to frequent ones
  • For each dimension of the histogram, compute a new value $t_i$

Variant: Soft BoVW

Use soft assignment to clusters, add counts to neighbor bins

Other variants

BoVW is only about counting the number of local descriptors

VLAD: vector of locally aggregated descriptors

Fisher vector

IR evalutation

How to evaluate a retrieval system ?

We need a set of queries for which we know the expected results “Ground truth”, aka “targets”, “gold standard”

Precision and recall

Used to measure the balance between

  • Returning many results, hence a lot of the relevant results present in the database, but also a lot of noise
  • Returning very few results, leading to less noise, but also less relevant results

F-measure

How to evaluate a ranked retrieval system ?

When results are ordered, more measures are availables.

Common useful measure are:

  • Precision-recall
  • ROC graph nd the area under it (AUC)

Precision-recall graph

Mean-average precision

Example: Compute the AP for a given query

For this query and the followinf results, plot the precision/recall graph

Is the first result relevant ? Oui

  • compute current precision: 1 relevant / 1 retrieved = 1
  • Recall: 1 relevant

Repeter pour chaque resultat

Case 2: what if $\vert e_i\vert=4$ ?

This post is licensed under CC BY 4.0 by the author.