Face detection in general

Why is facre detection so difficult ?

Pose (Out-of-Plane Rotation) and orientation (In-Plane Rotation) Presence or absence of structural components Occlusions Imaging conditions Faces are highly non-rigid object (deformations)

Face localization
Facial feature extraction (landmarks such as eyes, mouth, …)
Face recognition
Verification
Facial expression

Overview of different approaches:

Knowledge top-down base method
Feature invariant methods (localization)
Template-matching methods (localization)
Appearance-based methods (detection)

Apparence-based methods in details

Eigenfaces
Distribution-based methods
Support Vector Machines (SVM)
Sparse Network of Winnows
Naive Bayes Classifier
Hidden Markov models

Information Theoretic Approaches (ITA)
Inductive Learning (C4.5 and Find-S algorithms)
Artficial Neural Networks (ANN) techniques
- Shallow networks (inverse de Deep)
- Deep learning

Les connections residuelles permettent de faire une retropropagation beaucoup plus loin que le reste du reseau.

Le gros defaut des VGG: ils ont enormement de poids

poids $=$ parametres $\neq$ hyperparametres

Il y a de moins en moins de neurones en \%.

Plus on a de poids, plus on a de chance que notre reseau soit puissant mais plus on a de chance qu’une partie serve a rien.

The beginning in 1994

Burel et Carel proposent une methodologie pour les ANN’s:

La phase d’entrainement ou un systeme tunes les parametres internes
La phase d’entrainement local ou le systeme adapte les poids specifiques a l’environnement d’un site local
La phase de detection durant laquelle les poids ne bougent pas

Vaillant, Montcoq et Le Cun: first translation invariant ANN, decides if each pixel belong or not to a given object

Yang et Huang: first fully automatic human face recoginition system

1997

Rowleu, Baluja, Kanade propose the first rotation-invariant method:

Uses template-based approach
Methodology:
- Regions are proposed
- A router network estimates the orientation of this region
- This network prepares the windows using this angle
- A detector network decides if the window contains a face

2004

First real-time face detection algo by Viola & Jones

Tells if a given image of arbitrary size contains a human face, and if so, where it is
Minimizes false positive and false negative rates
Usually 5 types of Haar-like features

$24\times 24$ image contains a huge number of features ($162886$)
Integral image for feature computation

$A=1$, $B=2$, $C=3$, etc.

Allows a low computational cost of features

Principle

The algorithm should deploy more resources to work on those windows more likely to contain a face while spending as little effort as possible on the rest

We can use weak classifiers
Then we can mae a strong one with a sequence of weak ones
Viola & Jones: use AdaBoost

The more layers, the less false positive:

Overfeat (2014)

Winner of the ImageNet Large Scale Visual Recognition Challenger of 2013
Makes at the sae time classification (blocks), localization (grouping blocks) and detection (merge windows)
This multitask approach boosts the performance of the network
Trained on ImageNEt 2012

Inspired for multi-viewing voting
Uses multiscale factor of 1.4
Using a dense sliding windows thanks to convolution

The better aligned the network window and the object, the strongest the confidence of the network response.

Efficiency: convolution computations in overlapping regions are shared
Bounding boxes are accumulated instead of suppressed
Only one shared network for 3 functionalities
Uses a feature extractor for classification purpose
Use offset to refine the resolution of the proposed windows
Detection fine-tuning: negative training on the fly

Methodology

decomposition into blocks with 3 offsets
for each block, estimation of the most probable corresponding class
(overlapping) region proposals for each class (see below)
bounding box deduction for each class (see below)

The MTCNN face detection algorithm (2016)

Zhang, Zhang & Li

Real-time deep-learning-based face detection algorithm
The MTCNN is a cascade of 3 similar networks (P/R/O-nets)
The four steps:
1. Computation of the (multiscale) image pyramid
2. P-nEt: propositional network
3. R-net: refinement net (filters and refines the results of the P-Net)
4. O-net: output network (still refines, and propose landmarks)

Use hard sample mining (the $30\%$ easier cases fo not intervene in the retropropagation) to improve the detection results
Originality: uses multi-task learning, that is, every network
- predict bounding boxes
- use regression to refine/calibrate the position of the edges of the bounding box
- applies Non-Maxmal Suppression (NMS) to keep only relevant candidate windows (merge of highly overlapped candidates)
- (can) propose 5 facial landmarks
This multi-task seems to improve face detection compared to usual mono-task learning
How does it work in pratice? It minimizes:

\[Loss=$\alpha_1\times L_{detection} + \alpha_2\times L_{regression} + \alpha_3\times L_{landmarks}$\]

where the first is based on cross-entropy, and the others are based on Euclidian loss

Fast R-CNN and its predecessors (2014-2015)

Spatial-Pyramid Pool network (2014)

Have been proposed to speed up R-CNN by sharing computation,
The SPPnet computes a shared feature map using convolutions over the entire image, and only then extract features corresponding to each proposal to make the prediction,
Then it concatenates the features of the proposal coming from each scale thanks to MaxPooling to a $6 \times 6 \times$ scales map (spatial pyramid).
SPP-nets accelerates R-CNN by 10 to 100 times at test times and by 3 at training time.
Drawback 1: Like the R-CNN, it is a multi-stage approach:
- First, feature extraction using convolution,
- Second, fine-tuning of a network using log loss,
- Third, SVM training,
- Fourth, fitting bounding-box regressors.
Drawback 2: Features are written to disk,
The fine-tuning cannot update the convolutional layers that precede the spatial pyramid pooling (limited accuracy).

Fast R-CNN0 (2015)

a Fast Region-based Convolutional Network method,

Mainly made of several innovation to make is faster
Uses Singular Value Decomposition (SVD) truncation to fasten the computations,
Uses a multi-task loss to train all the network in one single stage (it jointly learns to classify objects proposals (windows) and refine their spatial locations),
Trains the VGG16 9 times faster than the RCNN and 3 times faster than the SPP-nets,
Is able to retropropagate the error in the convolutional layers (contrary to SPPnets and RCNN) and then increases the accuracy,
No disk storage is required for feature caching.

Faster R-CNN (2016)

Usual object detection methods depended on (slow) region proposal algorithms,
They got the original idea to use ANN’s to do these predictions on GPU (much faster),
They called this technology Region Proposal Networks (RPNs).

Properties

is just made of several convolutional layers applied on the feature maps,
It is then a fully convolutional layer (weights are shared in space),
It is then translation-invariant in space (contrary to MultiBox method),
it can be seen as a mini-network with a sliding-window applied on the feature map to predict proposals,
predicts at the same time proposals using regression and objectness scores,
is able to predict proposals with a wide range of scales and aspect ratios (bye default, 3 and 3 respectively).

Since the Fast R-CNN does not have region proposal, they added their RPN before the Fast-RCNN to obtain the Faster R-CNN,

The RPN is then an attention network since it tells to the Fast R-CNN where to look
Since the efficiency of the Fast R-CNN depends on the region proposals, better proposal thanks to the RPN implies a better accuracy of the Faster R-CNN,
To ensure that features used between the RPN and the Fast R-CNN are the same, they shared the weights of the Feature Extractor between them (faster, more accurate).
It took then 10 milliseconds to compute the predictions of the RPN.

Mask R-CNN (2018)

Extension of Faster R-CNN

aim is instance segmentation

Has 3 outputs/prediction

the usual bounding box predictions (from Faster R-CNN),
the usual classification predictions (still from Faster R-CNN),
the mask predictions (A small FCN applied to each RoI – NEW !!),

No competition is done among classes prediction

Mask prediction is done in parallel
The training is done with a multi-task loss:

\[Loss = \alpha_1L_{class} + \alpha_2L_{reg}+\alpha_3L_{mask}\]

We can easily change the backbone (feature extractor)
It runs a 5 fps

R-FCN Architectures (2016)

Region-based Fully Convolutional Networks

2-stage object detection strategy
Every layer is convolutional, whatever its role
Almost all the computations are shared on the entire image
Rols (candidate regions) are extracted to a Region Proposal Network (RPN)
Uses position-sensitive score maps

On decale la fenetre sur la droite:

At top-middle probability map, the white pixels correspond to the probability that a head

RetinaNet (2018)

One-stage detector
Uses an innovative focal loss
Naturally handles class imbalance
Uses a Feature Pyramid Network (FPN) backbone of ResNet architecture
Then it provides a rich multi-scale feature pyramid (efficiency)
At each scale, they attach subnetworks to classify and make regressions

Detectrons (2018-2019)

Detectron V1 2018 (Facebook)

Detectron V2 (Facebook)

Real-time detection algorithms

YOLO (You Only Look Once) (2016)

single-shot detection architecture
- Designed for real-time applications
- It does NOT predict regions of interests
- It predicts a fixed amount of detections on the image directly,
- They are then filtered to contain only the actual detections.
faster than region-based architectures
lower detection accuracy
performs a multi-box bounding box regression on the input image directly
Method: the image is overlayed by a grid, and for each grid cell, a fixed amount of detections are predicted.

SSD (Single Shot Multibox Detector) (2016)

Is a single-shot detection architecture
Instead of performing bounding box regression on the final layer like YOLO, SSDs append additional convolutional layers that gradually decrease in size.
For each additional layer, a fixed amount of predictions with diverse aspect ratios are computed,
It results in a large number of predictions that differ heavily across size and aspect ratio.

YOLOv2 (YOLO 9000) (2016)

Extension of YOLOv1
Ability to predict objects at different resolutions,
Computes the first bounding box predictions using clustering,
Better performance than SSD.

DLIM: Face detection