Lien de la note Hackmd
Face detection in general
Why is facre detection so difficult ?
Pose (Out-of-Plane Rotation) and orientation (In-Plane Rotation) Presence or absence of structural components Occlusions Imaging conditions Faces are highly non-rigid object (deformations)
Related problems
- Face localization
- Facial feature extraction (landmarks such as eyes, mouth, …)
- Face recognition
- Verification
- Facial expression
Overview of different approaches:
- Knowledge top-down base method
- Feature invariant methods (localization)
- Template-matching methods (localization)
- Appearance-based methods (detection)
Apparence-based methods in details
- Eigenfaces
- Distribution-based methods
- Support Vector Machines (SVM)
- Sparse Network of Winnows
- Naive Bayes Classifier
- Hidden Markov models
- Information Theoretic Approaches (ITA)
- Inductive Learning (C4.5 and Find-S algorithms)
- Artficial Neural Networks (ANN) techniques
- Shallow networks (inverse de Deep)
- Deep learning
Les connections residuelles permettent de faire une retropropagation beaucoup plus loin que le reste du reseau.
Le gros defaut des VGG: ils ont enormement de poids
poids $=$ parametres $\neq$ hyperparametres
Il y a de moins en moins de neurones en \%.
Plus on a de poids, plus on a de chance que notre reseau soit puissant mais plus on a de chance qu’une partie serve a rien.
The beginning in 1994
Burel et Carel proposent une methodologie pour les ANN’s:
- La phase d’entrainement ou un systeme tunes les parametres internes
- La phase d’entrainement local ou le systeme adapte les poids specifiques a l’environnement d’un site local
- La phase de detection durant laquelle les poids ne bougent pas
Vaillant, Montcoq et Le Cun: first translation invariant ANN, decides if each pixel belong or not to a given object
Yang et Huang: first fully automatic human face recoginition system
1997
Rowleu, Baluja, Kanade propose the first rotation-invariant method:
- Uses template-based approach
- Methodology:
- Regions are proposed
- A router network estimates the orientation of this region
- This network prepares the windows using this angle
- A detector network decides if the window contains a face
2004
First real-time face detection algo by Viola & Jones
- Tells if a given image of arbitrary size contains a human face, and if so, where it is
- Minimizes false positive and false negative rates
- Usually 5 types of Haar-like features
- $24\times 24$ image contains a huge number of features ($162886$)
- Integral image for feature computation
- $A=1$, $B=2$, $C=3$, etc.
Allows a low computational cost of features
Principle
The algorithm should deploy more resources to work on those windows more likely to contain a face while spending as little effort as possible on the rest
- We can use weak classifiers
- Then we can mae a strong one with a sequence of weak ones
- Viola & Jones: use AdaBoost
The more layers, the less false positive:
Overfeat (2014)
- Winner of the ImageNet Large Scale Visual Recognition Challenger of 2013
- Makes at the sae time classification (blocks), localization (grouping blocks) and detection (merge windows)
- This multitask approach boosts the performance of the network
- Trained on ImageNEt 2012
- Inspired for multi-viewing voting
- Uses multiscale factor of 1.4
- Using a dense sliding windows thanks to convolution
The better aligned the network window and the object, the strongest the confidence of the network response.
- Efficiency: convolution computations in overlapping regions are shared
- Bounding boxes are accumulated instead of suppressed
- Only one shared network for 3 functionalities
- Uses a feature extractor for classification purpose
- Use offset to refine the resolution of the proposed windows
- Detection fine-tuning: negative training on the fly
Methodology
- decomposition into blocks with 3 offsets
- for each block, estimation of the most probable corresponding class
- (overlapping) region proposals for each class (see below)
- bounding box deduction for each class (see below)
The MTCNN face detection algorithm (2016)
Zhang, Zhang & Li
- Real-time deep-learning-based face detection algorithm
- The MTCNN is a cascade of 3 similar networks (P/R/O-nets)
- The four steps:
- Computation of the (multiscale) image pyramid
- P-nEt: propositional network
- R-net: refinement net (filters and refines the results of the P-Net)
- O-net: output network (still refines, and propose landmarks)
- Use hard sample mining (the $30\%$ easier cases fo not intervene in the retropropagation) to improve the detection results
- Originality: uses multi-task learning, that is, every network
- predict bounding boxes
- use regression to refine/calibrate the position of the edges of the bounding box
- applies Non-Maxmal Suppression (NMS) to keep only relevant candidate windows (merge of highly overlapped candidates)
- (can) propose 5 facial landmarks
- This multi-task seems to improve face detection compared to usual mono-task learning
- How does it work in pratice? It minimizes:
where the first is based on cross-entropy, and the others are based on Euclidian loss
Fast R-CNN and its predecessors (2014-2015)
Spatial-Pyramid Pool network (2014)
- Have been proposed to speed up R-CNN by sharing computation,
- The SPPnet computes a shared feature map using convolutions over the entire image, and only then extract features corresponding to each proposal to make the prediction,
- Then it concatenates the features of the proposal coming from each scale thanks to MaxPooling to a $6 \times 6 \times$ scales map (spatial pyramid).
- SPP-nets accelerates R-CNN by 10 to 100 times at test times and by 3 at training time.
- Drawback 1: Like the R-CNN, it is a multi-stage approach:
- First, feature extraction using convolution,
- Second, fine-tuning of a network using log loss,
- Third, SVM training,
- Fourth, fitting bounding-box regressors.
- Drawback 2: Features are written to disk,
- The fine-tuning cannot update the convolutional layers that precede the spatial pyramid pooling (limited accuracy).
Fast R-CNN0 (2015)
a Fast Region-based Convolutional Network method,
- Mainly made of several innovation to make is faster
- Uses Singular Value Decomposition (SVD) truncation to fasten the computations,
- Uses a multi-task loss to train all the network in one single stage (it jointly learns to classify objects proposals (windows) and refine their spatial locations),
- Trains the VGG16 9 times faster than the RCNN and 3 times faster than the SPP-nets,
- Is able to retropropagate the error in the convolutional layers (contrary to SPPnets and RCNN) and then increases the accuracy,
- No disk storage is required for feature caching.
Faster R-CNN (2016)
- Usual object detection methods depended on (slow) region proposal algorithms,
- They got the original idea to use ANN’s to do these predictions on GPU (much faster),
- They called this technology Region Proposal Networks (RPNs).
Properties
- is just made of several convolutional layers applied on the feature maps,
- It is then a fully convolutional layer (weights are shared in space),
- It is then translation-invariant in space (contrary to MultiBox method),
- it can be seen as a mini-network with a sliding-window applied on the feature map to predict proposals,
- predicts at the same time proposals using regression and objectness scores,
- is able to predict proposals with a wide range of scales and aspect ratios (bye default, 3 and 3 respectively).
Since the Fast R-CNN does not have region proposal, they added their RPN before the Fast-RCNN to obtain the Faster R-CNN,
- The RPN is then an attention network since it tells to the Fast R-CNN where to look
- Since the efficiency of the Fast R-CNN depends on the region proposals, better proposal thanks to the RPN implies a better accuracy of the Faster R-CNN,
- To ensure that features used between the RPN and the Fast R-CNN are the same, they shared the weights of the Feature Extractor between them (faster, more accurate).
- It took then 10 milliseconds to compute the predictions of the RPN.
Mask R-CNN (2018)
Extension of Faster R-CNN
aim is instance segmentation
Has 3 outputs/prediction
- the usual bounding box predictions (from Faster R-CNN),
- the usual classification predictions (still from Faster R-CNN),
- the mask predictions (A small FCN applied to each RoI – NEW !!),
No competition is done among classes prediction
- Mask prediction is done in parallel
- The training is done with a multi-task loss:
- We can easily change the backbone (feature extractor)
- It runs a 5 fps
R-FCN Architectures (2016)
Region-based Fully Convolutional Networks
- 2-stage object detection strategy
- Every layer is convolutional, whatever its role
- Almost all the computations are shared on the entire image
- Rols (candidate regions) are extracted to a Region Proposal Network (RPN)
- Uses position-sensitive score maps
On decale la fenetre sur la droite:
At top-middle probability map, the white pixels correspond to the probability that a head
RetinaNet (2018)
- One-stage detector
- Uses an innovative focal loss
- Naturally handles class imbalance
- Uses a Feature Pyramid Network (FPN) backbone of ResNet architecture
- Then it provides a rich multi-scale feature pyramid (efficiency)
- At each scale, they attach subnetworks to classify and make regressions
Detectrons (2018-2019)
Detectron V1 2018 (Facebook)
Detectron V2 (Facebook)
Real-time detection algorithms
YOLO (You Only Look Once) (2016)
- single-shot detection architecture
- Designed for real-time applications
- It does NOT predict regions of interests
- It predicts a fixed amount of detections on the image directly,
- They are then filtered to contain only the actual detections.
- faster than region-based architectures
- lower detection accuracy
- performs a multi-box bounding box regression on the input image directly
- Method: the image is overlayed by a grid, and for each grid cell, a fixed amount of detections are predicted.
SSD (Single Shot Multibox Detector) (2016)
- Is a single-shot detection architecture
- Instead of performing bounding box regression on the final layer like YOLO, SSDs append additional convolutional layers that gradually decrease in size.
- For each additional layer, a fixed amount of predictions with diverse aspect ratios are computed,
- It results in a large number of predictions that differ heavily across size and aspect ratio.
YOLOv2 (YOLO 9000) (2016)
- Extension of YOLOv1
- Ability to predict objects at different resolutions,
- Computes the first bounding box predictions using clustering,
- Better performance than SSD.