Lien de la note Hackmd

Why do we care

Group the input data into clusters that share some characteristics

Find pattern in the data (data mining problem)
Visualize the data in a simpler way
Infer some properties of a given data point based on how it relates to other data point (satistical learning)

Why is it tricky

Belongs to unsupervised learning

No grounds truth available to learn/evaluate quality of the algorithm

How to assess how much data points are related to each other?

Which criteria (features) are the most relevant
Which metrics make the most sense

How to assess the soundness of the resulting cluster? Is it relevant ?

Families of clustering approaches

Distance-based clustering
- centroid-based approach (k-mean)
- connectivity-based approaches (based on distance)
Density-based clustering
- set of dense points
Distribution-based clustering
- likelihood of point to belong to the same distribution
Fuzzy clustering
- Relaxed clustering paradigm where a data point can be assigned to multiple clusters with a quantified degree of belongingness metric (fuzzy $c$-means clustering,…).

$k-$means clustering

Partition $n$ observations $x_1,…,x_n$ int $k$ clusters $C={C_1,…,C_k}$ where each observations $x_i$ belongs to the cluster $C_{j*}$ whose mean $\mu_{j*}$ is the closest: $x_i\in S_{j*}$ with $j^{*}=argmin_j\Vert x_i-\mu_j \Vert_2$

La croix represente le centre, on veut la plus petite distance depuis un centre pour ajouter un point dans un cluster:

Minimize within-cluster sum of squares (variance)
Overall optimization problem:

NP-hard problem, no guarantee to find the optimal value
Stochastic and very sensitive to initial conditions
Sensitive to outliers (thank you $L_2$ norm)
Probably the most used clustering algorithm

$k-$means and Voronoi tesselation

Voronoi tesselation parition of the Euclidean space relatively to discrete points/seeds. Each region/Voronoi cell is composed of all the points in the space that are closer to the cell seed than any other seed

$k$-mean provides a way to obtain a Voronoi tesselation of the input space, where seeds are the final cluster means
Alternatively, one case use some pre-computer Voronoi tesselation seeds as initial clusters for $k$-means

Determining the optimal number of cluster

Combien de clusters a vue de nez pour cette image ?

2, 3, 4, 14….

Compute explained variance for an increasing number of clusters $k$

Plot and find the bend of the elbow

Sometimes it does not work :(

Sometimes, $k$-means works…

But most of the time not as expected.

Probably because the $L_2$ norm that $k$-means tries to minimize

Sensible of curse of dimensionality
Form “normalized Gaussian” clusters
Does not adapt to manifold geometry
Sensible to class imbalance
Sensible to outliers

Simple Linear Iterative Clustering

A kick-ass image segmentation algorithm using $k$-means

SLIC superpixels uses a modified $k$-means clustering in the $Labxy$ space to produce $k$ clusters regurlaly sampled and perceptually coherent from a color point of view.

k-medoids clustering

Possible extension to $k$-means

Cluster centroids are not initial data points $\Rightarrow$ can be problematic

$\Rightarrow$ Replace centroids by medoid (points with the smallest distance to all other points in the cluster)

$\Rightarrow$ $k$-medoid algorithm

Overall objective: find $k$ medoids $m_1, . . . , m_k$ that minimize the partitioning cost

Fuzzy $c$-means clustering

$k$-means is a hard clustering method: each data point 100% belongs to the cluster

Soft clustering methods allow each data points to belong to several clusters with various degrees of membership

Gaussian mixture models

$k$-means on steroids

$k$-means works for spherical clusters, but fails in any other cases $\Rightarrow$ try harder Model probability density function $f$ of data as a mixture of multivariate Gaussian

Cette courbe est une superposition de plusieurs Gaussiennes:

Il faut pouvoir estimer les facteurs de proportions de ces gaussiennes dans la somme

The EM algorithm

Initialization

Select $k$ random points as initial means $\hat\mu_1,…,\hat\mu_k$
Init all covariance matrices $\hat\sum_1,…,\hat\sum_k$ as whole data sample covariances matrix $\hat\sum$
Set uniform mixture weight $\hat\phi_1,…,\hat\phi_k=\frac{1}{k}$

Alternate until convergence

Expectation step Compute membership weight $\hat\gamma_{ij}$ of $x_i$ with respect to $j^{th}$ component $\mathcal N(x\vert\mu_j,\sum_j)$

Maximization step Update weights (in that ordre)

Tadaaaa

$k$-means vs GMM

Let the fight begin!

Kernel Density Estimation

Nonparametric estimation

Goal Estimate probability density function $f$ based on observations $x_1,…,x_n$ only, assumed to derive from $f$ ~~otherwise wtf are we doing here~~

The kernel density estimator with bandwith $h$ at given point $x$ is given by

Exemples

Mean shift clustering

shift each point to the the local density maximum of its KDE, and assign to the same cluster all points that lead to the same maximum

Exemples

On peut faire la meme chose sur les images en couleurs:

DBSCAN

Density-base spatial clustering of applications with noise

Divide points into 3 categories (core, boundary, outliers) whether there are at least $minPts$ in their $\epsilon$-neighborhood or not
Find the connected component of core points (ignore all non-core points)
Assign non-core points to nearby clusters if it is less than $\epsilon$ away, otherwise assign to noise

Spectral clustering

View clustering task as a min-cut operation in a graph

Compute similarity graph (but which one?) of data $x_1,…,x_n$

Compute (weighted) adjacency matrix $W$, degree matrix $D$ and Laplacian matrix $L=D-W$
Perform eigendecomposition of $L=(E,\triangle)$

Fact #1 0 is and eigenvalue of $L$ with multiplicity $\sim#$ connected components in graph, its eigenvectors are identity vectors of those connected components

Fact #2 Eigenvector of smallest non-zero eigenvalue (Fiedler vector) gives the normalized min-cut of graph

Performs $k$-means clustering of the $k$ smallest eigenvectors $[e_1,…,e_k]_{n\times k}$

Hierarchical clustering

A very natural way of handling data

Goal Generate a sequence of nested clusters and ordre them in a hierarchy, represented by a dendogram

Leaves the dendogram = initial data
Inner nodes of the dendogram = clusters

Exemple

Agglomerative vs Divise clustering

Agglomerative: merge clusters from fine to coarse (bottom-up approach)

Divisive clustering: split clusters (top-down approach)

Needs some heuristics to avoid the $O(2^n)$ ways of spitting each cluster…
Not so used in practice

Bestiarity

Overall comparison of all methods

IML: Unsupervised clustering

Why do we care

Why is it tricky

Families of clustering approaches

$k-$means clustering

$k-$means and Voronoi tesselation

Determining the optimal number of cluster

Sometimes, $k$-means works…

Simple Linear Iterative Clustering

k-medoids clustering

Fuzzy $c$-means clustering

Gaussian mixture models

The EM algorithm

Initialization

Alternate until convergence

$k$-means vs GMM

Kernel Density Estimation

Exemples

Mean shift clustering

Exemples

DBSCAN

Spectral clustering

Hierarchical clustering

Exemple

Agglomerative vs Divise clustering

Bestiarity

Overall comparison of all methods

Further Reading

IML: Introduction

IML: Dimensionality reduction

IML: Supervised learning