Lien de la note Hackmd
Why do we care
Group the input data into clusters that share some characteristics
- Find pattern in the data (data mining problem)
- Visualize the data in a simpler way
- Infer some properties of a given data point based on how it relates to other data point (satistical learning)
Why is it tricky
Belongs to unsupervised learning
- No grounds truth available to learn/evaluate quality of the algorithm
How to assess how much data points are related to each other?
- Which criteria (features) are the most relevant
- Which metrics make the most sense
How to assess the soundness of the resulting cluster? Is it relevant ?
Families of clustering approaches
- Distance-based clustering
- centroid-based approach (k-mean)
- connectivity-based approaches (based on distance)
- Density-based clustering
- set of dense points
- Distribution-based clustering
- likelihood of point to belong to the same distribution
- Fuzzy clustering
- Relaxed clustering paradigm where a data point can be assigned to multiple clusters with a quantified degree of belongingness metric (fuzzy $c$-means clustering,…).
$k-$means clustering
Partition $n$ observations $x_1,…,x_n$ int $k$ clusters $C={C_1,…,C_k}$ where each observations $x_i$ belongs to the cluster $C_{j*}$ whose mean $\mu_{j*}$ is the closest: $x_i\in S_{j*}$ with $j^{*}=argmin_j\Vert x_i-\mu_j \Vert_2$
La croix represente le centre, on veut la plus petite distance depuis un centre pour ajouter un point dans un cluster:
- Minimize within-cluster sum of squares (variance)
- Overall optimization problem:
- NP-hard problem, no guarantee to find the optimal value
- Stochastic and very sensitive to initial conditions
- Sensitive to outliers (thank you $L_2$ norm)
- Probably the most used clustering algorithm
$k-$means and Voronoi tesselation
Voronoi tesselation parition of the Euclidean space relatively to discrete points/seeds. Each region/Voronoi cell is composed of all the points in the space that are closer to the cell seed than any other seed
- $k$-mean provides a way to obtain a Voronoi tesselation of the input space, where seeds are the final cluster means
- Alternatively, one case use some pre-computer Voronoi tesselation seeds as initial clusters for $k$-means
Determining the optimal number of cluster
Combien de clusters a vue de nez pour cette image ?
2, 3, 4, 14….
Compute explained variance for an increasing number of clusters $k$
Plot and find the bend of the elbow
Sometimes it does not work :(
Sometimes, $k$-means works…
But most of the time not as expected.
Probably because the $L_2$ norm that $k$-means tries to minimize
- Sensible of curse of dimensionality
- Form “normalized Gaussian” clusters
- Does not adapt to manifold geometry
- Sensible to class imbalance
- Sensible to outliers
Simple Linear Iterative Clustering
A kick-ass image segmentation algorithm using $k$-means
SLIC superpixels uses a modified $k$-means clustering in the $Labxy$ space to produce $k$ clusters regurlaly sampled and perceptually coherent from a color point of view.
k-medoids clustering
Possible extension to $k$-means
Cluster centroids are not initial data points $\Rightarrow$ can be problematic
$\Rightarrow$ Replace centroids by medoid (points with the smallest distance to all other points in the cluster)
$\Rightarrow$ $k$-medoid algorithm
Overall objective: find $k$ medoids $m_1, . . . , m_k$ that minimize the partitioning cost
Fuzzy $c$-means clustering
$k$-means is a hard clustering method: each data point 100% belongs to the cluster
Soft clustering methods allow each data points to belong to several clusters with various degrees of membership
Gaussian mixture models
$k$-means on steroids
$k$-means works for spherical clusters, but fails in any other cases $\Rightarrow$ try harder Model probability density function $f$ of data as a mixture of multivariate Gaussian
Cette courbe est une superposition de plusieurs Gaussiennes:
Il faut pouvoir estimer les facteurs de proportions de ces gaussiennes dans la somme
The EM algorithm
Initialization
- Select $k$ random points as initial means $\hat\mu_1,…,\hat\mu_k$
- Init all covariance matrices $\hat\sum_1,…,\hat\sum_k$ as whole data sample covariances matrix $\hat\sum$
- Set uniform mixture weight $\hat\phi_1,…,\hat\phi_k=\frac{1}{k}$
Alternate until convergence
Expectation step Compute membership weight $\hat\gamma_{ij}$ of $x_i$ with respect to $j^{th}$ component $\mathcal N(x\vert\mu_j,\sum_j)$
Maximization step Update weights (in that ordre)
Tadaaaa
$k$-means vs GMM
Let the fight begin!
Kernel Density Estimation
Nonparametric estimation
Goal Estimate probability density function $f$ based on observations $x_1,…,x_n$ only, assumed to derive from $f$ otherwise wtf are we doing here
The kernel density estimator with bandwith $h$ at given point $x$ is given by
Exemples
Mean shift clustering
shift each point to the the local density maximum of its KDE, and assign to the same cluster all points that lead to the same maximum
Exemples
On peut faire la meme chose sur les images en couleurs:
DBSCAN
Density-base spatial clustering of applications with noise
- Divide points into 3 categories (core, boundary, outliers) whether there are at least $minPts$ in their $\epsilon$-neighborhood or not
- Find the connected component of core points (ignore all non-core points)
- Assign non-core points to nearby clusters if it is less than $\epsilon$ away, otherwise assign to noise
Spectral clustering
View clustering task as a min-cut operation in a graph
- Compute similarity graph (but which one?) of data $x_1,…,x_n$
- Compute (weighted) adjacency matrix $W$, degree matrix $D$ and Laplacian matrix $L=D-W$
- Perform eigendecomposition of $L=(E,\triangle)$
Fact #1 0 is and eigenvalue of $L$ with multiplicity $\sim#$ connected components in graph, its eigenvectors are identity vectors of those connected components
Fact #2 Eigenvector of smallest non-zero eigenvalue (Fiedler vector) gives the normalized min-cut of graph
- Performs $k$-means clustering of the $k$ smallest eigenvectors $[e_1,…,e_k]_{n\times k}$
Hierarchical clustering
A very natural way of handling data
Goal Generate a sequence of nested clusters and ordre them in a hierarchy, represented by a dendogram
- Leaves the dendogram = initial data
- Inner nodes of the dendogram = clusters
Exemple
Agglomerative vs Divise clustering
Agglomerative: merge clusters from fine to coarse (bottom-up approach)
Divisive clustering: split clusters (top-down approach)
- Needs some heuristics to avoid the $O(2^n)$ ways of spitting each cluster…
- Not so used in practice