Lien de la note Hackmd

Why do we care ?

We have at hand $n$ points $x1,…, xn$ lying in some N-dimensional space, $x_i \in\mathbb R^n , \forall i = 1, . . . , n,$ compactly written as a $n × N$ matrix $X$

One row of $X$ = one sample
One column of $X$ = a given feature value for all samples

Example of real high-dimensional data

Real world data is very often high-dimensional

MNIST image classification:

Sample $x$:image with 28x28 pixels
Data set: 60000 samples
Dimensionality: $x \in\mathbb R^{28×28=784}$

MUSE hyperspectral image analysis:

Sample $x$: pixel with 3600 spectral bands
Data set: image with 300x300 pixels
Dimensionality: $x \in \mathbb R^{3600}$

Pour discriminer les galaxies ~~c’est raciste ca monsieur~~

The curse of dimensionality

High-dimensional spaces ~~suck donkey ballz~~ suffer from the curse of dimensionality (also called Hughes’ phenomenon)

Sur $\mathbb R$

Sur $\mathbb R^2$

Revenir a la meme densite d’echantillonage:

Sur $\mathbb R^3$

Revenir a la meme densite d’echantillonage:

\[\frac{\nu(\mathbb S^n)}{\nu([-1;1]^n)}=\frac{\pi^{\frac{n}{2}}}{2\Gamma(\frac{n}{2}+1)}\to_{n\to+\infty}0\]

Points uniformly distributed in a $n$−cube of side 2 mostly fall outside of the unit sphere!

Why is it tricky?

We naturally cannot picture anything that is more than 3D in our mind

Picturing something 3D in a 2D flat screen can already be misleading
Real data naturally lives in (complex) high-dimensional space
Real data is often strongly correlated

And somehow, we want to have a good look to our data before feeding it to some machine learning algorithm (can I use the inherent structure of my data to pimp my machine learning performances?)

How ?

Dimensionality reduction: transform data set $X$ with dimensionality $N$ into a new data set $Y$ ($n \times M$ matrix) with dimensionality $M \lt N$ (hopefully $M \lt\le N$) such that as few information as possible is lost in the process. $y_i$ ($i$th row of $Y$) is the low-dimensional counterpart (the projection) of $x_i$.

INFORMATION ???

Linear approaches

Somehow trying to find a low-dimensional subspace in which the projected data would not be too much distorted after projection.

Johnson-Lindenstrauss lemma
Classical scaling
(The one and only) Principal Component Analysis
And much more…

Johnson-Lindenstrauss lemma

It’s not because you can that you will

Let $0\lt\varepsilon\lt1$ and let $x_1,…,x_n$ be $n$ points in $\mathbb R^N$. Then there exists a linear map $f:\mathbb R^N\to\mathbb R^M$ such that for every points $x_i$ and $x_j$

\[(1-\varepsilon)\Vert x_i-x_j \Vert^2\le \Vert f(x_i)-f(x_j) \Vert^2\le(1+\varepsilon)\Vert x_i-x_j \Vert^2\]

With $M=\frac{4\log(n)}{(\frac{\varepsilon^2}{2} − \frac{\varepsilon^3}{3}).}$

Johnson, W. B., & Lindenstrauss, J. (1984). Extensions of Lipschitz mappings into a Hilbert space. Contemporary mathematics.

La douille: il faut trouver la matrice $M$.

Classical scaling

Also called Principal Coordinates Analysis (PCoA)

Lots of formula here, but you just need to retain the overall idea

PCoA: project data points $X$ onto $Y$ with a linear mapping $M$ such that $Y = XM$ such that all pairwise distances between points do not change too much before/after projection

If $D$ is the $n \times n$ Euclidean distance matrix with entries $d_{ij} = \Vert x_i − xj\Vert_2$ and $D^{(2)} = [d_{ij}^2]$, PCoA seeks the linear mapping $M$ that minimizes

\[\phi(Y)=\sum_{i,j}(d_{ij}^2-\Vert y_i-y_j\Vert^2)\]

with $y_i = x_iM$ and $\Vert m_i\Vert^2=1\forall i$

Solution: eigendecomposition (=diagonalisation) of the Gram matrix $K = XX^T = E\Delta E$

$K$ can be obtained by double centering $D^{(2)}:K=-\frac{1}{2}C_nD^{(2)}C_n$ with centering matrix $C_n=I_n-\frac{1}{n}ones(n,n)$

Optimal projection onto the first $M$ dimensions $Y=\Delta_M^{\frac{1}{2}}E_M^T$ with $E_M$ matrix of the $M$ largest eigenvectors of $E$.

Principal component analysis

Also known as the Karhunen-Loeve transform

Closely related to PCoA, but operates on the covariance matrix $X_c^T X_c$ PCA seeks the linear mapping $M$ that maximizes the projection variance $tr(M^T cov(X)M)$ with $\Vert mi\Vert^2 = 1 \forall i$.

\[X=\begin{bmatrix} \overbrace{x_{11}}^{u_1=\text{moyenne}} & \overbrace{x_{12}}^{u_2}\\ \vdots & \vdots\\ x_{n1}&x_{n2} \end{bmatrix} \Rightarrow \text{centrage des donnees}\] \[X_c=\begin{bmatrix} x_{11}-u_1 & x_{12}-u_2\\ \vdots & \vdots\\ x_{n1}1-u_1&x_{n2}-u_2 \end{bmatrix}\]

Center the data $X_c = C_nX$ 1.b (opt) Reduce the data
Compute covariance matrix $\sum=\frac{1}{n-1}X_c^TX_c$
Perform eigendecomposition $(E,\Delta)$ of $\sum$
Project on the first $M$ principal axes $Y=XE_M$

Data after projection is uncorrelated, but has lost some interpretability

PCA is probably the most popular and used unsupervised linear dimensionality reduction technique, but it comes with a bunch of operability questions, the 2 principles being:

How to automatically select the right number of dimensions to project?
How to project a new data point on a learned projection subspace?
- See you in lab session for the answer

Non-linear approaches

When it is assumed that the data does not live in an Euclidean subspace (why would it anyway?), some more advanced techniques must be relied on.

Isomap
Locally linear embedding
Kernel Principal Component Analysis (aka PCA on steroids)
Multilayer autoencoders
And much more…

Isomap

Geodesic distance rocks

Isometric feature mapping: same idea as classical scaling, but using geodesic distance instead of Euclidean distance.

Compute k-nearest neighbor graph of data $x_1,…,x_n$
Compute all pairwise geodesic distances
Apply classical scaling

Exemple

Isomap applied to some images of the digit 2 in MNIST data

Locally linear embedding

Locally linear embedding: the manifold can be locally considered Euclidean

For each point $x_i$:

get its k-nearest neighbors $x_j$, $j=1,…,k$
Get weights $w_{ij}$ that best linearly reconstruct $x_i$ with $x_j$: minimize $\sum_{i=1}^n\Vert x_i-\sum w_{ij}x_j\Vert$
- with constraints $\sum w_{ij}=1$ (closed-form solution)
Low-dimensional embedding $\to$ reconstruct $y_i$ with $y_j$ and same weights $w_{ij}$:

minimize $\sum_{i=1}^n\Vert y_i-\sum w_{ij}y_j\Vert$

with constraints $\frac{1}{n}\sum_iy_iy_i^T$ and $\sum_iy_i=0$ (eigendecomposition of a Gram matrix)

The kernel trick

When one actually wants to increase the dimension

Base idea: map $n$ non linearly separable points to a (possibly infinite) space where they would be with a function $\phi$

How should we define $\phi$ ?
Do we really want to compute stuff in a (possibly infinite) feature space?

Mercer theorem: we do not need to know the mapping $\phi$ explicitly as long as we have a positive semi-definite kernel/Gram matrix $K=[\mathcal k(x_i,x_j)]=[<\phi(x_i),\phi(x_j)>]$

Widely used kernel functions:

Polynomial kernel: $\mathcal k(x_i,x_j)=(x_i^Tx_j+1)^d$
Gaussian RBF kernel: $\mathcal k(x_i,x_j)=e^{-\gamma\Vert x_i-x_j\Vert^2}$
Sigmoid kernel: $\mathcal k(x_i,x_j)=\tanh(bx_i^Tx_j+c)$

Kernel PCA

PCA on steroids

The maths behind are quite hard, but the following scikit-learn recipe works fine:

Compute kernel matrix $k=[\mathcal k(x_i,x_j)]=[<\phi(x_i),\phi(x_j)>]$ and double-center it $K_c=C_nKC_n$
Eigendecomposition of $K_c$ is strongly related to this of the (intractable) covariance matrix in the feature space $\to$ get eigenvectors $V$ and corresponding eigenvalues $\Delta$ of $K_c$.
Keep the first $M$ columns of $\sqrt{\Delta V}$ to get the coordinates of projected data points in the low $M$-dimensional space.

But things get nasty when one wants to project a new data point $x$ that was not known when constructing the kernel…

Non-linear PCA

Also known as autoencoder

Overall idea:

train an autoencoder (neural network with an autoassociative architecture) to perform an identity mapping.
use the output of the bottleneck layer as low-dimensional code.

Bottleneck code is a non-linear combination of entries (thanks to activation functions on the encoder layers) $\to$ learned mapping is a non-linear PCA.

Principal components are generalized from straight lines to curves: the projection subspace which is described by all nonlinear components is also curved.

Let’s recap

High-dimensional data set $X$ is a $n \times N$ matrix, with $n =$ number of samples and $N =$ dimensionality of underlying space.

Parametric $\equiv$ explicit embedding from high-dimensional space to low-dimensional one
For LLE: $p$ is the ratio of non-zero elements in a sparse matrix to the total number of elements
For NL-PCA: $i$ is the number of iterations and w is the number of weights in the neural network

t-Distributed Stochastic Neighbor Embedding

t-SNE is a popular method to see in 2D or 3D wtf is going on in a high-dimensional spaces.

Construct a probability distribution $p$ over pairs of points in the high-dim space: the more similar (the closer) the two points, the higher the probability
Define a second probability distribution $q$ over the points in the low-dim space, and dispatch the points such that the distance between p and q in minimized (for the KullbackLeibler divergence)

t-SNE is excellent in visualizing the well-separated clusters, but fails to preserve the global geometry of the data.
t-SNE depends on a perplexity parameter, which reflects the scale of search for close points.

Independant component analysis

ICA aims to provide a solution to the so-called cocktail party: retrieving independent sources that got mixed-up together with unknown scaling coefficients.

Goal: estimate source $s$ and mixing matrix $A$ from observation $x = As$.

Ill-posed $\Rightarrow$ enforce independence on source components
Work on higher order statistics (PCA limits to order-2 statistics)
Unkown source must not be Gaussian-distributed

Contrarily to PCA vectors, ICA vectors are not orthogonal and not ranked by importance, but they are mutually independents.

IML: Dimensionality reduction