Home IML: Dimensionality reduction
Post
Cancel

IML: Dimensionality reduction

Lien de la note Hackmd

Why do we care ?

We have at hand $n$ points $x1,…, xn$ lying in some N-dimensional space, $x_i \in\mathbb R^n , \forall i = 1, . . . , n,$ compactly written as a $n × N$ matrix $X$

  • One row of $X$ = one sample
  • One column of $X$ = a given feature value for all samples

Example of real high-dimensional data

MNIST image classification:

  • Sample $x$:image with 28x28 pixels
  • Data set: 60000 samples
  • Dimensionality: $x \in\mathbb R^{28×28=784}$

MUSE hyperspectral image analysis:

  • Sample $x$: pixel with 3600 spectral bands
  • Data set: image with 300x300 pixels
  • Dimensionality: $x \in \mathbb R^{3600}$

Pour discriminer les galaxies c’est raciste ca monsieur

The curse of dimensionality

Sur $\mathbb R$

Sur $\mathbb R^2$

Revenir a la meme densite d’echantillonage:

Sur $\mathbb R^3$

Revenir a la meme densite d’echantillonage:

\[\frac{\nu(\mathbb S^n)}{\nu([-1;1]^n)}=\frac{\pi^{\frac{n}{2}}}{2\Gamma(\frac{n}{2}+1)}\to_{n\to+\infty}0\]

Why is it tricky?

  • We naturally cannot picture anything that is more than 3D in our mind

  • Picturing something 3D in a 2D flat screen can already be misleading
  • Real data naturally lives in (complex) high-dimensional space
  • Real data is often strongly correlated

How ?

INFORMATION ???

Linear approaches

Somehow trying to find a low-dimensional subspace in which the projected data would not be too much distorted after projection.

  • Johnson-Lindenstrauss lemma
  • Classical scaling
  • (The one and only) Principal Component Analysis
  • And much more…

Johnson-Lindenstrauss lemma

It’s not because you can that you will

Classical scaling

Also called Principal Coordinates Analysis (PCoA)

Lots of formula here, but you just need to retain the overall idea

If $D$ is the $n \times n$ Euclidean distance matrix with entries $d_{ij} = \Vert x_i − xj\Vert_2$ and $D^{(2)} = [d_{ij}^2]$, PCoA seeks the linear mapping $M$ that minimizes

\[\phi(Y)=\sum_{i,j}(d_{ij}^2-\Vert y_i-y_j\Vert^2)\]

with $y_i = x_iM$ and $\Vert m_i\Vert^2=1\forall i$

$K$ can be obtained by double centering $D^{(2)}:K=-\frac{1}{2}C_nD^{(2)}C_n$ with centering matrix $C_n=I_n-\frac{1}{n}ones(n,n)$

Optimal projection onto the first $M$ dimensions $Y=\Delta_M^{\frac{1}{2}}E_M^T$ with $E_M$ matrix of the $M$ largest eigenvectors of $E$.

Principal component analysis

Also known as the Karhunen-Loeve transform

Closely related to PCoA, but operates on the covariance matrix $X_c^T X_c$ PCA seeks the linear mapping $M$ that maximizes the projection variance $tr(M^T cov(X)M)$ with $\Vert mi\Vert^2 = 1 \forall i$.

\[X=\begin{bmatrix} \overbrace{x_{11}}^{u_1=\text{moyenne}} & \overbrace{x_{12}}^{u_2}\\ \vdots & \vdots\\ x_{n1}&x_{n2} \end{bmatrix} \Rightarrow \text{centrage des donnees}\] \[X_c=\begin{bmatrix} x_{11}-u_1 & x_{12}-u_2\\ \vdots & \vdots\\ x_{n1}1-u_1&x_{n2}-u_2 \end{bmatrix}\]
  1. Center the data $X_c = C_nX$ 1.b (opt) Reduce the data
  2. Compute covariance matrix $\sum=\frac{1}{n-1}X_c^TX_c$
  3. Perform eigendecomposition $(E,\Delta)$ of $\sum$
  4. Project on the first $M$ principal axes $Y=XE_M$

Data after projection is uncorrelated, but has lost some interpretability

PCA is probably the most popular and used unsupervised linear dimensionality reduction technique, but it comes with a bunch of operability questions, the 2 principles being:

  1. How to automatically select the right number of dimensions to project?
  2. How to project a new data point on a learned projection subspace?
    • See you in lab session for the answer

Non-linear approaches

When it is assumed that the data does not live in an Euclidean subspace (why would it anyway?), some more advanced techniques must be relied on.

  • Isomap
  • Locally linear embedding
  • Kernel Principal Component Analysis (aka PCA on steroids)
  • Multilayer autoencoders
  • And much more…

Isomap

Geodesic distance rocks

Isometric feature mapping: same idea as classical scaling, but using geodesic distance instead of Euclidean distance.

  1. Compute k-nearest neighbor graph of data $x_1,…,x_n$
  2. Compute all pairwise geodesic distances
  3. Apply classical scaling

Exemple

Isomap applied to some images of the digit 2 in MNIST data

Locally linear embedding

Locally linear embedding: the manifold can be locally considered Euclidean

For each point $x_i$:

  1. get its k-nearest neighbors $x_j$, $j=1,…,k$
  2. Get weights $w_{ij}$ that best linearly reconstruct $x_i$ with $x_j$: minimize $\sum_{i=1}^n\Vert x_i-\sum w_{ij}x_j\Vert$
    • with constraints $\sum w_{ij}=1$ (closed-form solution)
  3. Low-dimensional embedding $\to$ reconstruct $y_i$ with $y_j$ and same weights $w_{ij}$:

minimize \(\sum_{i=1}^n\Vert y_i-\sum w_{ij}y_j\Vert\)

with constraints $\frac{1}{n}\sum_iy_iy_i^T$ and $\sum_iy_i=0$ (eigendecomposition of a Gram matrix)

The kernel trick

When one actually wants to increase the dimension

Base idea: map $n$ non linearly separable points to a (possibly infinite) space where they would be with a function $\phi$

  • How should we define $\phi$ ?
  • Do we really want to compute stuff in a (possibly infinite) feature space?

Widely used kernel functions:

  • Polynomial kernel: $\mathcal k(x_i,x_j)=(x_i^Tx_j+1)^d$
  • Gaussian RBF kernel: $\mathcal k(x_i,x_j)=e^{-\gamma\Vert x_i-x_j\Vert^2}$
  • Sigmoid kernel: $\mathcal k(x_i,x_j)=\tanh(bx_i^Tx_j+c)$

Kernel PCA

PCA on steroids

The maths behind are quite hard, but the following scikit-learn recipe works fine:

  1. Compute kernel matrix $k=[\mathcal k(x_i,x_j)]=[<\phi(x_i),\phi(x_j)>]$ and double-center it $K_c=C_nKC_n$
  2. Eigendecomposition of $K_c$ is strongly related to this of the (intractable) covariance matrix in the feature space $\to$ get eigenvectors $V$ and corresponding eigenvalues $\Delta$ of $K_c$.
  3. Keep the first $M$ columns of $\sqrt{\Delta V}$ to get the coordinates of projected data points in the low $M$-dimensional space.

But things get nasty when one wants to project a new data point $x$ that was not known when constructing the kernel…

Non-linear PCA

Also known as autoencoder

Overall idea:

  1. train an autoencoder (neural network with an autoassociative architecture) to perform an identity mapping.
  2. use the output of the bottleneck layer as low-dimensional code.

Bottleneck code is a non-linear combination of entries (thanks to activation functions on the encoder layers) $\to$ learned mapping is a non-linear PCA.

Principal components are generalized from straight lines to curves: the projection subspace which is described by all nonlinear components is also curved.

Let’s recap

High-dimensional data set $X$ is a $n \times N$ matrix, with $n =$ number of samples and $N =$ dimensionality of underlying space.

  • Parametric $\equiv$ explicit embedding from high-dimensional space to low-dimensional one
  • For LLE: $p$ is the ratio of non-zero elements in a sparse matrix to the total number of elements
  • For NL-PCA: $i$ is the number of iterations and w is the number of weights in the neural network

t-Distributed Stochastic Neighbor Embedding

t-SNE is a popular method to see in 2D or 3D wtf is going on in a high-dimensional spaces.

  1. Construct a probability distribution $p$ over pairs of points in the high-dim space: the more similar (the closer) the two points, the higher the probability
  2. Define a second probability distribution $q$ over the points in the low-dim space, and dispatch the points such that the distance between p and q in minimized (for the KullbackLeibler divergence)

  • t-SNE is excellent in visualizing the well-separated clusters, but fails to preserve the global geometry of the data.
  • t-SNE depends on a perplexity parameter, which reflects the scale of search for close points.

Independant component analysis

ICA aims to provide a solution to the so-called cocktail party: retrieving independent sources that got mixed-up together with unknown scaling coefficients.

Goal: estimate source $s$ and mixing matrix $A$ from observation $x = As$.

  • Ill-posed $\Rightarrow$ enforce independence on source components
  • Work on higher order statistics (PCA limits to order-2 statistics)
  • Unkown source must not be Gaussian-distributed

Contrarily to PCA vectors, ICA vectors are not orthogonal and not ranked by importance, but they are mutually independents.

This post is licensed under CC BY 4.0 by the author.