Home PCA - Definitions
Post
Cancel

PCA - Definitions

In this post I will describe Principal Component Analysis (PCA), and in particular sample PCA. I show that the common definition of PCA can be formulated as finding the “best” low rank approximation of the data matrix.

References:


The objective of PCA, is to reduce the number of variables in your dataset whilst retaining as much information as possible from your original dataset.

I will denote a dataset by the matrix $X \in \mathbb{R}^{n \times p}$, where each row $\bx_i \in \mathbb{R}^{p}$ is an observation of $p$ variables.

\[\underset{(n \times p)}{X} = \left[ \begin{array}{cccc} - & \bx_{1}^T & - \\ - & \bx_{2}^T & - \\ & \vdots & \\ - & \bx_{n}^T & - \end{array} \right]\]

Informal Definition

In PCA, the objective of representing the $p$ variables by a smaller set of variables, is achieved by projecting the data matrix onto vectors of length $p$.

In the definition of PCA, these vectors, that the data are projected onto, are chosen iteratively. The first vector being the one that maximises the variance of the set of the projected vectors $(\bx_i^T\mathbf{v})_{i=1,\dots,n}$. Then, the next vector has the same objective, but now has the additional constraint that it must be orthogonal to the previous vector.

\[\left[ \begin{array}{ccc} - & \bx_{1}^T & - \\ - & \bx_{2}^T & - \\ & \vdots & \\ - & \bx_{n}^T & - \end{array} \right] \left[ \begin{array}{ccc} | \\ \mathbf{v}\\ | \\ \end{array} \right]\]

A projection vector $\mathbf{v}$, as defined in the previous paragraph, is often called a PC loading, and the set of projected vectors $(\bx_i^T\mathbf{v})_{i=1,\dots,n}$ are reffered to as PC scores, or principal components.

So far, I have mentioned that PC loadings are defined as vectors that maximise the variance of the projected data whilst being orthogonal to each other. It turns out, if the dataset has been centred, that the PC loadings can be equivalently defined as the vectors that define an orthogonal coordinate system of a $d$-dimensional subspace of the $p$-dimensional space, such that the average distance of the data vectors to the orthogonal projection of the data vectors to this lower $d$-dimensional space is minimised.

In the upcoming sections, I provide these definitions formally and then show that they are indeed equivalent.


Formal Definitions

Maximal Variance

Below I give the formal definition of the $k$th principal component loading $\mathbf{v}_k$.

\[\begin{align}\label{eq:PCA def} &\underset{\mathbf{v}: ||\mathbf{v}|| = 1}{\argmax} \quad \mathbf{v}^TS\mathbf{v} \\ \text{s.t} \quad \mathbf{v}_j^T\mathbf{v} = 0 &\quad \text{for} \quad j = 1, \dots, k-1 \nonumber \end{align}\]

Here $S$ denotes the sample covariance matrix, and note that if the data matrix $X$ has centred columns, $S = \frac{X^TX}{n}$.

In the objective we are maximising $\mathbf{v}^TS\mathbf{v}$, which is equal to the sample variance of the projected vectors $(\bx_i^T\mathbf{v})_{i=1,\dots,n}$. We also have the constraint $||\mathbf{v}|| = 1$, if this constraint was not present we would always be able to increase the size of $\mathbf{v}^TS\mathbf{v}$ by scaling up $\mathbf{v}$.


The definition of PCA is given so that the vectors are chosen iteratively, that is we select the first PC loading so that it maximises the sample variance of the projected data, $\mathbf{v}^TS\mathbf{v}$, then choose the $k$th PC loading with the same objective but is orthogonal to all other PC loadings. A natural question is then: could we get a set of orthonormal vectors such that the projected data has a larger total variance if we selected them directly?

It happens to be that solutions to the two different problems, iterative and direct, are the same. So finding the first $d$ PC loadings iteratively, is equivalent to finding the orthogonal matrix $V$ such that:

\[\begin{align*} \underset{V \in \mathbb{R}^{p \times d}}{\argmax} \quad &\text{Tr}(V^TSV) \\ &\text{s.t} \quad V^TV = I_d \end{align*}\]

Best Low Rank Approximation

An equivalent optimisation problem, to the one give in the previous section, is given by looking for a lower dimension representation of the data vectors that is “closest” to the original data vectors.

We choose a lower $d$ dimensional representation of a data vector $\bx_i \in \mathbb{R}^p$, by selecting an orthogonal coordinate system and a set of coordinates. We denote the choice of orthogonal coordinate system by the matrix $\phi \in \mathbb{R}^{p \times d}$, and the chosen coordinates by a vector $\mathbf{y}_i \in \mathbb{R}^{d}$.

Therefore, $\phi \mathbf{y}_i$ is a vector that belongs to the subspace spanned by the columns of $\phi$, which has dimension $d < p$.

The objective is then to choose the coordinate system $\phi$ and coordinates $\mathbf{y}_i$, such that the squared euclidean distance is minimised:

\[\begin{align*} \underset{\phi \in \mathbb{R}^{p \times d}}{\argmin} \underset{y_1, \dots ,y_n \in \mathbb{R}^d}{\argmin} \sum_{i=1}^n{||\bx_i - \phi \mathbf{y}_i||_2^2}\\ \text{s.t} \quad \phi^T\phi = I_d \end{align*}\]

Equivalence of the Two Definitions

I will now show that searching for the best rank $d$ approximation for a centred data matrix is an equivalent problem to searching for the linear projections that maximise the sample variance.

First of all, if we fix our choice of coordinate system $\phi$, the closest vector to $\bx_i$ that belongs to the column space of $\phi$ is the orthogonal projection of $\bx_i$ onto the column space of $\phi$. The orthogonal projection is given by:

\[\begin{align*} \phi(\phi^T\phi)^{-1}\phi^T\bx_i \end{align*}\]

We can formally write this as:

\[\begin{align*} &\underset{y_i \in \mathbb{R}^d}{\argmin} ||\bx_i - \phi \mathbf{y}_i||_2^2 \; = \; \phi^T \bx_i\\ &\text{s.t} \quad \phi^T\phi = I_d \end{align*}\]

Now, noting that we can rewrite the sum of the euclidean distances in matrix form using the Frobenius norm.

\[\begin{align*} \sum_{i=1}^n{||\bx_i - \phi \mathbf{y}_i||_2^2} \; = \; || X^T - \phi Y^T||_F^2 \end{align*}\]
Viewing the Matrices
\[X^T \; = \; \left[ \begin{array}{cccc} | & | & & |\\ \bx_{1} & \bx_{2} & \dots & \bx_{n}\\ | & | & & | \end{array} \right]\] \[\Phi \; Y^T = \left[ \begin{array}{cccc} | & & |\\ \phi_{(1)} & \dots & \phi_{(k)}\\ | & & | \end{array} \right] \left[ \begin{array}{cccc} | & | & & |\\ \mathbf{y}_{1} & \mathbf{y}_{2} & \dots & \mathbf{y}_{n}\\ | & | & & | \end{array} \right]\]

The problem of finding the coordinate system and coordinates now reduces to the following:

\[\begin{align*} \underset{\phi \in \mathbb{R}^{p \times d}}{\argmin} || X^T - \phi \phi^T X^T||_F^2\\ \text{s.t} \quad \phi^T\phi = I_d \end{align*}\]

Which is equivalent to:

\[\begin{align*} \underset{\phi \in \mathbb{R}^{p \times d}}{\argmax}\sum_{i=1}^k\phi_i^T\Big(\frac{X^TX}{n}\Big)\phi_i \end{align*}\]
Algebra showing equivalence
\[\begin{align*} \underset{\phi \in \mathbb{R}^{d \times k}}{\argmax}\text{Tr}(\phi^TXX^T\phi)\\ \text{s.t} \quad \phi^T\phi = I_k \end{align*}\]

We can rewrite the objective of this equivalent definition:

\[\begin{align*} \sum_{i=1}^n{||\bx_i - \Phi \mathbf{y}_i||_2^2} \quad = \; || X^T - \Phi Y^T||_F^2 = || X - Y \Phi^T ||_F^2 \; = \quad \sum_{j=1}^p{||\bx_{(j)} - Y \phi_j ||_2^2} \end{align*}\]

Conclusion

In conclusion, I have shown that searching for the first $k$ principal components, with regards to searching for projections that maximise the sample variance, is equivalent to searching for a $k$ dimensional subspace such that the dis

This post is licensed under CC BY 4.0 by the author.