Principal Component Analysis (PCA) - demystify dimensionality reduction techniques (1)
One of the most commonly used techniques in population genetics to analyze population structure is definitely Principal Component Analysis (PCA). The following figure from Novembre et al. 2008 displays European populations in the space of first two principal components based on their genetic data, which is much higher dimensional, showing an astonishing resemblance to the map of Europe. This is just one application of dimensionality reduction techniques for analyzing genetic data, or more broadly, studying biological problems.
PCA (principal componenent analysis), PCoA (principal coordinate analysis), MDS (multidimensional scaling), FA (factor analysis), … all these terms frequently show up when we talk about dimensionality reduction, both in population genetics and beyond. At least for me, they were fairly confusing at the beginning, so I think it would be nice to write up some notes on these methods to explain 1) the mathematics behind them, 2) their usage and intuition, as well as 3) the differences and similarities between them.
Linear Algebra Review
Before we start discussing the dimensionality reduction methods, let’s first review some linear algebra concepts that will be essential for understanding these methods.
Eigendecomposition
Suppose there is a square matrix A of size n×n. By definition, an eigenvector u of the matrix A is a nonzero vector satisfies the following equation:
where λ is the eigenvalue associated to this eigenvector. Putting eigenvalues in a diagonal matrix Λ and the associated eigenvectors as columns in a matrix U:
we can then rewrite the above equation as
Rearranging this equation gives the eigendecomposition of matrix A:
Note that not all square matrices have eigendecomposition — only a diagonalizable matrix, namely a matrix similar to a diagonal matrix, does so. We say a matrix A is similar to a matrix D if D=P⁻¹AP for some invertible matrix P.
For a special type of matrices, called positive semi-definite (PSD) matrices, the eigendecomposition always exists.
A matrix A is positive semi-definite if
for every nonzero real column vector z. An alternative definition is that, matrix A is positive semi-definite if A=ZᵀZ for a certain matrix Z containing real numbers.
All eigenvalues of a PSD matrix are non-negative, and all eigenvectors associated to different eigenvalues are pairwise orthorgonal to each other. As a result, UᵀU is a diagonal matrix. If further that the eigenvectors u’s are normalized to have a 2-norm of one (i.e., uᵀu=1), then UᵀU=I, and U is called an orthogonal matrix (or orthonormal matrix).
PSD matrices we commonly see in include correlation matrices. covariance, and cross-product matrices.
SVD (Singular Value Decomposition)
Unlike eigendecomposition which only exists for certain square matrices, SVD always exists for any rectangular or square matrix.
Suppose a matrix A of size m×n, where m≥n. Then its SVD gives
where r=rank(A)≤min{m,n}, and
- Σ is a diagonal matrix of size r×r with entries σᵢ’s corresponding to the singular values of A,
- U is a unitary matrix (or orthogonal matrix if over the real number field) of size m×r with columns corresponding to the left singular vectors uᵢ’s, and
- V is another unitary matrix of size n×r with columns corresponding to the right singular vectors vᵢ’s.
This is the reduced SVD.
Typically, the r singular values σᵢ’s are ordered in descending order.
In some cases, we may also see what is called a full SVD, in which
- Σ is a m×n matrix with r entries of σᵢ’s followed by n-r zeros along the diagonal of the upper n×n block and m-n rows of zeros below,
- U is a unitary matrix of size m×m, having an additional m-r orthonormal columns, and
- V is a unitary matrix of size n×n, having an additional n-r orthonormal columns.
When m≤n instead, the reduced SVD form still makes sense.
A nice illustration of the variants of SVD can be found on the wiki page.
PCA (Principal Component Analysis)
Now it’s time to break down PCA.
The goal of PCA can be thought of as to find a new set of attributes/coordinates that better capture the variability in data.
Suppose there is a data matrix X of size m×n, where there are m rows corresponding to individuals/samples/objects and n columns corresponding to (original) attributes/variables/features.
Denote the covariance matrix of this data as S:
If the data is already centered, that is, the mean of each attribute is 0, then the covariance matrix of the data is simply
Centered and non-centered data will have identical covariance matrices. Recall that a covaiance matrix is PSD, thus diagonalizable with nonnegative eigenvalues.
Next, we perform eigendecomposition on this n×n matrix S, and transform the data using the eigenvalues and eigenvectors.
Denote the eigenvalues as
and eigenvectors in a n×n matrix U as
The data is transformed to be
It is now the same set of m objects represented by n new columns, each corresponding to a new attribute. The new i-th attribute is obtained as a linear combination of all the original attributes weighted by the entries in the i-th eigenvector, and the variance of the new i-th attribute is given by the i-th eigenvalue. These new attributes are termed principal components (PCs). The new attributes are ordered decreasingly in the eigenvalues, making sure that the leading ones capture as much variance as possible. The greatest variance by this scalar projection of the data thus lies on the first principal component.
This ordering of the new attributes allow us to reduce the dimensionality of data while keeping as much variability in data as possible !
For example, suppose we want to obtain a 2D visualiation of our data. We need to reduce the dimension to 2 (if the number of original attribute, n, is larger than 2). We can take only the first two eigenvectors (if we aim for as much as variance to be kept as possible), which give the weights for the first two PCs,
then transform the data as
The transformed data will have two coordinates, which can be nicely visualized in 2D space formed by the first two PCs. Moreover, we know that the variance explained by these two PCs is
Essentially, the i-th PC, a m-by-1 column vector, can be obtained by
and we can visualize the data on, or reduce the data to whichever PCs we want it to be.
Dimensionality reduction is not only used for visualization. It can also be used to extract a set of new features from the data. One benefit of doing this is that we can select a desired number of new attributes (features) according to the total variance explained (by summing up the corresponding values), and these new attributes “summarize ” information in the old ones. A second benefit is, each pair of the new attributes now has 0 variance, thereby we can avoid the collinearity issue and improve the algorithmic efficiency when using them in downstream analyses.
For instance, as shown in the figure at the beginning of this post, by summarizing the information from genetic loci, researchers were able to reveal genetic patterns in European populations that aligned nicely to their geographical patterns, using the 2D visualization to demonstrate how “genes mirro geography”.
In a word, PCA reduces the dimensionality of data while retaining the most important information, making it easier to visualize and analyze.
In my next post, I will discuss PCoA (principal coordinate analysis), a somewhat “relevant” dimensionality reduction technique to PCA, at least from its name.
Here are my other posts in this series if you are interested:
- PCoA & MDS: Demystify dimensionality reduction techniques (2): PCoA & Multidimensional Scaling
- FA: Demystify dimensionality reduction techniques (3): Factor Analysis
- MFA: Demystify dimensionality reduction techniques (4): Multiple Factor Analysis
- Practical dimensionality reduction in R and Python: TBC