Weighted PCA

Fri, 08 Jul 2022

Data Science, Machine Learning, Mathematics, Statistics

Consider an m by n matrix \(\mathbf{A}\) and an m by 1 vector of integers \(\mathbf{w}\). Now, consider the matrix \(\mathbf{R}\), where \(\mathbf{R}\) is formed by taking \(\mathbf{A_i}\), that is the i-th row of \(\mathbf{A}\), and repeating it \(\mathbf{w_i}\) times. This new matrix \(\mathbf{R}\) has dimension s by n, where

\[\displaylines{s = \sum_{i=1}^{m}{\mathbf{w_i}}}\ .\]

If \(s >> m\), then it is undesirable to compute the full matrix \(\mathbf{R}\) and perform principal component analysis (PCA) on it. Instead, the highly structured form of \(\mathbf{R}\), suggests there should be a more efficient method to perform this decomposition. This method is known as weighted PCA and is the topic of this post.

In vanilla PCA, the data matrix must first be centered by subtracting the column means from each column. Considering the full data matrix \(\mathbf{R}\), the element of the i-th row and j-th column \(\mathbf{A_{ij}}\) is repeated \(\mathbf{w_i}\) times so that the j-th column average is

\[\displaylines{\frac{\sum_{i=1}^{m}{\mathbf{w_i} * \mathbf{A_{ij}}}}{\sum_{i=1}^{m}{\mathbf{w_i}}}}\ .\]

To write this more succinctly, define \(\mathbf{v} = (1/s) * \mathbf{w}\), that is the vector \(\mathbf{w}\) normalized to sum to 1. In this way, the weighted column averages can be defined using the matrix product as

\[\displaylines{\mathbf{a} = \mathbf{v}^{T}\mathbf{A}}\ .\]

Though it is not necessary, it is frequently desirable to fully standardize the matrix \(\mathbf{A}\) by dividing each of its columns by the corresponding column standard deviations. To implicitly normalize the full matrix \(\mathbf{R}\), the weighted standard deviations need be computed instead. Using "broadcasting", and several other abuses of notation, this can be written as

\[\displaylines{\mathbf{d} = \sqrt{\mathbf{v}^{T} (\mathbf{A} - \mathbf{a})^2}}\ ,\]

where the square and square-root are both applied element-wise to the resulting vector.

With the weighted average and standard deviation vectors computed, the matrix \(\mathbf{R}\) can be implicitly standardized by subtracting the column means and dividing by the column standard deviations. That is, define the standardized matrix as

\[\displaylines{\mathbf{M} = (\mathbf{A} - \mathbf{a}) / \mathbf{d}}\ ,\]

where the subtraction and division operations are performed using broadcasting.

Next, the goal is to use the singular value decomposition (SVD) to perform the eigendecomposition of \((1/m) * \mathbf{M}^T\mathbf{M}\) and ultimately \((1/s) * \mathbf{R}_{s}^T\mathbf{R}_{s}\), where \(\mathbf{R}_{s}\) is the appropriately standardized full matrix \(\mathbf{R}\) (*). Notice that the empirical covariance matrix \((1/s) * \mathbf{R}_{s}^T\mathbf{R}_{s}\) can be computed as \(\mathbf{M}^T\mathbf{W}\mathbf{A}\), where \(\mathbf{W} = diag(\mathbf{v})\) is a diagonal matrix with the vector \(\mathbf{v} = (1/s) * \mathbf{w}\) along the diagonal. This formula can be re-arranged as follows:

\[\displaylines{\mathbf{M}^T\mathbf{W}\mathbf{M}}\]

\[\displaylines{= \mathbf{M}^T\mathbf{W}^{1/2}\mathbf{W}^{1/2}\mathbf{M}}\]

\[\displaylines{= (\mathbf{W}^{1/2}\mathbf{M})^T\mathbf{W}^{1/2}\mathbf{M}}\]

\[\displaylines{= \mathbf{B}^T\mathbf{B}}\ ,\]

since \(\mathbf{W} = \mathbf{W}^T\), as the matrix is diagonal and thus symmetric, and where \(\mathbf{B} = \mathbf{W}^{1/2}\mathbf{M}\).

In this way, one can compute the SVD of the matrix \(\mathbf{B} = \mathbf{W}^{1/2}\mathbf{M}\) to implicitly perform PCA on the full data matrix \(\mathbf{R}\). That is, find the familiar \(\mathbf{U}\), \(\mathbf{\Sigma}\), and \(\mathbf{V}\) such that

\[\displaylines{\mathbf{B} = \mathbf{U}\mathbf{\Sigma}\mathbf{V}^T}\ .\]

Using the compact SVD, the components are simply \(\mathbf{V}\), so that the projected data is defined to be \(\mathbf{P}=\mathbf{M}\mathbf{V}\). To obtain the projection of the full matrix \(\mathbf{R}\), simply repeat each row \(\mathbf{P}_i\) in the reduced projected matrix \(\mathbf{w}_i\) times!

A simple Python helper class for performing a weighted PCA is available here on GitHub.

The following code snippet illustrates the above process in more detail:

import numpy as np
from sklearn.decomposition import PCA

# %% Generate a random problem instance
m, n = 8, 4
A = np.random.randint(0, 10, size=(m, n))
# Repeats (or weightings) of the rows of A
W = np.random.randint(1, 10, size=m)
# Row i of A is repeated W[i] times
R = np.repeat(A, W, axis=0)
# %% Perform PCA on the full (repeated) data matrix
Ra = R.mean(0)
Rd = R.std(0)
M  = (R - Ra) / Rd

U1, S1, VT1 = np.linalg.svd(M, full_matrices=False)

TA1 = M @ VT1.T
# %% Perform PCA on the reduced data matrix with weights
Wn = W / W.sum()
Aa = Wn @ A
Ad = np.sqrt(Wn @ np.square(A - Aa))
M  = np.sqrt(W)[:, None] * (A - Aa) / Ad

U2, S2, VT2 = np.linalg.svd(M, full_matrices=False)

# SVD is sign indeterminate; fix order of signs
# based on first sample
for c, (i, j) in enumerate(zip(U1[0], U2[0])):
  if np.sign(i) != np.sign(j):
    U2[:, c] *= -1
    VT2[c]   *= -1

TA2 = np.repeat(M @ VT2.T, W, axis=0)
# %% Assert both results are close
assert(np.allclose(TA1, TA2))
# %% Validate against sklearn
Ra = R.mean(0)
Rd = R.std(0)
M  = (R - Ra) / Rd

pca = PCA().fit(M)

# SVD is sign indeterminate; fix order of signs
VT3 = pca.components_
for c, (i, j) in enumerate(zip(VT2[:, 0], VT3[:, 0])):
  if np.sign(i) != np.sign(j):
    pca.components_[c] *= -1

TA3 = pca.transform(M)
assert(np.allclose(TA1, TA3))

(*) The unbiased estimates can also be used instead. It is not used here as the difference is small and it complicates the notation somewhat.