\[\displaylines{s = \sum_{i=1}^{m}{\mathbf{w_i}}}\ .\]
If \(s >> m\), then it is undesirable to compute the full matrix \(\mathbf{R}\) and perform principal component analysis (PCA) on it. Instead, the highly structured form of \(\mathbf{R}\), suggests there should be a more efficient method to perform this decomposition. This method is known as weighted PCA and is the topic of this post.\[\displaylines{\frac{\sum_{i=1}^{m}{\mathbf{w_i} * \mathbf{A_{ij}}}}{\sum_{i=1}^{m}{\mathbf{w_i}}}}\ .\]
To write this more succinctly, define \(\mathbf{v} = (1/s) * \mathbf{w}\), that is the vector \(\mathbf{w}\) normalized to sum to 1. In this way, the weighted column averages can be defined using the matrix product as\[\displaylines{\mathbf{a} = \mathbf{v}^{T}\mathbf{A}}\ .\]
Though it is not necessary, it is frequently desirable to fully standardize the matrix \(\mathbf{A}\) by dividing each of its columns by the corresponding column standard deviations. To implicitly normalize the full matrix \(\mathbf{R}\), the weighted standard deviations need be computed instead. Using "broadcasting", and several other abuses of notation, this can be written as\[\displaylines{\mathbf{d} = \sqrt{\mathbf{v}^{T} (\mathbf{A} - \mathbf{a})^2}}\ ,\]
where the square and square-root are both applied element-wise to the resulting vector.\[\displaylines{\mathbf{M} = (\mathbf{A} - \mathbf{a}) / \mathbf{d}}\ ,\]
where the subtraction and division operations are performed using broadcasting.\[\displaylines{\mathbf{M}^T\mathbf{W}\mathbf{M}}\]
\[\displaylines{= \mathbf{M}^T\mathbf{W}^{1/2}\mathbf{W}^{1/2}\mathbf{M}}\]
\[\displaylines{= (\mathbf{W}^{1/2}\mathbf{M})^T\mathbf{W}^{1/2}\mathbf{M}}\]
\[\displaylines{= \mathbf{B}^T\mathbf{B}}\ ,\]
since \(\mathbf{W} = \mathbf{W}^T\), as the matrix is diagonal and thus symmetric, and where \(\mathbf{B} = \mathbf{W}^{1/2}\mathbf{M}\).\[\displaylines{\mathbf{B} = \mathbf{U}\mathbf{\Sigma}\mathbf{V}^T}\ .\]
Using the compact SVD, the components are simply \(\mathbf{V}\), so that the projected data is defined to be \(\mathbf{P}=\mathbf{M}\mathbf{V}\). To obtain the projection of the full matrix \(\mathbf{R}\), simply repeat each row \(\mathbf{P}_i\) in the reduced projected matrix \(\mathbf{w}_i\) times!import numpy as np
from sklearn.decomposition import PCA
# %% Generate a random problem instance
m, n = 8, 4
A = np.random.randint(0, 10, size=(m, n))
# Repeats (or weightings) of the rows of A
W = np.random.randint(1, 10, size=m)
# Row i of A is repeated W[i] times
R = np.repeat(A, W, axis=0)
# %% Perform PCA on the full (repeated) data matrix
Ra = R.mean(0)
Rd = R.std(0)
M = (R - Ra) / Rd
U1, S1, VT1 = np.linalg.svd(M, full_matrices=False)
TA1 = M @ VT1.T
# %% Perform PCA on the reduced data matrix with weights
Wn = W / W.sum()
Aa = Wn @ A
Ad = np.sqrt(Wn @ np.square(A - Aa))
M = np.sqrt(W)[:, None] * (A - Aa) / Ad
U2, S2, VT2 = np.linalg.svd(M, full_matrices=False)
# SVD is sign indeterminate; fix order of signs
# based on first sample
for c, (i, j) in enumerate(zip(U1[0], U2[0])):
if np.sign(i) != np.sign(j):
U2[:, c] *= -1
VT2[c] *= -1
TA2 = np.repeat(M @ VT2.T, W, axis=0)
# %% Assert both results are close
assert(np.allclose(TA1, TA2))
# %% Validate against sklearn
Ra = R.mean(0)
Rd = R.std(0)
M = (R - Ra) / Rd
pca = PCA().fit(M)
# SVD is sign indeterminate; fix order of signs
VT3 = pca.components_
for c, (i, j) in enumerate(zip(VT2[:, 0], VT3[:, 0])):
if np.sign(i) != np.sign(j):
pca.components_[c] *= -1
TA3 = pca.transform(M)
assert(np.allclose(TA1, TA3))