Weighted Sparse PCA for Rectangular Matrices

Tue, 06 Sep 2022

Computer Science, Data Science, Machine Learning, Mathematics, Statistics

Consider a sparse mxn rectangular matrix \(\mathbf{X}\), where either \(m >> n\) or \(m << n\). Performing a principal component analysis (PCA) on \(\mathbf{X}\) involves computing the eigenvectors of its covariance matrix. This is often accomplished using the singular value decomposition (SVD) of the centered matrix \(\mathbf{X}-\mathbf{U}\). But, with large sparse matrices, this centering step is frequently intractable. If it is tractable, however, to compute the eigendecomposition of either an mxm or an nxn dense matrix in memory, other approaches are possible.

Weighted PCA

Let \(\mathbf{U}\) be a matrix with the weighted column means of \(\mathbf{X}\) in its columns, and \(\mathbf{W}\) be a diagonal matrix with non-negative sample weights along its diagonal. It is assumed that these weights are normalized so that \(\sum_{i=1}^{m}{W_{ii}}=1\). The weighted covariance matrix may be estimated as:

\[\displaylines{\mathbf{\Sigma} = ({\mathbf{X}-\mathbf{U}})^T\mathbf{W}({\mathbf{X}-\mathbf{U}})}\ .\]

Again, this definition is problematic for sparse matrices as centering by subtracting the column means \(\mathbf{U}\) destroys the sparsity of the matrix. For this reason, the SVD of the uncentered matrix \(\mathbf{X}\) is often employed as an alternative (e.g. TruncatedSVD). Unfortunately, the resulting components may be dominated by the column means and produce misleading results.

When either m or n is sufficiently small, it is possible to efficiently perform PCA using the eigenvalue decomposition of \(\mathbf{\Sigma}\), while avoiding memory intensive centering operations. There are two cases to cover here: one when \(m >> n\) and the other when \(m << n\)

Vertical Case

When \(m >> n\) it is more efficient to perform the PCA using the typical formula for the covariance. However, this formula can be expanded so that it may be computed efficiently with sparse matrices. Namely,

\[\displaylines{\mathbf{\Sigma} = ({\mathbf{X}-\mathbf{U}})^T\mathbf{W}({\mathbf{X}-\mathbf{U}})}\]

\[\displaylines{= {(\mathbf{W}^{1/2}(\mathbf{X}-\mathbf{U}))}^T{(\mathbf{W}^{1/2}(\mathbf{X}-\mathbf{U}))}}\]

\[\displaylines{= {(\mathbf{W}^{1/2}\mathbf{X}-\mathbf{W}^{1/2}\mathbf{U})}^T{(\mathbf{W}^{1/2}\mathbf{X}-\mathbf{W}^{1/2}\mathbf{U})}}\]

\[\displaylines{= {(\mathbf{W}^{1/2}\mathbf{X})^T(\mathbf{W}^{1/2}\mathbf{X})-(\mathbf{W}^{1/2}\mathbf{X})^T(\mathbf{W}^{1/2}\mathbf{U})-(\mathbf{W}^{1/2}\mathbf{U})^T(\mathbf{W}^{1/2}\mathbf{X})+(\mathbf{W}^{1/2}\mathbf{U})^T(\mathbf{W}^{1/2}\mathbf{U})}}\]

\[\displaylines{= {\mathbf{X}^T\mathbf{W}\mathbf{X}-\mathbf{X}^T\mathbf{W}\mathbf{U}-\mathbf{U}^T\mathbf{W}\mathbf{X}+\mathbf{U}^T\mathbf{W}\mathbf{U}}}\ .\]

Now, each of the four elements in the equation are nxn matrices that can be computed without the need for the memory intensive centering operation. However, this formula can be simplified further by noting,

\[\displaylines{\mathbf{X}^T\mathbf{W}\mathbf{U} = \mathbf{X}^T\mathbf{W}^T\mathbf{U} = (\mathbf{W}\mathbf{X})^T\mathbf{U}=(\mathbf{U}^T(\mathbf{W}\mathbf{X}))^T }\ .\]

Further, since the sample weights are non-negative, \(\mathbf{Tr}[{\mathbf{W}}]=1\), and \(\mathbf{U}\) contains the weighted means of \(\mathbf{X}\) repeated in its columns, the expression \((\mathbf{W}\mathbf{X})^T\mathbf{U}\) is equivalent to the outer product of the weighted column means: \(\mathbf{\mu}\mathbf{\mu}^T\). Since this outer product is symmetric, the formula reduces to:

\[\displaylines{= ({\mathbf{X}^T\mathbf{W}\mathbf{X}-2\mathbf{\mu}\mathbf{\mu}^T+\mathbf{U}^T\mathbf{W}\mathbf{U})}}\ .\]

Finally, the last term contains the two column mean matrices. Each element \(\mathbf{U}_{ij}\) contain the sum of the product of the i-th column mean and the j-th column mean repeated m times. However, as \(\mathbf{Tr}[{\mathbf{W}}]=1\), the weight matrix cancels this factor of m out and the expression is found to be equivalent to the above outer product: \(\mathbf{\mu}\mathbf{\mu}^T\).

Thus, the final equation is as follows:

\[\displaylines{\mathbf{\Sigma} = {\mathbf{X}^T\mathbf{W}\mathbf{X}-2\mathbf{\mu}\mathbf{\mu}^T+\mathbf{\mu}\mathbf{\mu}^T}}\]

\[\displaylines{= {\mathbf{X}^T\mathbf{W}\mathbf{X}-\mathbf{\mu}\mathbf{\mu}^T}}\ .\]

With \(\mathbf{\Sigma}\) estimated, the eigenvalue decomposition can be computed using a routine for symmetric matrices. That is, find an orthogonal matrix \(\mathbf{Q}\) and a diagonal matrix \(\mathbf{\Lambda}\) such that

\[\displaylines{\mathbf{\Sigma} = {\mathbf{Q}\mathbf{\Lambda}\mathbf{Q}^T}}\ .\]

The principal components are then recovered from the columns of \(\mathbf{Q}\).

Horizontal Case

When \(m << n\) it is more efficient to perform the PCA using a row-oriented analysis. This analysis hinges on the fact from linear algebra that if \(\mathbf{v}\) is an eigenvector of \(\mathbf{A}\mathbf{A}^T\) then \(\mathbf{A}^T\mathbf{v}\) is an eigenvector of \(\mathbf{A}^T\mathbf{A}\) with the same eigenvalue. This can be seen using the definition of eigenvectors and eigenvalues:

\[\displaylines{(\mathbf{A}\mathbf{A}^T)\mathbf{v} = \lambda\mathbf{v} }\]

\[\displaylines{\mathbf{A}^T(\mathbf{A}\mathbf{A}^T)\mathbf{v} = \mathbf{A}^T\lambda\mathbf{v} }\]

\[\displaylines{(\mathbf{A}^T\mathbf{A})(\mathbf{A}^T\mathbf{v}) = \lambda(\mathbf{A}^T\mathbf{v}) }\]

\[\displaylines{(\mathbf{A}^T\mathbf{A})\mathbf{w} = \lambda\mathbf{w} }\ ,\]

which implies that \(\mathbf{w}=\mathbf{A}^T\mathbf{v}\) is an eigenvector of \(\mathbf{A}^T\mathbf{A}\). Thus, the approach is to compute the smaller matrix \(\mathbf{A}\mathbf{A}^T\) and then use the above formula to find the principal components.

Let \(\mathbf{V} = \mathbf{V}^T = \mathbf{W}^{1/2}\). Note that from above \(\mathbf{\Sigma}={(\mathbf{W}^{1/2}(\mathbf{X}-\mathbf{U}))}^T{(\mathbf{W}^{1/2}(\mathbf{X}-\mathbf{U}))}\). With sample weights and centering, the formula to expand is:

\[\displaylines{{(\mathbf{W}^{1/2}(\mathbf{X}-\mathbf{U}))}{(\mathbf{W}^{1/2}(\mathbf{X}-\mathbf{U}))}^T }\]

\[\displaylines{= \mathbf{V}[(\mathbf{X}-\mathbf{U})(\mathbf{X}-\mathbf{U})^T]\mathbf{V}^T }\]

\[\displaylines{=\mathbf{V}[\mathbf{X}\mathbf{X}^T - \mathbf{X}\mathbf{U}^T - \mathbf{U}\mathbf{X}^T - \mathbf{U}\mathbf{U}^T]\mathbf{V} }\ .\]

In this case, the cross terms do not simplify as much due to the placement of the \(\mathbf{W}^{1/2}\) matrices. However, the inner terms are simply transposes of each-other and so the matrix product can be computed once and then transposed. The final term is a constant matrix where each element is equal to \(\mathbf{\mu}^T\mathbf{\mu}\). Ultimately, broadcasting allows the final 3 terms to be easily computed.

With the row-wise matrix computed, the eigenvalue decomposition can again be performed and then the principal components can be obtained as \((\mathbf{V}(\mathbf{X}-\mathbf{U}))^T\mathbf{Q}\). Notice the addition of the weighting and the centering terms. Again, this can also be expanded to avoid centering the sparse matrix:

\[\displaylines{(\mathbf{V}(\mathbf{X}-\mathbf{U}))^T\mathbf{Q} = (\mathbf{V}\mathbf{X})^T\mathbf{Q} - (\mathbf{V}\mathbf{U})^T\mathbf{Q}}\ .\]

Both terms above are more efficiently computed using broadcasting. Finally, these components may be re-normalized to form orthonormal principal components.

Source Code

In conclusion, this post details a method to compute the full PCA of a sparse mxn rectangular matrix. Specifically, this method applies to problems where a dense SVD on the original matrix is not possible, but it is tractable to compute the eigendecomposition of either an mxm or an nxn dense matrix in memory. The two approaches to weighted PCA described above are implemented in the class EigenPCA available here on GitHub.