Kernel Recipes and Kernel PCA
Tue, 01 Nov 2022
Computer Science, Data Science, Data Visualization, Machine Learning
One strength of kernel methods is their ability to operate directly on non-numerical objects like sets. As seen in the previous post, the Jaccard index on sets satisfies Mercer's condition and thus is a valid kernel. The process of proving a similarity measure is a valid kernel is somewhat involved, but thankfully several theorems can be employed to get more mileage out of the set of known good kernels. This post outlines some recipes for producing new valid kernels and introduces a method for obtaining numerical representations of samples using kernel methods.
Recipes for New Kernels
A number of transformations of valid kernels are known to produce new valid kernels, only several of which are demonstrated here.
Theorem 1
If \(k_1(\mathbf{x}, \mathbf{y})\) and \(k_2(\mathbf{x}, \mathbf{y})\) are both valid kernels, then \(k(\mathbf{x}, \mathbf{y}) = k_1(\mathbf{x}, \mathbf{y}) + k_2(\mathbf{x}, \mathbf{y})\) is also a valid kernel.
Proof
For any given set of points \(\{\mathbf{x}_i\}_{i=1}^{n}\), let the Gram matrices produced by \(k_1\) and \(k_2\) be
A and
B respectively. Note that, since both \(k_1\) and \(k_2\) are valid kernels, both
A and
B are positive semi-definite matrices. Next, by definition the Gram matrix produced by \(k(\mathbf{x}, \mathbf{y})\) is \(\mathbf{C} = \mathbf{A} + \mathbf{B}\). Since
C is a sum of positive semi-definite matrices, it is also positive semi-definite. Finally, since the gram matrix produced by \(k(\mathbf{x}, \mathbf{y})\) is shown to be positive semi-definite for an arbitrary set of points, \(k(\mathbf{x}, \mathbf{y})\) is also a positive semi-definite kernel. \(\blacksquare\)
Theorem 2
If \(k_1(\mathbf{x}, \mathbf{y})\) is a valid kernel and \(c > 0\) is a positive constant, then \(k(\mathbf{x}, \mathbf{y}) = c k_1(\mathbf{x}, \mathbf{y})\) is also a valid kernel.
Proof
For any given set of points \(\{\mathbf{x}_i\}_{i=1}^{n}\), let the Gram matrix produced by \(k_1\) be
A. Again, note that
A is positive semi-definite. Next, by definition the Gram matrix produced by \(k(\mathbf{x}, \mathbf{y})\) is \(\mathbf{C} = c \mathbf{A}\). Since multiplying
A by a positive constant does not zero or affect the sign of its eigenvalues,
C is also a positive-semi-definite matrix. Again, since
C is shown to be positive semi-definite for an arbitrary set of points, \(k(\mathbf{x}, \mathbf{y})\) is also a positive semi-definite kernel. \(\blacksquare\)
Putting theorems 1 and 2 together, it is easy to prove that a weighted sum of positive semi-definite kernels is also a positive semi-definite kernel when the weights are positive.
Theorem 3
\[\displaylines{k(\mathbf{x}, \mathbf{y}) = \sum\limits_{i=1}^{n}{c_i k_i(\mathbf{x}, \mathbf{y})} }\ ,\]
where \(c_i > 0\) are positive constants and \(k_i\) are positive semi-definite kernels, is a positive semi-definite kernel.
Proof
By Theorem 2 each element of the sum is a valid kernel. By Theorem 1, a sum of valid kernels is itself a valid kernel, and so \(k(\mathbf{x}, \mathbf{y})\) is also a valid kernel. \(\blacksquare\)
Theorem 3 allows practitioners to build kernels that combine a weighted combination of valid similarity metrics. Each of these individual metrics can be based on dot-products of numerical vectors, similarity functions over sets like the Jaccard index, or even, for example, functions comparing strings. Kernel functions allow rich expressiveness in modeling that is also backed with solid theoretical underpinnings.
Kernel Principal Component Analysis
Kernel principal component analysis (KPCA) generalizes the method of principal component analysis (PCA) by substituting an arbitrary kernel function for the typical dot product used in PCA. Through some algebraic manipulation, formulae for the projections of the data onto the implied principal components with respect to a given kernel \(k(\mathbf{x}, \mathbf{y})\) are obtained. Specifically, for a set of points \(\{\mathbf{x}_i\}_{i=1}^{n}\) and an arbitrary vector
x, the
k-th component of the projection of the vector using KPCA is
\[\displaylines{\sum\limits_{i=1}^{n}{\mathbf{v}_i^{k}k(\mathbf{x}_i, \mathbf{x})} }\ ,\]
where \(\mathbf{v}^k\) is the k-th normalized eigenvector of the implicitly centered gram matrix. Thus, to compute the projection, the kernel function is first computed between the query point and each of the training samples in turn. Next, the result is dotted with the appropriately normalized eigenvectors of the centralized gram matrix.
By leveraging Kernel PCA with the above theorems, one can obtain vector representations for arbitrary non-numerical objects like sets and strings. Further, practitioners may combine numerical and non-numerical data via multiple weighted kernel functions to obtain numerical vectors for complex heterogeneous data types.
Interactive KPCA Embeddings
The
plots page of my new github.io site contains several interactive scatter plots built using Kernel PCA. Click on the links below to see more.
The 1000 Genomes embedding is built using Jaccard similarity on the genotype sets for each individual. The Pokémon embedding factors in multiple pieces of information including Pokémon powers, abilities, and types. Finally, the embedding of the 117th U.S. senate is constructed using Jaccard similarity over the votes cast by each senator.