\[\displaylines{\mathbf{A}\mathbf{x}=\mathbf{y}}\ ,\]
where \(\mathbf{A}\) is the m x n data matrix, \(\mathbf{x}\) is the n x 1 vector of regression coefficients, and \(\mathbf{y}\) is the m x 1 vector of target values. As discussed in a previous post, estimation and interpretation of this problem is made difficult if the matrix \(\mathbf{A}\) suffers from multi-colinearity. Unfortunately, this is frequently found to be the case in practical problems.\[\displaylines{\mathbf{A}^{T}\mathbf{A}=\mathbf{I}_n}\ ,\]
where \(\mathbf{I}_n\) is the n x n identity matrix. Consider an n x n transformation matrix that is capable of decorrelating the columns of \(\mathbf{A}\). That is, consider a matrix \(\mathbf{W}\) such that\[\displaylines{(\mathbf{A}\mathbf{W})^{T}(\mathbf{A}\mathbf{W})=\mathbf{I}_n}\ .\]
Using properties of the matrix transpose, this is equivalent to\[\displaylines{\mathbf{W}^T\mathbf{A}^{T}\mathbf{A}\mathbf{W} =\mathbf{W}^T\mathbf{X}\mathbf{W}=\mathbf{I}_n}\ .\]
Now, consider the n x n matrix \(\mathbf{X}=\mathbf{A}^{T}\mathbf{A}\). It can be shown that this matrix is at least positive semi-definite.\[\displaylines{\mathbf{a}^T\mathbf{X}\mathbf{a}}\]
\[\displaylines{=\mathbf{a}^T\mathbf{A}^T\mathbf{A}\mathbf{a}}\]
\[\displaylines{=(\mathbf{A}\mathbf{a})^T\mathbf{A}\mathbf{a}}\]
\[\displaylines{=\mathbf{b}^T\mathbf{b} >=0 }\ ,\]
where the substitution \(\mathbf{b}=\mathbf{A}\mathbf{a}\) has been made. Thus, \(\mathbf{X}\) may be re-written as the product of an n x n triangular matrix with its transpose:\[\displaylines{\mathbf{X}=\mathbf{L}\mathbf{L}^{T}}\ .\]
This decomposition is known as the Cholesky decomposition (*) and it is guaranteed to exist for all real-valued symmetric positive semidefinite matrices.\[\displaylines{\mathbf{I}_n=\mathbf{W}^T\mathbf{X}\mathbf{W}}\]
\[\displaylines{=\mathbf{W}^T\mathbf{X}^{T}\mathbf{W}}\]
\[\displaylines{=\mathbf{W}^T(\mathbf{L}\mathbf{L}^T)^{T}\mathbf{W}}\]
\[\displaylines{=\mathbf{W}^T\mathbf{L}^{T}\mathbf{L}\mathbf{W}}\]
\[\displaylines{=(\mathbf{L}\mathbf{W})^{T}\mathbf{L}\mathbf{W}}\ .\]
Thus, the matrix \((\mathbf{L}\mathbf{W})\) is also an orthogonal matrix. However, in the above equation, all matrices are now of dimension n x n, and further, the matrix \(\mathbf{L}\) is a lower triangular matrix. From this, a convenient solution for performing decorrelation is apparent. It is sufficient to let \(\mathbf{W}=\mathbf{L}^{-1}\) as\[\displaylines{(\mathbf{A}\mathbf{W})^{T}(\mathbf{A}\mathbf{W})}\]
\[\displaylines{\mathbf{W}^T\mathbf{A}^{T}\mathbf{A}\mathbf{W}}\]
\[\displaylines{\mathbf{W}^T\mathbf{X}\mathbf{W}}\]
\[\displaylines{\mathbf{W}^T\mathbf{X}^T\mathbf{W}}\]
\[\displaylines{\mathbf{W}^T(\mathbf{L}\mathbf{L}^{T})^T\mathbf{W}}\]
\[\displaylines{\mathbf{W}^T\mathbf{L}^T\mathbf{L}\mathbf{W}}\]
\[\displaylines{(\mathbf{L}\mathbf{W})^T(\mathbf{L}\mathbf{W})}\]
\[\displaylines{(\mathbf{L}^{-1}\mathbf{L})^T(\mathbf{L}\mathbf{L}^{-1})}\]
\[\displaylines{\mathbf{I}_n^T\mathbf{I}_n=\mathbf{I}_n}\]
produces the desired result. Thus, the inverse of the Cholesky decomposition of the correlation matrix decorrelates the input matrix. Further, this approach is also convenient as \(\mathbf{L}\) is triangular.\[\displaylines{\mathbf{B}\mathbf{\tilde{x}}=\mathbf{y}}\ .\]
The scalar loss of the least squares formulation can be written as:\[\displaylines{L=(\mathbf{y}-\mathbf{B}\mathbf{\tilde{x}})^{T}(\mathbf{y}-\mathbf{B}\mathbf{\tilde{x}})}\ .\]
Setting the derivative of this expression to zero produces:\[\displaylines{\frac{\partial L}{\partial \mathbf{\tilde{x}}}=-2\mathbf{B}^Ty+2\mathbf{B}^T\mathbf{B}\mathbf{\tilde{x}}=0}\]
Now, from above, \(\mathbf{B}^T\mathbf{B}=\mathbf{I}_n\) and so this equation simplifies as follows:\[\displaylines{\frac{\partial L}{\partial \mathbf{\tilde{x}}}=-2\mathbf{B}^Ty+2\mathbf{B}^T\mathbf{B}\mathbf{\tilde{x}}=0}\]
\[\displaylines{\mathbf{B}^T\mathbf{B}\mathbf{\tilde{x}}=\mathbf{B}^T\mathbf{y}}\]
\[\displaylines{\mathbf{I}_n\mathbf{\tilde{x}}=\mathbf{B}^T\mathbf{y}}\]
\[\displaylines{\mathbf{\tilde{x}}=\mathbf{B}^T\mathbf{y}}\]
Thus, the least squares coefficients are simply the dot product of the transformed matrix with the target values. Substituting this into the original equation gives the linear least square approximation:\[\displaylines{\mathbf{B}\mathbf{B}^T\mathbf{y}=\mathbf{\hat{y}}}\ .\]
From this, it is apparent that the quality of the approximation improves the closer that the scatter matrix \(\mathbf{B}\mathbf{B}^T\) is to the identity matrix. This occurs when the rows of \(\mathbf{B}\) are uncorrelated with each other.\[\displaylines{\mathbf{B}\mathbf{\tilde{x}}=\mathbf{\hat{y}}}\]
\[\displaylines{\mathbf{A}\mathbf{W}\mathbf{\tilde{x}}=\mathbf{\hat{y}}}\]
\[\displaylines{\mathbf{A}\mathbf{x}=\mathbf{\hat{y}}}\ .\]
From the above, it is apparent that the coefficients in the original space may be recovered using the following equation:\[\displaylines{\mathbf{x}=\mathbf{W}\mathbf{\tilde{x}}}\ .\]
In summary, among other things, this approach may be used to orthogonalize the columns of a matrix and for least-squares approximation. In total, it requires computing one Cholesky decomposition, one inverse of a triangular matrix, and several matrix multiplications.