1. What’s there?
    1. Why do I write this?
  2. QR Decomposition
    1. Basics?
    2. Ok, what is it?
    3. Ok, but why?
    4. How do I do it?
    5. Applications!!
  3. Cholesky decomposition
    1. Axiomatic basis
    2. What is it?
    3. Isn’t this very specific?
  4. LU decomposition
    1. Definition
    2. How do I do it?
    3. Applications
  5. Sources

What’s there?

There are basically 5 (ಐದು / पाँच) types of matrix decompositions that are like super important. I strive to go in depth into all of them. These are

  1. Eigendecomposition (Matrix Decompositions 1)
  2. QR decomposition (This post!)
  3. Cholesky decomposition (This post!)
  4. LU decomposition (This post!)
  5. (perhaps the most important) Singular Value Decomposition

Why do I write this?

I wanted to create an authoritative document I could refer to for later. Thought that I might as well make it public (build in public, serve society and that sort of stuff).

QR Decomposition

Basics?

Yeah, you remember vector spaces and orthogonality? Yeah, that’s it.

Ok, what is it?

A factorization of a matrix \mathbf{A} \in \mathbb{R}^{m \times n} into the product of two specific types of matrices: an orthogonal matrix \mathbf{Q} and an upper triangular matrix \mathbf{R}.

Definition (QR Decomposition): Let \mathbf{A} be an m \times n matrix with linearly independent columns. The QR decomposition of \mathbf{A} is a factorization:

\mathbf{A} = \mathbf{Q}\mathbf{R}

where:

  • \mathbf{Q} is an m \times n matrix with orthonormal columns. Thus, \mathbf{Q}^\top \mathbf{Q} = \mathbf{I}_n, where \mathbf{I}_n is the n \times n identity matrix.
  • \mathbf{R} is an n \times n upper triangular matrix. This means all entries below the main diagonal are zero (R_{ij} = 0 for i > j).
    • R? Right? Right-upper? And Q preceded R? Get it? Idk.

Ok, but why?

Orthogonal matrices are awesome.

QR decomposition formalizes a process of transforming an arbitrary basis (the columns of \mathbf{A}) into an orthonormal basis (the columns of \mathbf{Q}), while encoding the transformation information in the upper triangular matrix \mathbf{R}.

How do I do it?

Gram-Schmidt process!

This handout from UCLA is awesome. I have realised that I’m not made out of time and want to make these notes more for intuition and rigorous math only for places that I deem super important (around ~50%).

Applications!!

This is why I do this – yayyy (first time I got to this section in under ~20 minutes).

  1. Solving Linear Systems and Least-Squares Problems
  2. In robotics?

Solving Linear Systems and Least-Squares Problems

Say we want to solve \mathbf{A}\mathbf{x} = \mathbf{b}. We can write –

\mathbf{Q}\mathbf{R}\mathbf{x} = \mathbf{b}

Since \mathbf{Q} is orthogonal, \mathbf{Q}^{-1} = \mathbf{Q}^\top. We can multiply by \mathbf{Q}^\top:

\mathbf{Q}^\top\mathbf{Q}\mathbf{R}\mathbf{x} = \mathbf{Q}^\top\mathbf{b} \implies \mathbf{R}\mathbf{x} = \mathbf{Q}^\top\mathbf{b}

This new system is trivial to solve because \mathbf{R} is upper triangular. We can solve for the components of \mathbf{x} starting from the last one and moving upwards, a process known as back substitution.

Now, for the linear least squares regression.

The objective is to find the vector \mathbf{x} that minimizes the squared Euclidean norm of the residual:

\min_{\mathbf{x}} |\mathbf{A}\mathbf{x} - \mathbf{b}|_2^2

That means, for us,

|\mathbf{Q}\mathbf{R}\mathbf{x} - \mathbf{b}|_2^2 = |\mathbf{Q}^\top(\mathbf{Q}\mathbf{R}\mathbf{x} - \mathbf{b})|_2^2 = |\mathbf{R}\mathbf{x} - \mathbf{Q}^\top\mathbf{b}|_2^2

How did we do this? Orthogonal transformations preserve the Euclidean norm!!!

Very cool and smart sounding, right? But, why did we do this?

The main issue lies with the formation of \mathbf{A}^\top\mathbf{A}. It has a high condition number and is less numerically stable.

A condition number is a score that measures how sensitive a function or a matrix is to small changes in its input.

This means that –

  • A low condition number means a matrix is well-conditioned. Small changes in the input data will only cause small changes in the output or solution. This is stable and reliable.
  • A high condition number means a matrix is ill-conditioned. Tiny, insignificant changes in the input data can cause huge, dramatic changes in the solution. This is unstable and unreliable.

In numerical calculations, an ill-conditioned matrix can amplify tiny rounding errors into completely wrong answers.

Forming the matrix \mathbf{A}^\top\mathbf{A} can be disastrous for numerical stability. Mathematically, the condition number of \mathbf{A}^\top\mathbf{A} is the square of the condition number of \mathbf{A}.

cond(\mathbf{A}^\top\mathbf{A}) = [cond(\mathbf{A})]^2

This means that if your original matrix \mathbf{A} is even slightly ill-conditioned, the matrix \mathbf{A}^\top\mathbf{A} you need to work with will be extremely ill-conditioned.

The QR method is superior because it cleverly bypasses the need to form the unstable \mathbf{A}^\top\mathbf{A} matrix.

  1. Decomposition: First, it decomposes the original matrix \mathbf{A} into \mathbf{Q}\mathbf{R}, where \mathbf{Q} is an orthogonal matrix and \mathbf{R} is an upper triangular matrix. This decomposition process itself is numerically stable.
  2. Stable Transformation: The next step uses a key property of orthogonal matrices: they don’t change the length (Euclidean norm) of vectors when they multiply them. This allows the problem to be transformed without amplifying errors:

|\mathbf{A}\mathbf{x} - \mathbf{b}|_2^2 = |\mathbf{Q}\mathbf{R}\mathbf{x} - \mathbf{b}|_2^2 = |\mathbf{R}\mathbf{x} - \mathbf{Q}^\top\mathbf{b}|_2^2

This transformation is numerically stable because the condition number of \mathbf{R} is the same as the original matrix \mathbf{A}. We haven’t made the problem worse:))))

In Robotics?

Literally anywhere you use least squares. Example – Levenberg-Marquardt algorithm in inverse kinematics:)


Cholesky decomposition

Axiomatic basis

We do have something for this!

The entire theory of Cholesky decomposition rests upon the properties of Symmetric Positive Definite (SPD) matrices. Let \mathbf{A} \in \mathbb{R}^{n \times n} be a square matrix.

  1. Symmetry: The matrix is symmetric if it is equal to its transpose, \mathbf{A} = \mathbf{A}^\top. This means A_{ij} = A_{ji} for all i, j (the more you know!).
  2. Positive Definiteness: A symmetric matrix \mathbf{A} is positive definite if for every non-zero vector \mathbf{x} \in \mathbb{R}^n \setminus {\mathbf{0}}, the quadratic form \mathbf{x}^\top \mathbf{A} \mathbf{x} is strictly positive.

\mathbf{x}^\top \mathbf{A} \mathbf{x} > 0 \quad \forall \mathbf{x} \neq \mathbf{0}

Key Properties of SPD Matrices:

  • All eigenvalues of an SPD matrix are real and strictly positive.
  • All diagonal elements (A_{ii}) are strictly positive.
  • The matrix is always invertible (since no eigenvalue is zero).

Proofs!

What is it?

Let \mathbf{A} be a symmetric, positive definite matrix. Its Cholesky decomposition is a factorization of the form:

\mathbf{A} = \mathbf{L}\mathbf{L}^\top

where:

  • \mathbf{L} is a lower triangular matrix with strictly positive diagonal entries (L_{ii} > 0).
  • \mathbf{L}^\top is the transpose of \mathbf{L}, which is an upper triangular matrix.

Isn’t this very specific?

So, why do we need this?

Ee often need to generate random samples from a multivariate Gaussian distribution \mathcal{N}(\boldsymbol{\mu}, \mathbf{\Sigma}). The covariance matrix \mathbf{\Sigma} is, by definition, SPD.

How do we use it?

  1. Generate Standard Normal Variates: Generate a vector \mathbf{z} \in \mathbb{R}^D where each component z_i is an independent sample from the standard normal distribution \mathcal{N}(0, 1). This vector \mathbf{z} has a mean of \mathbf{0} and a covariance matrix of \mathbf{I}.
  2. Decompose: Compute the Cholesky factor \mathbf{L} of the target covariance matrix, \mathbf{\Sigma} = \mathbf{L}\mathbf{L}^\top.
  3. Transform: Create the desired sample \mathbf{x} using the affine transformation:

\mathbf{x} = \boldsymbol{\mu} + \mathbf{L}\mathbf{z}

This works because the mean and covariance of the transformed variable \mathbf{x} are:

  • \mathbb{E}[\mathbf{x}] = \mathbb{E}[\boldsymbol{\mu} + \mathbf{L}\mathbf{z}] = \boldsymbol{\mu} + \mathbf{L}\mathbb{E}[\mathbf{z}] = \boldsymbol{\mu}.
  • \text{Cov}(\mathbf{x}) = \text{Cov}(\boldsymbol{\mu} + \mathbf{L}\mathbf{z}) = \mathbf{L}\text{Cov}(\mathbf{z})\mathbf{L}^\top = \mathbf{L}\mathbf{I}\mathbf{L}^\top = \mathbf{L}\mathbf{L}^\top = \mathbf{\Sigma}.
    The Cholesky decomposition provides the correct linear transformation \mathbf{L} to “color” the uncorrelated noise \mathbf{z} into the correlated noise specified by \mathbf{\Sigma}.

LU decomposition

Definition

The LU decomposition of a square matrix \mathbf{A} is a factorization:

\mathbf{A} = \mathbf{L}\mathbf{U}

where \mathbf{L} is a unit lower triangular matrix and \mathbf{U} is an upper triangular matrix. Not every matrix has an LU decomposition. However, if row swaps are allowed, a stable factorization is always possible. This leads to the more general form \mathbf{P}\mathbf{A} = \mathbf{L}\mathbf{U}, where \mathbf{P} is a permutation matrix that records the row swaps.

How do I do it?

Remember how Guassian elimination does elementary operations? And, these elementary operations can be written as elementary matrices? So, we can do –

\mathbf{A} = (\mathbf{E}_{32}\mathbf{E}_{31}\mathbf{E}_{21})^{-1}\mathbf{U} = \mathbf{E}_{21}^{-1}\mathbf{E}_{31}^{-1}\mathbf{E}_{32}^{-1}\mathbf{U}

Now, if you work it out – the elementary matrices will be lower triangular and their product will also be lower triangular. A nice property is that these lower triangular elementary matrices is that their product is also simple: the multipliers l_{ij} slot directly into the matrix.

\mathbf{L} = \mathbf{E}_{21}^{-1}\mathbf{E}_{31}^{-1}\mathbf{E}_{32}^{-1} = \begin{bmatrix} 1 & 0 & 0 \\ l_{21} & 1 & 0 \\ l_{31} & l_{32} & 1 \end{bmatrix}

Now, what if one of the pivots is 0? Or like, super small – leading to instability?

To prevent this, we perform pivoting: swapping the current row with a subsequent row that has a larger pivot element.

Each row swap can be represented by left-multiplication with a permutation matrix \mathbf{P}. A permutation matrix is an identity matrix with its rows reordered. Applying all necessary row swaps to \mathbf{A} at the beginning gives the general and stable form of the decomposition:

\mathbf{P}\mathbf{A} = \mathbf{L}\mathbf{U}

This factorization exists for any square matrix \mathbf{A}.

Applications

Literally any decomposition can help Solving Linear Systems easier. I’m gonna skip this but that’s the primary use (again).

Other than that, the decomposition provides an efficient way to calculate the determinant.

\det(\mathbf{P}\mathbf{A}) = \det(\mathbf{L})\det(\mathbf{U}) \implies \det(\mathbf{P})\det(\mathbf{A}) = \det(\mathbf{L})\det(\mathbf{U})

  • \det(\mathbf{L}) = 1 (since it is unit triangular).
  • \det(\mathbf{U}) = \prod_{i=1}^n U_{ii} (the product of its diagonal elements).
  • \det(\mathbf{P}) = (-1)^s, where s is the number of row swaps performed.
    Therefore, \det(\mathbf{A}) = (-1)^s \prod_{i=1}^n U_{ii}.

Sources

  1. Mathematics for Machine Learning (link)
Posted in

One response to “Matrix Decompositions 2”

  1. Matrix Decompositions 3 – Vineeth Bhat Avatar

    […] Matrix Decompositions 2 […]

    Like

Leave a comment