1. What’s there?
    1. Why do I write this?
  2. What are S, V and D?
    1. Geometric action of a matrix!
    2. Formal definition!!
    3. Woah! Column space, row space? What?
    4. An Exposition on the Fundamental Subspaces of a Linear Transformation
  3. Theoretical development
    1. From eigendecomposition
    2. Singular values and geometry
    3. Reduced SVD
  4. Applications
    1. Low-Rank Approximation and Principal Component Analysis (PCA)
    2. Moore-Penrose Pseudoinverse and Linear Regression
    3. Numerical Stability and Condition Number
  5. Aside: Eckart-Young-Mirsky Theorem
    1. Matrix norms
    2. Theorem
  6. Aside: Intuitions for SVD
    1. Theoretical development for Moore-Pensore inverse
    2. Listing the intuitions
  7. Sources

What’s there?

There are basically 5 (ಐದು / पाँच) types of matrix decompositions that are like super important. I strive to go in depth into all of them. These are

  1. Eigendecomposition (Matrix Decompositions 1)
  2. QR decomposition (Matrix Decompositions 2)
  3. Cholesky decomposition (Matrix Decompositions 2)
  4. LU decomposition (Matrix Decompositions 2)
  5. (perhaps the most important) Singular Value Decomposition (This post!)

Why do I write this?

I wanted to create an authoritative document I could refer to for later. Thought that I might as well make it public (build in public, serve society and that sort of stuff).

What are S, V and D?

Singular, Value and Decomposition? Let’s start!

Geometric action of a matrix!

Remember eignedecomposition? Yes? No? See Matrix Decompositions 1

What if the matrix is not square?? We use SVD! Wait lemme explain:

Any linear transformation \mathbf{A} maps the set of unit vectors in the input space (a unit sphere in \mathbb{R}^n) to a hyperellipse in the output space \mathbb{R}^m.

This is the main geometric insight of the SVD –

This hyperellipse lies within a subspace of the codomain. The SVD provides a complete characterization of this transformation by identifying the specific orthonormal basis vectors in the input space that are mapped directly onto the principal axes of this resulting hyperellipse.

Formal definition!!

Theorem (Singular Value Decomposition): For any matrix \mathbf{A} \in \mathbb{R}^{m \times n} of rank r \le \min(m, n), there exists a factorization of the form:

\mathbf{A} = \mathbf{U}\mathbf{\Sigma}\mathbf{V}^\top

where the components have the following properties:

  1. \mathbf{U} \in \mathbb{R}^{m \times m} is an Orthogonal Matrix. The columns of \mathbf{U}, denoted {\mathbf{u}_1, \dots, \mathbf{u}_m}, are the left-singular vectors. They form an orthonormal basis for the output space (codomain), \mathbb{R}^m. The first r of these vectors, {\mathbf{u}_1, \dots, \mathbf{u}_r}, form an orthonormal basis for the column space (range) of \mathbf{A}. Geometrically, these are the unit vectors that define the directions of the principal axes of the output hyperellipse.
  2. \mathbf{V} \in \mathbb{R}^{n \times n} is an Orthogonal Matrix. The columns of \mathbf{V}, denoted {\mathbf{v}_1, \dots, \mathbf{v}_n}, are the right-singular vectors. They form an orthonormal basis for the input space (domain), \mathbb{R}^n. The first r of these vectors, {\mathbf{v}_1, \dots, \mathbf{v}_r}, form an orthonormal basis for the row space of \mathbf{A}. Geometrically, these are the specific orthonormal vectors on the input unit sphere that are mapped by \mathbf{A} onto the principal axes of the output hyperellipse.
  3. \mathbf{\Sigma} \in \mathbb{R}^{m \times n} is a Rectangular Diagonal Matrix. The only non-zero entries of \mathbf{\Sigma} are on its main diagonal. These entries, \Sigma_{ii} = \sigma_i, are the singular values of \mathbf{A}. They are real, non-negative, and by convention, sorted in descending order:

\sigma_1 \ge \sigma_2 \ge \dots \ge \sigma_r > 0, \quad \text{and} \quad \sigma_{r+1} = \dots = \sigma_{\min(m,n)} = 0

The number of non-zero singular values is exactly the rank r of the matrix \mathbf{A}. Geometrically, each singular value \sigma_i is the length of the i-th principal semi-axis of the output hyperellipse.

Woah! Column space, row space? What?

Actually, I’ll call this –

An Exposition on the Fundamental Subspaces of a Linear Transformation

The most profound insights into the nature of this transformation are revealed by analyzing four special vector subspaces that are intrinsically associated with the matrix \mathbf{A}. These are the four fundamental subspaces. Understanding them provides a complete geometric and algebraic picture of the action of the matrix.

Subspace 1: The Column Space (or Range)

The column space of a matrix \mathbf{A} \in \mathbb{R}^{m \times n}, denoted \text{Col}(\mathbf{A}) or \mathcal{R}(\mathbf{A}), is the set of all possible linear combinations of its column vectors. If the columns of \mathbf{A} are {\mathbf{a}_1, \dots, \mathbf{a}_n}, where each \mathbf{a}_j \in \mathbb{R}^m, then:

$latex \text{Col}(\mathbf{A}) = \text{span}{\mathbf{a}1, \dots, \mathbf{a}_n} = \left{ \sum{j=1}^n c_j \mathbf{a}_j \mid c_j \in \mathbb{R} \right}$

Notice how the column space is the set of all vectors \mathbf{b} for which the system \mathbf{A}\mathbf{x} = \mathbf{b} has a solution.

Subspace 2: The Null Space (or Kernel)

The null space of a matrix \mathbf{A} \in \mathbb{R}^{m \times n}, denoted \text{Null}(\mathbf{A}) or \mathcal{N}(\mathbf{A}), is the set of all solutions to the homogeneous linear system \mathbf{A}\mathbf{x} = \mathbf{0}.

\text{Null}(\mathbf{A}) = { \mathbf{x} \in \mathbb{R}^n \mid \mathbf{A}\mathbf{x} = \mathbf{0} }

The null space answers the question: “What is the set of all input vectors that get crushed or annihilated by the transformation?”

For a system \mathbf{A}\mathbf{x} = \mathbf{b}, if a solution \mathbf{x}_p exists (a particular solution), then the set of all solutions is given by {\mathbf{x}_p + \mathbf{x}_n \mid \mathbf{x}_n \in \text{Null}(\mathbf{A})}.

The null space thus describes the ambiguity or non-uniqueness of the solution.

Subspace 3: The Row Space

It is equivalent to the column space of \mathbf{A}^\top.

Subspace 4: The Left Null Space

The left null space of a matrix \mathbf{A} \in \mathbb{R}^{m \times n} is the set of all solutions to the homogeneous system \mathbf{A}^\top\mathbf{y} = \mathbf{0}.

\text{Null}(\mathbf{A}^\top) = { \mathbf{y} \in \mathbb{R}^m \mid \mathbf{A}^\top\mathbf{y} = \mathbf{0} }

It is called the “left” null space because the equation can be written as \mathbf{y}^\top\mathbf{A} = \mathbf{0}^\top, where the vector \mathbf{y} is on the left.

Notice how the left null space contains all vectors in the output space that are orthogonal to every vector in the column space.

It represents the part of the codomain that is “unreachable” by the transformation.

The Rank-Nullity Theorem

Theorem (Rank-Nullity Theorem): For any matrix \mathbf{A} \in \mathbb{R}^{m \times n}:

\dim(\text{Col}(\mathbf{A})) + \dim(\text{Null}(\mathbf{A})) = n

or, using the definition of rank:

\text{rank}(\mathbf{A}) + \dim(\text{Null}(\mathbf{A})) = n

Interpretation: The dimension of the input space (n) is partitioned between two subspaces: the part that is mapped to the column space (the row space, whose dimension is the rank) and the part that is crushed to zero (the null space).

The Fundamental Theorem of Linear Algebra

Part 1: Dimensions

By applying the Rank-Nullity theorem to both \mathbf{A} and its transpose \mathbf{A}^\top, we arrive at a complete description of the dimensions of all four fundamental subspaces:

  • \dim(\text{Col}(\mathbf{A})) = r
  • \dim(\text{Row}(\mathbf{A})) = r
  • \dim(\text{Null}(\mathbf{A})) = n - r
  • \dim(\text{Null}(\mathbf{A}^\top)) = m - r

Part 2: Geometry

  • The row space and the null space are orthogonal complements in the input space \mathbb{R}^n.
  • The column space and the left null space are orthogonal complements in the output space \mathbb{R}^m.

Theoretical development

From eigendecomposition

Consider the matrix \mathbf{S}_1 = \mathbf{A}^\top\mathbf{A} \in \mathbb{R}^{n \times n}. By construction, \mathbf{S}_1 is symmetric (\mathbf{S}_1^\top = (\mathbf{A}^\top\mathbf{A})^\top = \mathbf{A}^\top(\mathbf{A}^\top)^\top = \mathbf{A}^\top\mathbf{A} = \mathbf{S}_1) and positive semi-definite (\mathbf{x}^\top\mathbf{S}_1\mathbf{x} = \mathbf{x}^\top\mathbf{A}^\top\mathbf{A}\mathbf{x} = (\mathbf{A}\mathbf{x})^\top(\mathbf{A}\mathbf{x}) = |\mathbf{A}\mathbf{x}|_2^2 \ge 0).

By the Spectral Theorem, any real symmetric matrix has a full set of orthonormal eigenvectors and real, non-negative eigenvalues. Therefore, \mathbf{S}_1 admits an eigendecomposition:

\mathbf{A}^\top\mathbf{A} = \mathbf{V}\mathbf{D}\mathbf{V}^\top

where \mathbf{V} is an orthogonal matrix of eigenvectors and \mathbf{D} is a diagonal matrix of eigenvalues \lambda_i \ge 0.

Now, substitute the SVD form \mathbf{A} = \mathbf{U}\mathbf{\Sigma}\mathbf{V}^\top into this expression:

\mathbf{A}^\top\mathbf{A} = (\mathbf{U}\mathbf{\Sigma}\mathbf{V}^\top)^\top (\mathbf{U}\mathbf{\Sigma}\mathbf{V}^\top) = (\mathbf{V}\mathbf{\Sigma}^\top\mathbf{U}^\top) (\mathbf{U}\mathbf{\Sigma}\mathbf{V}^\top)

Since \mathbf{U} is orthogonal, \mathbf{U}^\top\mathbf{U} = \mathbf{I}_m, which simplifies the expression to:

\mathbf{A}^\top\mathbf{A} = \mathbf{V}(\mathbf{\Sigma}^\top\mathbf{\Sigma})\mathbf{V}^\top

Comparing this with the eigendecomposition \mathbf{V}\mathbf{D}\mathbf{V}^\top, we establish the following fundamental relationships:

  • The right-singular vectors of \mathbf{A} (the columns of \mathbf{V}) are the eigenvectors of \mathbf{A}^\top\mathbf{A}.
  • The eigenvalues of \mathbf{A}^\top\mathbf{A} are the squares of the singular values of \mathbf{A}, i.e., \lambda_i = \sigma_i^2.

A symmetric argument applies to the matrix \mathbf{S}_2 = \mathbf{A}\mathbf{A}^\top \in \mathbb{R}^{m \times m}:

\mathbf{A}\mathbf{A}^\top = (\mathbf{U}\mathbf{\Sigma}\mathbf{V}^\top)(\mathbf{U}\mathbf{\Sigma}\mathbf{V}^\top)^\top = \mathbf{U}(\mathbf{\Sigma}\mathbf{\Sigma}^\top)\mathbf{U}^\top

This reveals that:

  • The left-singular vectors of \mathbf{A} (the columns of \mathbf{U}) are the eigenvectors of \mathbf{A}\mathbf{A}^\top.
  • The non-zero eigenvalues of \mathbf{A}\mathbf{A}^\top are identical to the non-zero eigenvalues of \mathbf{A}^\top\mathbf{A}, and are also equal to \sigma_i^2.

Singular values and geometry

The relationship between the components can be summarized by the singular value equations, derived by right-multiplying \mathbf{A} = \mathbf{U}\mathbf{\Sigma}\mathbf{V}^\top by \mathbf{V}:

\mathbf{A}\mathbf{V} = \mathbf{U}\mathbf{\Sigma}

Considering this equation one column at a time, for i=1, \dots, n:

\mathbf{A}\mathbf{v}_i = \sigma_i \mathbf{u}_i

This equation provides the geometric interpretation:

  • The linear transformation \mathbf{A} maps the i-th right-singular vector (an orthonormal basis vector in the domain) to a scaled version of the i-th left-singular vector (an orthonormal basis vector in the codomain).
  • The scaling factor is the i-th singular value.

Reduced SVD

What if m \gg n or n \gg m? We use the reduced version!

  • Full SVD: The form described above, where \mathbf{U} is m \times m and \mathbf{V} is n \times n.
  • Reduced SVD: If m > n, the matrix \mathbf{\Sigma} will have m-n rows of all zeros. These rows, when multiplied by \mathbf{U}, contribute nothing to the final product. The reduced SVD eliminates these zero rows and the corresponding columns of \mathbf{U}:

\mathbf{A} = \mathbf{U}_r \mathbf{\Sigma}_r \mathbf{V}_r^\top

where \mathbf{U}_r \in \mathbb{R}^{m \times n}, \mathbf{\Sigma}_r \in \mathbb{R}^{n \times n} (now a square diagonal matrix), and \mathbf{V}_r = \mathbf{V} \in \mathbb{R}^{n \times n}. This is a more memory-efficient representation.

Applications

Low-Rank Approximation and Principal Component Analysis (PCA)

The expression \mathbf{A} = \sum_{i=1}^{r} \sigma_i \mathbf{u}_i \mathbf{v}_i^\top represents \mathbf{A} as a weighted sum of rank-one matrices. The Eckart-Young-Mirsky Theorem states that the optimal rank-k approximation of \mathbf{A} (the matrix \mathbf{A}_k that minimizes |\mathbf{A} - \mathbf{A}_k|) is found by truncating this sum:

\mathbf{A}_k = \sum_{i=1}^{k} \sigma_i \mathbf{u}_i \mathbf{v}_i^\top

What does this mean? You select the ones corresponding to the largest singular values:) This is what’s used in PCA! (PCA, Spectral Clustering and t-SNE)

Moore-Penrose Pseudoinverse and Linear Regression

For any matrix \mathbf{A}, the SVD provides its pseudoinverse: \mathbf{A}^\dagger = \mathbf{V}\mathbf{\Sigma}^\dagger\mathbf{U}^\top, where \mathbf{\Sigma}^\dagger is formed by taking the reciprocal of the non-zero singular values in \mathbf{\Sigma} and transposing.

In ML: The solution to the linear least-squares problem, \min_{\mathbf{x}} |\mathbf{A}\mathbf{x} - \mathbf{b}|_2^2, is given by \mathbf{x} = \mathbf{A}^\dagger\mathbf{b}.

Numerical Stability and Condition Number

Remember? I spoke about this in Matrix Decompositions 2? Condition number, that is.

The condition number of a matrix \mathbf{A}, which measures the sensitivity of the solution of \mathbf{A}\mathbf{x}=\mathbf{b} to perturbations in the data. It is defined as

\kappa(\mathbf{A}) = \frac{\sigma_1}{\sigma_r} \quad (\text{the ratio of the largest to the smallest non-zero singular value})

Aside: Eckart-Young-Mirsky Theorem

Matrix norms

Spectral Norm: The maximum scaling factor the matrix can apply to any unit vector: |\mathbf{M}|_2 = \max_{|\mathbf{x}|_2=1} |\mathbf{M}\mathbf{x}|_2. It can be proven that this is equal to the largest singular value, |\mathbf{M}|_2 = \sigma_1.

Frobenius Norm: The matrix equivalent of the vector Euclidean norm, defined as the square root of the sum of the squares of all its elements: |\mathbf{M}|_F = \sqrt{\sum_{i,j} M_{ij}^2}. It can be proven that this is equal to the square root of the sum of the squares of its singular values: |\mathbf{M}|_F = \sqrt{\sum_{i=1}^r \sigma_i^2}.

Theorem

Let the SVD of \mathbf{M} be \mathbf{M} = \sum_{i=1}^{r} \sigma_i \mathbf{u}_i \mathbf{v}_i^\top. The best rank-k approximation to \mathbf{M} (for k \le r), denoted \mathbf{M}_k, is the one that minimizes the approximation error |\mathbf{M} - \mathbf{M}_k|. For both the spectral and Frobenius norms, this minimum is achieved by the truncated SVD sum:

\mathbf{M}_k = \sum_{i=1}^{k} \sigma_i \mathbf{u}_i \mathbf{v}_i^\top

The approximation error is given by:

  • |\mathbf{M} - \mathbf{M}_k|_2 = \sigma_{k+1}
  • |\mathbf{M} - \mathbf{M}_k|_F = \sqrt{\sum_{i=k+1}^r \sigma_i^2}

This theorem guarantees that truncating the SVD provides the most compact linear approximation of a matrix for a given rank, capturing the “most important parts of M” by retaining the components with the largest singular values.

Aside: Intuitions for SVD

I highly suggest looking up Six (and a half) intuitions for SVD, but here are some things that aren’t directly covered earlier in this blog:

Theoretical development for Moore-Pensore inverse

We substitute the SVD of \mathbf{M} into the objective function and use the property that the L2 norm is invariant under orthogonal transformations:

\begin{aligned} |\mathbf{M}\mathbf{x} - \mathbf{b}|_2^2 &= |\mathbf{U}\mathbf{S}\mathbf{V}^\top\mathbf{x} - \mathbf{b}|_2^2 \ &= |\mathbf{U}^\top(\mathbf{U}\mathbf{S}\mathbf{V}^\top\mathbf{x} - \mathbf{b})|_2^2 \quad \ &= |\mathbf{S}\mathbf{V}^\top\mathbf{x} - \mathbf{U}^\top\mathbf{b}|_2^2 \end{aligned}

By defining a change of variables \mathbf{x}' = \mathbf{V}^\top\mathbf{x} and \mathbf{b}' = \mathbf{U}^\top\mathbf{b}, the problem becomes \min_{\mathbf{x}'} |\mathbf{S}\mathbf{x}' - \mathbf{b}'|_2^2. Since \mathbf{S} is diagonal, this is a sum of independent squared errors:

\sum_{i=1}^r (\sigma_i x'_i - b'_i)^2 + \sum_{i=r+1}^m (b'_i)^2

The minimum is achieved when \sigma_i x'_i - b'_i = 0 for i=1,\dots,r, yielding x'_i = b'_i / \sigma_i. For i > r, x'_i is unconstrained as \sigma_i=0; the minimum-norm solution sets these to zero. Transforming back gives the solution:

\mathbf{x} = \mathbf{V}\mathbf{x}' = \sum_{i=1}^{r} \frac{b'_i}{\sigma_i}\mathbf{v}_i = \sum_{i=1}^{r} \frac{\mathbf{u}_i^\top \mathbf{b}}{\sigma_i}\mathbf{v}_i = \mathbf{M}^\dagger \mathbf{b}

where \mathbf{M}^\dagger = \mathbf{V}\mathbf{S}^\dagger\mathbf{U}^\top is the Moore-Penrose Pseudoinverse.

Listing the intuitions

We have mostly coverered all of these above btw,

  1. Rotations and Scalings (The Geometric Picture)
  2. Best Rank-k Approximations
  3. Least Squares Regression
  4. Input and Output Directions
  5. Lost & Preserved Information
  6. PCA and Information Compression

Yay, we are (mostly) done with matrix decompositions!

Sources

  1. Mathematics for Machine Learning (link)
  2. Six (and a half) intuitions for SVD (link)

Posted in

One response to “Matrix Decompositions 3”

Leave a reply to Other important matrix bits – Vineeth Bhat Cancel reply