Other important matrix bits

What’s there?

I want to try and cover every other kinda trivial / small things related to matrices / LinAlg in this. This covers

Affine spaces (This post!)
Inner products of functions (This post!)
Orthogonal projections (This post!)
Matrix approximation (This post!)
Rotations and quaternions

Why do I write this?

I wanted to create an authoritative document I could refer to for later. Thought that I might as well make it public (build in public, serve society and that sort of stuff).

Affine Spaces

Definition(s)

An affine space $A$ is a set of elements called points, together with a vector space $V$ , and an operation of vector addition that acts on points. This operation, $+: A \times V \to A$ , satisfies the following axioms:

For any point $P \in A$ , $P + \mathbf{0} = P$ .
For any point $P \in A$ and any two vectors $\mathbf{v}_1, \mathbf{v}_2 \in V$ , $(P + \mathbf{v}_1) + \mathbf{v}_2 = P + (\mathbf{v}_1 + \mathbf{v}_2)$ .
For any two points $P, Q \in A$ , there exists a unique vector $\mathbf{v} \in V$ such that $P + \mathbf{v} = Q$ . This unique vector is denoted $\vec{PQ} = Q - P$ .

The vector space $V$ is called the direction space associated with the affine space $A$ .

Some other definitions

Affine Subspaces: An affine subspace $L$ of an affine space $A$ is a subset of $A$ that is itself an affine space with a direction space $U$ that is a vector subspace of $V$ . More concretely, an affine subspace is formed by taking a vector subspace and translating it.

$L = P_0 + U := { P_0 + \mathbf{u} \mid \mathbf{u} \in U }$

where $P_0 \in A$ is a support point (an origin for the subspace) and $U \subseteq V$ is the direction space. A vector space is a special case of an affine subspace where the support point is the origin.

Parametric Representation: If the direction space $U$ has a basis ${\mathbf{b}_1, \dots, \mathbf{b}_k}$ , then any point $P$ in the affine subspace $L$ can be uniquely represented in parametric form as:

$P = P_0 + \sum_{i=1}^k \lambda_i \mathbf{b}_i, \quad \text{where } \lambda_i \in \mathbb{R}$

Examples include lines ( $k=1$ ) and planes ( $k=2$ ). A hyperplane is an affine subspace of codimension 1 (i.e., of dimension $n-1$ in an $n$ -dimensional space).

Affine Mappings: An affine map $f: A \to B$ between two affine spaces is a function that preserves the structure of affine subspaces. It can be shown that any affine map can be uniquely decomposed into a linear map on the underlying vector spaces plus a translation. In a coordinate system, this takes the familiar form:

$f(\mathbf{x}) = \mathbf{A}\mathbf{x} + \mathbf{t}$

where $\mathbf{A}$ is the matrix of the underlying linear map and $\mathbf{t}$ is a translation vector.

Very math stuff – what’s the use?

Use 1: Support Vector Machines!

The separating hyperplane in a Support Vector Machine (SVM) is actually an affine subspace!

In a $D$ -dimensional feature space, the hyperplane is defined by the set of points $\mathbf{x}$ satisfying the equation:

$\langle \mathbf{w}, \mathbf{x} \rangle + b = 0$

Here, $\mathbf{w}$ is the normal vector that defines the direction space’s orthogonal complement, and $b$ is a scalar related to the translation from the origin.

Inner Products (of functions)

Starting from dot products

The concept of an inner product on a space of functions is a direct generalization of the dot product on finite-dimensional Euclidean space $\mathbb{R}^n$ .

The dot product in $\mathbb{R}^n$ is defined as $\langle \mathbf{x}, \mathbf{y} \rangle = \mathbf{x}^\top\mathbf{y} = \sum_{i=1}^n x_i y_i$ . It satisfies the axioms of an inner product: bilinearity, symmetry, and positive definiteness.
Now, a function $f: [a,b] \to \mathbb{R}$ can be conceptualized as an infinite-dimensional vector, where each point $x \in [a,b]$ corresponds to a component with value $f(x)$ . In this view, the discrete sum of the dot product becomes a continuous integral.
Further, to ensure the integrals are well-defined, we consider a vector space of functions, typically the space of square-integrable functions on an interval $[a,b]$ , denoted $L^2([a,b])$ .

Formal Definition: The inner product of two real-valued functions $f(x)$ and $g(x)$ in the space $L^2([a,b])$ is defined as:

$\langle f, g \rangle := \int_a^b f(x) g(x) dx$

More generally, a weight function $w(x) > 0$ can be introduced: $\langle f, g \rangle_w = \int_a^b f(x) g(x) w(x) dx$ .

Theoretical Development

Gonna info dump some things –

Induced Norm: The inner product induces a norm, known as the $L^2$ -norm, which measures the “length” or “magnitude” of a function:

$|f|_{L^2} = \sqrt{\langle f, f \rangle} = \left(\int_a^b f(x)^2 dx\right)^{1/2}$

Orthogonality of Functions: Two functions $f$ and $g$ are orthogonal if their inner product is zero:

$\langle f, g \rangle = \int_a^b f(x) g(x) dx = 0$

Orthogonal Basis and Fourier Series: A set of functions ${\phi_k(x)}$ is an orthogonal basis if its members are mutually orthogonal.

Where can I see this in action?

Use 1: Fourier Series

Functional analysis says that many well-behaved functions can be represented as a linear combination of orthogonal basis functions.

E.g., Fourier series, which represents a periodic function as a sum of sines and cosines, which form an orthogonal set on the interval $[-\pi, \pi]$ .

The coefficient for each basis function $\phi_k$ is found by projecting $f$ onto it: $c_k = \frac{\langle f, \phi_k \rangle}{|\phi_k|^2}$ .

Other than this, you can actually show that solutions relating to Kernel methods, unknown functions in robotics (e.g., friction), etc. can be written as inner products of functions.

Orthogonal Projections

What is it, exactly?

An orthogonal projection is a mapping that takes a vector and finds its closest point within a given subspace.

It is the formalization of the geometric idea of casting a shadow.

Ok, find, but what is it?

Given a vector space $V$ and a subspace $U \subseteq V$ , for any vector $\mathbf{v} \in V$ , we want to find the unique vector $\mathbf{u} \in U$ that is closest to $\mathbf{v}$ . This is equivalent to minimizing the distance:

$\text{proj}_U(\mathbf{v}) = \arg\min_{\mathbf{u} \in U} |\mathbf{v} - \mathbf{u}|$

The solution to this minimization problem occurs when the “error” vector, $\mathbf{v} - \mathbf{u}$ , is orthogonal to the subspace $U$ . That is, $\langle \mathbf{v} - \mathbf{u}, \mathbf{z} \rangle = 0$ for all $\mathbf{z} \in U$ .

This is basically the Geometric Condition!

Two Cases

You can either

Project onto a line, or,
Project onto a subspace

Projection onto a Line: The simplest case is projecting a vector $\mathbf{v}$ onto the line spanned by a single non-zero vector $\mathbf{b}$ . The projection is a scaled version of $\mathbf{b}$ , $\text{proj}_{\mathbf{b}}(\mathbf{v}) = \lambda \mathbf{b}$ . Using the orthogonality condition:

$\langle \mathbf{v} - \lambda\mathbf{b}, \mathbf{b} \rangle = 0 \implies \langle \mathbf{v}, \mathbf{b} \rangle - \lambda\langle \mathbf{b}, \mathbf{b} \rangle = 0 \implies \lambda = \frac{\langle \mathbf{v}, \mathbf{b} \rangle}{|\mathbf{b}|^2}$

The projection is thus $\text{proj}_{\mathbf{b}}(\mathbf{v}) = \frac{\langle \mathbf{v}, \mathbf{b} \rangle}{|\mathbf{b}|^2} \mathbf{b}$ . In matrix form for the standard dot product in $\mathbb{R}^n$ , this is represented by the projection matrix: $\mathbf{P} = \frac{\mathbf{b}\mathbf{b}^\top}{\mathbf{b}^\top\mathbf{b}}$ .

Hold up here. Notice that $\text{proj}_{\mathbf{b}}(\mathbf{v})$ is a vector. This is the vector of the projection. The magnituge of the projection, just the “size” of the shadow so to say, is the complement.

$\text{comp}_{\mathbf{b}}(\mathbf{v}) = \frac{\langle \mathbf{v}, \mathbf{b} \rangle}{|\mathbf{b}|}$

Projection onto a Subspace: This generalizes to projecting onto a subspace $U$ spanned by the columns of a matrix $\mathbf{A} \in \mathbb{R}^{m \times n}$ (with linearly independent columns). The projection $\mathbf{p}$ is in the column space, so $\mathbf{p} = \mathbf{A}\hat{\mathbf{x}}$ for some coefficient vector $\hat{\mathbf{x}}$ . The error vector $\mathbf{b} - \mathbf{A}\hat{\mathbf{x}}$ must be orthogonal to the column space of $\mathbf{A}$ . This means it must be in the left null space of $\mathbf{A}$ :

$\mathbf{A}^\top(\mathbf{b} - \mathbf{A}\hat{\mathbf{x}}) = \mathbf{0} \implies \mathbf{A}^\top\mathbf{A}\hat{\mathbf{x}} = \mathbf{A}^\top\mathbf{b}$

This is the normal equation. Solving for $\hat{\mathbf{x}}$ and plugging back gives the projection $\mathbf{p}$ :

$\hat{\mathbf{x}} = (\mathbf{A}^\top\mathbf{A})^{-1}\mathbf{A}^\top\mathbf{b} \implies \mathbf{p} = \mathbf{A}(\mathbf{A}^\top\mathbf{A})^{-1}\mathbf{A}^\top\mathbf{b}$

The projection matrix is $\mathbf{P} = \mathbf{A}(\mathbf{A}^\top\mathbf{A})^{-1}\mathbf{A}^\top$ . If the columns of $\mathbf{A}$ are orthonormal, then $\mathbf{A}^\top\mathbf{A} = \mathbf{I}$ , and the formula simplifies to $\mathbf{P} = \mathbf{A}\mathbf{A}^\top$ .

Wait this doesn’t make sense directly:( Ok, imma write it again!

Projection onto a subspace

Yes, its own section.

Concept Projection onto a Line (1D) Projection onto a Subspace (nD)
The “Ground” A single line defined by one vector b. A “floor” or subspace defined by multiple basis vectors, which are the columns of a matrix A.
The “Shadow” The projection p is on the line, so it’s a scaled version of b. The projection p is on the “floor,” so it’s a linear combination of the basis vectors in A.
Orthogonality The error vector (b – p) must be orthogonal to the line’s direction vector b. The error vector (b – p) must be orthogonal to the entire subspace. This means it must be orthogonal to every column of A.

Concept	Projection onto a Line (1D)	Projection onto a Subspace (nD)
The “Ground”	A single line defined by one vector b.	A “floor” or subspace defined by multiple basis vectors, which are the columns of a matrix A.
The “Shadow”	The projection p is on the line, so it’s a scaled version of b.	The projection p is on the “floor,” so it’s a linear combination of the basis vectors in A.
Orthogonality	The error vector (b – p) must be orthogonal to the line’s direction vector b.	The error vector (b – p) must be orthogonal to the entire subspace. This means it must be orthogonal to every column of A.

Now, let’s derive!

State the Goal: We want to find the projection $\mathbf{p}$ , which we know is a combination of $\mathbf{A}$ ‘s columns: $\mathbf{p} = \mathbf{A}\hat{\mathbf{x}}$ .

Use the Orthogonality Condition: The error vector ( $\mathbf{b} - \mathbf{p}$ ) must be orthogonal to the column space of $\mathbf{A}$ .

$\mathbf{A}^\top(\mathbf{b} - \mathbf{p}) = \mathbf{0}$

Substitute $\mathbf{p}$ : Replace $\mathbf{p}$ with what we know it is, $\mathbf{A}\hat{\mathbf{x}}$ .

$\mathbf{A}^\top(\mathbf{b} - \mathbf{A}\hat{\mathbf{x}}) = \mathbf{0}$

Solve for the Coefficients $\hat{\mathbf{x}}$ : Distribute $\mathbf{A}^\top$ to get the normal equation. This equation’s name comes from the fact that it enforces the “normal” (orthogonal) condition.

$\mathbf{A}^\top\mathbf{b} - \mathbf{A}^\top\mathbf{A}\hat{\mathbf{x}} = \mathbf{0}$

$\mathbf{A}^\top\mathbf{A}\hat{\mathbf{x}} = \mathbf{A}^\top\mathbf{b}$

Now, solve for the coefficient vector $\hat{\mathbf{x}}$ :

$\hat{\mathbf{x}} = (\mathbf{A}^\top\mathbf{A})^{-1}\mathbf{A}^\top\mathbf{b}$

This formula gives us the perfect “recipe” of coefficients to create the projection.

Find the Projection Vector $\mathbf{p}$ : Now that we have the recipe ( $\hat{\mathbf{x}}$ ), we can build the projection vector $\mathbf{p}$ .

$\mathbf{p} = \mathbf{A}\hat{\mathbf{x}} = \mathbf{A}(\mathbf{A}^\top\mathbf{A})^{-1}\mathbf{A}^\top\mathbf{b}$

The projection matrix is the part that does all the work: $\mathbf{P} = \mathbf{A}(\mathbf{A}^\top\mathbf{A})^{-1}\mathbf{A}^\top$ . It’s a single matrix that takes any vector $\mathbf{b}$ and spits out its projection onto the subspace defined by $\mathbf{A}$ .

Where have we seen this?

The solution to the least-squares problem, $\min_{\boldsymbol{\theta}} |\mathbf{y} - \mathbf{X}\boldsymbol{\theta}|_2^2$ , is found by solving the normal equation $\mathbf{X}^\top\mathbf{X}\boldsymbol{\theta} = \mathbf{X}^\top\mathbf{y}$ .

This solution, $\hat{\mathbf{y}} = \mathbf{X}\boldsymbol{\theta}$ , is the orthogonal projection of the observed target vector $\mathbf{y}$ onto the column space of the design matrix $\mathbf{X}$ .

Matrix Approximation

I just wanted to add this as a heading since it’s not directly evident that this is just the EYM theorem discussed in depth within Matrix Decompositions 3. Anyway, here’s the statement again –

Theorem (Eckart-Young-Mirsky): Let the SVD of a matrix $\mathbf{M} \in \mathbb{R}^{m \times n}$ of rank $r$ be $\mathbf{M} = \mathbf{U}\mathbf{S}\mathbf{V}^\top = \sum_{i=1}^{r} \sigma_i \mathbf{u}_i \mathbf{v}_i^\top$ . The best rank- $k$ approximation to $\mathbf{M}$ (for $k < r$ ) that minimizes the error $|\mathbf{M} - \mathbf{M}_k|$ (in both spectral and Frobenius norms) is given by the truncated SVD sum:

$\mathbf{M}_k = \arg\min_{\text{rank}(\mathbf{B})=k} |\mathbf{M} - \mathbf{B}| = \sum_{i=1}^{k} \sigma_i \mathbf{u}_i \mathbf{v}_i^\top$

The theorem states that to get the best approximation, one should keep the components corresponding to the largest singular values and discard the rest. The singular values quantify the “importance” of each rank-one component.

Applications

Gosh, what’s not there?

Machine Learning (Principal Component Analysis): PCA is the most direct application of this theorem. It seeks the best rank-$k$ approximation of a centered data matrix. The SVD provides both the optimal subspace (spanned by the first $k$ right-singular vectors) and the coordinates of the data projected onto it.
Machine Learning (Recommender Systems): A user-item rating matrix is often very large, sparse, and assumed to be approximately low-rank. Low-rank matrix factorization, which is a form of matrix approximation, is used to fill in the missing entries. The SVD finds the optimal low-rank approximation, uncovering latent factors that describe user preferences and item characteristics.
AI (Natural Language Processing – Topic Modeling): In Latent Semantic Analysis (LSA), a large term-document matrix is constructed. Its low-rank approximation via SVD is used to find a “semantic space” where terms and documents with similar meanings are close together. The singular vectors represent abstract “topics.”

Sources

Mathematics for Machine Learning (link)

Vineeth Bhat

recent posts