Vineeth Bhat

Foundations of Probability

September 7, 2025
What’s there?

I want to cover like some basics of probability before I get into Gaussians. Saliently, this covers LLN and CLT.

Why do I write this?

I wanted to create an authoritative document I could refer to for later. Thought that I might as well make it public (build in public, serve society and that sort of stuff).

Axioms of Probability

Imma just drone on about theory in this post.

Sample Space ( $\Omega$ ): The sample space is the set of all possible elementary outcomes of a random experiment. Each outcome $\omega \in \Omega$ must be mutually exclusive and the set must be exhaustive.

Event Space ( $\mathcal{F}$ ): An event is a subset of the sample space $\Omega$ . The event space, denoted $\mathcal{F}$ , is a collection of events to which we can assign probabilities. It is not always the set of all possible subsets (the power set), but it must be a $\sigma$ -algebra, which is a collection of subsets of $\Omega$ that satisfies three properties –
1. The empty set is included: $\emptyset \in \mathcal{F}$ .
2. Closure under complementation: If an event $A \in \mathcal{F}$ , then its complement $\Omega \setminus A$ is also in $\mathcal{F}$ .
3. Closure under countable unions: If $A_1, A_2, \dots$ is a countable sequence of events in $\mathcal{F}$ , then their union $latex
Probability Measure ( $P$ ): The probability measure is a function that maps an event in the event space $\mathcal{F}$ to a real number. This function must satisfy the Kolmogorov Axioms –
1. Non-negativity: The probability of any event is non-negative.
$P(A) \ge 0 \quad \forall A \in \mathcal{F}$
1. Normalization (Unit Measure): The probability of the entire sample space is 1. This means that some outcome is guaranteed to occur.
$P(\Omega) = 1$
1. Countable Additivity (or $\sigma$ -additivity): For any countable sequence of pairwise disjoint events $A_1, A_2, \dots$ in $\mathcal{F}$ (i.e., $A_i \cap A_j = \emptyset$ for $i \neq j$ ), the probability of their union is the sum of their individual probabilities.
$P\left(\bigcup_{i=1}^\infty A_i\right) = \sum_{i=1}^\infty P(A_i)$

Random Variables

A random variable $X$ is a function $X: \Omega \to \mathbb{R}$ that assigns a real number to every possible outcome $\omega \in \Omega$ .

The key technical requirement is that this function must be measurable, which ensures that sets of the form $\{\omega \in \Omega \mid X(\omega) \le x\}$ are valid events in $\mathcal{F}$ , allowing us to assign probabilities to them.

PDF and PMF

The behavior of a discrete random variable is described by its PMF, $p_X(x)$ , which gives the probability of the variable taking on a specific value $x$ :

$p_X(x) = P(X=x) := P(\{\omega \in \Omega \mid X(\omega) = x\})$

The PMF satisfies $p_X(x) \ge 0$ and $\sum_x p_X(x) = 1$ .

The probability that a continuous random variable takes on any single specific value is zero. Instead, we describe its behavior with a PDF, $f_X(x)$ . The probability of the variable falling within an interval $[a, b]$ is the integral of the PDF over that interval:

$P(a \le X \le b) = \int_a^b f_X(x) dx$

The PDF satisfies $f_X(x) \ge 0$ and $\int_{-\infty}^\infty f_X(x) dx = 1$ . Note that, unlike a PMF, a PDF can have values greater than 1.

Also gonna throw this in here

A random variable is mixed if it has properties of both discrete and continuous variables.

Its probability distribution is a mixture of a PMF (for the discrete parts) and a PDF (for the continuous parts).

Conditional Probability, Independence, and the Chain Rule

Conditional Probability: The conditional probability of event $A$ occurring, given that event $B$ has already occurred, is denoted $P(A|B)$ . It represents an update to our belief about $A$ in light of the new information that $B$ is true. It is defined as:

$P(A|B) = \frac{P(A \cap B)}{P(B)}, \quad \text{provided } P(B) > 0$

Geometrically, this is a re-normalization of the probability space: we restrict our attention to the outcomes within event $B$ and re-normalize their probabilities to sum to one.

Statistical Independence: Two events $A$ and $B$ are statistically independent if the occurrence of one does not provide any information about the occurrence of the other. Mathematically (this is all I care about rn ngl), this means:

$P(A|B) = P(A) \quad \text{and} \quad P(B|A) = P(B)$

Substituting this into the definition of conditional probability yields the more common test for independence:

$P(A \cap B) = P(A)P(B)$

Two random variables $X$ and $Y$ are independent if this property holds for all possible values they can take: $p(x, y) = p(x)p(y)$ .

Chain Rule

In the language of random variables, for a set of variables $X_1, \dots, X_n$ , the chain rule allows us to express their joint distribution as:

$p(x_1, \dots, x_n) = \prod_{i=1}^n p(x_i | x_1, \dots, x_{i-1})$

Bayes’ Theorem
Bayes’ theorem provides a formal mechanism for inverting conditional probabilities.

Terminology
- Hypothesis ( $H$ ): A proposition about the world whose truth we are uncertain about (e.g., “The patient has the disease,” “This email is spam”).
- Evidence ( $E$ ): A new piece of data or observation that is relevant to the hypothesis.
- Prior ( $P(H)$ ): Our initial belief in the hypothesis before observing the evidence.
- Likelihood ( $P(E|H)$ ): The probability of observing the evidence if the hypothesis were true. This is the forward, generative model.
- Posterior ( $P(H|E)$ ): Our updated belief in the hypothesis after observing the evidence. This is the quantity we want to compute.
The theorem itself

$P(H|E) = \frac{P(E|H)P(H)}{P(E)}$
- The Marginal Likelihood (or Evidence): The denominator, $P(E)$ , is the total probability of observing the evidence, averaged over all possible hypotheses. It is computed using the law of total probability: $P(E) = P(E|H)P(H) + P(E|H^c)P(H^c)$ . It acts as a normalization constant, ensuring that the posterior probabilities sum to one.
The theorem can be written in a more memorable form using the terms:

$\text{Posterior} = \frac{\text{Likelihood} \times \text{Prior}}{\text{Evidence}}$
Expectation, Variance and Moments

Imma just skip the basic stuff.

This is the Expectation of a Function –

$\mathbb{E}[g(X)] = \sum_{x} g(x) \, p(x) \quad \text{or} \quad \int_{-\infty}^{\infty} g(x) \, f(x) dx$

Moments

Moments about the Origin: The $k$ -th moment of a random variable $X$ about the origin is defined as the expected value of $X^k$ :

$\mu'_k = \mathbb{E}[X^k]$

The first moment, $\mu'_1 = \mathbb{E}[X]$ , is the mean.

Central Moments (Moments about the Mean): The $k$ -th central moment is the expected value of the $k$ -th power of the deviation from the mean, $(X-\mu)^k$ :

$\mu_k = \mathbb{E}[(X - \mu)^k]$

Random facts about central moments
- The first central moment is always zero: $\mu_1 = \mathbb{E}[X-\mu] = \mathbb{E}[X] - \mu = 0$ .
- The second central moment, $\mu_2 = \mathbb{E}[(X - \mu)^2]$ , is the variance. It is the most important measure of the spread or dispersion of the distribution. It is denoted $\text{Var}(X)$ or $\sigma^2$ . The standard deviation, $\sigma = \sqrt{\text{Var}(X)}$ , is the square root of the variance, returned to the original units of the variable.
- The third central moment (normalized) is related to the skewness, which measures the asymmetry of the distribution.
- The fourth central moment (normalized) is related to the kurtosis, which measures the “tailedness” or propensity for outliers.
Covariance and correlation

We’ve discussed covariance in PCA, Spectral Clustering and t-SNE.

Now, notice that the magnitude of the covariance is difficult to interpret because it depends on the units and variances of the individual variables.
A covariance of 100 might be very strong for one pair of variables but very weak for another.

To create a standardized, unit-free measure of the linear relationship, we normalize the covariance by the standard deviations of the two variables. This gives the Pearson correlation coefficient, $\rho$ –

$\rho(X, Y) = \frac{\text{Cov}(X, Y)}{\sigma_X \sigma_Y}$
- $\rho = 1$ : Perfect positive linear relationship.
- $\rho = -1$ : Perfect negative linear relationship.
- $\rho = 0$ : No linear relationship (uncorrelated).
Law of Large Numbers (LLN)

It says that that the average of a large number of independent samples from a distribution converges to the theoretical expected value of that distribution.

It is the theorem that guarantees that sampling works.

Theoretical Development: Forms of the Law

Weak Law of Large Numbers: This law states that for any arbitrarily small positive number $\epsilon$ , the probability that the sample mean deviates from the true mean by more than $\epsilon$ approaches zero as the sample size $N$ goes to infinity.

$\lim_{N \to \infty} P(|\bar{X}_N - \mu| > \epsilon) = 0$

This is also called convergence in probability. It doesn’t guarantee that for a single long experiment the average will be close to the mean, but that it is overwhelmingly likely to be.

Strong Law of Large Numbers: This law makes a much stronger statement. It asserts that, with probability 1, the sample mean will converge to the true mean.

$P(\lim_{N \to \infty} \bar{X}_N = \mu) = 1$

This is also called almost sure convergence. This guarantees that for a single, infinitely long experiment, the sample average will eventually settle down at the true population mean.

Why is this important in ML?

The LLN is the theoretical justification for Empirical Risk Minimization.

When we minimize a loss function on a finite training set, we are computing an empirical average. The LLN gives us confidence that, with enough data, this empirical risk will be a good approximation of the true expected risk over the entire data distribution.

Central Limit Theorem (CLT)

Foundations
- The Question: The LLN tells us where the sample mean $\bar{X}_N$ is going (it converges to $\mu$ ). The CLT tells us how it gets there: it describes the shape of the probability distribution of the sample mean around the true mean for a large but finite $N$ .
- The Setup: Let $X_1, X_2, \dots, X_N$ be a sequence of i.i.d. random variables with finite mean $\mu$ and finite variance $\sigma^2$ .
Theoretical Development

Theorem (Central Limit Theorem): As the sample size $N$ becomes large, the distribution of the sample mean $\bar{X}_N$ approaches a normal (Gaussian) distribution with mean $\mu$ and variance $\sigma^2/N$ .

$\bar{X}_N \approx \mathcal{N}\left(\mu, \frac{\sigma^2}{N}\right)$

More formally, the standardized version of the sample mean converges in distribution to a standard normal distribution:

$\frac{\bar{X}_N - \mu}{\sigma/\sqrt{N}} \xrightarrow{d} \mathcal{N}(0, 1) \quad \text{as } N \to \infty$

Why’s this so cool?
1. The CLT is incredibly powerful because it holds true regardless of the shape of the original distribution of the $X_i$ . Whether the individual variables are from a uniform, Bernoulli, exponential, or any other well-behaved distribution, the distribution of their average will always tend towards a Gaussian bell curve.
2. This theorem explains the ubiquity of the Gaussian distribution in nature and science.
Many complex phenomena can be modeled as the sum or average of many small, independent random effects. According to the CLT, the distribution of this aggregate phenomenon will be approximately Gaussian, even if the individual effects are not. For example, measurement error in an instrument is often the sum of many small, independent sources of error, so it is well-modeled by a Gaussian.

Basically, Gaussians are common and this can be exploited.

Sources
1. Mathematics for Machine Learning (link)

An Exposition on Rotations and Quaternions

August 24, 2025

What’s there?

I want to try and cover every other kinda trivial / small things related to matrices / LinAlg in this. This covers

Affine spaces (Other important matrix bits)
Inner products of functions (Other important matrix bits)
Orthogonal projections (Other important matrix bits)
Matrix approximation (Other important matrix bits)
Rotations and quaternions (This post!)

Why do I write this?

I wanted to create an authoritative document I could refer to for later. Thought that I might as well make it public (build in public, serve society and that sort of stuff).

Rotations as linear transformations

As always, basics!

Isometry

The defining characteristic of a rigid body motion is that it preserves the distances between all points on the body.

A transformation that preserves distances is called an isometry.

In a vector space equipped with an inner product (and its induced norm), this means a transformation $\Phi: \mathbb{R}^n \to \mathbb{R}^n$ must preserve the norm of vectors and the angles between them. This is mathematically equivalent to preserving the inner product:

$\langle \Phi(\mathbf{x}), \Phi(\mathbf{y}) \rangle = \langle \mathbf{x}, \mathbf{y} \rangle \quad \forall \mathbf{x}, \mathbf{y} \in \mathbb{R}^n$

Orthogonal Matrices preserve norms

For any linear transformation represented by a matrix $\mathbf{R}$ , the inner product preservation property is $(\mathbf{R}\mathbf{x})^\top(\mathbf{R}\mathbf{y}) = \mathbf{x}^\top\mathbf{y}$ . This expands to $\mathbf{x}^\top\mathbf{R}^\top\mathbf{R}\mathbf{y} = \mathbf{x}^\top\mathbf{I}\mathbf{y}$ .

This implies the defining property of an orthogonal matrix:

$\mathbf{R}^\top\mathbf{R} = \mathbf{I}$

This means the inverse of an orthogonal matrix is its transpose, $\mathbf{R}^{-1} = \mathbf{R}^\top$ .

There you go, said it in a roundabout way since I’ve spoken about this atleast twice now:)

Distinguishing Rotations from Reflections

Orthogonal matrices represent all isometries that fix the origin.

This is a set which includes both rotations and reflections.

The determinant of an orthogonal matrix is always $\det(\mathbf{R}) = \pm 1$ .

A transformation is said to preserve orientation (or “handedness”) if its determinant is positive.

A pure rotation is an orientation-preserving isometry. Therefore, a rotation is represented by an orthogonal matrix with a determinant of +1.

Special Orthogonal Group SO(n)

The set of all $n \times n$ rotation matrices forms a mathematical group under matrix multiplication. This group is known as the Special Orthogonal Group, denoted $SO(n)$ .

E.g., $SO(2)$ represents rotations in the plane, and $SO(3)$ represents rotations in 3D space.

Rotations in 2D and 3D

Rotations in 2D, $SO(2)$ : The action of a 2D rotation on the standard basis vectors $\mathbf{e}_1=[1,0]^\top, \mathbf{e}_2=[0,1]^\top$ completely defines the transformation. A counter-clockwise rotation by an angle $\theta$ maps $\mathbf{e}_1 \mapsto [\cos\theta, \sin\theta]^\top$ and $\mathbf{e}_2 \mapsto [-\sin\theta, \cos\theta]^\top$ . The columns of the rotation matrix are these transformed basis vectors, yielding:

$\mathbf{R}(\theta) = \begin{bmatrix} \cos\theta & -\sin\theta \\ \sin\theta & \cos\theta \end{bmatrix} \in SO(2)$

Rotations in 3D can also be derived to be

$\mathbf{R}_x(\theta) = \begin{bmatrix} 1 & 0 & 0 \\ 0 & \cos\theta & -\sin\theta \\ 0 & \sin\theta & \cos\theta \end{bmatrix}$

$\mathbf{R}_y(\theta) = \begin{bmatrix} \cos\theta & 0 & \sin\theta \\ 0 & 1 & 0 \\ -\sin\theta & 0 & \cos\theta \end{bmatrix}$

$\mathbf{R}_z(\theta) = \begin{bmatrix} \cos\theta & -\sin\theta & 0 \\ \sin\theta & \cos\theta & 0 \\ 0 & 0 & 1 \end{bmatrix}$

Pre- and Post-Multiplication: The Duality of Coordinate Frames

This is actually a very important concept!

The non-commutativity of matrix multiplication ( $\mathbf{A}\mathbf{B} \neq \mathbf{B}\mathbf{A}$ ) is not a mathematical inconvenience; it is the essential property that allows matrices to capture the non-commutative nature of 3D rotations. The choice of whether to pre-multiply ( $\mathbf{R}_{\text{new}}\mathbf{R}_{\text{old}}$ ) or post-multiply ( $\mathbf{R}_{\text{old}}\mathbf{R}_{\text{new}}$ ) depends entirely on the coordinate frame in which the new rotation is defined!

Foundations

Coordinate Frames

A coordinate frame, which we denote $\{A\}$ , is a coordinate system used to represent the position and orientation of objects. In $\mathbb{R}^n$ , it consists of an origin point and a set of $n$ orthonormal basis vectors.

A vector’s components are only meaningful with respect to a specific frame.

We denote a vector $\mathbf{p}$ expressed in frame $\{A\}$ as $\mathbf{p}^A$ .

Active vs. Passive Transformations

A rotation matrix can be interpreted in two ways –

Active Rotation: The coordinate system is fixed, and the object (represented by a vector) is rotated.
The matrix transforms the vector’s coordinates to new coordinates within the same frame. This is the more common interpretation.

$\mathbf{p}'^A = \mathbf{R} \mathbf{p}^A$

Passive Rotation: The object is fixed in space, and the coordinate system is rotated.
The matrix transforms the vector’s coordinates from the old frame to the new frame. This is a change of basis.
A rotation of the frame by $\mathbf{R}$ means the coordinates of a fixed vector in the new frame are given by $\mathbf{R}^\top$ times the old coordinates.

The choice between pre- and post-multiplication arises when we compose multiple rotations. This is best understood by considering two primary scenarios: rotations defined with respect to a fixed, global frame versus rotations defined with respect to a moving, local frame.

Scenario A: Pre-Multiplication (Rotations in a Fixed Global Frame)

This is the case where all rotations are defined with respect to the same, unchanging, external coordinate system.

This is called the world frame or global frame, which we will denote as $\{G\}$ .

Here, for two successive rotations, we write:

$\mathbf{R}_{\text{total}} = \mathbf{R}_2 \mathbf{R}_1$

The new rotation matrix, $\mathbf{R}_2$ , is pre-multiplied (multiplied on the left) to the existing rotation matrix, $\mathbf{R}_1$ . The order of matrix multiplication ( $\mathbf{R}_2$ then $\mathbf{R}_1$ ) is the same as the order of application to the vector (first $\mathbf{R}_1$ , then $\mathbf{R}_2$ ).

Scenario B: Post-Multiplication (Rotations Relative to a Moving Local Frame)

This is the case where a sequence of rotations is defined with respect to the object’s own, changing coordinate system.

This is often called the body frame or local frame, which we will denote as $\{B\}$ .

The Transformation: An object initially has an orientation described by the matrix $\mathbf{R}_1$ , which represents the orientation of its local frame $\{B\}$ with respect to the global frame $\{G\}$ . Let’s denote this as $\mathbf{R}_{GB_1}$ .
Composition: Now, we want to apply a second rotation, $\mathbf{R}_2$ , but this rotation is defined relative to the object’s current, rotated frame, ${B_1}$ . This local rotation transforms frame ${B_1}$ to a new frame ${B_2}$ , and is denoted $\mathbf{R}_{B_1 B_2}$ . We want to find the final orientation of the object, $\mathbf{R}_{GB_2}$ , in the global frame.
The Rule: To find the final orientation, we chain the transformations. The orientation of the new frame ${B_2}$ relative to the global frame ${G}$ is the orientation of ${B_1}$ in ${G}$ followed by the orientation of ${B_2}$ in ${B_1}$ . This corresponds to a matrix product:

$\mathbf{R}_{GB_2} = \mathbf{R}_{GB_1} \mathbf{R}_{B_1 B_2}$

If we let $\mathbf{R}_1 = \mathbf{R}_{GB_1}$ and $\mathbf{R}_2 = \mathbf{R}_{B_1 B_2}$ , the final orientation is:

$\mathbf{R}_{\text{final}} = \mathbf{R}_1 \mathbf{R}_2$

The new rotation matrix, $\mathbf{R}_2$ , is post-multiplied (multiplied on the right) to the existing rotation matrix, $\mathbf{R}_1$ .

How do you determine the choice?

The choice between these two conventions is not arbitrary; it is dictated by the problem’s physical or conceptual setup.

\mathbf{R}_{\text{new}} = \mathbf{R}_{\text{global}} \mathbf{R}_{\text{old}}

\mathbf{R}_{\text{new}} = \mathbf{R}_{\text{old}} \mathbf{R}_{\text{local}}

Convention	Pre-multiplication: $\mathbf{R}_{\text{new}} = \mathbf{R}_{\text{global}} \mathbf{R}_{\text{old}}$	Post-multiplication: $\mathbf{R}_{\text{new}} = \mathbf{R}_{\text{old}} \mathbf{R}_{\text{local}}$
Interpretation	Rotations are applied with respect to a fixed, external (world) coordinate system.	Rotations are applied with respect to the object’s own, moving (body) coordinate system.
Analogy	An air traffic controller describing an airplane’s orientation relative to the ground. “First turn 90 degrees North, then pitch up 20 degrees.”	A pilot inside the cockpit describing their maneuvers. “First I will pitch up 20 degrees, then I will yaw 10 degrees to my right.”

Application: Kinematic Chains

A robot arm is a chain of links, where each link’s frame is attached to the previous one. The orientation of link 2 is defined relative to link 1. The orientation of link 3 is defined relative to link 2, and so on. The orientation of the end-effector (frame ${E}$ ) with respect to the base (frame ${0}$ ) is the post-multiplication of the successive joint rotations:

$\mathbf{R}_{0E} = \mathbf{R}_{01}(\theta_1) \mathbf{R}_{12}(\theta_2) \mathbf{R}_{23}(\theta_3) \cdots \mathbf{R}_{(n-1)E}(\theta_n)$

Gimbal Lock

Getting back to $SO(3)$ , any arbitrary rotation is is a sequence of three rotations about principal axes, known as Euler angles (e.g., yaw, pitch, roll).

This suffers from a singularity where two of the three rotational axes align, causing the loss of one degree of freedom. This is called the Gimbal lock.

At this point, it is impossible to perform small rotations about one of the original axes, leading to instability in control and animation systems.

I really cannot explain it better mathematically than this StackExchange Answer and this YT video for a demo.

How do you solve this? Quaternions.

Quaternions

Hierarchy of number systems

Real Numbers ( $\mathbb{R}$ ): Ordered, commutative, associative. Lacks algebraic closure (e.g., $x^2+1=0$ has no real solution).

Complex Numbers ( $\mathbb{C}$ ): Adds the imaginary unit $i$ where $i^2=-1$ . The system is algebraically closed, commutative, and associative. It sacrifices the property of being an ordered field.

Quaternions ( $\mathbb{H}$ ): Adds two more imaginary units, $j$ and $k$ . The system is associative but sacrifices the property of commutativity in multiplication ( $ab \neq ba$ ). This non-commutativity is not a flaw; it is the essential feature that allows quaternions to model non-commutative 3D rotations.

Octonions ( $\mathbb{O}$ ): Adds four more units. The system is neither commutative nor associative, sacrificing associativity in multiplication ( $(ab)c \neq a(bc)$ ).

Quaternion Algebra

A quaternion $q \in \mathbb{H}$ is an element of the form:

$q = w + x\mathbf{i} + y\mathbf{j} + z\mathbf{k}$

where $w, x, y, z$ are real numbers, and ${\mathbf{1}, \mathbf{i}, \mathbf{j}, \mathbf{k}}$ form a basis for this 4D vector space.

$w$ is the scalar part, denoted $\text{Re}(q)$ .

$\mathbf{v} = x\mathbf{i} + y\mathbf{j} + z\mathbf{k}$ is the vector part, denoted $\text{Im}(q)$ . A quaternion with a zero scalar part is called a pure quaternion.

The algebra is defined by the Hamiltonian rules for the basis elements:

$\mathbf{i}^2 = \mathbf{j}^2 = \mathbf{k}^2 = \mathbf{i}\mathbf{j}\mathbf{k} = -1$

This single, compact relation generates the entire non-commutative multiplication table:

$\begin{aligned} \mathbf{i}\mathbf{j} = \mathbf{k} \quad &\text{and} \quad \mathbf{j}\mathbf{i} = -\mathbf{k} \\ \mathbf{j}\mathbf{k} = \mathbf{i} \quad &\text{and} \quad \mathbf{k}\mathbf{j} = -\mathbf{i} \\ \mathbf{k}\mathbf{i} = \mathbf{j} \quad &\text{and} \quad \mathbf{i}\mathbf{k} = -\mathbf{j} \end{aligned}$

The multiplication of two general quaternions $q_1 = w_1 + \mathbf{v}_1$ and $q_2 = w_2 + \mathbf{v}_2$ is defined by distributing terms and applying these rules. This leads to a more compact formula relating quaternion multiplication to the vector dot and cross products:

$q_1 q_2 = (w_1w_2 - \mathbf{v}_1 \cdot \mathbf{v}_2) + (w_1\mathbf{v}_2 + w_2\mathbf{v}_1 + \mathbf{v}_1 \times \mathbf{v}_2)$

Quaternion Operations

Conjugate: $q^* = w - x\mathbf{i} - y\mathbf{j} - z\mathbf{k} = w - \mathbf{v}$ . The conjugate of a product is the reversed product of the conjugates: $(q_1 q_2)^* = q_2^* q_1^*$ .
Norm: The norm of a quaternion, $|q|$ , is a non-negative real number defined as:

$|q| = \sqrt{q q^*} = \sqrt{(w+\mathbf{v})(w-\mathbf{v})} = \sqrt{w^2 - \mathbf{v}^2 - (\mathbf{v} \times \mathbf{v} - \mathbf{v}\cdot\mathbf{v})} = \sqrt{w^2 + \mathbf{v}\cdot\mathbf{v}} = \sqrt{w^2 + x^2 + y^2 + z^2}$

This is the standard Euclidean norm in $\mathbb{R}^4$ . The norm is multiplicative: $|q_1 q_2| = |q_1||q_2|$ .

Inverse: The multiplicative inverse of a non-zero quaternion $q$ is:

$q^{-1} = \frac{q^*}{|q|^2}$

This ensures that $\mathbb{H}$ is a division algebra: every non-zero element has an inverse.

Unit Quaternion: A quaternion with a norm of 1, $|q|=1$ . For unit quaternions, the inverse is simply the conjugate: $q^{-1} = q^*$ . The set of all unit quaternions forms a 3-dimensional sphere ( $S^3$ ) embedded in $\mathbb{R}^4$ .

Geometry of 3D Rotations

Representing Rotations with Unit Quaternions

A rotation in $\mathbb{R}^3$ is uniquely defined by an axis of rotation (a unit vector $\mathbf{u} \in \mathbb{R}^3$ ) and a right-handed angle of rotation $\theta \in [0, 2\pi)$ .

Euler’s Rotation Theorem states that any composition of rotations is equivalent to a single rotation about some axis.

I mean, this makes sense intuitive as well right? Not gonna prove it.

A unit quaternion $q$ represents such a rotation via the Euler-Rodrigues formula:

$q = \cos(\theta/2) + \mathbf{u}\sin(\theta/2) = \cos(\theta/2) + (u_x\mathbf{i} + u_y\mathbf{j} + u_z\mathbf{k})\sin(\theta/2)$

Notice the use of the half-angle $\theta/2$. This implies that

A rotation of $2\pi$ about any axis corresponds to $q = \cos(\pi) + \mathbf{u}\sin(\pi) = -1$ , and

A rotation of $4\pi$ is required to return to the identity quaternion $q=1$ .

This mathematical structure is known as a double cover: the group of unit quaternions (denoted $Sp(1)$ or $S^3$ ) is a double cover of the group of 3D rotations $SO(3)$ .

The two quaternions $q$ and $-q$ correspond to the exact same physical rotation in 3D space.

The Sandwich Product

Embed: Lift the vector $\mathbf{v}$ into the quaternion algebra by representing it as a pure quaternion $p$ :

$p = 0 + v_x\mathbf{i} + v_y\mathbf{j} + v_z\mathbf{k}$

Transform: Apply the rotation via the sandwich product:

$p' = q p q^{-1} = q p q^*$

Project: The result $p'$ is guaranteed to be another pure quaternion. The vector part of $p'$ is the rotated vector $\mathbf{v}'$ .

Composition of Rotations

If a rotation represented by $q_1$ is followed by a rotation represented by $q_2$ , the single composite rotation is represented by the quaternion product:

$q_{\text{total}} = q_2 q_1$

The non-commutativity of quaternion multiplication ( $q_2 q_1 \neq q_1 q_2$ ) correctly reflects the non-commutativity of 3D rotations (e.g., rotating 90 degrees about the x-axis then the y-axis is different from the reverse).

This is a significant advantage over Euler angles, where composition is complex and unintuitive.

Spherical Linear Interpolation (SLERP)

To smoothly interpolate between two orientations represented by unit quaternions $q_1$ and $q_2$ , we need to find a path of constant rotational velocity on the surface of the $S^3$ hypersphere.

This path is a great-circle arc.

What this means that it’s a circle arc in 4 dimensions. We use SLERP for this.

$\text{Slerp}(q_1, q_2, t) = \frac{\sin((1-t)\Omega)}{\sin\Omega}q_1 + \frac{\sin(t\Omega)}{\sin\Omega}q_2$

where $t \in [0,1]$ is the interpolation parameter and $\Omega$ is the angle between the four-dimensional vectors corresponding to $q_1$ and $q_2$ , given by $\cos\Omega = w_1w_2 + x_1x_2 + y_1y_2 + z_1z_2$ .

This formula produces smooth, constant-velocity rotations, a feat that is notoriously difficult and unstable with Euler angles.

Applications

Duh, you rotate everywhere in robotics.

Sources

Mathematics for Machine Learning (link)

Other important matrix bits

August 24, 2025

What’s there?

I want to try and cover every other kinda trivial / small things related to matrices / LinAlg in this. This covers

Affine spaces (This post!)
Inner products of functions (This post!)
Orthogonal projections (This post!)
Matrix approximation (This post!)
Rotations and quaternions

Why do I write this?

I wanted to create an authoritative document I could refer to for later. Thought that I might as well make it public (build in public, serve society and that sort of stuff).

Affine Spaces

Definition(s)

An affine space $A$ is a set of elements called points, together with a vector space $V$ , and an operation of vector addition that acts on points. This operation, $+: A \times V \to A$ , satisfies the following axioms:

For any point $P \in A$ , $P + \mathbf{0} = P$ .
For any point $P \in A$ and any two vectors $\mathbf{v}_1, \mathbf{v}_2 \in V$ , $(P + \mathbf{v}_1) + \mathbf{v}_2 = P + (\mathbf{v}_1 + \mathbf{v}_2)$ .
For any two points $P, Q \in A$ , there exists a unique vector $\mathbf{v} \in V$ such that $P + \mathbf{v} = Q$ . This unique vector is denoted $\vec{PQ} = Q - P$ .

The vector space $V$ is called the direction space associated with the affine space $A$ .

Some other definitions

Affine Subspaces: An affine subspace $L$ of an affine space $A$ is a subset of $A$ that is itself an affine space with a direction space $U$ that is a vector subspace of $V$ . More concretely, an affine subspace is formed by taking a vector subspace and translating it.

$L = P_0 + U := { P_0 + \mathbf{u} \mid \mathbf{u} \in U }$

where $P_0 \in A$ is a support point (an origin for the subspace) and $U \subseteq V$ is the direction space. A vector space is a special case of an affine subspace where the support point is the origin.

Parametric Representation: If the direction space $U$ has a basis ${\mathbf{b}_1, \dots, \mathbf{b}_k}$ , then any point $P$ in the affine subspace $L$ can be uniquely represented in parametric form as:

$P = P_0 + \sum_{i=1}^k \lambda_i \mathbf{b}_i, \quad \text{where } \lambda_i \in \mathbb{R}$

Examples include lines ( $k=1$ ) and planes ( $k=2$ ). A hyperplane is an affine subspace of codimension 1 (i.e., of dimension $n-1$ in an $n$ -dimensional space).

Affine Mappings: An affine map $f: A \to B$ between two affine spaces is a function that preserves the structure of affine subspaces. It can be shown that any affine map can be uniquely decomposed into a linear map on the underlying vector spaces plus a translation. In a coordinate system, this takes the familiar form:

$f(\mathbf{x}) = \mathbf{A}\mathbf{x} + \mathbf{t}$

where $\mathbf{A}$ is the matrix of the underlying linear map and $\mathbf{t}$ is a translation vector.

Very math stuff – what’s the use?

Use 1: Support Vector Machines!

The separating hyperplane in a Support Vector Machine (SVM) is actually an affine subspace!

In a $D$ -dimensional feature space, the hyperplane is defined by the set of points $\mathbf{x}$ satisfying the equation:

$\langle \mathbf{w}, \mathbf{x} \rangle + b = 0$

Here, $\mathbf{w}$ is the normal vector that defines the direction space’s orthogonal complement, and $b$ is a scalar related to the translation from the origin.

Inner Products (of functions)

Starting from dot products

The concept of an inner product on a space of functions is a direct generalization of the dot product on finite-dimensional Euclidean space $\mathbb{R}^n$ .

The dot product in $\mathbb{R}^n$ is defined as $\langle \mathbf{x}, \mathbf{y} \rangle = \mathbf{x}^\top\mathbf{y} = \sum_{i=1}^n x_i y_i$ . It satisfies the axioms of an inner product: bilinearity, symmetry, and positive definiteness.
Now, a function $f: [a,b] \to \mathbb{R}$ can be conceptualized as an infinite-dimensional vector, where each point $x \in [a,b]$ corresponds to a component with value $f(x)$ . In this view, the discrete sum of the dot product becomes a continuous integral.
Further, to ensure the integrals are well-defined, we consider a vector space of functions, typically the space of square-integrable functions on an interval $[a,b]$ , denoted $L^2([a,b])$ .

Formal Definition: The inner product of two real-valued functions $f(x)$ and $g(x)$ in the space $L^2([a,b])$ is defined as:

$\langle f, g \rangle := \int_a^b f(x) g(x) dx$

More generally, a weight function $w(x) > 0$ can be introduced: $\langle f, g \rangle_w = \int_a^b f(x) g(x) w(x) dx$ .

Theoretical Development

Gonna info dump some things –

Induced Norm: The inner product induces a norm, known as the $L^2$ -norm, which measures the “length” or “magnitude” of a function:

$|f|_{L^2} = \sqrt{\langle f, f \rangle} = \left(\int_a^b f(x)^2 dx\right)^{1/2}$

Orthogonality of Functions: Two functions $f$ and $g$ are orthogonal if their inner product is zero:

$\langle f, g \rangle = \int_a^b f(x) g(x) dx = 0$

Orthogonal Basis and Fourier Series: A set of functions ${\phi_k(x)}$ is an orthogonal basis if its members are mutually orthogonal.

Where can I see this in action?

Use 1: Fourier Series

Functional analysis says that many well-behaved functions can be represented as a linear combination of orthogonal basis functions.

E.g., Fourier series, which represents a periodic function as a sum of sines and cosines, which form an orthogonal set on the interval $[-\pi, \pi]$ .

The coefficient for each basis function $\phi_k$ is found by projecting $f$ onto it: $c_k = \frac{\langle f, \phi_k \rangle}{|\phi_k|^2}$ .

Other than this, you can actually show that solutions relating to Kernel methods, unknown functions in robotics (e.g., friction), etc. can be written as inner products of functions.

Orthogonal Projections

What is it, exactly?

An orthogonal projection is a mapping that takes a vector and finds its closest point within a given subspace.

It is the formalization of the geometric idea of casting a shadow.

Ok, find, but what is it?

Given a vector space $V$ and a subspace $U \subseteq V$ , for any vector $\mathbf{v} \in V$ , we want to find the unique vector $\mathbf{u} \in U$ that is closest to $\mathbf{v}$ . This is equivalent to minimizing the distance:

$\text{proj}_U(\mathbf{v}) = \arg\min_{\mathbf{u} \in U} |\mathbf{v} - \mathbf{u}|$

The solution to this minimization problem occurs when the “error” vector, $\mathbf{v} - \mathbf{u}$ , is orthogonal to the subspace $U$ . That is, $\langle \mathbf{v} - \mathbf{u}, \mathbf{z} \rangle = 0$ for all $\mathbf{z} \in U$ .

This is basically the Geometric Condition!

Two Cases

You can either

Project onto a line, or,
Project onto a subspace

Projection onto a Line: The simplest case is projecting a vector $\mathbf{v}$ onto the line spanned by a single non-zero vector $\mathbf{b}$ . The projection is a scaled version of $\mathbf{b}$ , $\text{proj}_{\mathbf{b}}(\mathbf{v}) = \lambda \mathbf{b}$ . Using the orthogonality condition:

$\langle \mathbf{v} - \lambda\mathbf{b}, \mathbf{b} \rangle = 0 \implies \langle \mathbf{v}, \mathbf{b} \rangle - \lambda\langle \mathbf{b}, \mathbf{b} \rangle = 0 \implies \lambda = \frac{\langle \mathbf{v}, \mathbf{b} \rangle}{|\mathbf{b}|^2}$

The projection is thus $\text{proj}_{\mathbf{b}}(\mathbf{v}) = \frac{\langle \mathbf{v}, \mathbf{b} \rangle}{|\mathbf{b}|^2} \mathbf{b}$ . In matrix form for the standard dot product in $\mathbb{R}^n$ , this is represented by the projection matrix: $\mathbf{P} = \frac{\mathbf{b}\mathbf{b}^\top}{\mathbf{b}^\top\mathbf{b}}$ .

Hold up here. Notice that $\text{proj}_{\mathbf{b}}(\mathbf{v})$ is a vector. This is the vector of the projection. The magnituge of the projection, just the “size” of the shadow so to say, is the complement.

$\text{comp}_{\mathbf{b}}(\mathbf{v}) = \frac{\langle \mathbf{v}, \mathbf{b} \rangle}{|\mathbf{b}|}$

Projection onto a Subspace: This generalizes to projecting onto a subspace $U$ spanned by the columns of a matrix $\mathbf{A} \in \mathbb{R}^{m \times n}$ (with linearly independent columns). The projection $\mathbf{p}$ is in the column space, so $\mathbf{p} = \mathbf{A}\hat{\mathbf{x}}$ for some coefficient vector $\hat{\mathbf{x}}$ . The error vector $\mathbf{b} - \mathbf{A}\hat{\mathbf{x}}$ must be orthogonal to the column space of $\mathbf{A}$ . This means it must be in the left null space of $\mathbf{A}$ :

$\mathbf{A}^\top(\mathbf{b} - \mathbf{A}\hat{\mathbf{x}}) = \mathbf{0} \implies \mathbf{A}^\top\mathbf{A}\hat{\mathbf{x}} = \mathbf{A}^\top\mathbf{b}$

This is the normal equation. Solving for $\hat{\mathbf{x}}$ and plugging back gives the projection $\mathbf{p}$ :

$\hat{\mathbf{x}} = (\mathbf{A}^\top\mathbf{A})^{-1}\mathbf{A}^\top\mathbf{b} \implies \mathbf{p} = \mathbf{A}(\mathbf{A}^\top\mathbf{A})^{-1}\mathbf{A}^\top\mathbf{b}$

The projection matrix is $\mathbf{P} = \mathbf{A}(\mathbf{A}^\top\mathbf{A})^{-1}\mathbf{A}^\top$ . If the columns of $\mathbf{A}$ are orthonormal, then $\mathbf{A}^\top\mathbf{A} = \mathbf{I}$ , and the formula simplifies to $\mathbf{P} = \mathbf{A}\mathbf{A}^\top$ .

Wait this doesn’t make sense directly:( Ok, imma write it again!

Projection onto a subspace

Yes, its own section.

Concept Projection onto a Line (1D) Projection onto a Subspace (nD)
The “Ground” A single line defined by one vector b. A “floor” or subspace defined by multiple basis vectors, which are the columns of a matrix A.
The “Shadow” The projection p is on the line, so it’s a scaled version of b. The projection p is on the “floor,” so it’s a linear combination of the basis vectors in A.
Orthogonality The error vector (b – p) must be orthogonal to the line’s direction vector b. The error vector (b – p) must be orthogonal to the entire subspace. This means it must be orthogonal to every column of A.

Concept	Projection onto a Line (1D)	Projection onto a Subspace (nD)
The “Ground”	A single line defined by one vector b.	A “floor” or subspace defined by multiple basis vectors, which are the columns of a matrix A.
The “Shadow”	The projection p is on the line, so it’s a scaled version of b.	The projection p is on the “floor,” so it’s a linear combination of the basis vectors in A.
Orthogonality	The error vector (b – p) must be orthogonal to the line’s direction vector b.	The error vector (b – p) must be orthogonal to the entire subspace. This means it must be orthogonal to every column of A.

Now, let’s derive!

State the Goal: We want to find the projection $\mathbf{p}$ , which we know is a combination of $\mathbf{A}$ ‘s columns: $\mathbf{p} = \mathbf{A}\hat{\mathbf{x}}$ .

Use the Orthogonality Condition: The error vector ( $\mathbf{b} - \mathbf{p}$ ) must be orthogonal to the column space of $\mathbf{A}$ .

$\mathbf{A}^\top(\mathbf{b} - \mathbf{p}) = \mathbf{0}$

Substitute $\mathbf{p}$ : Replace $\mathbf{p}$ with what we know it is, $\mathbf{A}\hat{\mathbf{x}}$ .

$\mathbf{A}^\top(\mathbf{b} - \mathbf{A}\hat{\mathbf{x}}) = \mathbf{0}$

Solve for the Coefficients $\hat{\mathbf{x}}$ : Distribute $\mathbf{A}^\top$ to get the normal equation. This equation’s name comes from the fact that it enforces the “normal” (orthogonal) condition.

$\mathbf{A}^\top\mathbf{b} - \mathbf{A}^\top\mathbf{A}\hat{\mathbf{x}} = \mathbf{0}$

$\mathbf{A}^\top\mathbf{A}\hat{\mathbf{x}} = \mathbf{A}^\top\mathbf{b}$

Now, solve for the coefficient vector $\hat{\mathbf{x}}$ :

$\hat{\mathbf{x}} = (\mathbf{A}^\top\mathbf{A})^{-1}\mathbf{A}^\top\mathbf{b}$

This formula gives us the perfect “recipe” of coefficients to create the projection.

Find the Projection Vector $\mathbf{p}$ : Now that we have the recipe ( $\hat{\mathbf{x}}$ ), we can build the projection vector $\mathbf{p}$ .

$\mathbf{p} = \mathbf{A}\hat{\mathbf{x}} = \mathbf{A}(\mathbf{A}^\top\mathbf{A})^{-1}\mathbf{A}^\top\mathbf{b}$

The projection matrix is the part that does all the work: $\mathbf{P} = \mathbf{A}(\mathbf{A}^\top\mathbf{A})^{-1}\mathbf{A}^\top$ . It’s a single matrix that takes any vector $\mathbf{b}$ and spits out its projection onto the subspace defined by $\mathbf{A}$ .

Where have we seen this?

The solution to the least-squares problem, $\min_{\boldsymbol{\theta}} |\mathbf{y} - \mathbf{X}\boldsymbol{\theta}|_2^2$ , is found by solving the normal equation $\mathbf{X}^\top\mathbf{X}\boldsymbol{\theta} = \mathbf{X}^\top\mathbf{y}$ .

This solution, $\hat{\mathbf{y}} = \mathbf{X}\boldsymbol{\theta}$ , is the orthogonal projection of the observed target vector $\mathbf{y}$ onto the column space of the design matrix $\mathbf{X}$ .

Matrix Approximation

I just wanted to add this as a heading since it’s not directly evident that this is just the EYM theorem discussed in depth within Matrix Decompositions 3. Anyway, here’s the statement again –

Theorem (Eckart-Young-Mirsky): Let the SVD of a matrix $\mathbf{M} \in \mathbb{R}^{m \times n}$ of rank $r$ be $\mathbf{M} = \mathbf{U}\mathbf{S}\mathbf{V}^\top = \sum_{i=1}^{r} \sigma_i \mathbf{u}_i \mathbf{v}_i^\top$ . The best rank- $k$ approximation to $\mathbf{M}$ (for $k < r$ ) that minimizes the error $|\mathbf{M} - \mathbf{M}_k|$ (in both spectral and Frobenius norms) is given by the truncated SVD sum:

$\mathbf{M}_k = \arg\min_{\text{rank}(\mathbf{B})=k} |\mathbf{M} - \mathbf{B}| = \sum_{i=1}^{k} \sigma_i \mathbf{u}_i \mathbf{v}_i^\top$

The theorem states that to get the best approximation, one should keep the components corresponding to the largest singular values and discard the rest. The singular values quantify the “importance” of each rank-one component.

Applications

Gosh, what’s not there?

Machine Learning (Principal Component Analysis): PCA is the most direct application of this theorem. It seeks the best rank-$k$ approximation of a centered data matrix. The SVD provides both the optimal subspace (spanned by the first $k$ right-singular vectors) and the coordinates of the data projected onto it.
Machine Learning (Recommender Systems): A user-item rating matrix is often very large, sparse, and assumed to be approximately low-rank. Low-rank matrix factorization, which is a form of matrix approximation, is used to fill in the missing entries. The SVD finds the optimal low-rank approximation, uncovering latent factors that describe user preferences and item characteristics.
AI (Natural Language Processing – Topic Modeling): In Latent Semantic Analysis (LSA), a large term-document matrix is constructed. Its low-rank approximation via SVD is used to find a “semantic space” where terms and documents with similar meanings are close together. The singular vectors represent abstract “topics.”

Sources

Mathematics for Machine Learning (link)

Ch1: Reliable, Scalable and Maintainable Apps

August 24, 2025
What’s there?

I’ve started reading DDIA! These are some short notes form Chapter 1.

Why do I write this?

I wanted to create an authoritative document I could refer to for later. Thought that I might as well make it public (build in public, serve society and that sort of stuff).

Also, just reading the textbook will in any case lead to forgetting parts of it. So, thought that I might as well (yay, repetition!) write some short notes for a quick overview later.

Introduction

Standard building blocks

Stuff that pretty much everyone is familiar with –
1. Databases
2. Caches
3. Search indexes
4. Stream processing
5. Batch processing
Why the single umbrella of data systems?

Soo many different things – why do we group them as one?
1. Many tools are optimizes for a variety of different use cases.
  E.g., Redis, a datastore, can also be used as a message queue.
2. Many apps demand multiple requirements that can be encompassed as a single application code.
How do you describe such systems?

Three things –
1. Reliability
2. Maintainability
3. Scalability
Reliability

Fault Tolerant or Fault Resistant?

Resistant. You cannot guarantee that no fault will ever occur.

Do we handle faults or failures?
A fault is one component of a system failing.

A failure is the system going down.

We handle failures.

Scalability

A system’s ability to cope with additional load.

Load parameters

How you describe the load. E.g., the number of writes / reads per second.

Describing Performance

What happens when the load increases?
- Service Level Objectives (SLOs) and Service Level Agreements (SLAs)
- Percentile based measurement
Coping with load

There are just two ways
1. Scaling up (vertical)
2. Scaling out (horizontal)
The best designs are generally a combination of the two.

Maintainability

I think this just revolves around the broader concepts of tech debts, but the major stuff are
1. Operability
2. Simplicity
3. Evolvability
Takeaways

Yeah, this was a very basic introductory chapter that was (kinda?) trivial since I had done distributed systems in school. The main takeaways are:
- Access patterns matter!
- There is no one solution.
- You should always factor in reliability, scalability and maintainability.
Sources
1. Designing Data-Intensive Applications (link)
Matrix Decompositions 3

August 23, 2025
What’s there?

There are basically 5 (ಐದು / पाँच) types of matrix decompositions that are like super important. I strive to go in depth into all of them. These are
1. Eigendecomposition (Matrix Decompositions 1)
2. QR decomposition (Matrix Decompositions 2)
3. Cholesky decomposition (Matrix Decompositions 2)
4. LU decomposition (Matrix Decompositions 2)
5. (perhaps the most important) Singular Value Decomposition (This post!)
Why do I write this?

I wanted to create an authoritative document I could refer to for later. Thought that I might as well make it public (build in public, serve society and that sort of stuff).

What are S, V and D?

Singular, Value and Decomposition? Let’s start!

Geometric action of a matrix!

Remember eignedecomposition? Yes? No? See Matrix Decompositions 1

What if the matrix is not square?? We use SVD! Wait lemme explain:

Any linear transformation $\mathbf{A}$ maps the set of unit vectors in the input space (a unit sphere in $\mathbb{R}^n$ ) to a hyperellipse in the output space $\mathbb{R}^m$ .

This is the main geometric insight of the SVD –
This hyperellipse lies within a subspace of the codomain. The SVD provides a complete characterization of this transformation by identifying the specific orthonormal basis vectors in the input space that are mapped directly onto the principal axes of this resulting hyperellipse.

Formal definition!!

Theorem (Singular Value Decomposition): For any matrix $\mathbf{A} \in \mathbb{R}^{m \times n}$ of rank $r \le \min(m, n)$ , there exists a factorization of the form:

$\mathbf{A} = \mathbf{U}\mathbf{\Sigma}\mathbf{V}^\top$

where the components have the following properties:
1. $\mathbf{U} \in \mathbb{R}^{m \times m}$ is an Orthogonal Matrix. The columns of $\mathbf{U}$ , denoted ${\mathbf{u}_1, \dots, \mathbf{u}_m}$ , are the left-singular vectors. They form an orthonormal basis for the output space (codomain), $\mathbb{R}^m$ . The first $r$ of these vectors, ${\mathbf{u}_1, \dots, \mathbf{u}_r}$ , form an orthonormal basis for the column space (range) of $\mathbf{A}$ . Geometrically, these are the unit vectors that define the directions of the principal axes of the output hyperellipse.
2. $\mathbf{V} \in \mathbb{R}^{n \times n}$ is an Orthogonal Matrix. The columns of $\mathbf{V}$ , denoted ${\mathbf{v}_1, \dots, \mathbf{v}_n}$ , are the right-singular vectors. They form an orthonormal basis for the input space (domain), $\mathbb{R}^n$ . The first $r$ of these vectors, ${\mathbf{v}_1, \dots, \mathbf{v}_r}$ , form an orthonormal basis for the row space of $\mathbf{A}$ . Geometrically, these are the specific orthonormal vectors on the input unit sphere that are mapped by $\mathbf{A}$ onto the principal axes of the output hyperellipse.
3. $\mathbf{\Sigma} \in \mathbb{R}^{m \times n}$ is a Rectangular Diagonal Matrix. The only non-zero entries of $\mathbf{\Sigma}$ are on its main diagonal. These entries, $\Sigma_{ii} = \sigma_i$ , are the singular values of $\mathbf{A}$ . They are real, non-negative, and by convention, sorted in descending order:
$\sigma_1 \ge \sigma_2 \ge \dots \ge \sigma_r > 0, \quad \text{and} \quad \sigma_{r+1} = \dots = \sigma_{\min(m,n)} = 0$

The number of non-zero singular values is exactly the rank $r$ of the matrix $\mathbf{A}$ . Geometrically, each singular value $\sigma_i$ is the length of the $i$ -th principal semi-axis of the output hyperellipse.
Woah! Column space, row space? What?

Actually, I’ll call this –

An Exposition on the Fundamental Subspaces of a Linear Transformation

The most profound insights into the nature of this transformation are revealed by analyzing four special vector subspaces that are intrinsically associated with the matrix $\mathbf{A}$ . These are the four fundamental subspaces. Understanding them provides a complete geometric and algebraic picture of the action of the matrix.

Subspace 1: The Column Space (or Range)

The column space of a matrix $\mathbf{A} \in \mathbb{R}^{m \times n}$ , denoted $\text{Col}(\mathbf{A})$ or $\mathcal{R}(\mathbf{A})$ , is the set of all possible linear combinations of its column vectors. If the columns of $\mathbf{A}$ are ${\mathbf{a}_1, \dots, \mathbf{a}_n}$ , where each $\mathbf{a}_j \in \mathbb{R}^m$ , then:

$latex \text{Col}(\mathbf{A}) = \text{span}{\mathbf{a}1, \dots, \mathbf{a}_n} = \left{ \sum{j=1}^n c_j \mathbf{a}_j \mid c_j \in \mathbb{R} \right}$

Notice how the column space is the set of all vectors $\mathbf{b}$ for which the system $\mathbf{A}\mathbf{x} = \mathbf{b}$ has a solution.

Subspace 2: The Null Space (or Kernel)

The null space of a matrix $\mathbf{A} \in \mathbb{R}^{m \times n}$ , denoted $\text{Null}(\mathbf{A})$ or $\mathcal{N}(\mathbf{A})$ , is the set of all solutions to the homogeneous linear system $\mathbf{A}\mathbf{x} = \mathbf{0}$ .

$\text{Null}(\mathbf{A}) = { \mathbf{x} \in \mathbb{R}^n \mid \mathbf{A}\mathbf{x} = \mathbf{0} }$

The null space answers the question: “What is the set of all input vectors that get crushed or annihilated by the transformation?”

For a system $\mathbf{A}\mathbf{x} = \mathbf{b}$ , if a solution $\mathbf{x}_p$ exists (a particular solution), then the set of all solutions is given by ${\mathbf{x}_p + \mathbf{x}_n \mid \mathbf{x}_n \in \text{Null}(\mathbf{A})}$ .

The null space thus describes the ambiguity or non-uniqueness of the solution.

Subspace 3: The Row Space

It is equivalent to the column space of $\mathbf{A}^\top$ .

Subspace 4: The Left Null Space

The left null space of a matrix $\mathbf{A} \in \mathbb{R}^{m \times n}$ is the set of all solutions to the homogeneous system $\mathbf{A}^\top\mathbf{y} = \mathbf{0}$ .

$\text{Null}(\mathbf{A}^\top) = { \mathbf{y} \in \mathbb{R}^m \mid \mathbf{A}^\top\mathbf{y} = \mathbf{0} }$

It is called the “left” null space because the equation can be written as $\mathbf{y}^\top\mathbf{A} = \mathbf{0}^\top$ , where the vector $\mathbf{y}$ is on the left.

Notice how the left null space contains all vectors in the output space that are orthogonal to every vector in the column space.

It represents the part of the codomain that is “unreachable” by the transformation.

The Rank-Nullity Theorem

Theorem (Rank-Nullity Theorem): For any matrix $\mathbf{A} \in \mathbb{R}^{m \times n}$ :

$\dim(\text{Col}(\mathbf{A})) + \dim(\text{Null}(\mathbf{A})) = n$

or, using the definition of rank:

$\text{rank}(\mathbf{A}) + \dim(\text{Null}(\mathbf{A})) = n$

Interpretation: The dimension of the input space ( $n$ ) is partitioned between two subspaces: the part that is mapped to the column space (the row space, whose dimension is the rank) and the part that is crushed to zero (the null space).

The Fundamental Theorem of Linear Algebra

Part 1: Dimensions

By applying the Rank-Nullity theorem to both $\mathbf{A}$ and its transpose $\mathbf{A}^\top$ , we arrive at a complete description of the dimensions of all four fundamental subspaces:
- $\dim(\text{Col}(\mathbf{A})) = r$
- $\dim(\text{Row}(\mathbf{A})) = r$
- $\dim(\text{Null}(\mathbf{A})) = n - r$
- $\dim(\text{Null}(\mathbf{A}^\top)) = m - r$
Part 2: Geometry
- The row space and the null space are orthogonal complements in the input space $\mathbb{R}^n$ .
- The column space and the left null space are orthogonal complements in the output space $\mathbb{R}^m$ .
Theoretical development

From eigendecomposition

Consider the matrix $\mathbf{S}_1 = \mathbf{A}^\top\mathbf{A} \in \mathbb{R}^{n \times n}$ . By construction, $\mathbf{S}_1$ is symmetric ( $\mathbf{S}_1^\top = (\mathbf{A}^\top\mathbf{A})^\top = \mathbf{A}^\top(\mathbf{A}^\top)^\top = \mathbf{A}^\top\mathbf{A} = \mathbf{S}_1$ ) and positive semi-definite ( $\mathbf{x}^\top\mathbf{S}_1\mathbf{x} = \mathbf{x}^\top\mathbf{A}^\top\mathbf{A}\mathbf{x} = (\mathbf{A}\mathbf{x})^\top(\mathbf{A}\mathbf{x}) = |\mathbf{A}\mathbf{x}|_2^2 \ge 0$ ).

By the Spectral Theorem, any real symmetric matrix has a full set of orthonormal eigenvectors and real, non-negative eigenvalues. Therefore, $\mathbf{S}_1$ admits an eigendecomposition:

$\mathbf{A}^\top\mathbf{A} = \mathbf{V}\mathbf{D}\mathbf{V}^\top$

where $\mathbf{V}$ is an orthogonal matrix of eigenvectors and $\mathbf{D}$ is a diagonal matrix of eigenvalues $\lambda_i \ge 0$ .

Now, substitute the SVD form $\mathbf{A} = \mathbf{U}\mathbf{\Sigma}\mathbf{V}^\top$ into this expression:

$\mathbf{A}^\top\mathbf{A} = (\mathbf{U}\mathbf{\Sigma}\mathbf{V}^\top)^\top (\mathbf{U}\mathbf{\Sigma}\mathbf{V}^\top) = (\mathbf{V}\mathbf{\Sigma}^\top\mathbf{U}^\top) (\mathbf{U}\mathbf{\Sigma}\mathbf{V}^\top)$

Since $\mathbf{U}$ is orthogonal, $\mathbf{U}^\top\mathbf{U} = \mathbf{I}_m$ , which simplifies the expression to:

$\mathbf{A}^\top\mathbf{A} = \mathbf{V}(\mathbf{\Sigma}^\top\mathbf{\Sigma})\mathbf{V}^\top$

Comparing this with the eigendecomposition $\mathbf{V}\mathbf{D}\mathbf{V}^\top$ , we establish the following fundamental relationships:
- The right-singular vectors of $\mathbf{A}$ (the columns of $\mathbf{V}$ ) are the eigenvectors of $\mathbf{A}^\top\mathbf{A}$ .
- The eigenvalues of $\mathbf{A}^\top\mathbf{A}$ are the squares of the singular values of $\mathbf{A}$ , i.e., $\lambda_i = \sigma_i^2$ .
A symmetric argument applies to the matrix $\mathbf{S}_2 = \mathbf{A}\mathbf{A}^\top \in \mathbb{R}^{m \times m}$ :

$\mathbf{A}\mathbf{A}^\top = (\mathbf{U}\mathbf{\Sigma}\mathbf{V}^\top)(\mathbf{U}\mathbf{\Sigma}\mathbf{V}^\top)^\top = \mathbf{U}(\mathbf{\Sigma}\mathbf{\Sigma}^\top)\mathbf{U}^\top$

This reveals that:
- The left-singular vectors of $\mathbf{A}$ (the columns of $\mathbf{U}$ ) are the eigenvectors of $\mathbf{A}\mathbf{A}^\top$ .
- The non-zero eigenvalues of $\mathbf{A}\mathbf{A}^\top$ are identical to the non-zero eigenvalues of $\mathbf{A}^\top\mathbf{A}$ , and are also equal to $\sigma_i^2$ .
Singular values and geometry

The relationship between the components can be summarized by the singular value equations, derived by right-multiplying $\mathbf{A} = \mathbf{U}\mathbf{\Sigma}\mathbf{V}^\top$ by $\mathbf{V}$ :

$\mathbf{A}\mathbf{V} = \mathbf{U}\mathbf{\Sigma}$

Considering this equation one column at a time, for $i=1, \dots, n$ :

$\mathbf{A}\mathbf{v}_i = \sigma_i \mathbf{u}_i$

This equation provides the geometric interpretation:
- The linear transformation $\mathbf{A}$ maps the $i$ -th right-singular vector (an orthonormal basis vector in the domain) to a scaled version of the $i$ -th left-singular vector (an orthonormal basis vector in the codomain).
- The scaling factor is the $i$ -th singular value.
Reduced SVD

What if $m \gg n$ or $n \gg m$ ? We use the reduced version!
- Full SVD: The form described above, where $\mathbf{U}$ is $m \times m$ and $\mathbf{V}$ is $n \times n$ .
- Reduced SVD: If $m > n$ , the matrix $\mathbf{\Sigma}$ will have $m-n$ rows of all zeros. These rows, when multiplied by $\mathbf{U}$ , contribute nothing to the final product. The reduced SVD eliminates these zero rows and the corresponding columns of $\mathbf{U}$ :
$\mathbf{A} = \mathbf{U}_r \mathbf{\Sigma}_r \mathbf{V}_r^\top$

where $\mathbf{U}_r \in \mathbb{R}^{m \times n}$ , $\mathbf{\Sigma}_r \in \mathbb{R}^{n \times n}$ (now a square diagonal matrix), and $\mathbf{V}_r = \mathbf{V} \in \mathbb{R}^{n \times n}$ . This is a more memory-efficient representation.

Applications

Low-Rank Approximation and Principal Component Analysis (PCA)

The expression $\mathbf{A} = \sum_{i=1}^{r} \sigma_i \mathbf{u}_i \mathbf{v}_i^\top$ represents $\mathbf{A}$ as a weighted sum of rank-one matrices. The Eckart-Young-Mirsky Theorem states that the optimal rank- $k$ approximation of $\mathbf{A}$ (the matrix $\mathbf{A}_k$ that minimizes $|\mathbf{A} - \mathbf{A}_k|$ ) is found by truncating this sum:

$\mathbf{A}_k = \sum_{i=1}^{k} \sigma_i \mathbf{u}_i \mathbf{v}_i^\top$

What does this mean? You select the ones corresponding to the largest singular values:) This is what’s used in PCA! (PCA, Spectral Clustering and t-SNE)

Moore-Penrose Pseudoinverse and Linear Regression

For any matrix $\mathbf{A}$ , the SVD provides its pseudoinverse: $\mathbf{A}^\dagger = \mathbf{V}\mathbf{\Sigma}^\dagger\mathbf{U}^\top$ , where $\mathbf{\Sigma}^\dagger$ is formed by taking the reciprocal of the non-zero singular values in $\mathbf{\Sigma}$ and transposing.

In ML: The solution to the linear least-squares problem, $\min_{\mathbf{x}} |\mathbf{A}\mathbf{x} - \mathbf{b}|_2^2$ , is given by $\mathbf{x} = \mathbf{A}^\dagger\mathbf{b}$ .

Numerical Stability and Condition Number

Remember? I spoke about this in Matrix Decompositions 2? Condition number, that is.

The condition number of a matrix $\mathbf{A}$ , which measures the sensitivity of the solution of $\mathbf{A}\mathbf{x}=\mathbf{b}$ to perturbations in the data. It is defined as

$\kappa(\mathbf{A}) = \frac{\sigma_1}{\sigma_r} \quad (\text{the ratio of the largest to the smallest non-zero singular value})$

Aside: Eckart-Young-Mirsky Theorem

Matrix norms

Spectral Norm: The maximum scaling factor the matrix can apply to any unit vector: $|\mathbf{M}|_2 = \max_{|\mathbf{x}|_2=1} |\mathbf{M}\mathbf{x}|_2$ . It can be proven that this is equal to the largest singular value, $|\mathbf{M}|_2 = \sigma_1$ .

Frobenius Norm: The matrix equivalent of the vector Euclidean norm, defined as the square root of the sum of the squares of all its elements: $|\mathbf{M}|_F = \sqrt{\sum_{i,j} M_{ij}^2}$ . It can be proven that this is equal to the square root of the sum of the squares of its singular values: $|\mathbf{M}|_F = \sqrt{\sum_{i=1}^r \sigma_i^2}$ .

Theorem

Let the SVD of $\mathbf{M}$ be $\mathbf{M} = \sum_{i=1}^{r} \sigma_i \mathbf{u}_i \mathbf{v}_i^\top$ . The best rank- $k$ approximation to $\mathbf{M}$ (for $k \le r$ ), denoted $\mathbf{M}_k$ , is the one that minimizes the approximation error $|\mathbf{M} - \mathbf{M}_k|$ . For both the spectral and Frobenius norms, this minimum is achieved by the truncated SVD sum:

$\mathbf{M}_k = \sum_{i=1}^{k} \sigma_i \mathbf{u}_i \mathbf{v}_i^\top$

The approximation error is given by:
- $|\mathbf{M} - \mathbf{M}_k|_2 = \sigma_{k+1}$
- $|\mathbf{M} - \mathbf{M}_k|_F = \sqrt{\sum_{i=k+1}^r \sigma_i^2}$
This theorem guarantees that truncating the SVD provides the most compact linear approximation of a matrix for a given rank, capturing the “most important parts of M” by retaining the components with the largest singular values.

Aside: Intuitions for SVD

I highly suggest looking up Six (and a half) intuitions for SVD, but here are some things that aren’t directly covered earlier in this blog:

Theoretical development for Moore-Pensore inverse

We substitute the SVD of $\mathbf{M}$ into the objective function and use the property that the L2 norm is invariant under orthogonal transformations:

$\begin{aligned} |\mathbf{M}\mathbf{x} - \mathbf{b}|_2^2 &= |\mathbf{U}\mathbf{S}\mathbf{V}^\top\mathbf{x} - \mathbf{b}|_2^2 \ &= |\mathbf{U}^\top(\mathbf{U}\mathbf{S}\mathbf{V}^\top\mathbf{x} - \mathbf{b})|_2^2 \quad \ &= |\mathbf{S}\mathbf{V}^\top\mathbf{x} - \mathbf{U}^\top\mathbf{b}|_2^2 \end{aligned}$

By defining a change of variables $\mathbf{x}' = \mathbf{V}^\top\mathbf{x}$ and $\mathbf{b}' = \mathbf{U}^\top\mathbf{b}$ , the problem becomes $\min_{\mathbf{x}'} |\mathbf{S}\mathbf{x}' - \mathbf{b}'|_2^2$ . Since $\mathbf{S}$ is diagonal, this is a sum of independent squared errors:

$\sum_{i=1}^r (\sigma_i x'_i - b'_i)^2 + \sum_{i=r+1}^m (b'_i)^2$

The minimum is achieved when $\sigma_i x'_i - b'_i = 0$ for $i=1,\dots,r$ , yielding $x'_i = b'_i / \sigma_i$ . For $i > r$ , $x'_i$ is unconstrained as $\sigma_i=0$ ; the minimum-norm solution sets these to zero. Transforming back gives the solution:

$\mathbf{x} = \mathbf{V}\mathbf{x}' = \sum_{i=1}^{r} \frac{b'_i}{\sigma_i}\mathbf{v}_i = \sum_{i=1}^{r} \frac{\mathbf{u}_i^\top \mathbf{b}}{\sigma_i}\mathbf{v}_i = \mathbf{M}^\dagger \mathbf{b}$

where $\mathbf{M}^\dagger = \mathbf{V}\mathbf{S}^\dagger\mathbf{U}^\top$ is the Moore-Penrose Pseudoinverse.

Listing the intuitions

We have mostly coverered all of these above btw,
1. Rotations and Scalings (The Geometric Picture)
2. Best Rank-k Approximations
3. Least Squares Regression
4. Input and Output Directions
5. Lost & Preserved Information
6. PCA and Information Compression
Yay, we are (mostly) done with matrix decompositions!

Sources
1. Mathematics for Machine Learning (link)
2. Six (and a half) intuitions for SVD (link)
Matrix Decompositions 2

August 19, 2025
What’s there?

There are basically 5 (ಐದು / पाँच) types of matrix decompositions that are like super important. I strive to go in depth into all of them. These are
1. Eigendecomposition (Matrix Decompositions 1)
2. QR decomposition (This post!)
3. Cholesky decomposition (This post!)
4. LU decomposition (This post!)
5. (perhaps the most important) Singular Value Decomposition
Why do I write this?

I wanted to create an authoritative document I could refer to for later. Thought that I might as well make it public (build in public, serve society and that sort of stuff).

QR Decomposition

Basics?

Yeah, you remember vector spaces and orthogonality? Yeah, that’s it.

Ok, what is it?

A factorization of a matrix $\mathbf{A} \in \mathbb{R}^{m \times n}$ into the product of two specific types of matrices: an orthogonal matrix $\mathbf{Q}$ and an upper triangular matrix $\mathbf{R}$ .

Definition (QR Decomposition): Let $\mathbf{A}$ be an $m \times n$ matrix with linearly independent columns. The QR decomposition of $\mathbf{A}$ is a factorization:

$\mathbf{A} = \mathbf{Q}\mathbf{R}$

where:
- $\mathbf{Q}$ is an $m \times n$ matrix with orthonormal columns. Thus, $\mathbf{Q}^\top \mathbf{Q} = \mathbf{I}_n$ , where $\mathbf{I}_n$ is the $n \times n$ identity matrix.
- is an upper triangular matrix. This means all entries below the main diagonal are zero ( for ).
  - R? Right? Right-upper? And Q preceded R? Get it? Idk.
Ok, but why?

Orthogonal matrices are awesome.

QR decomposition formalizes a process of transforming an arbitrary basis (the columns of $\mathbf{A}$ ) into an orthonormal basis (the columns of $\mathbf{Q}$ ), while encoding the transformation information in the upper triangular matrix $\mathbf{R}$ .

How do I do it?

Gram-Schmidt process!

This handout from UCLA is awesome. I have realised that I’m not made out of time and want to make these notes more for intuition and rigorous math only for places that I deem super important (around ~50%).

Applications!!

This is why I do this – yayyy (first time I got to this section in under ~20 minutes).
1. Solving Linear Systems and Least-Squares Problems
2. In robotics?
Solving Linear Systems and Least-Squares Problems

Say we want to solve $\mathbf{A}\mathbf{x} = \mathbf{b}$ . We can write –

$\mathbf{Q}\mathbf{R}\mathbf{x} = \mathbf{b}$

Since $\mathbf{Q}$ is orthogonal, $\mathbf{Q}^{-1} = \mathbf{Q}^\top$ . We can multiply by $\mathbf{Q}^\top$ :

$\mathbf{Q}^\top\mathbf{Q}\mathbf{R}\mathbf{x} = \mathbf{Q}^\top\mathbf{b} \implies \mathbf{R}\mathbf{x} = \mathbf{Q}^\top\mathbf{b}$

This new system is trivial to solve because $\mathbf{R}$ is upper triangular. We can solve for the components of $\mathbf{x}$ starting from the last one and moving upwards, a process known as back substitution.

Now, for the linear least squares regression.

The objective is to find the vector $\mathbf{x}$ that minimizes the squared Euclidean norm of the residual:

$\min_{\mathbf{x}} |\mathbf{A}\mathbf{x} - \mathbf{b}|_2^2$

That means, for us,

$|\mathbf{Q}\mathbf{R}\mathbf{x} - \mathbf{b}|_2^2 = |\mathbf{Q}^\top(\mathbf{Q}\mathbf{R}\mathbf{x} - \mathbf{b})|_2^2 = |\mathbf{R}\mathbf{x} - \mathbf{Q}^\top\mathbf{b}|_2^2$

How did we do this? Orthogonal transformations preserve the Euclidean norm!!!

Very cool and smart sounding, right? But, why did we do this?

The main issue lies with the formation of $\mathbf{A}^\top\mathbf{A}$ . It has a high condition number and is less numerically stable.

A condition number is a score that measures how sensitive a function or a matrix is to small changes in its input.

This means that –
- A low condition number means a matrix is well-conditioned. Small changes in the input data will only cause small changes in the output or solution. This is stable and reliable.
- A high condition number means a matrix is ill-conditioned. Tiny, insignificant changes in the input data can cause huge, dramatic changes in the solution. This is unstable and unreliable.
In numerical calculations, an ill-conditioned matrix can amplify tiny rounding errors into completely wrong answers.

Forming the matrix $\mathbf{A}^\top\mathbf{A}$ can be disastrous for numerical stability. Mathematically, the condition number of $\mathbf{A}^\top\mathbf{A}$ is the square of the condition number of $\mathbf{A}$ .

$cond(\mathbf{A}^\top\mathbf{A}) = [cond(\mathbf{A})]^2$

This means that if your original matrix $\mathbf{A}$ is even slightly ill-conditioned, the matrix $\mathbf{A}^\top\mathbf{A}$ you need to work with will be extremely ill-conditioned.

The QR method is superior because it cleverly bypasses the need to form the unstable $\mathbf{A}^\top\mathbf{A}$ matrix.
1. Decomposition: First, it decomposes the original matrix $\mathbf{A}$ into $\mathbf{Q}\mathbf{R}$ , where $\mathbf{Q}$ is an orthogonal matrix and $\mathbf{R}$ is an upper triangular matrix. This decomposition process itself is numerically stable.
2. Stable Transformation: The next step uses a key property of orthogonal matrices: they don’t change the length (Euclidean norm) of vectors when they multiply them. This allows the problem to be transformed without amplifying errors:
$|\mathbf{A}\mathbf{x} - \mathbf{b}|_2^2 = |\mathbf{Q}\mathbf{R}\mathbf{x} - \mathbf{b}|_2^2 = |\mathbf{R}\mathbf{x} - \mathbf{Q}^\top\mathbf{b}|_2^2$

This transformation is numerically stable because the condition number of $\mathbf{R}$ is the same as the original matrix $\mathbf{A}$ . We haven’t made the problem worse:))))

In Robotics?

Literally anywhere you use least squares. Example – Levenberg-Marquardt algorithm in inverse kinematics:)

Cholesky decomposition

Axiomatic basis

We do have something for this!

The entire theory of Cholesky decomposition rests upon the properties of Symmetric Positive Definite (SPD) matrices. Let $\mathbf{A} \in \mathbb{R}^{n \times n}$ be a square matrix.
1. Symmetry: The matrix is symmetric if it is equal to its transpose, $\mathbf{A} = \mathbf{A}^\top$ . This means $A_{ij} = A_{ji}$ for all $i, j$ (the more you know!).
2. Positive Definiteness: A symmetric matrix $\mathbf{A}$ is positive definite if for every non-zero vector $\mathbf{x} \in \mathbb{R}^n \setminus {\mathbf{0}}$ , the quadratic form $\mathbf{x}^\top \mathbf{A} \mathbf{x}$ is strictly positive.
$\mathbf{x}^\top \mathbf{A} \mathbf{x} > 0 \quad \forall \mathbf{x} \neq \mathbf{0}$

Key Properties of SPD Matrices:
- All eigenvalues of an SPD matrix are real and strictly positive.
- All diagonal elements ( $A_{ii}$ ) are strictly positive.
- The matrix is always invertible (since no eigenvalue is zero).
Proofs!

What is it?

Let $\mathbf{A}$ be a symmetric, positive definite matrix. Its Cholesky decomposition is a factorization of the form:

$\mathbf{A} = \mathbf{L}\mathbf{L}^\top$

where:
- $\mathbf{L}$ is a lower triangular matrix with strictly positive diagonal entries ( $L_{ii} > 0$ ).
- $\mathbf{L}^\top$ is the transpose of $\mathbf{L}$ , which is an upper triangular matrix.
Isn’t this very specific?

So, why do we need this?

Ee often need to generate random samples from a multivariate Gaussian distribution $\mathcal{N}(\boldsymbol{\mu}, \mathbf{\Sigma})$ . The covariance matrix $\mathbf{\Sigma}$ is, by definition, SPD.

How do we use it?
1. Generate Standard Normal Variates: Generate a vector $\mathbf{z} \in \mathbb{R}^D$ where each component $z_i$ is an independent sample from the standard normal distribution $\mathcal{N}(0, 1)$ . This vector $\mathbf{z}$ has a mean of $\mathbf{0}$ and a covariance matrix of $\mathbf{I}$ .
2. Decompose: Compute the Cholesky factor $\mathbf{L}$ of the target covariance matrix, $\mathbf{\Sigma} = \mathbf{L}\mathbf{L}^\top$ .
3. Transform: Create the desired sample $\mathbf{x}$ using the affine transformation:
$\mathbf{x} = \boldsymbol{\mu} + \mathbf{L}\mathbf{z}$

This works because the mean and covariance of the transformed variable $\mathbf{x}$ are:
- $\mathbb{E}[\mathbf{x}] = \mathbb{E}[\boldsymbol{\mu} + \mathbf{L}\mathbf{z}] = \boldsymbol{\mu} + \mathbf{L}\mathbb{E}[\mathbf{z}] = \boldsymbol{\mu}$ .
- $\text{Cov}(\mathbf{x}) = \text{Cov}(\boldsymbol{\mu} + \mathbf{L}\mathbf{z}) = \mathbf{L}\text{Cov}(\mathbf{z})\mathbf{L}^\top = \mathbf{L}\mathbf{I}\mathbf{L}^\top = \mathbf{L}\mathbf{L}^\top = \mathbf{\Sigma}$ .
  The Cholesky decomposition provides the correct linear transformation $\mathbf{L}$ to “color” the uncorrelated noise $\mathbf{z}$ into the correlated noise specified by $\mathbf{\Sigma}$ .
LU decomposition

Definition

The LU decomposition of a square matrix $\mathbf{A}$ is a factorization:

$\mathbf{A} = \mathbf{L}\mathbf{U}$

where $\mathbf{L}$ is a unit lower triangular matrix and $\mathbf{U}$ is an upper triangular matrix. Not every matrix has an LU decomposition. However, if row swaps are allowed, a stable factorization is always possible. This leads to the more general form $\mathbf{P}\mathbf{A} = \mathbf{L}\mathbf{U}$ , where $\mathbf{P}$ is a permutation matrix that records the row swaps.

How do I do it?

Remember how Guassian elimination does elementary operations? And, these elementary operations can be written as elementary matrices? So, we can do –

$\mathbf{A} = (\mathbf{E}_{32}\mathbf{E}_{31}\mathbf{E}_{21})^{-1}\mathbf{U} = \mathbf{E}_{21}^{-1}\mathbf{E}_{31}^{-1}\mathbf{E}_{32}^{-1}\mathbf{U}$

Now, if you work it out – the elementary matrices will be lower triangular and their product will also be lower triangular. A nice property is that these lower triangular elementary matrices is that their product is also simple: the multipliers $l_{ij}$ slot directly into the matrix.

$\mathbf{L} = \mathbf{E}_{21}^{-1}\mathbf{E}_{31}^{-1}\mathbf{E}_{32}^{-1} = \begin{bmatrix} 1 & 0 & 0 \\ l_{21} & 1 & 0 \\ l_{31} & l_{32} & 1 \end{bmatrix}$

Now, what if one of the pivots is 0? Or like, super small – leading to instability?

To prevent this, we perform pivoting: swapping the current row with a subsequent row that has a larger pivot element.

Each row swap can be represented by left-multiplication with a permutation matrix $\mathbf{P}$ . A permutation matrix is an identity matrix with its rows reordered. Applying all necessary row swaps to $\mathbf{A}$ at the beginning gives the general and stable form of the decomposition:

$\mathbf{P}\mathbf{A} = \mathbf{L}\mathbf{U}$

This factorization exists for any square matrix $\mathbf{A}$ .

Applications

Literally any decomposition can help Solving Linear Systems easier. I’m gonna skip this but that’s the primary use (again).

Other than that, the decomposition provides an efficient way to calculate the determinant.

$\det(\mathbf{P}\mathbf{A}) = \det(\mathbf{L})\det(\mathbf{U}) \implies \det(\mathbf{P})\det(\mathbf{A}) = \det(\mathbf{L})\det(\mathbf{U})$
- $\det(\mathbf{L}) = 1$ (since it is unit triangular).
- $\det(\mathbf{U}) = \prod_{i=1}^n U_{ii}$ (the product of its diagonal elements).
- $\det(\mathbf{P}) = (-1)^s$ , where $s$ is the number of row swaps performed.
  Therefore, $\det(\mathbf{A}) = (-1)^s \prod_{i=1}^n U_{ii}$ .
Sources
1. Mathematics for Machine Learning (link)
PCA, Spectral Clustering and t-SNE

July 21, 2025
Whats this?

It’s a document on PCA, Spectral Clustering and t-SNE, duh? All of these are used to make sense of the structure of data. More importantly, all of these are intricately related to eigendecomposition (Matrix Decompositions 1).

Why do I write this?

As always, I wanted to create an authoritative document I could refer to for later. Thought that I might as well make it public.

Principal Component Analysis (PCA)

What’s this used for? We use it to find a lower-dimensional linear subspace of a high-dimensional data space that captures the maximal amount of variance in the data.

Big dimensional data bad, lesser dimensions good:)

Perspectives

We can think about PCA from two viewpoints
1. Maximizing the variance of projected data
2. Minimizing reconstruction error
Huh? What’s variance? Covariance?

Let’s nail down expected value first.

The expected value of a random variable, which represents its theoretical mean or center of mass. For a continuous random variable $X$ with probability density function (PDF) $p(x)$ , the expected value is:

$\mathbb{E}[X] = \mu = \int_{-\infty}^{\infty} x p(x) dx$

So, as you can probably gauge, this is like the mean – describes the central tendency of a distribution. But, it does not tell us anything about its spread / dispersion. This is where we get variance!

The variance of a random variable $X$ , denoted $\text{Var}(X)$ or $\sigma^2$ , is the expected value of the squared deviation from its mean $\mu$ .

$\text{Var}(X) = \sigma^2 := \mathbb{E}[(X - \mu)^2]$

The square root of the variance, $\sigma$ , is the standard deviation, which is returned to the original units of the variable and is often more interpretable.

Takeaway – A low variance indicates that the values are tightly clustered around the mean, while a high variance indicates that they are spread out over a wider range.

Some math-y aspects of variance

The Raw-score Formula

The definition of variance can be expanded using the linearity of expectation.

$\text{Var}(X) = \mathbb{E}[(X^2 - 2\mu X + \mu^2)] = \mathbb{E}[X^2] - 2\mu\mathbb{E}[X] + \mu^2 = \mathbb{E}[X^2] - 2\mu^2 + \mu^2$

$\text{Var}(X) = \mathbb{E}[X^2] - (\mathbb{E}[X])^2$

This is more convenient for us. Why?

Empirally determining variance

In practice, we do not have the true distribution $p(x)$ but a finite dataset of $N$ observations ${x_n}_{n=1}^N$ . We estimate the true variance using the empirical variance (or sample variance). First, the empirical mean is calculated:

$\bar{x} = \frac{1}{N} \sum_{n=1}^N x_n$

The empirical variance is the average of the squared deviations from this empirical mean:

$\hat{\sigma}^2 = \frac{1}{N} \sum_{n=1}^N (x_n - \bar{x})^2$

Important note – The unbiased estimator uses a denominator of $N-1$ , but for large $N$ and in many machine learning contexts, the version with $N$ is used.

Huh? You ask again.

Why did you talk $N-1$ about above?

First – an unbiased estimator is a formula that, on average, gives us the correct value for the property that we are trying to measure from a population.

The population bit is important – this is the actual disribution. What are have are samples!

When you calculate the variance of a sample, you don’t know the true mean ( $\mu$ ) of the entire population. Instead, you have to use the sample mean ( $\bar{x}$ ).
The sample mean ( $\bar{x}$ ) is calculated from your specific data points. By its nature, it’s the value that is closest to all the points in your sample, minimizing the sum of squared differences $(x_n - \bar{x})^2$ for that particular sample.

Since the sample mean is always a “perfect fit” for the sample, the deviations from it are slightly smaller than the deviations would be from the true population mean! So, we divide by $N-1$ to account for this – we call this Bessel’s Correction.

Using a smaller denominator makes the final result slightly larger. This small increase perfectly corrects for the underestimation caused by using the sample mean, ensuring that, on average, the calculated variance matches the true population variance. This is why it’s called the unbiased estimator. (What’s this? Keep reading until a few more sections. Or hit ctrl+F and find biased-estimator-def)

Ok, but, why $N-1$ ???

(Feel free to skip this bit if you don’t care about Math. But, then, that’s why you are here, isn’t it?)

First, let’s define our terms:
- $\sigma^2$ is the true variance of the population (what we want to estimate).
- $\mu$ is the true mean of the population.
- $\bar{x} = \frac{1}{N}\sum x_i$ is the sample mean.
- The sample variance estimator is $s^2 = \frac{1}{N-1}\sum(x_i - \bar{x})^2$ .
- The expectation operator, $E[\cdot]$ , gives the long-run average of an estimator. For an estimator to be unbiased, its expectation must equal the true parameter, i.e., $E[s^2] = \sigma^2$ .
Now, the derivation!

Our goal is to compute the expected value of the sum of squared differences, $E\left[\sum(x_i - \bar{x})^2\right]$ . We start by rewriting the term inside the sum:

$\sum(x_i - \bar{x})^2 = \sum(x_i - \mu - (\bar{x} - \mu))^2$

Expanding the squared term gives:

$= \sum\left[ (x_i - \mu)^2 - 2(x_i - \mu)(\bar{x} - \mu) + (\bar{x} - \mu)^2 \right]$

Distributing the summation:

$= \sum(x_i - \mu)^2 - 2(\bar{x} - \mu)\sum(x_i - \mu) + \sum(\bar{x} - \mu)^2$

Now, let’s simplify the middle term, $\sum(x_i - \mu)$ , which equals $N(\bar{x} - \mu)$ . Substituting this in:

$= \sum(x_i - \mu)^2 - 2(\bar{x} - \mu) \cdot N(\bar{x} - \mu) + N(\bar{x} - \mu)^2$

$= \sum(x_i - \mu)^2 - 2N(\bar{x} - \mu)^2 + N(\bar{x} - \mu)^2$

$= \sum(x_i - \mu)^2 - N(\bar{x} - \mu)^2$

Now we take the expectation of this entire expression:

$E\left[\sum(x_i - \bar{x})^2\right] = E\left[\sum(x_i - \mu)^2\right] - E\left[N(\bar{x} - \mu)^2\right]$

By definition, $E[(x_i - \mu)^2]$ is the population variance $\sigma^2$ , and $E[(\bar{x} - \mu)^2]$ is the variance of the sample mean, which is $\sigma^2/N$ . Substituting these in:

$E\left[\sum(x_i - \bar{x})^2\right] = \sum E[(x_i - \mu)^2] - N \cdot E[(\bar{x} - \mu)^2]$

$= N\sigma^2 - N\left(\frac{\sigma^2}{N}\right)$

$= N\sigma^2 - \sigma^2$

$= (N-1)\sigma^2$

See? Now remember that you can user either $N$ or $N - 1$ since for very, very large $N$ , $\mu = \bar{x}$ .

Covariance!

Variance describes how a single variable changes. Covariance extends this idea to describe how two variables change together.

The covariance between two random variables $X$ and $Y$ with means $\mu_X$ and $\mu_Y$ respectively, is the expected value of the product of their deviations from their means:
$\text{Cov}(X, Y) := \mathbb{E}[(X - \mu_X)(Y - \mu_Y)]$

Covariance Matrix!!

Multidimensional Data

For a $D$ -dimensional random vector $\mathbf{X} = [X_1, \dots, X_D]^\top$ with mean vector $\boldsymbol{\mu} = \mathbb{E}[\mathbf{X}]$ , the relationships between all pairs of its components are captured by the covariance matrix $\mathbf{\Sigma} \in \mathbb{R}^{D \times D}$ .

$\mathbf{\Sigma} := \mathbb{E}[(\mathbf{X} - \boldsymbol{\mu})(\mathbf{X} - \boldsymbol{\mu})^\top]$

The entry at position $(i, j)$ of this matrix is the covariance between the $i$ -th and $j$ -th components of the vector:

$\Sigma_{ij} = \text{Cov}(X_i, X_j)$

The diagonal elements are the variances of each component, $\Sigma_{ii} = \text{Var}(X_i)$ .

As always, properties!
1. Symmetry: $\mathbf{\Sigma} = \mathbf{\Sigma}^\top$ , since $\text{Cov}(X_i, X_j) = \text{Cov}(X_j, X_i)$ .
2. Positive Semi-Definiteness: For any vector $\mathbf{w} \in \mathbb{R}^D$ , $\mathbf{w}^\top \mathbf{\Sigma} \mathbf{w} \ge 0$ . This is because $\mathbf{w}^\top \mathbf{\Sigma} \mathbf{w} = \text{Var}(\mathbf{w}^\top \mathbf{X})$ , and variance is always non-negative! (since it’s a square, duh)

Empirical Covariance Matrix

As always, we only have a sample; not the distribution / population.

So, what do we do? $N$ or $N-1$ ?

I will first define population and samples since I’ve just been naming them
- Population: The entire set of all possible observations of interest. Population parameters, such as the true mean $\mu$ and the true covariance $\mathbf{\Sigma}$ , are considered fixed but unknown constants.
- Sample: A finite subset of observations $latex {\mathbf{x}n}{n=1}^N$ drawn from the population. We use the sample to make inferences about the unknown population parameters.
With me so far? Good, now more definitions.

An estimator is a function of the sample data that is used to estimate an unknown population parameter. For example, the sample mean $\bar{\mathbf{x}}$ is an estimator for the population mean $\boldsymbol{\mu}$ . A crucial property of an estimator is its bias.

Definition (Bias of an Estimator): Let $\theta$ be an unknown population parameter and let $\hat{\theta}$ be an estimator of $\theta$ based on a sample. The bias of the estimator is defined as the difference between the expected value of the estimator and the true value of the parameter:

$\text{Bias}(\hat{\theta}) = \mathbb{E}[\hat{\theta}] - \theta$
- An estimator is unbiased if its bias is zero, i.e., $\mathbb{E}[\hat{\theta}] = \theta$ . This means that if we were to repeatedly draw samples of size $N$ from the population and compute the estimate $\hat{\theta}$ for each, the average of these estimates would converge to the true parameter $\theta$ .
- An estimator is biased if $\mathbb{E}[\hat{\theta}] \neq \theta$ .
(See? I told you a few more sections earlier; keyword: biased-estimator-def)

Cool, I just want to remind you of Bessel’s Correction once again:

$\mathbb{E}[\hat{\sigma}^2_{N-1}] = \mathbb{E}\left[\frac{N}{N-1}\hat{\sigma}^2_N\right] = \frac{N}{N-1}\mathbb{E}[\hat{\sigma}^2_N] = \frac{N}{N-1} \left( \frac{N-1}{N}\sigma^2 \right) = \sigma^2$

and also the why once again – The bias arises because we are measuring deviations from the sample mean $\bar{x}$ , which is itself derived from the data.

The data points are, on average, closer to their own sample mean than to the true population mean, leading to an underestimation of the true spread.

First, the MLE!

Maximum Likelihood Estimation is a fundamental method for deriving estimators. Given a dataset $\mathcal{X} = {x_n}_{n=1}^N$ and a parameterized probability model $p(\mathbf{x} | \boldsymbol{\theta})$ , the likelihood function is the probability of observing the data, viewed as a function of the parameters $\boldsymbol{\theta}$ :

$L(\boldsymbol{\theta}; \mathcal{X}) = p(\mathcal{X} | \boldsymbol{\theta})$

The MLE principle states that we should choose the parameters $\hat{\boldsymbol{\theta}}_{\text{ML}}$ that maximize this likelihood function. For an i.i.d. sample, this is equivalent to maximizing the log-likelihood:

$\hat{\boldsymbol{\theta}}_{\text{ML}} = \arg\max _{\boldsymbol{\theta}} \sum_{n=1}^N \log p(\mathbf{x}_n | \boldsymbol{\theta})$

How did we get the above? Take log on both sides of

$L(\boldsymbol{\theta}; \mathcal{X}) = p(\mathcal{X} | \boldsymbol{\theta}) = \prod_{n=1}^N p(\mathbf{x}_n | \boldsymbol{\theta})$

Ok what the hell is this? Probability theory solves the “forward problem”: Given a model with known parameters, what does the data look like? (e.g., “If this coin is fair, $\theta=0.5$ , I expect to see roughly 5 heads in 10 flips.”)

(Statistical inference in general, and) MLE in particular, solves the inverse problem: Given the data we have observed, what were the most likely parameters of the model that generated it? (e.g., “I flipped a coin 10 times and got 8 heads. What is the most plausible value for this coin’s bias, $\theta$?”)
- Normally, we think of $p(\mathcal{X} | \boldsymbol{\theta})$ as a probability distribution over possible datasets $\mathcal{X}$ , for a fixed parameter setting $\boldsymbol{\theta}$ .
- The likelihood function, $L(\boldsymbol{\theta}; \mathcal{X})$ , treats the observed data $\mathcal{X}$ as fixed and views this quantity as a function of the parameters $\boldsymbol{\theta}$ .
Here, I present a statement:

The covariance estimator with denominator $N$ is the MLE under a Gaussian assumption

Ponder this. What can it even mean? Yeah, it’s just that $\mathcal{X}$ is taken as a gaussian.

Specifically, we assume our data points ${\mathbf{x}_n}$ are drawn i.i.d. from a multivariate Gaussian distribution, $\mathcal{N}(\boldsymbol{\mu}, \mathbf{\Sigma})$ . The parameters to be estimated are $\boldsymbol{\theta} = {\boldsymbol{\mu}, \mathbf{\Sigma}}$ .

That means that the log-likelihood function for this model and a dataset $\mathcal{X}$ is:

$\log L(\boldsymbol{\mu}, \mathbf{\Sigma}; \mathcal{X}) = \sum_{n=1}^N \log \left( \frac{1}{(2\pi)^{D/2}|\mathbf{\Sigma}|^{1/2}} \exp\left(-\frac{1}{2}(\mathbf{x}_n-\boldsymbol{\mu})^\top\mathbf{\Sigma}^{-1}(\mathbf{x}_n-\boldsymbol{\mu})\right) \right)$

Do math, take partial derivatives, yada yada and we get:
- $\hat{\boldsymbol{\mu}}_{\text{ML}} = \frac{1}{N}\sum\mathbf{x}_n = \bar{\mathbf{x}}$ (the sample mean)
- $\hat{\mathbf{\Sigma}}_{\text{ML}} = \frac{1}{N}\sum(\mathbf{x}_n - \bar{\mathbf{x}})(\mathbf{x}_n - \bar{\mathbf{x}})^\top$
Takeaway – The MLE (estimator / estimation – same thing) is the estimator that makes the observed data “most probable” under a Gaussian model.

Now, this estimator is biased!

Changing notation a bit (I started this on another day and am lazy to keep track, haha).

Here’s what’s up rn:

The estimator for the covariance matrix using denominator $N$ is the MLE:

$\mathbf{S}_{N} = \frac{1}{N} \sum_{n=1}^N (\mathbf{x}_n - \bar{\mathbf{x}})(\mathbf{x}_n - \bar{\mathbf{x}})^\top$

Its expected value is: (tiny recap time!)

$\mathbb{E}[\mathbf{S}_N] = \frac{N-1}{N}\mathbf{\Sigma}$

This estimator is biased. The unbiased sample covariance matrix is:

$\mathbf{S}_{N-1} = \frac{1}{N-1} \sum_{n=1}^N (\mathbf{x}_n - \bar{\mathbf{x}})(\mathbf{x}_n - \bar{\mathbf{x}})^\top$

Its expected value is $\mathbb{E}[\mathbf{S}_{N-1}] = \mathbf{\Sigma}$ .

Now, which one do we use?

Use MLE for ML, unbiased for statistics.
- Statistics: Unbiasedness is a highly desirable property. An unbiased estimator is correct on average.
  - The unbiased estimator $\mathbf{S}_{N-1}$ is the standard. In statistical analysis (e.g., hypothesis testing, confidence intervals), using an unbiased estimate of variance is critical for the validity of the statistical procedures. The focus is on estimation accuracy.
- Machine Learning: Minimizing prediction error is a desired property. MLE is better for such cases.
  - The quality of an estimator is often judged by its Mean Squared Error (MSE)
  - $\text{MSE}(\hat{\theta}) = \mathbb{E}[(\hat{\theta} - \theta)^2] = (\text{Bias}(\hat{\theta}))^2 + \text{Var}(\hat{\theta})$
  - This is the bias-variance tradeoff. The best estimator is one that minimizes this total error.
  - The MLE estimator $\mathbf{S}_{N}$ has a slightly lower variance than the unbiased estimator $\mathbf{S}_{N-1}$ (what, you want the proof? we did this earlier, dummy). It can be shown that $\text{MSE}(\mathbf{S}_{N}) < \text{MSE}(\mathbf{S}_{N-1})$ . Therefore, if the goal is to find an estimator that is “closer” to the true value on average in terms of squared error, the biased MLE is superior.
Back to PCA

We take the empirical covariance matrix to be

$\mathbf{S} = \frac{1}{N} \sum_{n=1}^N \mathbf{x}_n \mathbf{x}_n^\top = \frac{1}{N}\mathbf{X}^\top\mathbf{X}$

where $\mathbf{X} \in \mathbb{R}^{N \times D}$ is the data matrix with observations as rows. This is when we are dealing with a centered dataset, i.e., the sample mean, $\frac{1}{N}\sum_n \mathbf{x}_n = \mathbf{0}$ .

Notice that the covariance matrix, $\mathbf{S} \in \mathbb{R}^{D \times D}$ ; it is reflects the dimenions, NOT the samples.

Important Note: Ok, cool, $N$ for ML (explained above). But, does it matter here? NO. We use eigenvectors and these are the same for both $\mathbf{S}_{N-1}$ and $\mathbf{S}_{N}$ since these are only scaled versions of each other.

(Perspective 1) As maximising variance

Remember: We have centered the dataset!

Now, we are trying to find a lower dimension space. So, we reduce the number of dimensions (duh), but what this means is that we should be able to reflect ${x_n} \in \mathbb{R}^D$ in less than $D$ coordinate vectors.

So, we try to find a set of coordinate vectors $\in \mathbb{R}^D$ such that the number of these coordinate vectors is less than $D$ . Thus, now, the new set of coordinate vectors will no longer span the entire real space, i.e., $\mathbb{R}^D$ .

Now, let us find the direction, represented by a unit vector $\mathbf{b}_1 \in \mathbb{R}^D$ , onto which the projected data has the greatest variance.

The projection of a data point $\mathbf{x}_n$ onto the direction $\mathbf{b}_1$ results in a scalar coordinate $z_{1n} = \mathbf{b}_1^\top \mathbf{x}_n$ . The variance of these projected coordinates over the entire dataset is:

$V_{1} = \frac{1}{N}\sum_{n=1}^{N} z_{1n}^{2}$

$= \frac{1}{N}\sum_{n=1}^{N} (\mathbf{b}_{1}^\top \mathbf{x}_{n})^{2}$

$= \frac{1}{N}\sum_{n=1}^{N} \mathbf{b}_{1}^\top \mathbf{x}_{n} \mathbf{x}_{n}^\top \mathbf{b}_{1}$

$= \mathbf{b}_{1}^\top \left(\frac{1}{N}\sum_{n=1}^{N} \mathbf{x}_{n} \mathbf{x}_{n}^\top\right) \mathbf{b}_{1}$

$= \mathbf{b}_{1}^\top \mathbf{S} \mathbf{b}_{1}$

What we want to do is maximize this variance subject to the constraint that $\mathbf{b}_1$ is a unit vector, i.e., $|\mathbf{b}_1|^2 = \mathbf{b}_1^\top \mathbf{b}_1 = 1$ . This is a constrained optimization problem:

$\max_{\mathbf{b}_1} \mathbf{b}_1^\top \mathbf{S} \mathbf{b}_1 \quad \text{subject to} \quad \mathbf{b}_1^\top \mathbf{b}_1 = 1$

We form the Lagrangian:

$\mathcal{L}(\mathbf{b}_1, \lambda_1) = \mathbf{b}_1^\top \mathbf{S} \mathbf{b}_1 - \lambda_1(\mathbf{b}_1^\top \mathbf{b}_1 - 1)$

Setting the gradient with respect to $\mathbf{b}_1$ to zero yields:

$\nabla_{\mathbf{b}_1} \mathcal{L} = 2\mathbf{S}\mathbf{b}_1 - 2\lambda_1\mathbf{b}_1 = \mathbf{0} \implies \mathbf{S}\mathbf{b}_1 = \lambda_1\mathbf{b}_1$

This is the eigenvalue equation for the covariance matrix $\mathbf{S}$ . To maximize the variance $V_1 = \mathbf{b}_1^\top \mathbf{S} \mathbf{b}_1 = \mathbf{b}_1^\top (\lambda_1 \mathbf{b}_1) = \lambda_1$ , we must choose $\lambda_1$ to be the largest eigenvalue of $\mathbf{S}$ . The corresponding vector $\mathbf{b}_1$ is the first principal component.

Subsequent principal components $\mathbf{b}_2, \dots, \mathbf{b}_M$ are found by finding the eigenvectors of $\mathbf{S}$ corresponding to the next largest eigenvalues, ensuring they are orthogonal to the preceding components.

Notice how it is also very neat that the variance is then just the sum of the chosen eigen values:)

(Perspective 2) As minimizing reconstruction error

No one really talks about this one much but let’s see. I will be very math-y here.
- Let the principal subspace be an $M$ -dimensional subspace spanned by an orthonormal basis of vectors ${\mathbf{b}_{m}}_{m=1}^M$
- $\mathbf{B} = [\mathbf{b}_1, \dots, \mathbf{b}_M] \in \mathbb{R}^{D \times M}$ .
- The orthogonal projection of a data point $\mathbf{x}_n$ onto this subspace is $\tilde{\mathbf{x}}_n = \mathbf{B}\mathbf{B}^\top \mathbf{x}_n$ .
- Our goal is to find the basis $\mathbf{B}$ that minimizes the average squared reconstruction error:
$J_M = \frac{1}{N} \sum_{n=1}^N |\mathbf{x}_n - \tilde{\mathbf{x}}_n|^2$

By the properties of orthogonal projections, the total variance can be decomposed: $\sum_n |\mathbf{x}_n|^2 = \sum_n |\tilde{\mathbf{x}}_n|^2 + \sum_n |\mathbf{x}_n - \tilde{\mathbf{x}}_n|^2$ . Since the total variance $\sum_n |\mathbf{x}_n|^2$ is fixed, minimizing the reconstruction error is equivalent to maximizing the projected variance $\sum_n |\tilde{\mathbf{x}}_n|^2$ .

So, yes, we have very smartly reduced the 2nd perspective to the first perspective. Yay!

Probabilistic PCA (PPCA)

Classical PCA is a geometric, algorithmic procedure. Probabilistic PCA (PPCA) reformulates this procedure within a probabilistic framework by defining a latent variable model that describes the generative process of the data.

What? Ok, fine, I’ll explain.

PCA is a point estimate – gives you a single, definitive answer for the principal components and the lower-dimensional space they define, without expressing any uncertainty about that result. It is fundamentally a descriptive tool, not a generative one. This means:
- It cannot generate new data points
- It does not provide a likelihood of the data, making principled model selection (e.g., choosing the optimal number of components $M$ ) difficult!
- It cannot handle missing values in the data.
So, we now have PPCA! PPCA posits that the high-dimensional observed data $\mathbf{x} \in \mathbb{R}^D$ is a manifestation of some simpler, unobserved (latent) cause $\mathbf{z} \in \mathbb{R}^M$ , where $M < D$ .

PPCA models the relationship between the latent and the observed (this is such a smart sounding statement, omg).

Definitions

(No, what did you expect?)

Latent Variable Prior: A simple cause is assumed for each data point. This is modeled by a prior distribution over the latent variable $\mathbf{z}$ . PPCA assumes a standard multivariate Gaussian prior:

$p(\mathbf{z}) = \mathcal{N}(\mathbf{z} | \mathbf{0}, \mathbf{I})$

This asserts that the latent variables are, a priori, independent and centered at the origin of the latent space.

Conditional Distribution of Data (Likelihood): The observed data point $\mathbf{x}$ is generated from its corresponding latent variable $\mathbf{z}$ via a linear transformation, with added noise. This is defined by the conditional probability $p(\mathbf{x}|\mathbf{z})$ :

$p(\mathbf{x} | \mathbf{z}) = \mathcal{N}(\mathbf{x} | \mathbf{B}\mathbf{z} + \boldsymbol{\mu}, \sigma^2\mathbf{I})$

This is equivalent to the equation $\mathbf{x}_n = \mathbf{B}\mathbf{z}_n + \boldsymbol{\mu} + \boldsymbol{\epsilon}_n$ , where:
- is a matrix that maps the latent space to the data space. Its columns can be interpreted as the basis vectors of the principal subspace.
  - The principal subspace is the lower-dimensional plane or hyperplane that captures the most important variations in your data. Basis vectors are the fundamental directions that define or “span” that subspace.
  - This just means that the reconstructed data point is made by a linear combination of the basis vectors of the principal subspace (plus the following things).
- is the mean of the data.
  - Yeah I don’t think I can explain this more.
- is isotropic Gaussian noise, which accounts for the variability of data off the principal subspace.
  - isotropic = fluctuations are same in all directions; hence the scalar multiplied with the identity matrix for variance.
  - Also, every dimension is independent and uncorrrelated.
Calculation time!

To define the likelihood of the model, we must integrate out the latent variables to find the marginal distribution of the observed data, $p(\mathbf{x})$ :

$p(\mathbf{x}) = \int p(\mathbf{x} | \mathbf{z}) p(\mathbf{z}) d\mathbf{z}$

Since both $p(\mathbf{x}|\mathbf{z})$ and $p(\mathbf{z})$ are Gaussian, and the relationship is linear, this integral can be solved analytically. The resulting marginal distribution is also a Gaussian, $p(\mathbf{x}) = \mathcal{N}(\mathbf{x} | \boldsymbol{\mu}, \mathbf{C})$ , with mean and covariance:
- Mean: $\mathbb{E}[\mathbf{x}] = \mathbb{E}[\mathbf{B}\mathbf{z} + \boldsymbol{\mu} + \boldsymbol{\epsilon}] = \mathbf{B}\mathbb{E}[\mathbf{z}] + \boldsymbol{\mu} + \mathbb{E}[\boldsymbol{\epsilon}] = \boldsymbol{\mu}$ .
- Covariance: $\text{Cov}(\mathbf{x}) = \text{Cov}(\mathbf{B}\mathbf{z} + \boldsymbol{\epsilon}) = \mathbf{B}\text{Cov}(\mathbf{z})\mathbf{B}^\top + \text{Cov}(\boldsymbol{\epsilon}) = \mathbf{B}\mathbf{I}\mathbf{B}^\top + \sigma^2\mathbf{I} = \mathbf{B}\mathbf{B}^\top + \sigma^2\mathbf{I}$ .
The parameters of the model ( $\mathbf{B}, \boldsymbol{\mu}, \sigma^2$ ) are found by maximizing the log-likelihood of the observed data under this marginal distribution. How to do this? I dunno and I can’t write that much latex. Go read a book for this please (note to self – code is available thankfully).

Advantanges
- Handling Missing Data: The generative nature of PPCA allows it to handle missing data through the Expectation-Maximization (EM) algorithm.
  - How? Again, I think I’ll write more on EM later. But, for now, let’s take it as is.
- Model Selection: The probabilistic formulation provides a log-likelihood value for the dataset given a model.
  - This allows for principled model selection, for instance, choosing the optimal number of principal components $M$ by comparing the likelihood values (often using criteria like AIC or BIC) for different choices of $M$.
- PPCA is used in robotics for sensor fusion and state estimation where sensor noise is a critical factor. How? Idk – I took this statement as is. Takeaway – super useful.
Kernel PCA

This addresses the primary limitation of standard PCA – PCA seeks a linear subspace to represent the data. If the data lies on a complex, non-linear manifold (e.g., a spiral, a “Swiss roll”), any linear projection will fail to capture the underlying structure, leading to a poor low-dimensional representation.

How? Map the data from the original input space $\mathbb{R}^D$ to a higher (often infinite) dimensional feature space $\mathcal{F}$ via a non-linear mapping $\phi: \mathbb{R}^D \to \mathcal{F}$ . The mapping $\phi$ is chosen such that the structure of the mapped data $\phi(\mathbf{x})$ becomes linear in $\mathcal{F}$ . Standard linear PCA is then performed on the data in this feature space.

But, we get a computational barrier:
- The dimensionality of $\mathcal{F}$ can be enormous or infinite.
- The function $\phi$ itself may not even be known explicitly.
So, we use the kernel trick!

Let’s perform PCA on the mapped data $latex {\phi(\mathbf{x}n)}{n=1}^N$. The covariance matrix in the feature space $\mathcal{F}$ is:

$\mathbf{S}_\phi = \frac{1}{N} \sum_{n=1}^N \phi(\mathbf{x}_n) \phi(\mathbf{x}_n)^\top$

The eigenvalue problem in $\mathcal{F}$ is $\mathbf{S}_\phi \mathbf{v} = \lambda \mathbf{v}$ , where $\mathbf{v} \in \mathcal{F}$ are the principal components in the feature space.

Want to go on? Follow this lecture pdf from slide 32; I really can’t write it better for myself either. The only extra thing that I can offer is to answer the question: Why can eigenvectors be written as linear combiunation of features?

The eigenvectors vⱼ can be written as a linear combination of the features because they must lie in the space spanned by the feature vectors φ(xᵢ) themselves.

Let’s start with the PCA equation:

$\frac{1}{m} \sum_{i=1}^m \phi(\mathbf{x}_i)\phi(\mathbf{x}_i)^T \mathbf{v}_j = \lambda_j \mathbf{v}_j$

We can isolate the eigenvector vⱼ by rearranging the equation (assuming the eigenvalue λⱼ is not zero):

$\mathbf{v}_j = \frac{1}{m\lambda_j} \sum_{i=1}^m \phi(\mathbf{x}_i)\phi(\mathbf{x}_i)^T \mathbf{v}_j$

Now, let’s group the terms differently:

$\mathbf{v}_j = \sum_{i=1}^m \left( \frac{\phi(\mathbf{x}_i)^T \mathbf{v}_j}{m\lambda_j} \right) \phi(\mathbf{x}_i)$

Notice that the term in the parenthesis, $\frac{\phi(\mathbf{x}_i)^T \mathbf{v}_j}{m\lambda_j}$ , is just a scalar (a single number) because it’s a dot product divided by constants.

If we call this scalar coefficient $a_{ji}$ , we get the exact relationship shown on the slide:

$\mathbf{v}_j = \sum_{i=1}^m a_{ji} \phi(\mathbf{x}_i)$

This proves that any eigenvector vⱼ is simply a weighted sum of all the feature vectors φ(xᵢ).

Spectral Clustering

This is a graph-based clustering technique that partitions data by analyzing the spectrum (eigenvalues) of a graph’s Laplacian matrix.

Foundations

What? Ok, some definitions :sparkles: (I hope you can relate to some things if you took data structures and algorithms)

Similarity Graph

Given a dataset of $N$ points ${\mathbf{x}_1, \dots, \mathbf{x}_N}$ , we construct a graph $G = (V, E)$ , where the set of vertices $V$ corresponds to the data points. The set of edges $E$ represents the similarity or affinity between pairs of points. This similarity is quantified by a weighted adjacency matrix $\mathbf{A} \in \mathbb{R}^{N \times N}$ , where $A_{ij} \ge 0$ is the similarity between points $\mathbf{x}_i$ and $\mathbf{x}_j$ . High values indicate high similarity.
- Common ways to construct $\mathbf{A}$ include the k-nearest neighbor graph or using a Gaussian kernel: $A_{ij} = \exp(-|\mathbf{x}_i - \mathbf{x}_j|^2 / 2\sigma^2)$ .
Graph Partitioning

The goal of clustering is to partition the vertices $V$ into disjoint sets (clusters) $C_1, C_2, \dots, C_k$ . An ideal partition is one where the edges within a cluster have high weights (high intra-cluster similarity) and the edges between clusters have low weights (low inter-cluster similarity). This is known as finding a minimal graph cut.
- This is our objective btw in case it wasn’t clear.
Graph Laplacian

Ok, this isn’t much of a definition. I’ll try to explain it since it’s the core component / concept.

It is the fundamental object whose spectrum (eigenvalues and eigenvectors) reveals the cluster structure of the graph.

First, let’s define the degree matrix. A diagonal matrix where each diagonal element $D_{ii}$ is the sum of weights of all edges connected to vertex $i$ :

$D_{ii} = \sum_{j=1}^N A_{ij}$

The unnormalized graph laplacian is:

$\mathbf{L} := \mathbf{D} - \mathbf{A}$

Some properties of the Graph Laplacian:
- It is symmetric and positive semi-definite.
  - It is symmetric since it’s the combination of the adjacency matrix and the degree matrix (a diagonal matrix) are symmetric.
  - We later show that $x^{T}Lx$ turns into a square term, so it is positive semi-definite (or semi positive definite). Or, ctrl + F and find quadratic-laplacian-square.
  - The Spectral Theorem states that any real, symmetric matrix has real eigenvalues and orthogonal eigenvectors.
    
    The added property of being positive semi-definite ensures these real eigenvalues are also non-negative (λ≥0).
    
    This is crucial because it allows us to reliably sort the eigenvalues from smallest to largest.
- The smallest eigenvalue of is always , with the corresponding eigenvector being the vector of all ones, .
  - This is because the sum of each row in the Laplacian matrix is always zero by construction ( $D_{ii} - \sum_j A_{ij} = 0$ ).
Graph cut and Laplacian

Ok, simplest case first – we have two partitions, i.e., two disjoint sets, $C$ and its complement $\bar{C}$ .

The “cut” is the sum of weights of all edges connecting these two sets:

$\text{cut}(C, \bar{C}) = \sum_{i \in C, j \in \bar{C}} A_{ij}$

Repeating for dramatic effect – minimizing this cut directly is an NP-hard combinatorial problem.

We (I) define a partition indicator vector $\mathbf{f} \in \mathbb{R}^N$ where:

Yeah, I didn’t want to write latex.

Now, consider the quadratic form of the Laplacian, $\mathbf{f}^\top \mathbf{L} \mathbf{f}$ .

$\mathbf{f}^\top \mathbf{L} \mathbf{f} = \mathbf{f}^\top (\mathbf{D} - \mathbf{A}) \mathbf{f}$

$= \sum_{i} D_{ii}f_{i}^2 - \sum_{i,j} A_{ij}f_{i} f_{j}$

$= \sum_{i} \left(\sum_{j} A_{ij}\right) f_{i}^2 - \sum_{i,j} A_{ij}f_{i} f_{j}$

$= \frac{1}{2} \left( \sum_{i} \sum_{j} A_{ij}f_{i}^2 - 2\sum_{i,j} A_{ij}f_{i} f_{j} + \sum_{j} \sum_{i} A_{ji}f_{j}^2 \right)$

$= \frac{1}{2} \sum_{i,j} A_{ij}(f_{i}^2 - 2f_{i} f_{j} + f_{j}^2)$

$= \frac{1}{2} \sum_{i,j} A_{ij}(f_{i} - f_{j})^2$

See? I promised you a square earlier! (keyword – quadratic-laplacian-square)

Now, observe the term $(f_i - f_j)^2$ :
- If $f_i = f_j$ (i.e., both nodes are in the same partition), $(f_i - f_j)^2 = 0$ .
- If $f_i \neq f_j$ (i.e., nodes are in different partitions), $(f_i - f_j)^2 = (1 - (-1))^2 = 4$ .
So / Therefore / Thus / Hence, minimizing the graph cut is equivalent to minimizing $\mathbf{f}^\top \mathbf{L} \mathbf{f}$ .

How does this help us?

The optimization problem $\min_{\mathbf{f}} \mathbf{f}^\top \mathbf{L} \mathbf{f}$ with the constraint that $f_i \in {1, -1}$ is still combinatorial.

So, what we do is – we relax. With multiple xs. The entries of $\mathbf{f}$ can take any real values now. So, we get

$\min_{\mathbf{f} \in \mathbb{R}^N} \mathbf{f}^\top \mathbf{L} \mathbf{f} \quad \text{subject to} \quad \mathbf{f}^\top \mathbf{f} = 1 \quad (\text{or some other normalization})$

This is the formulation for minimizing the Rayleigh quotient. Its solution? Notice how its the same optimization problem we had in PCA?

The solution is given by the eigenvector of $\mathbf{L}$ corresponding to the smallest eigenvalue.

However, the smallest eigenvalue is $\lambda_1=0$ with eigenvector $\mathbf{u}_1 = \mathbf{1}$ . This is a trivial solution where all entries of $\mathbf{f}$ are constant, which does not define a partition.

So, the non-trivial solution must be orthogonal to this vector of ones.

$\min \mathbf{f}^\top \mathbf{L} \mathbf{f}$ subject to $\mathbf{f}^\top \mathbf{f} = 1$ and $\mathbf{f}^\top \mathbf{1} = 0$

This is the eigenvector corresponding to the second-smallest eigenvalue of $\mathbf{L}$ .
- This eigenvector is known as the Fiedler vector.
Generalized

(I’m just going to get an LLM to generate it)
1. Construct Similarity Graph: Given $N$ data points, construct the adjacency matrix $\mathbf{A}$ .
2. Compute Laplacian: Compute the degree matrix $\mathbf{D}$ and the Laplacian $\mathbf{L} = \mathbf{D} - \mathbf{A}$ .
3. Compute Eigenvectors: Find the first $k$ eigenvectors $\mathbf{u}_1, \dots, \mathbf{u}_k$ of $\mathbf{L}$ corresponding to the $k$ smallest eigenvalues.
4. Form Embedding: Construct a matrix $\mathbf{U} \in \mathbb{R}^{N \times k}$ with these eigenvectors as its columns.
5. Cluster Embedding: Treat each row of $\mathbf{U}$ as a new data point in a $k$ -dimensional space. Cluster these $N$ new points into $k$ clusters using a simple algorithm like k-means.
6. Assign Original Points: The cluster assignment of the $i$ -th row of $\mathbf{U}$ is the final cluster assignment for the original data point $\mathbf{x}_i$ .
t-Distributed Stochastic Neighbor Embedding (t-SNE)

Why not just PCA?

Remember kernel PCA? The one that solves the problem of PCA being linear? Yeah, t-SNE solves the same thing.

Prerequisites

Gaussians

The bell curve, duh. I am not explaining this further. Here’s the formula:

$p(x | \mu, \sigma^2) = \frac{1}{\sqrt{2\pi\sigma^2}} \exp\left(-\frac{(x - \mu)^2}{2\sigma^2}\right)$

The important thing to know here is that Gaussians have light tails.
- Values away from the mean are exceptionally unlikely.
t-distribution

These have heavier tails than Gaussians – defined by degrees of freedom ( $\nu=1$ ).
- The PDF of the t-distribution decays polynomially, not exponentially.
- As $\nu \to \infty$ , the t-distribution converges to the Gaussian distribution.
- The specific case used in t-SNE is the t-distribution with one degree of freedom ( $\nu=1$ ), also known as the Cauchy distribution. Its simplified PDF has the form:
$p(y) \propto \frac{1}{1+y^2}$

Now, before moving on, I want to add a bit more context – the teacher is the gaussian and the student is the t-distribution (the lower dimension one).

This is a crucial design choice to solve the “crowding problem”. In a 2D map, there isn’t enough space to accurately represent all the neighborhood distances from a high-D space. The heavy tails of the t-distribution mean that two points can be placed far apart in the map and still have a non-trivial similarity score. This allows t-SNE to place moderately dissimilar points far apart, effectively creating more space for the local neighborhoods to be modeled without being “crowded” together.

Kullback-Leibler (KL) Divergence

It is a measure of how one probability distribution, $Q$ , is different from a second, reference probability distribution, $P$ . It is not a true distance metric (it is not symmetric), but it is often used to measure the “distance” or “divergence” between distributions.

For discrete distributions over outcomes $i$ :

$D_{KL}(P || Q) = \sum_i P(i) \log \frac{P(i)}{Q(i)}$

Again, context – this is our cost function in t-SNE.

Oh? You want more on it? I want to cover both KL divergence and Gaussians more. But, for now, I think I can suggest Wikipedia.

Math

Three stages:

Stage 1: Modeling High-Dimensional Similarities

The similarity between two data points, $\mathbf{x}_i$ and $\mathbf{x}_j$ , is converted into a conditional probability, $p_{j|i}$ . This is interpreted as the probability that $\mathbf{x}_i$ would choose $\mathbf{x}_j$ as its neighbor if neighbors were picked in proportion to their probability density under a Gaussian distribution centered at $\mathbf{x}_i$ .

$p_{j|i} = \frac{\exp(-|\mathbf{x}_i - \mathbf{x}_j|^2 / 2\sigma_i^2)}{\sum_{k \neq i} \exp(-|\mathbf{x}_i - \mathbf{x}_k|^2 / 2\sigma_i^2)}$

So, sorta like “j given i” but not exactly.

Ok, what is $\sigma_i^2$ ? Variance of the local Gaussian. Yes, but how do we get this? We use something called Perplexity

Perplexity

(adding this as well – notice how it dictates the size of the neighbourhood in a way)

Yeah, you see how? (I just added the last sentence within brackets for fun)

Now, notice how the scale parameter $\sigma_i$ must be locally adaptive. Each data point $\mathbf{x}_i$ needs its own unique bandwidth $\sigma_i$ that is appropriate for its local neighborhood density.

Before everything else, focus on the following definition.

For each data point $\mathbf{x}_i$ , we have a conditional probability distribution $P_i$ over all other points, where $P_i = \{p_{j|i}\}_{j \neq i}$ . The Shannon entropy of this distribution, measured in bits, is:

$H(P_i) = - \sum_{j \neq i} p_{j|i} \log_2 p_{j|i}$

The entropy $H(P_i)$ quantifies the uncertainty involved in choosing a neighbor for point $\mathbf{x}_i$ according to the distribution $P_i$ .
- If the distribution is sharply peaked on one neighbor (low uncertainty), the entropy will be low.
- If the distribution is spread out over many neighbors (high uncertainty), the entropy will be high.
The perplexity of the distribution $P_i$ is defined as the exponential of its Shannon entropy:

$\text{Perp}(P_i) = 2^{H(P_i)}$

(If you’ve done NLP, this would seem very familiar. That’s because it is.)

What does this do? Notice that if the probability distribution $P_i$ is uniform over $k$ neighbors (and zero for all others), its perplexity will be exactly $k$.

Perplexity can be interpreted as a smooth, continuous measure of the effective number of neighbors for point $\mathbf{x}_i$ .

Ok, now, we don’t know $H(P_i)$ – wouldn’t we need the variances to calculate it?

Yes. So, here’s what we do:
1. User Input: The user specifies a single, global perplexity value. This value is the same for all points and typically ranges from 5 to 50. Let’s say the user chooses a perplexity of 30.
2. Per-Point Optimization: For each data point $\mathbf{x}_i$ , the t-SNE algorithm performs an independent optimization. It searches for the specific value of the Gaussian variance, $\sigma_i^2$ , that produces a conditional probability distribution $P_i$ whose perplexity is exactly equal to the user’s desired value (in this case, 30).
3. The Search Algorithm: This search is performed efficiently using binary search. The algorithm makes a guess for $\sigma_i$ , computes the resulting perplexity, and then adjusts $\sigma_i$ up or down until the target perplexity is met within a given tolerance.
Neat, isn’t it?

Last thing in this stage – Symmetrization.

The conditional probabilities $p_{j|i}$ are not symmetric. To create a single joint probability distribution $P$ over all pairs of points, they are symmetrized:

$p_{ij} = \frac{p_{j|i} + p_{i|j}}{2N}$

where $N$ is the number of data points. This $P$ represents the target similarity structure that we wish to preserve.

Stage 2: Modeling Low-Dimensional Similarities ( $Q$ )

Now, we define a similar probability distribution $Q$ for the points in the low-dimensional embedding space, ${\mathbf{y}_1, \dots, \mathbf{y}_N}$ .

Remember how I told you about the crowding problem? No? Go back.

The t-distribution is heavy-tailed, meaning it decays much more slowly with distance than a Gaussian.

$q_{ij} = \frac{(1 + |\mathbf{y}_i - \mathbf{y}_j|^2)^{-1}}{\sum_{k \neq l} (1 + |\mathbf{y}_k - \mathbf{y}_l|^2)^{-1}}$

This is the student’s t-distribution with one degree of freedom.

Stage 3: Optimization via KL Divergence Minimization

We minimize the KL divergence:

$C = \text{KL}(P||Q) = \sum_{i<j} p_{ij} \log \frac{p_{ij}}{q_{ij}}$

If you are a dum-dum like me – $i < j$ gives half of what $i \neq j$ does.

This is non-convex: has multiple minimas. So, we use iterative gradient descent and hope for the best.

$\frac{\partial C}{\partial \mathbf{y}_i} = 4 \sum_{j} (p_{ij} - q_{ij})(1 + |\mathbf{y}_i - \mathbf{y}_j|^2)^{-1}(\mathbf{y}_i - \mathbf{y}_j)$

(This makes t-SNE non-deterministic!)

Why not just kernel PCA?

Kernel PCA: finds an optimal linear projection in a non-linearly transformed space.
- Non-linear feature extraction and dimensionality reduction. It seeks to find a set of non-linear components that can be used for subsequent tasks (e.g., classification, regression).
t-SNE: finds a a faithful representation of the local data topology.
- Visualization and exploratory data analysis. It seeks to create a 2D or 3D map of the data that reveals its underlying local structure and clustering.
Now, a very, very important point is that

Kernel PCA deals with global structure, t-SNE with local.

You use kernel PCA when you need non-linear feature extraction for a subsequent machine learning task.

You use t-SNE for visualizing / exploring a high-dimensional feature space.

Key takeaway: Do NOT use t-SNE for feature extraction.

Sources
1. Mathematics for Machine Learning (link)
2. McGill lecture (link)
Matrix Decompositions 1

July 17, 2025
What’s there?

There are basically 5 (ಐದು / पाँच) types of matrix decompositions that are like super important. I strive to go in depth into all of them. These are
1. Eigendecomposition (This post!)
2. QR decomposition
3. Cholesky decomposition
4. LU decomposition
5. (perhaps the most important) Singular Value Decomposition
Edit (July 20, 2025): Added proof of spectral theorem.

Why do I write this?

I wanted to create an authoritative document I could refer to for later. Thought that I might as well make it public (build in public, serve society and that sort of stuff).

Eigen decomposition

Wait…

Basics!

Linear Transformation

A linear transformation is a function $\Phi: V \to W$ between two vector spaces $V$ and $W$ that preserves the operations of vector addition and scalar multiplication. For finite-dimensional vector spaces, any linear transformation $\Phi: \mathbb{R}^n \to \mathbb{R}^m$ can be represented by an $m \times n$ matrix $\mathbf{A}$ . The action of the transformation on a vector $\mathbf{x} \in \mathbb{R}^n$ is given by matrix-vector multiplication: $\Phi(\mathbf{x}) = \mathbf{A}\mathbf{x}$ .

Ok, got it? Basic linear algebra.

Eigenvectors and eigenvalues

Let $\mathbf{A} \in \mathbb{R}^{n \times n}$ be a square matrix. A nonzero vector $\mathbf{x} \in \mathbb{R}^n \setminus {\mathbf{0}}$ is an eigenvector of $\mathbf{A}$ if there exists a scalar $\lambda \in \mathbb{C}$ such that $\mathbf{A}\mathbf{x} = \lambda\mathbf{x}$ . The scalar $\lambda$ is called the eigenvalue corresponding to the eigenvector $\mathbf{x}$ .

I will now add a bit of geometry. The equation states that the vector $\mathbf{A}\mathbf{x}$ (the result of the transformation) is collinear (fancy term for along the same line) with the original vector $\mathbf{x}$ . The eigenvalue $\lambda$ is the scaling factor.
- If $\lambda > 1$ , the eigenvector is stretched.
- If $0 < \lambda < 1$ , the eigenvector is compressed.
- If $\lambda < 0$ , the eigenvector is flipped and scaled.
- If $\lambda = 1$ , the eigenvector is unchanged.
- If $\lambda = 0$ , the eigenvector is mapped to the zero vector, meaning it lies in the null space of $\mathbf{A}$ .
Want a reminder on how to find these? The expression $p_{\mathbf{A}}(\lambda) = \det(\mathbf{A} - \lambda\mathbf{I})$ is a polynomial in $\lambda$ of degree $n$ , called the characteristic polynomial. The roots of this polynomial are the eigenvalues of $\mathbf{A}$ . By the fundamental theorem of algebra, an $n$ -degree polynomial has $n$ roots (counting multiplicity) in the complex plane $\mathbb{C}$ .

Important matrix types

Ok, this is purely LLM generated.
- Diagonal Matrices ( $\mathbf{D}$ ): Square matrices where all non-diagonal entries are zero ( $D_{ij} = 0$ for $i \neq j$ ). They represent pure scaling along the coordinate axes. Their powers ( $\mathbf{D}^k$ ) and inverses ( $\mathbf{D}^{-1}$ ) are trivial to compute.
- Triangular Matrices ( $\mathbf{L}, \mathbf{U}$ ): Square matrices that are either lower triangular ( $\mathbf{L}$ , where $L_{ij} = 0$ for $i < j$ ) or upper triangular ( $\mathbf{U}$ , where $U_{ij} = 0$ for $i > j$ ). Systems of equations involving triangular matrices can be solved efficiently via forward or backward substitution.
- Orthogonal Matrices ( $\mathbf{Q}$ ): Square matrices whose columns (and rows) form an orthonormal basis. They satisfy the property $\mathbf{Q}^\top\mathbf{Q} = \mathbf{Q}\mathbf{Q}^\top = \mathbf{I}$ , which implies $\mathbf{Q}^{-1} = \mathbf{Q}^\top$ . Geometrically, orthogonal matrices represent pure rotations or reflections, which preserve lengths and angles of vectors: $|\mathbf{Q}\mathbf{x}|_2 = |\mathbf{x}|_2$ .
I wrote this (!!) –
- Also, Diagonalizability. A square matrix $\mathbf{A} \in \mathbb{R}^{n \times n}$ is said to be diagonalizable if it has $n$ linearly independent eigenvectors. Such matrices are also known as non-defective matrices.
Eigendecomposition

Basics

Ok, we can start now. For eigen decompositions, we are super concerned with endomorphisms. This is when the domain and the codomains of the linear transformation are in the same space. This is basically – $\Phi: \mathbb{R}^n \to \mathbb{R}^n$ , represented by a square matrix $\mathbf{A} \in \mathbb{R}^{n \times n}$ .

Why do you need to know this? You don’t. I knew this and barfed the information on you.

When can you eigendecompose?

If a matrix $\mathbf{A} \in \mathbb{R}^{n \times n}$ is diagonalizable, it can be factored as $\mathbf{A} = \mathbf{P}\mathbf{D}\mathbf{P}^{-1}$ , where:
- $\mathbf{P} \in \mathbb{R}^{n \times n}$ is a matrix whose columns are the $n$ linearly independent eigenvectors of $\mathbf{A}$ , i.e., $\mathbf{P} = [\mathbf{p}_1, \mathbf{p}_2, \dots, \mathbf{p}_n]$ .
- $\mathbf{D} \in \mathbb{R}^{n \times n}$ is a diagonal matrix whose diagonal entries are the corresponding eigenvalues, $D_{ii} = \lambda_i$ .
- $\mathbf{P}^{-1}$ is the inverse of the matrix $\mathbf{P}$ . The existence of $\mathbf{P}^{-1}$ is guaranteed by the linear independence of the eigenvectors.
Ok, why can you eigendecompose?

Let $\mathbf{A}$ have $n$ linearly independent eigenvectors $\mathbf{p}_1, \dots, \mathbf{p}_n$ with corresponding eigenvalues $\lambda_1, \dots, \lambda_n$ . We can write the $n$ eigenvalue equations as:

$\mathbf{A}\mathbf{p}_i = \lambda_i\mathbf{p}_i, \quad i=1, \dots, n$

We can concatenate these equations into a single matrix equation. Let $\mathbf{P} = [\mathbf{p}_1, \dots, \mathbf{p}_n]$ . Then:

$\mathbf{A}\mathbf{P} = \mathbf{A}[\mathbf{p}_1, \dots, \mathbf{p}_n] = [\mathbf{A}\mathbf{p}_1, \dots, \mathbf{A}\mathbf{p}_n] = [\lambda_1\mathbf{p}_1, \dots, \lambda_n\mathbf{p}_n]$

The right-hand side can be expressed as the product of $\mathbf{P}$ and a diagonal matrix $\mathbf{D}$ :

$[\lambda_1\mathbf{p}_1, \dots, \lambda_n\mathbf{p}_n] = [\mathbf{p}_1, \dots, \mathbf{p}_n] \begin{bmatrix} \lambda_1 & 0 & \dots & 0 \\ 0 & \lambda_2 & \dots & 0 \\ \vdots & \vdots & \ddots & \vdots \\ 0 & 0 & \dots & \lambda_n \end{bmatrix} = \mathbf{P}\mathbf{D}$

Thus, we have the relationship $\mathbf{A}\mathbf{P} = \mathbf{P}\mathbf{D}$ . Since the eigenvectors are linearly independent, the matrix $\mathbf{P}$ is invertible. We can right-multiply by $\mathbf{P}^{-1}$ to obtain the eigendecomposition:

$\mathbf{A} = \mathbf{P}\mathbf{D}\mathbf{P}^{-1}$

Side Note: Notice how $\mathbf{P} \in \mathbb{R}^{n \times n}$ is not orthonormal. It can be, sure. Your call. Why? It’s inverse removes the norm of the individual columns. Trivial enough (math textbook, much?).

Eigendecomposition is powerful!

Eigendecomposition reveals fundamental properties of a matrix:
- Determinant: The determinant of $\mathbf{A}$ is the product of its eigenvalues.
$\det(\mathbf{A}) = \det(\mathbf{P}\mathbf{D}\mathbf{P}^{-1}) = \det(\mathbf{P})\det(\mathbf{D})\det(\mathbf{P}^{-1}) = \det(\mathbf{D}) = \prod_{i=1}^n \lambda_i$
- Trace: The trace of $\mathbf{A}$ is the sum of its eigenvalues.
$\text{tr}(\mathbf{A}) = \text{tr}(\mathbf{P}\mathbf{D}\mathbf{P}^{-1}) = \text{tr}(\mathbf{D}\mathbf{P}^{-1}\mathbf{P}) = \text{tr}(\mathbf{D}) = \sum_{i=1}^n \lambda_i$
- Matrix Powers: Eigendecomposition simplifies the computation of matrix powers.
$\mathbf{A}^k = (\mathbf{P}\mathbf{D}\mathbf{P}^{-1})^k = \mathbf{P}\mathbf{D}^k\mathbf{P}^{-1}$

The Spectral Theorem for Symmetric Matrices

Ok, here’s a special case – what if the matrix is symmetric? (Why do we care about symmetric matrices? Remind yourself about covariance matrices.)

Theorem (Spectral Theorem): For any real symmetric matrix $\mathbf{A} \in \mathbb{R}^{n \times n}$ :
1. All eigenvalues of $\mathbf{A}$ are real numbers.
2. There exists an orthonormal basis of $\mathbb{R}^n$ consisting of the eigenvectors of $\mathbf{A}$ . That is, the eigenvectors can be chosen to be mutually orthogonal and have unit norm.
Proof 1: Eigenvalues are Real

Let $\lambda$ be an eigenvalue of $\mathbf{A}$ with a corresponding eigenvector $\mathbf{x}$ . These may be complex, so we start with the standard eigenvalue equation:

$\mathbf{A}\mathbf{x} = \lambda\mathbf{x}$

Now, take the conjugate transpose (denoted by a star, $*$ ) of both sides:

$(\mathbf{A}\mathbf{x})^* = (\lambda\mathbf{x})^*$

$\mathbf{x}^*\mathbf{A}^* = \bar{\lambda}\mathbf{x}^*$

Since the matrix $\mathbf{A}$ is real and symmetric, its conjugate transpose is itself ( $\mathbf{A}^* = \mathbf{A}^\top = \mathbf{A}$ ). This simplifies the equation to:

$\mathbf{x}^{*}\mathbf{A} = \bar{\lambda}\mathbf{x}^{*}$

Next, we right-multiply this equation by the original eigenvector $\mathbf{x}$ :

$\mathbf{x}^*\mathbf{A}\mathbf{x} = \bar{\lambda}\mathbf{x}^*\mathbf{x} \quad \cdots \text{(Eq. 1)}$

Now, let’s return to the original equation, $\mathbf{A}\mathbf{x} = \lambda\mathbf{x}$ , and left-multiply it by $\mathbf{x}^*$ :

$\mathbf{x}^*\mathbf{A}\mathbf{x} = \mathbf{x}^*(\lambda\mathbf{x}) = \lambda\mathbf{x}^*\mathbf{x} \quad \cdots \text{(Eq. 2)}$

By comparing Eq. 1 and Eq. 2, we can see that:

$\lambda\mathbf{x}^*\mathbf{x} = \bar{\lambda}\mathbf{x}^*\mathbf{x}$

$(\lambda - \bar{\lambda})\mathbf{x}^*\mathbf{x} = 0$

The term $\mathbf{x}^*\mathbf{x}$ is the squared magnitude of the vector $\mathbf{x}$ . Since an eigenvector cannot be a zero vector, $\mathbf{x}^*\mathbf{x}$ must be a positive real number. Therefore, we can safely divide by it, leaving:

$\lambda - \bar{\lambda} = 0 \quad \implies \quad \lambda = \bar{\lambda}$

A number is equal to its own complex conjugate if and only if it is a real number. This completes the proof that all eigenvalues of a real symmetric matrix are real.

Proof 2: Eigenvectors of Distinct Eigenvalues are Orthogonal

Let $\lambda_1$ and $\lambda_2$ be two distinct eigenvalues ( $\lambda_1 \neq \lambda_2$ ) of $\mathbf{A}$ , with corresponding eigenvectors $\mathbf{x}_1$ and $\mathbf{x}_2$ . We have:
1. $\mathbf{A}\mathbf{x}_1 = \lambda_1\mathbf{x}_1$
2. $\mathbf{A}\mathbf{x}_2 = \lambda_2\mathbf{x}_2$
Let’s start with the first equation and left-multiply by $\mathbf{x}_2^\top$ :

$\mathbf{x}_2^\top\mathbf{A}\mathbf{x}_1 = \mathbf{x}_2^\top(\lambda_1\mathbf{x}_1) = \lambda_1\mathbf{x}_2^\top\mathbf{x}_1$

Now, let’s take the transpose of the entire expression. Since the result is a scalar, it is equal to its own transpose:

$(\mathbf{x}_2^\top\mathbf{A}\mathbf{x}_1)^\top = (\lambda_1\mathbf{x}_2^\top\mathbf{x}_1)^\top$

$\mathbf{x}_1^\top\mathbf{A}^\top\mathbf{x}_2 = \lambda_1\mathbf{x}_1^\top\mathbf{x}_2$

Because $\mathbf{A}$ is symmetric ( $\mathbf{A}^\top = \mathbf{A}$ ), this becomes:

$\mathbf{x}_1^\top\mathbf{A}\mathbf{x}_2 = \lambda_1\mathbf{x}_1^\top\mathbf{x}_2 \quad \cdots \text{(Eq. 3)}$

Next, we return to the second original equation ( $\mathbf{A}\mathbf{x}_2 = \lambda_2\mathbf{x}_2$ ) and left-multiply by $\mathbf{x}_1^\top$ :

$\mathbf{x}_1^\top\mathbf{A}\mathbf{x}_2 = \mathbf{x}_1^\top(\lambda_2\mathbf{x}_2) = \lambda_2\mathbf{x}_1^\top\mathbf{x}_2 \quad \cdots \text{(Eq. 4)}$

By comparing Eq. 3 and Eq. 4, we get:

$\lambda_1\mathbf{x}_1^\top\mathbf{x}_2 = \lambda_2\mathbf{x}_1^\top\mathbf{x}_2$

$(\lambda_1 - \lambda_2)\mathbf{x}_1^\top\mathbf{x}_2 = 0$

We started with the assumption that the eigenvalues are distinct, so $\lambda_1 - \lambda_2 \neq 0$ . For the product to be zero, the other term must be zero:

$\mathbf{x}_1^\top\mathbf{x}_2 = 0$

This shows that the eigenvectors $\mathbf{x}_1$ and $\mathbf{x}_2$ are orthogonal.

Yeah, that’s it.

Cool, now think about eigendecomposition. $\mathbf{P}$ is suddenly an orthogonal matrix. Thus, for any real symmetric matrix $\mathbf{A}$ , its eigendecomposition is:

$\mathbf{A} = \mathbf{P}\mathbf{D}\mathbf{P}^\top$

This form has a powerful geometric interpretation: any transformation by a symmetric matrix can be understood as a sequence of three operations:
- A rotation back to the original coordinate system ( $\mathbf{P}$ ).
- A rotation or reflection ( $\mathbf{P}^\top$ ) to align the coordinate system with the eigenvectors.
- A scaling along these new axes by the eigenvalues (multiplication by $\mathbf{D}$ ).
“Huh?”, you ask. Yeah, the above order is wrong.

When you apply the transformation $\mathbf{A}$ to a vector $\mathbf{x}$ , you calculate $\mathbf{A}\mathbf{x} = \mathbf{P}\mathbf{D}\mathbf{P}^\top\mathbf{x}$ . The correct sequence of geometric operations is:
1. Rotation/Reflection ( $\mathbf{P}^\top$ ): Aligns the coordinate system with the eigenvectors. This is essentially a change of basis.
2. Scaling ( $\mathbf{D}$ ): Scales the vector along these new axes.
3. Rotation Back ( $\mathbf{P}$ ): Returns to the original coordinate system.
Cool, ok, but where do I use it?

I point you back to the “Eigendecomposition is powerful!” subsection.

Sources
1. Mathematics for Machine Learning (link)
Trying to Teach Freshmen Robotics 2

July 15, 2025
(Written earlier; GitHub permalink)

Find week 1 at Trying to Teach Freshmen Robotics 1.
1. Weeks 5 to 8
2. Final Things
Weeks 5 to 8

Stuff that I sent over
1. PyTorch
2. Robotic Vision (with OpenCV)
Evidently, progress in the stuff I could teach was slow owing to academics (ironic, isn’t it?). Regardless, they are still interested in RRC and a couple of them got part-time internships in a robotics startup (Yay!).

PyTorch

Not just the basics but also dive into critical aspects that will be essential for applying neural networks to robotics problems.
1. Getting Started with PyTorch:
- Resource: Deep Learning with PyTorch: A 60 Minute Blitz
  Familiarize yourself with the core components of PyTorch.
1. Data Loaders and Datasets:
- PyTorch’s torch.utils.data.Dataset and DataLoader classes are designed to simplify data handling.
- Key Aspects:
  - Custom Datasets: How to structure and preprocess your robotics data for training.
  - Batching & Shuffling: Efficiently loading data during training to improve performance.
  - Parallel Data Loading: Leveraging multi-threading to accelerate your training pipeline.
- Resource: The Data Loading Tutorial
1. Optimization Theory:
- Understanding the mathematics and intuition behind optimization is crucial.
- Focus Areas:
  - Gradient Descent and Variants: Learn why and how different optimizers work (SGD, Adam, etc.).
  - Tuning Hyperparameters: Experiment with learning rates, momentum, and other parameters through torch.optim.
- Resource: The PyTorch Optimizers Documentation is a solid starting point.
- Other stuff:
  - You have done quite a bit of this in the mathy bits earlier and will redo it in the mobile robotics course. Take a look again at this in for the time being: Intro to Optimization
Robotic Vision with OpenCV
1. Introduction to OpenCV:
- Resource: The OpenCV-Python Tutorials for an overview.
1. Multi-view Geometry:
- A super important thing in robotic vision is the ability to perceive depth and spatial relationships using multiple images.
- Focus Areas:
  - Camera Calibration: Learn to determine intrinsic and extrinsic camera parameters to correct lens distortions.
  - Resource: The Camera Calibration Tutorial guides you through this process.
  - Epipolar Geometry & Stereo Vision: Understand how to estimate depth and recover 3D information from two or more camera views.
  - Resource: Multi-view Geometry is perhaps the best book available. There are a few copies in the library as well.
1. Robust Vision Pipelines:
- Key Concepts:
  - Feature Detection: Experiment with detectors such as SIFT, SURF, or ORB to recognize and match key points in images.
  - Real-time Processing: Build pipelines that can handle continuous video streams and process images on the fly.
Final Things

Once you’re done. Start with reading about SLAM pipelines.
Enhancing the GFS

July 15, 2025
(Written on 29th December, 2025; GitHub permalink)

Essentially, I, along with a collaborator (Mitansh K), did the following –
1. Localized Implementation: Implemented pretty much everything that’s offered by the Google File System inclusing
  1. Write
  2. Read
  3. Replication
  4. Record Append
  5. Re-Replication
  6. Stale Replica Handling
  7. Garbage Collection
  8. Operation Log
  9. Data Integrity
using a codebase of 7200 LOC in Go. Evidently, we missed out on snapshoting and directory management and have left them as to-dos.
1. Enhanced Consistency: Rather than the at-least once record append operations offered by the GFS, we implemented it to be exactly once. This was done using a modified version of the 2PC protocol and handling idempotency (do see our report!).
2. Comprehensive System Analysis: Validated system performance through extensive benchmarking, demonstrating near-linear scaling in throughput for reads, writes, and appends, while maintaining consistency and reliability.
The implementation could be found here and the report here.

Any thoughts/improvements (especially on the GitHub issues) would be welcome.