Linear Algebra

The language of data, weights, and transformations. Every forward pass, every embedding, every optimization step is linear algebra under the hood.

vectors dot product matrices eigenvalues SVD PCA

Topic overview

Treat vectors, dot products, matrices, eigenvectors, SVD, and PCA as the shared language behind features, embeddings, transformations, and compression.

Core concepts

Focus on vector shape, norms, projections, matrix multiplication, basis changes, rank, eigen decomposition, singular values, and geometric intuition.

Why it matters

Most ML systems are matrix operations at scale: model layers, embeddings, similarity search, PCA, and attention all depend on linear algebra.

Interview relevance

Interviewers expect you to explain dimensions, similarity, projections, PCA, and why matrix operations make model computation efficient.

Vectors

The atomic unit of data in ML

Why — motivation

Every piece of data in ML is a vector. A user is a vector of features. A word is a vector of 768 floats. An image is a vector of pixel values. A model's hidden state is a vector. Before you can understand attention, embeddings, similarity search, or neural networks, you need to know what a vector is and what you can do with it.

Without this, the phrase "cosine similarity between two embeddings" is gibberish. With it, it's immediately obvious what's happening.

Intuition — the mental model

A vector is just a list of numbers with a direction. Think of it as an arrow in space. A 2D vector [3, 4] is an arrow that goes 3 steps right and 4 steps up. A 768-dimensional vector is the same idea — just an arrow in a space you can't visualize.

In ML, each number in the vector is a feature, and the entire vector is a compressed description of something — a word, a user, a sentence, an image patch.

Explanation

What a vector is

A vector is an ordered list of numbers: v = [v₁, v₂, ..., vₙ]. The number of elements is the dimension. Vectors can be written as a column vector (n×1 — default in ML literature) or a row vector (1×n). This distinction matters when multiplying with matrices — shape errors are almost always a row/column confusion.

v = [1, 2, 3] → row vector (1×3) v = [[1], → column vector (3×1) [2], [3]]

Vector magnitude (L2 norm)

The magnitude is how far the arrow reaches from the origin — square root of the sum of squared components:

||v|| = √(v₁² + v₂² + ... + vₙ²)

For v = [3, 4]: ||v|| = √(9 + 16) = 5. The L1 norm is the sum of absolute values. Both appear as regularization penalties — Ridge = L2, Lasso = L1.

Unit vector & normalization

A unit vector has magnitude = 1. To normalize: û = v / ||v||. When comparing word embeddings you normalize first so similarity is purely about direction, not magnitude.

How ML uses vectors

Feature vector: a user in a recommendation system — [age=0.3, purchase_freq=0.7, avg_spend=0.5, ...]
Word embedding (BERT): dense vector of 768 floats encoding the "meaning" of a word
Weight vector: every linear layer is a set of weight vectors, one per output neuron
Gradient vector: ∇L points in the direction of steepest loss increase

Interview Q & A

Q: What is a vector, and how is it used in machine learning?

A: A vector is an ordered list of numbers representing a point or direction in n-dimensional space. In ML, vectors are the fundamental way we represent data — a data sample is a feature vector, a word is an embedding vector, and neural network weights form vectors. Operations like dot products let us measure similarity, which is the basis of attention mechanisms and nearest-neighbor search.

1.2

Dot product

The engine of similarity, attention, and linear layers

Why — motivation

The dot product is the single most used operation in ML. Every linear layer in a neural network is a dot product. Every attention score in a Transformer is a dot product. Every cosine similarity computation involves a dot product. If you understand nothing else from linear algebra, understand this.

Intuition — the mental model

The dot product measures how much two vectors point in the same direction. Same direction — large positive. Perpendicular — zero. Opposite — negative.

Think of it as a vote: each dimension casts a vote, and the dot product is the weighted total agreement between the two vectors.

Explanation

Computing the dot product

Multiply corresponding elements and sum:

a · b = a₁b₁ + a₂b₂ + ... + aₙbₙ

Example: a = [1, 2, 3], b = [4, 5, 6]

a · b = (1×4) + (2×5) + (3×6) = 4 + 10 + 18 = 32

Geometric meaning

Second formula: a · b = ||a|| × ||b|| × cos(θ), where θ is the angle between the vectors. Large when aligned (θ→0°, cos→1), zero when perpendicular (θ=90°), negative when opposite (θ=180°).

Cosine similarity

Normalizes the dot product to remove magnitude effects:

cosine_similarity(a, b) = (a · b) / (||a|| × ||b||)

Result is always in [−1, 1]. Used everywhere in NLP — comparing sentence embeddings, finding similar documents, semantic search. This is how vector databases (Pinecone, Weaviate) do retrieval.

Dot product in neural networks

Every neuron computes output = w · x + b — a dot product of the weight vector w with input x. A layer with 512 neurons computes 512 dot products simultaneously, organized as a matrix multiply.

In Transformer attention: score = Q · Kᵀ — every query vector dot-producted with every key vector. High score = "this query should attend to this key."

Interview Q & A

Q: What does the dot product measure, and where does it appear in ML?

A: The dot product measures alignment — how much two vectors point in the same direction. Geometrically it equals ||a||·||b||·cos(θ). In ML: linear layers compute w·x+b per neuron, cosine similarity normalizes the dot product for semantic comparison, and Transformer attention scores are dot products between query and key vectors. It's arguably the most fundamental operation in deep learning.

1.3

Matrices

Datasets, weight tables, and linear transformations

Why — motivation

Your entire training dataset is a matrix. Every weight layer in a neural network is a matrix. Every batch of data you pass through a model gets multiplied by weight matrices. Understanding matrices — what they represent, how they transform data — is non-negotiable for understanding how models work.

Intuition — the mental model

A matrix is a function that transforms vectors. Multiply a vector by a matrix and you get a new vector — possibly in a different dimensional space, stretched, rotated, or projected. The weight matrix in a neural layer is a learned transformation: it reshapes input data into a representation more useful for the task.

Explanation

What a matrix is

A 2D array of numbers with m rows and n columns — an m×n matrix. A dataset X of 1000 samples with 20 features each is a (1000, 20) matrix. A linear layer mapping 20 inputs to 512 outputs has a weight matrix W of shape (512, 20).

A = [[1, 2, 3], → 2×3 matrix (2 rows, 3 columns) [4, 5, 6]]

Transpose

Flips a matrix over its diagonal — rows become columns. An m×n matrix becomes n×m.

A = [[1, 2], → Aᵀ = [[1, 3], [3, 4]] [2, 4]]

In attention, Q · Kᵀ requires transposing K so dimensions align. In PCA, you compute XᵀX to get the covariance matrix.

Matrix as a transformation

Multiplying matrix A (m×n) by vector x (n×1) gives a new vector (m×1). Each row of A computes one output value as a dot product with x.

y = Ax shape: (m×n) × (n×1) → (m×1)

A neural layer: output = W·input + bias. Stack multiple layers and each learns a progressively more abstract transformation.

Inverse & determinant

The inverse A⁻¹ (only for square matrices) satisfies A·A⁻¹ = I. If A transforms a vector, A⁻¹ undoes it. The closed-form linear regression solution is θ = (XᵀX)⁻¹Xᵀy — in practice you use solvers, not direct inversion.

The determinant measures how a matrix scales space. det = 0 means the matrix collapses space to a lower dimension — singular, not invertible, information is lost.

Interview Q & A

Q: What is a matrix, and what does it mean geometrically?

A: A matrix is a rectangular array of numbers that represents a linear transformation — it maps vectors from one space to another, rotating, scaling, or projecting them. In ML, weight matrices are the learned transformations in each layer: they reshape input representations into more useful forms. The transpose swaps rows and columns, the inverse undoes a transformation, and a zero determinant means the transformation is irreversible — information is lost.

1.4

Matrix multiplication

The single most important operation in deep learning

Why — motivation

Matrix multiplication is what makes neural networks fast. A forward pass through a linear layer is a matrix multiply. Running inference on a batch of 64 samples is a matrix multiply. The entire efficiency of GPU-accelerated deep learning comes from GPUs being exceptionally good at matrix multiplications. Without understanding this, you can't reason about model efficiency, memory, or batch processing.

Intuition — the mental model

Matrix multiplication is doing many dot products at once. If A has m rows and B has n columns, A×B produces an m×n matrix where entry (i,j) is the dot product of row i of A with column j of B.

The critical rule: inner dimensions must match. A(m×k) × B(k×n) → C(m×n). Shape errors are the most common bug — check shapes first.

Explanation

How matrix multiplication works

Each element C[i][j] = dot(row_i of A, col_j of B)

A = [[1, 2], B = [[5, 6], [3, 4]] [7, 8]] C[0][0] = 1×5 + 2×7 = 19 C[0][1] = 1×6 + 2×8 = 22 C[1][0] = 3×5 + 4×7 = 43 C[1][1] = 3×6 + 4×8 = 50 C = [[19, 22], [43, 50]]

Shape rule — the most important thing to memorize

(m × k) · (k × n) → (m × n) ↑___↑ inner dims must match outer dims are result shape

Linear layer: input (batch=32, features=256) × weights (256, 512) → output (32, 512)
Attention scores: Q(seq=10, d=64) × Kᵀ(d=64, seq=10) → scores(10, 10)
Matrix multiply is NOT commutative: A×B ≠ B×A in general

Batch matrix multiply in practice

In practice, you always process batches. If X is (32, 256) and W is (256, 512), then X @ W produces (32, 512) — all 32 forward passes computed simultaneously. This is why GPUs are effective: they parallelize thousands of dot products at once.

In PyTorch: torch.matmul(X, W) or the @ operator. Know the shapes before you call it.

Interview Q & A

Q: What is matrix multiplication and why is it central to deep learning?

A: Matrix multiplication combines two matrices by computing dot products between rows of the first and columns of the second. The shape rule is (m×k)·(k×n) → (m×n) — inner dimensions must match. It's central to deep learning because every linear layer computes output = W·x for a batch simultaneously in a single matrix multiply. GPUs are optimized for this operation. Transformers use it for attention scores (Q·Kᵀ), value aggregation, and every projection layer.

1.5

Eigenvalues & eigenvectors

Principal directions of a transformation — the heart of PCA

Why — motivation

PCA — the most common dimensionality reduction technique — is entirely based on eigenvalues and eigenvectors of the covariance matrix. Interviewers ask about PCA constantly, and the follow-up is always "how does it work mathematically?" They also appear in graph neural networks, spectral clustering, and stability analysis of training dynamics.

Intuition — the mental model

Most vectors change direction when multiplied by a matrix — they get rotated and stretched. But certain special vectors don't change direction — they only get stretched or shrunk. These are eigenvectors. The amount of stretching is the eigenvalue.

If a matrix represents "how your data varies," eigenvectors point in directions of maximum variance. The eigenvector with the largest eigenvalue points where your data spreads out the most. That's your first principal component in PCA.

Explanation

The definition

For a square matrix A, a non-zero vector v is an eigenvector if multiplying A by v gives back v scaled by a constant λ (lambda):

A · v = λ · v

v is the eigenvector. λ is the eigenvalue. A large λ means the matrix strongly stretches data in that direction. λ = 0 means the matrix collapses data in that direction — information is lost.

Concrete example

A = [[2, 0], v = [1, 0] [0, 3]] A · v = [2×1 + 0×0, 0×1 + 3×0] = [2, 0] = 2 × [1, 0] → v = [1, 0] is an eigenvector with eigenvalue λ = 2 → v = [0, 1] is an eigenvector with eigenvalue λ = 3

Interpretation: this matrix stretches x by 2× and y by 3×. The y-direction has more variance.

How PCA uses eigenvalues

Step 1: Compute the covariance matrix C = XᵀX / n. This captures how each pair of features varies together.

Step 2: Find eigenvectors of C. Each eigenvector is a direction in feature space. Its eigenvalue = variance of data in that direction.

Step 3: Sort eigenvectors by eigenvalue (largest first). Take the top-k — these are your principal components.

Step 4: Project data onto these k eigenvectors.

X_reduced = X · V_k (V_k = top-k eigenvectors as columns)

Key properties

Symmetric matrices (like covariance matrices) always have real eigenvalues and orthogonal eigenvectors
Sum of eigenvalues = trace of the matrix (sum of diagonal elements)
Product of eigenvalues = determinant of the matrix
Zero eigenvalue → matrix is singular (not invertible)

Interview Q & A

Q: Explain how PCA works using eigenvalues and eigenvectors.

A: PCA finds directions of maximum variance. It computes the covariance matrix C = XᵀX/n, then finds its eigenvectors and eigenvalues via Av = λv. Each eigenvector points in a principal direction; its eigenvalue tells you how much variance lies there. Sorting by eigenvalue and keeping the top-k gives the most informative directions. Projecting data onto these reduces dimensionality while preserving maximum variance. The eigenvectors are orthogonal, so principal components are uncorrelated.

1.6

SVD — Singular Value Decomposition

Generalized decomposition — compression, PCA, and LoRA

Why — motivation

SVD is how modern systems compress information. Image compression, recommendation systems, and — critically for LLMs — LoRA fine-tuning all use SVD or its logic. It generalizes eigendecomposition to non-square matrices, which covers most ML weight matrices. Interviewers at ML engineering roles will ask about this, especially those working on LLMs.

Intuition — the mental model

SVD says: any matrix can be decomposed into three simple operations — rotate, stretch, rotate again. The singular values in the middle tell you which directions carry the most information. Keep only the top-k singular values and set the rest to zero — you get the best possible rank-k approximation. The compressed version that loses the least information.

Explanation

The decomposition

Any matrix A of shape (m×n) factors as:

A = U · Σ · Vᵀ U: (m×m) — left singular vectors (orthogonal, directions in output space) Σ: (m×n) — diagonal matrix of singular values σ₁ ≥ σ₂ ≥ ... ≥ 0 Vᵀ: (n×n) — right singular vectors transposed (directions in input space)

When you compute y = Av, three things happen: Vᵀ rotates input, Σ stretches along each axis, U rotates into output space.

Truncated SVD — low-rank approximation

Keep only the top-k singular values to get the best rank-k approximation:

A_k = U_k · Σ_k · V_kᵀ where U_k is (m×k), Σ_k is (k×k), V_kᵀ is (k×n)

You've compressed the original m×n matrix into m·k + k + k·n numbers. The Eckart-Young theorem proves this is the best possible rank-k approximation — no other rank-k matrix is closer in Frobenius norm.

SVD vs eigendecomposition

Eigendecomposition only works on square matrices. SVD works on any matrix
For square symmetric matrices: singular values = absolute eigenvalues; U = V = eigenvectors
PCA on data matrix X = truncated SVD of X (equivalent to eigendecomposition of covariance matrix)
SVD is numerically more stable — used in all practical implementations of PCA

Where it shows up in ML

PCA: sklearn's PCA uses SVD internally, not eigendecomposition
Recommendation systems: matrix factorization — user-item rating matrix decomposed into low-rank factors
LoRA fine-tuning: represents weight updates as A·B (rank-r approximation). Instead of updating full W (d×d), you update A (d×r) and B (r×d) where r ≪ d
Image compression: SVD of pixel matrix, keep top-k singular values

Interview Q & A

Q: What is SVD and how does LoRA use it?

A: SVD decomposes any matrix A into U·Σ·Vᵀ — two orthogonal rotation matrices and a diagonal of singular values encoding how much information each direction carries. Truncated SVD keeps only the top-k singular values for the best rank-k approximation. LoRA applies this logic to LLM fine-tuning: instead of updating the full weight matrix W (e.g. 4096×4096), it represents the weight update ΔW as a product of two smaller matrices A (4096×r) and B (r×4096), where r is small like 8 or 16. This assumes the update doesn't need full rank — which empirically holds for fine-tuning tasks — and reduces trainable parameters dramatically.