The language of data, weights, and transformations. Every forward pass, every embedding, every optimization step is linear algebra under the hood.
The atomic unit of data in ML
Every piece of data in ML is a vector. A user is a vector of features. A word is a vector of 768 floats. An image is a vector of pixel values. A model's hidden state is a vector. Before you can understand attention, embeddings, similarity search, or neural networks, you need to know what a vector is and what you can do with it.
Without this, the phrase "cosine similarity between two embeddings" is gibberish. With it, it's immediately obvious what's happening.
A vector is just a list of numbers with a direction. Think of it as an arrow in space. A 2D vector [3, 4] is an arrow that goes 3 steps right and 4 steps up. A 768-dimensional vector is the same idea — just an arrow in a space you can't visualize.
In ML, each number in the vector is a feature, and the entire vector is a compressed description of something — a word, a user, a sentence, an image patch.
A vector is an ordered list of numbers: v = [v₁, v₂, ..., vₙ]. The number of elements is the dimension. Vectors can be written as a column vector (n×1 — default in ML literature) or a row vector (1×n). This distinction matters when multiplying with matrices — shape errors are almost always a row/column confusion.
The magnitude is how far the arrow reaches from the origin — square root of the sum of squared components:
For v = [3, 4]: ||v|| = √(9 + 16) = 5. The L1 norm is the sum of absolute values. Both appear as regularization penalties — Ridge = L2, Lasso = L1.
A unit vector has magnitude = 1. To normalize: û = v / ||v||. When comparing word embeddings you normalize first so similarity is purely about direction, not magnitude.
[age=0.3, purchase_freq=0.7, avg_spend=0.5, ...]The engine of similarity, attention, and linear layers
The dot product is the single most used operation in ML. Every linear layer in a neural network is a dot product. Every attention score in a Transformer is a dot product. Every cosine similarity computation involves a dot product. If you understand nothing else from linear algebra, understand this.
The dot product measures how much two vectors point in the same direction. Same direction — large positive. Perpendicular — zero. Opposite — negative.
Think of it as a vote: each dimension casts a vote, and the dot product is the weighted total agreement between the two vectors.
Multiply corresponding elements and sum:
Example: a = [1, 2, 3], b = [4, 5, 6]
Second formula: a · b = ||a|| × ||b|| × cos(θ), where θ is the angle between the vectors. Large when aligned (θ→0°, cos→1), zero when perpendicular (θ=90°), negative when opposite (θ=180°).
Normalizes the dot product to remove magnitude effects:
Result is always in [−1, 1]. Used everywhere in NLP — comparing sentence embeddings, finding similar documents, semantic search. This is how vector databases (Pinecone, Weaviate) do retrieval.
Every neuron computes output = w · x + b — a dot product of the weight vector w with input x. A layer with 512 neurons computes 512 dot products simultaneously, organized as a matrix multiply.
In Transformer attention: score = Q · Kᵀ — every query vector dot-producted with every key vector. High score = "this query should attend to this key."
Datasets, weight tables, and linear transformations
Your entire training dataset is a matrix. Every weight layer in a neural network is a matrix. Every batch of data you pass through a model gets multiplied by weight matrices. Understanding matrices — what they represent, how they transform data — is non-negotiable for understanding how models work.
A matrix is a function that transforms vectors. Multiply a vector by a matrix and you get a new vector — possibly in a different dimensional space, stretched, rotated, or projected. The weight matrix in a neural layer is a learned transformation: it reshapes input data into a representation more useful for the task.
A 2D array of numbers with m rows and n columns — an m×n matrix. A dataset X of 1000 samples with 20 features each is a (1000, 20) matrix. A linear layer mapping 20 inputs to 512 outputs has a weight matrix W of shape (512, 20).
Flips a matrix over its diagonal — rows become columns. An m×n matrix becomes n×m.
In attention, Q · Kᵀ requires transposing K so dimensions align. In PCA, you compute XᵀX to get the covariance matrix.
Multiplying matrix A (m×n) by vector x (n×1) gives a new vector (m×1). Each row of A computes one output value as a dot product with x.
A neural layer: output = W·input + bias. Stack multiple layers and each learns a progressively more abstract transformation.
The inverse A⁻¹ (only for square matrices) satisfies A·A⁻¹ = I. If A transforms a vector, A⁻¹ undoes it. The closed-form linear regression solution is θ = (XᵀX)⁻¹Xᵀy — in practice you use solvers, not direct inversion.
The determinant measures how a matrix scales space. det = 0 means the matrix collapses space to a lower dimension — singular, not invertible, information is lost.
The single most important operation in deep learning
Matrix multiplication is what makes neural networks fast. A forward pass through a linear layer is a matrix multiply. Running inference on a batch of 64 samples is a matrix multiply. The entire efficiency of GPU-accelerated deep learning comes from GPUs being exceptionally good at matrix multiplications. Without understanding this, you can't reason about model efficiency, memory, or batch processing.
Matrix multiplication is doing many dot products at once. If A has m rows and B has n columns, A×B produces an m×n matrix where entry (i,j) is the dot product of row i of A with column j of B.
The critical rule: inner dimensions must match. A(m×k) × B(k×n) → C(m×n). Shape errors are the most common bug — check shapes first.
Each element C[i][j] = dot(row_i of A, col_j of B)
In practice, you always process batches. If X is (32, 256) and W is (256, 512), then X @ W produces (32, 512) — all 32 forward passes computed simultaneously. This is why GPUs are effective: they parallelize thousands of dot products at once.
In PyTorch: torch.matmul(X, W) or the @ operator. Know the shapes before you call it.
Principal directions of a transformation — the heart of PCA
PCA — the most common dimensionality reduction technique — is entirely based on eigenvalues and eigenvectors of the covariance matrix. Interviewers ask about PCA constantly, and the follow-up is always "how does it work mathematically?" They also appear in graph neural networks, spectral clustering, and stability analysis of training dynamics.
Most vectors change direction when multiplied by a matrix — they get rotated and stretched. But certain special vectors don't change direction — they only get stretched or shrunk. These are eigenvectors. The amount of stretching is the eigenvalue.
If a matrix represents "how your data varies," eigenvectors point in directions of maximum variance. The eigenvector with the largest eigenvalue points where your data spreads out the most. That's your first principal component in PCA.
For a square matrix A, a non-zero vector v is an eigenvector if multiplying A by v gives back v scaled by a constant λ (lambda):
v is the eigenvector. λ is the eigenvalue. A large λ means the matrix strongly stretches data in that direction. λ = 0 means the matrix collapses data in that direction — information is lost.
Interpretation: this matrix stretches x by 2× and y by 3×. The y-direction has more variance.
Step 1: Compute the covariance matrix C = XᵀX / n. This captures how each pair of features varies together.
Step 2: Find eigenvectors of C. Each eigenvector is a direction in feature space. Its eigenvalue = variance of data in that direction.
Step 3: Sort eigenvectors by eigenvalue (largest first). Take the top-k — these are your principal components.
Step 4: Project data onto these k eigenvectors.
Generalized decomposition — compression, PCA, and LoRA
SVD is how modern systems compress information. Image compression, recommendation systems, and — critically for LLMs — LoRA fine-tuning all use SVD or its logic. It generalizes eigendecomposition to non-square matrices, which covers most ML weight matrices. Interviewers at ML engineering roles will ask about this, especially those working on LLMs.
SVD says: any matrix can be decomposed into three simple operations — rotate, stretch, rotate again. The singular values in the middle tell you which directions carry the most information. Keep only the top-k singular values and set the rest to zero — you get the best possible rank-k approximation. The compressed version that loses the least information.
Any matrix A of shape (m×n) factors as:
When you compute y = Av, three things happen: Vᵀ rotates input, Σ stretches along each axis, U rotates into output space.
Keep only the top-k singular values to get the best rank-k approximation:
You've compressed the original m×n matrix into m·k + k + k·n numbers. The Eckart-Young theorem proves this is the best possible rank-k approximation — no other rank-k matrix is closer in Frobenius norm.