What is backpropagation?medium
It computes gradients layer by layer so neural networks can learn.
Backpropagation applies the chain rule to calculate how each weight contributed to loss, then an optimizer updates weights.
InterviewSkill
Neural network fundamentals for modern ML and AI engineering interviews.
It computes gradients layer by layer so neural networks can learn.
Backpropagation applies the chain rule to calculate how each weight contributed to loss, then an optimizer updates weights.
Activations add non-linearity so networks can model complex patterns.
Without non-linear activations, stacked layers collapse into a linear transformation and lose expressive power.
Dropout randomly disables units during training to reduce overfitting.
It prevents the network from relying too heavily on specific neurons and encourages more robust distributed representations.
A convolutional neural network uses filters to learn spatial features.
CNNs are effective for images because convolutions capture local patterns and reuse weights across spatial positions.
It happens when gradients become too small for early layers to learn.
Deep networks can suffer from tiny gradients during backpropagation. Better activations, normalization, residual connections, and initialization help.
Forward pass: compute activations layer by layer, get a loss.
Forward pass: compute activations layer by layer, get a loss. Backward pass: use chain rule to compute gradient of loss w.r.t. each weight. For weight W in layer l: ∂L/∂W = ∂L/∂output × ∂output/∂W. Gradients flow backward through activation derivatives (ReLU → 1 if x>0 else 0, sigmoid → σ(1-σ)). Weights updated: W = W - lr × ∂L/∂W.
In deep RNNs, gradients shrink exponentially through time — early timesteps get near-zero updates.
In deep RNNs, gradients shrink exponentially through time — early timesteps get near-zero updates. LSTM introduces a cell state with additive updates (not multiplicative) and gates (input, forget, output) that regulate what to remember. The forget gate can stay near 1 to preserve gradient flow, avoiding vanishing. GRU is a simpler variant with the same key idea.
Tokens → embeddings + positional encoding → N encoder blocks.
Tokens → embeddings + positional encoding → N encoder blocks. Each block: (1) Multi-head self-attention: Q, K, V projections → attention = softmax(QKᵀ/√d_k)V. (2) Add & LayerNorm. (3) Feed-forward (two linear layers + ReLU). (4) Add & LayerNorm. Decoder adds cross-attention over encoder output. Final → linear + softmax over vocabulary.
BERT: encoder-only, bidirectional, pretrained with Masked LM + Next Sentence Prediction.
BERT: encoder-only, bidirectional, pretrained with Masked LM + Next Sentence Prediction. Sees full context — good for classification/extraction. GPT: decoder-only, autoregressive, pretrained with causal LM (predict next token). Good for generation. For a document extraction pipeline extraction, BERT-style is better; for generative summarization (a fund document processing system), GPT-style is better.
Pretraining: train on massive corpus to learn general representations (expensive, done once).
Pretraining: train on massive corpus to learn general representations (expensive, done once). Fine-tuning: update weights on task-specific labeled data — strong performance, needs labels. Prompt-based: craft inputs to steer a frozen LLM — zero/few-shot, no gradient updates. In production: prompt engineering first (fast, cheap), fine-tune if quality insufficient, pretrain only if domain is radically out-of-distribution.
Transformers have no inherent notion of sequence order. Positional encodings are added (summed) to token embeddings before the first layer.
Transformers have no inherent notion of sequence order. Positional encodings are added (summed) to token embeddings before the first layer. Original paper uses sinusoidal functions (fixed). Modern models use learned positional embeddings. RoPE (Rotary Position Embedding) encodes position by rotating Q and K vectors — better for long contexts, used in LLaMA and Qwen.
BatchNorm normalizes across the batch dimension — problematic for variable-length sequences and small batches.
BatchNorm normalizes across the batch dimension — problematic for variable-length sequences and small batches. LayerNorm normalizes across the feature dimension for each sample independently — no batch dependency. Transformers use LayerNorm. Pre-LN (normalize before attention) is more stable than post-LN during training for very deep models.
During training: randomly zero out activations with probability p, forcing the network to learn redundant representations.
During training: randomly zero out activations with probability p, forcing the network to learn redundant representations. During inference: disabled, activations scaled by (1-p) to maintain expected values (or training uses inverted dropout: scale by 1/(1-p)). In Transformers, dropout applied after attention weights and after FFN layers. Typical p=0.1.
Softmax converts raw logits to a probability distribution: softmax(z_i) = e^z_i / Σe^z_j.
Softmax converts raw logits to a probability distribution: softmax(z_i) = e^z_i / Σe^z_j. Temperature T scales logits before softmax: softmax(z_i/T). T<1 (e.g., 0.3): sharpens → more deterministic. T>1 (e.g., 1.5): flattens → more random/creative. T→0 → argmax (greedy). Use low temperature for extraction tasks (a document extraction pipeline), higher for creative generation.
Encoder-only (BERT): bidirectional, best for classification, extraction, embeddings.
Encoder-only (BERT): bidirectional, best for classification, extraction, embeddings. Decoder-only (GPT, Claude, LLaMA): autoregressive, best for generation and chat. Encoder-decoder (T5, BART): encoder processes input, decoder generates — best for seq2seq (translation, summarization). For RAG generation, decoder-only LLMs are standard; for embedding/retrieval, encoder-only models.