Deep Learning Interview Prep

What is backpropagation?medium

Type: conceptual
Topic: backpropagation
Frequency: common
Tags: backpropagation

Answer

It computes gradients layer by layer so neural networks can learn.

Explanation

Backpropagation applies the chain rule to calculate how each weight contributed to loss, then an optimizer updates weights.

Follow-upHow does backpropagation relate to gradient descent?

Why do neural networks need activation functions?medium

Type: conceptual
Topic: neural-networks-need-activation-functions
Frequency: common
Tags: neural, networks, need, activation, functions

Answer

Activations add non-linearity so networks can model complex patterns.

Explanation

Without non-linear activations, stacked layers collapse into a linear transformation and lose expressive power.

Follow-upWhy is ReLU popular?

What is dropout?hard

Type: conceptual
Topic: dropout
Frequency: common
Tags: dropout

Answer

Dropout randomly disables units during training to reduce overfitting.

Explanation

It prevents the network from relying too heavily on specific neurons and encourages more robust distributed representations.

Follow-upIs dropout used during inference?

What is a CNN?medium

Type: conceptual
Topic: cnn
Frequency: common
Tags: cnn

Answer

A convolutional neural network uses filters to learn spatial features.

Explanation

CNNs are effective for images because convolutions capture local patterns and reuse weights across spatial positions.

Follow-upWhy does weight sharing help?

What is vanishing gradient?medium

Type: conceptual
Topic: vanishing-gradient
Frequency: common
Tags: vanishing, gradient

Answer

It happens when gradients become too small for early layers to learn.

Explanation

Deep networks can suffer from tiny gradients during backpropagation. Better activations, normalization, residual connections, and initialization help.

Follow-upHow do residual connections reduce this problem?

Walk me through backpropagation from scratch.medium

Type: scenario
Topic: me-through-backpropagation-from-scratch
Frequency: common
Tags: deep-learning, walk, through, backpropagation, from, scratch

Answer

Forward pass: compute activations layer by layer, get a loss.

Explanation

Forward pass: compute activations layer by layer, get a loss. Backward pass: use chain rule to compute gradient of loss w.r.t. each weight. For weight W in layer l: ∂L/∂W = ∂L/∂output × ∂output/∂W. Gradients flow backward through activation derivatives (ReLU → 1 if x>0 else 0, sigmoid → σ(1-σ)). Weights updated: W = W - lr × ∂L/∂W.

Follow-upWhat tradeoffs did you consider in that implementation?

What is vanishing gradient and how does LSTM solve it?medium

Type: conceptual
Topic: is-vanishing-gradient-and-how-does-lstm-solve-it
Frequency: common
Tags: deep-learning, what, vanishing, gradient, and, how

Answer

In deep RNNs, gradients shrink exponentially through time — early timesteps get near-zero updates.

Explanation

In deep RNNs, gradients shrink exponentially through time — early timesteps get near-zero updates. LSTM introduces a cell state with additive updates (not multiplicative) and gates (input, forget, output) that regulate what to remember. The forget gate can stay near 1 to preserve gradient flow, avoiding vanishing. GRU is a simpler variant with the same key idea.

Follow-upCan you give a production example?

Explain the Transformer architecture from scratch.hard

Type: conceptual
Topic: the-transformer-architecture-from-scratch
Frequency: common
Tags: deep-learning, explain, the, transformer, architecture, from

Answer

Tokens → embeddings + positional encoding → N encoder blocks.

Explanation

Tokens → embeddings + positional encoding → N encoder blocks. Each block: (1) Multi-head self-attention: Q, K, V projections → attention = softmax(QKᵀ/√d_k)V. (2) Add & LayerNorm. (3) Feed-forward (two linear layers + ReLU). (4) Add & LayerNorm. Decoder adds cross-attention over encoder output. Final → linear + softmax over vocabulary.

Follow-upCan you give a production example?

How does BERT differ from GPT in architecture and pretraining?hard

Type: conceptual
Topic: does-bert-differ-from-gpt-in-architecture-and-pretraining
Frequency: common
Tags: deep-learning, how, does, bert, differ, from

Answer

BERT: encoder-only, bidirectional, pretrained with Masked LM + Next Sentence Prediction.

Explanation

BERT: encoder-only, bidirectional, pretrained with Masked LM + Next Sentence Prediction. Sees full context — good for classification/extraction. GPT: decoder-only, autoregressive, pretrained with causal LM (predict next token). Good for generation. For a document extraction pipeline extraction, BERT-style is better; for generative summarization (a fund document processing system), GPT-style is better.

Follow-upWhen would you choose one approach over the other?

Pretraining vs fine-tuning vs prompt-based learning — differences?hard

Type: conceptual
Topic: pretraining-vs-fine-tuning-vs-prompt-based-learning-differ
Frequency: common
Tags: deep-learning, pretraining, fine, tuning, prompt, based

Answer

Pretraining: train on massive corpus to learn general representations (expensive, done once).

Explanation

Pretraining: train on massive corpus to learn general representations (expensive, done once). Fine-tuning: update weights on task-specific labeled data — strong performance, needs labels. Prompt-based: craft inputs to steer a frozen LLM — zero/few-shot, no gradient updates. In production: prompt engineering first (fast, cheap), fine-tune if quality insufficient, pretrain only if domain is radically out-of-distribution.

Follow-upWhen would you choose one approach over the other?

How do token embeddings combine with positional embeddings in a Transformer?medium

Type: conceptual
Topic: do-token-embeddings-combine-with-positional-embeddings-in
Frequency: common
Tags: deep-learning, how, token, embeddings, combine, with

Answer

Transformers have no inherent notion of sequence order. Positional encodings are added (summed) to token embeddings before the first layer.

Explanation

Transformers have no inherent notion of sequence order. Positional encodings are added (summed) to token embeddings before the first layer. Original paper uses sinusoidal functions (fixed). Modern models use learned positional embeddings. RoPE (Rotary Position Embedding) encodes position by rotating Q and K vectors — better for long contexts, used in LLaMA and Qwen.

Follow-upCan you give a production example?

Batch normalization vs layer normalization — which is preferred in Transformers?medium

Type: conceptual
Topic: batch-normalization-vs-layer-normalization-which-is-prefer
Frequency: common
Tags: deep-learning, batch, normalization, layer, which

Answer

BatchNorm normalizes across the batch dimension — problematic for variable-length sequences and small batches.

Explanation

BatchNorm normalizes across the batch dimension — problematic for variable-length sequences and small batches. LayerNorm normalizes across the feature dimension for each sample independently — no batch dependency. Transformers use LayerNorm. Pre-LN (normalize before attention) is more stable than post-LN during training for very deep models.

Follow-upWhen would you choose one approach over the other?

Explain dropout — training vs inference behavior.medium

Type: conceptual
Topic: dropout-training-vs-inference-behavior
Frequency: common
Tags: deep-learning, explain, dropout, training, inference, behavior

Answer

During training: randomly zero out activations with probability p, forcing the network to learn redundant representations.

Explanation

During training: randomly zero out activations with probability p, forcing the network to learn redundant representations. During inference: disabled, activations scaled by (1-p) to maintain expected values (or training uses inverted dropout: scale by 1/(1-p)). In Transformers, dropout applied after attention weights and after FFN layers. Typical p=0.1.

Follow-upWhen would you choose one approach over the other?

What is softmax and what does temperature do to it?medium

Type: conceptual
Topic: is-softmax-and-what-does-temperature-do-to-it
Frequency: common
Tags: deep-learning, what, softmax, and, does

Answer

Softmax converts raw logits to a probability distribution: softmax(z_i) = e^z_i / Σe^z_j.

Explanation

Softmax converts raw logits to a probability distribution: softmax(z_i) = e^z_i / Σe^z_j. Temperature T scales logits before softmax: softmax(z_i/T). T<1 (e.g., 0.3): sharpens → more deterministic. T>1 (e.g., 1.5): flattens → more random/creative. T→0 → argmax (greedy). Use low temperature for extraction tasks (a document extraction pipeline), higher for creative generation.

Follow-upCan you give a production example?

Encoder-only vs decoder-only vs encoder-decoder Transformers?medium

Type: conceptual
Topic: encoder-only-vs-decoder-only-vs-encoder-decoder-transforme
Frequency: common
Tags: deep-learning, encoder, only, decoder

Answer

Encoder-only (BERT): bidirectional, best for classification, extraction, embeddings.

Explanation

Encoder-only (BERT): bidirectional, best for classification, extraction, embeddings. Decoder-only (GPT, Claude, LLaMA): autoregressive, best for generation and chat. Encoder-decoder (T5, BART): encoder processes input, decoder generates — best for seq2seq (translation, summarization). For RAG generation, decoder-only LLMs are standard; for embedding/retrieval, encoder-only models.

Follow-upWhen would you choose one approach over the other?

Deep Learning Interview Questions

What is backpropagation?medium

Why do neural networks need activation functions?medium

What is dropout?hard

What is a CNN?medium

What is vanishing gradient?medium

Walk me through backpropagation from scratch.medium

What is vanishing gradient and how does LSTM solve it?medium

Explain the Transformer architecture from scratch.hard

How does BERT differ from GPT in architecture and pretraining?hard

Pretraining vs fine-tuning vs prompt-based learning — differences?hard

How do token embeddings combine with positional embeddings in a Transformer?medium

Batch normalization vs layer normalization — which is preferred in Transformers?medium

Explain dropout — training vs inference behavior.medium

What is softmax and what does temperature do to it?medium

Encoder-only vs decoder-only vs encoder-decoder Transformers?medium