InterviewSkill

Deep Learning Interview Questions

Neural network fundamentals for modern ML and AI engineering interviews.

15 questions
Deep Learning

What is backpropagation?medium

Type
conceptual
Topic
backpropagation
Frequency
common
Tags
backpropagation
Answer

It computes gradients layer by layer so neural networks can learn.

Explanation

Backpropagation applies the chain rule to calculate how each weight contributed to loss, then an optimizer updates weights.

Follow-upHow does backpropagation relate to gradient descent?

Why do neural networks need activation functions?medium

Type
conceptual
Topic
neural-networks-need-activation-functions
Frequency
common
Tags
neural, networks, need, activation, functions
Answer

Activations add non-linearity so networks can model complex patterns.

Explanation

Without non-linear activations, stacked layers collapse into a linear transformation and lose expressive power.

Follow-upWhy is ReLU popular?

What is dropout?hard

Type
conceptual
Topic
dropout
Frequency
common
Tags
dropout
Answer

Dropout randomly disables units during training to reduce overfitting.

Explanation

It prevents the network from relying too heavily on specific neurons and encourages more robust distributed representations.

Follow-upIs dropout used during inference?

What is a CNN?medium

Type
conceptual
Topic
cnn
Frequency
common
Tags
cnn
Answer

A convolutional neural network uses filters to learn spatial features.

Explanation

CNNs are effective for images because convolutions capture local patterns and reuse weights across spatial positions.

Follow-upWhy does weight sharing help?

What is vanishing gradient?medium

Type
conceptual
Topic
vanishing-gradient
Frequency
common
Tags
vanishing, gradient
Answer

It happens when gradients become too small for early layers to learn.

Explanation

Deep networks can suffer from tiny gradients during backpropagation. Better activations, normalization, residual connections, and initialization help.

Follow-upHow do residual connections reduce this problem?

Walk me through backpropagation from scratch.medium

Type
scenario
Topic
me-through-backpropagation-from-scratch
Frequency
common
Tags
deep-learning, walk, through, backpropagation, from, scratch
Answer

Forward pass: compute activations layer by layer, get a loss.

Explanation

Forward pass: compute activations layer by layer, get a loss. Backward pass: use chain rule to compute gradient of loss w.r.t. each weight. For weight W in layer l: ∂L/∂W = ∂L/∂output × ∂output/∂W. Gradients flow backward through activation derivatives (ReLU → 1 if x>0 else 0, sigmoid → σ(1-σ)). Weights updated: W = W - lr × ∂L/∂W.

Follow-upWhat tradeoffs did you consider in that implementation?

What is vanishing gradient and how does LSTM solve it?medium

Type
conceptual
Topic
is-vanishing-gradient-and-how-does-lstm-solve-it
Frequency
common
Tags
deep-learning, what, vanishing, gradient, and, how
Answer

In deep RNNs, gradients shrink exponentially through time — early timesteps get near-zero updates.

Explanation

In deep RNNs, gradients shrink exponentially through time — early timesteps get near-zero updates. LSTM introduces a cell state with additive updates (not multiplicative) and gates (input, forget, output) that regulate what to remember. The forget gate can stay near 1 to preserve gradient flow, avoiding vanishing. GRU is a simpler variant with the same key idea.

Follow-upCan you give a production example?

Explain the Transformer architecture from scratch.hard

Type
conceptual
Topic
the-transformer-architecture-from-scratch
Frequency
common
Tags
deep-learning, explain, the, transformer, architecture, from
Answer

Tokens → embeddings + positional encoding → N encoder blocks.

Explanation

Tokens → embeddings + positional encoding → N encoder blocks. Each block: (1) Multi-head self-attention: Q, K, V projections → attention = softmax(QKᵀ/√d_k)V. (2) Add & LayerNorm. (3) Feed-forward (two linear layers + ReLU). (4) Add & LayerNorm. Decoder adds cross-attention over encoder output. Final → linear + softmax over vocabulary.

Follow-upCan you give a production example?

How does BERT differ from GPT in architecture and pretraining?hard

Type
conceptual
Topic
does-bert-differ-from-gpt-in-architecture-and-pretraining
Frequency
common
Tags
deep-learning, how, does, bert, differ, from
Answer

BERT: encoder-only, bidirectional, pretrained with Masked LM + Next Sentence Prediction.

Explanation

BERT: encoder-only, bidirectional, pretrained with Masked LM + Next Sentence Prediction. Sees full context — good for classification/extraction. GPT: decoder-only, autoregressive, pretrained with causal LM (predict next token). Good for generation. For a document extraction pipeline extraction, BERT-style is better; for generative summarization (a fund document processing system), GPT-style is better.

Follow-upWhen would you choose one approach over the other?

Pretraining vs fine-tuning vs prompt-based learning — differences?hard

Type
conceptual
Topic
pretraining-vs-fine-tuning-vs-prompt-based-learning-differ
Frequency
common
Tags
deep-learning, pretraining, fine, tuning, prompt, based
Answer

Pretraining: train on massive corpus to learn general representations (expensive, done once).

Explanation

Pretraining: train on massive corpus to learn general representations (expensive, done once). Fine-tuning: update weights on task-specific labeled data — strong performance, needs labels. Prompt-based: craft inputs to steer a frozen LLM — zero/few-shot, no gradient updates. In production: prompt engineering first (fast, cheap), fine-tune if quality insufficient, pretrain only if domain is radically out-of-distribution.

Follow-upWhen would you choose one approach over the other?

How do token embeddings combine with positional embeddings in a Transformer?medium

Type
conceptual
Topic
do-token-embeddings-combine-with-positional-embeddings-in
Frequency
common
Tags
deep-learning, how, token, embeddings, combine, with
Answer

Transformers have no inherent notion of sequence order. Positional encodings are added (summed) to token embeddings before the first layer.

Explanation

Transformers have no inherent notion of sequence order. Positional encodings are added (summed) to token embeddings before the first layer. Original paper uses sinusoidal functions (fixed). Modern models use learned positional embeddings. RoPE (Rotary Position Embedding) encodes position by rotating Q and K vectors — better for long contexts, used in LLaMA and Qwen.

Follow-upCan you give a production example?

Batch normalization vs layer normalization — which is preferred in Transformers?medium

Type
conceptual
Topic
batch-normalization-vs-layer-normalization-which-is-prefer
Frequency
common
Tags
deep-learning, batch, normalization, layer, which
Answer

BatchNorm normalizes across the batch dimension — problematic for variable-length sequences and small batches.

Explanation

BatchNorm normalizes across the batch dimension — problematic for variable-length sequences and small batches. LayerNorm normalizes across the feature dimension for each sample independently — no batch dependency. Transformers use LayerNorm. Pre-LN (normalize before attention) is more stable than post-LN during training for very deep models.

Follow-upWhen would you choose one approach over the other?

Explain dropout — training vs inference behavior.medium

Type
conceptual
Topic
dropout-training-vs-inference-behavior
Frequency
common
Tags
deep-learning, explain, dropout, training, inference, behavior
Answer

During training: randomly zero out activations with probability p, forcing the network to learn redundant representations.

Explanation

During training: randomly zero out activations with probability p, forcing the network to learn redundant representations. During inference: disabled, activations scaled by (1-p) to maintain expected values (or training uses inverted dropout: scale by 1/(1-p)). In Transformers, dropout applied after attention weights and after FFN layers. Typical p=0.1.

Follow-upWhen would you choose one approach over the other?

What is softmax and what does temperature do to it?medium

Type
conceptual
Topic
is-softmax-and-what-does-temperature-do-to-it
Frequency
common
Tags
deep-learning, what, softmax, and, does
Answer

Softmax converts raw logits to a probability distribution: softmax(z_i) = e^z_i / Σe^z_j.

Explanation

Softmax converts raw logits to a probability distribution: softmax(z_i) = e^z_i / Σe^z_j. Temperature T scales logits before softmax: softmax(z_i/T). T<1 (e.g., 0.3): sharpens → more deterministic. T>1 (e.g., 1.5): flattens → more random/creative. T→0 → argmax (greedy). Use low temperature for extraction tasks (a document extraction pipeline), higher for creative generation.

Follow-upCan you give a production example?

Encoder-only vs decoder-only vs encoder-decoder Transformers?medium

Type
conceptual
Topic
encoder-only-vs-decoder-only-vs-encoder-decoder-transforme
Frequency
common
Tags
deep-learning, encoder, only, decoder
Answer

Encoder-only (BERT): bidirectional, best for classification, extraction, embeddings.

Explanation

Encoder-only (BERT): bidirectional, best for classification, extraction, embeddings. Decoder-only (GPT, Claude, LLaMA): autoregressive, best for generation and chat. Encoder-decoder (T5, BART): encoder processes input, decoder generates — best for seq2seq (translation, summarization). For RAG generation, decoder-only LLMs are standard; for embedding/retrieval, encoder-only models.

Follow-upWhen would you choose one approach over the other?