InterviewSkill

NLP Interview Questions

Text processing, language representation, attention, and evaluation for NLP interviews.

9 questions
NLP

What is tokenization?medium

Type
conceptual
Topic
tokenization
Frequency
common
Tags
tokenization
Answer

Tokenization splits text into units such as words, subwords, or characters.

Explanation

Tokenization converts raw text into model-readable pieces. Modern models often use subword tokenization to handle rare words and multiple languages.

Follow-upWhy do LLMs use subword tokens?

What is TF-IDF?medium

Type
conceptual
Topic
tf-idf
Frequency
common
Tags
tf, idf
Answer

TF-IDF scores words by frequency in a document and rarity across documents.

Explanation

It highlights terms that are important to a document but not common everywhere, making it useful for search and classic text classification.

Follow-upWhat are TF-IDF limitations?

What are word embeddings?hard

Type
conceptual
Topic
word-embeddings
Frequency
common
Tags
word, embeddings
Answer

Embeddings represent words or text as dense numeric vectors.

Explanation

Similar meanings tend to have nearby vectors, enabling semantic similarity, clustering, retrieval, and downstream model features.

Follow-upHow are contextual embeddings different?

What is attention?medium

Type
conceptual
Topic
attention
Frequency
common
Tags
attention
Answer

Attention lets a model weight relevant tokens when building representations.

Explanation

Attention helps models connect distant words and focus on important context, which is central to transformer architectures.

Follow-upWhat is self-attention?

How do you evaluate NLP models?medium

Type
conceptual
Topic
evaluate-nlp-models
Frequency
common
Tags
evaluate, nlp, models
Answer

Use task-specific metrics plus qualitative error analysis.

Explanation

Classification may use F1, generation may use human evaluation or task success, and retrieval may use precision, recall, or MRR.

Follow-upWhy can BLEU be insufficient?

What does bidirectionality add in a Bi-LSTM text classifier?medium

Type
conceptual
Topic
used-bi-lstm-in-sarcasm-detection-what-does-bidirectionali
Frequency
common
Tags
nlp, you, used, lstm, sarcasm, detection
Answer

A unidirectional LSTM only has past context at each timestep.

Explanation

A unidirectional LSTM only has past context at each timestep. Bi-LSTM runs two LSTMs — one forward, one backward — and concatenates hidden states. For sarcasm, the final word often recontextualizes earlier words ('Oh great, another Monday'). Bidirectionality lets the model use future context to reinterpret earlier tokens — key reason for 7%+ accuracy gain.

Follow-upCan you give a production example?

What is multi-head attention and why is it better than single-head?medium

Type
conceptual
Topic
is-multi-head-attention-and-why-is-it-better-than-single-h
Frequency
common
Tags
nlp, what, multi, head, attention, and
Answer

Single-head computes one attention distribution — one way of relating tokens.

Explanation

Single-head computes one attention distribution — one way of relating tokens. Multi-head runs h parallel attention functions with different learned projections, then concatenates. Each head can attend to different relationship types (syntactic, semantic, positional). For sarcasm or contract parsing, some heads focus on negation, others on named entities, others on long-range dependencies.

Follow-upCan you give a production example?

You fine-tuned BERT for sarcasm detection. What layers did you modify?hard

Type
conceptual
Topic
fine-tuned-bert-for-sarcasm-detection-what-layers-did-you
Frequency
common
Tags
nlp, you, fine, tuned, bert, for
Answer

Kept BERT's pretrained weights mostly frozen (or used a very small LR).

Explanation

Kept BERT's pretrained weights mostly frozen (or used a very small LR). Added a classification head on top of the [CLS] token: linear → dropout → linear → softmax over 2 classes. Also tested adding a Bi-LSTM layer on top of BERT's token outputs before classification. Fine-tuned with AdamW, warmup scheduler, and early stopping on validation F1.

Follow-upCan you give a production example?

How does Word2Vec differ from contextual embeddings like BERT?medium

Type
conceptual
Topic
does-word2vec-differ-from-contextual-embeddings-like-bert
Frequency
common
Tags
nlp, how, does, word2vec, differ, from
Answer

Word2Vec: one static embedding per word — 'bank' always has the same vector.

Explanation

Word2Vec: one static embedding per word — 'bank' always has the same vector. BERT: contextual embeddings — 'bank' in 'river bank' vs 'bank account' gets different vectors based on the full sentence. For a domain corpus ranking, Word2Vec was sufficient (less ambiguous domain terms). For a document extraction pipeline contract extraction, contextual embeddings are critical.

Follow-upWhen would you choose one approach over the other?