NLP Interview Prep

What is tokenization?medium

Type: conceptual
Topic: tokenization
Frequency: common
Tags: tokenization

Answer

Tokenization splits text into units such as words, subwords, or characters.

Explanation

Tokenization converts raw text into model-readable pieces. Modern models often use subword tokenization to handle rare words and multiple languages.

Follow-upWhy do LLMs use subword tokens?

What is TF-IDF?medium

Type: conceptual
Topic: tf-idf
Frequency: common
Tags: tf, idf

Answer

TF-IDF scores words by frequency in a document and rarity across documents.

Explanation

It highlights terms that are important to a document but not common everywhere, making it useful for search and classic text classification.

Follow-upWhat are TF-IDF limitations?

What are word embeddings?hard

Type: conceptual
Topic: word-embeddings
Frequency: common
Tags: word, embeddings

Answer

Embeddings represent words or text as dense numeric vectors.

Explanation

Similar meanings tend to have nearby vectors, enabling semantic similarity, clustering, retrieval, and downstream model features.

Follow-upHow are contextual embeddings different?

What is attention?medium

Type: conceptual
Topic: attention
Frequency: common
Tags: attention

Answer

Attention lets a model weight relevant tokens when building representations.

Explanation

Attention helps models connect distant words and focus on important context, which is central to transformer architectures.

Follow-upWhat is self-attention?

How do you evaluate NLP models?medium

Type: conceptual
Topic: evaluate-nlp-models
Frequency: common
Tags: evaluate, nlp, models

Answer

Use task-specific metrics plus qualitative error analysis.

Explanation

Classification may use F1, generation may use human evaluation or task success, and retrieval may use precision, recall, or MRR.

Follow-upWhy can BLEU be insufficient?

What does bidirectionality add in a Bi-LSTM text classifier?medium

Type: conceptual
Topic: used-bi-lstm-in-sarcasm-detection-what-does-bidirectionali
Frequency: common
Tags: nlp, you, used, lstm, sarcasm, detection

Answer

A unidirectional LSTM only has past context at each timestep.

Explanation

A unidirectional LSTM only has past context at each timestep. Bi-LSTM runs two LSTMs — one forward, one backward — and concatenates hidden states. For sarcasm, the final word often recontextualizes earlier words ('Oh great, another Monday'). Bidirectionality lets the model use future context to reinterpret earlier tokens — key reason for 7%+ accuracy gain.

Follow-upCan you give a production example?

What is multi-head attention and why is it better than single-head?medium

Type: conceptual
Topic: is-multi-head-attention-and-why-is-it-better-than-single-h
Frequency: common
Tags: nlp, what, multi, head, attention, and

Answer

Single-head computes one attention distribution — one way of relating tokens.

Explanation

Single-head computes one attention distribution — one way of relating tokens. Multi-head runs h parallel attention functions with different learned projections, then concatenates. Each head can attend to different relationship types (syntactic, semantic, positional). For sarcasm or contract parsing, some heads focus on negation, others on named entities, others on long-range dependencies.

Follow-upCan you give a production example?

You fine-tuned BERT for sarcasm detection. What layers did you modify?hard

Type: conceptual
Topic: fine-tuned-bert-for-sarcasm-detection-what-layers-did-you
Frequency: common
Tags: nlp, you, fine, tuned, bert, for

Answer

Kept BERT's pretrained weights mostly frozen (or used a very small LR).

Explanation

Kept BERT's pretrained weights mostly frozen (or used a very small LR). Added a classification head on top of the [CLS] token: linear → dropout → linear → softmax over 2 classes. Also tested adding a Bi-LSTM layer on top of BERT's token outputs before classification. Fine-tuned with AdamW, warmup scheduler, and early stopping on validation F1.

Follow-upCan you give a production example?

How does Word2Vec differ from contextual embeddings like BERT?medium

Type: conceptual
Topic: does-word2vec-differ-from-contextual-embeddings-like-bert
Frequency: common
Tags: nlp, how, does, word2vec, differ, from

Answer

Word2Vec: one static embedding per word — 'bank' always has the same vector.

Explanation

Word2Vec: one static embedding per word — 'bank' always has the same vector. BERT: contextual embeddings — 'bank' in 'river bank' vs 'bank account' gets different vectors based on the full sentence. For a domain corpus ranking, Word2Vec was sufficient (less ambiguous domain terms). For a document extraction pipeline contract extraction, contextual embeddings are critical.

Follow-upWhen would you choose one approach over the other?

NLP Interview Questions

What is tokenization?medium

What is TF-IDF?medium

What are word embeddings?hard

What is attention?medium

How do you evaluate NLP models?medium

What does bidirectionality add in a Bi-LSTM text classifier?medium

What is multi-head attention and why is it better than single-head?medium

You fine-tuned BERT for sarcasm detection. What layers did you modify?hard

How does Word2Vec differ from contextual embeddings like BERT?medium