Data Engineer Interview Questions

A path for SQL, DBMS, Python, data modeling, pipelines, and reliability interviews.

Questions

SQLDBMSPythonData ModelingPipelinesReliability

How do you find weekly active users?medium

Answer

Group activity by week and count distinct users.

Explanation

A strong answer defines the activity event, handles timezone and deduplication, then uses a weekly date bucket with COUNT(DISTINCT user_id).

Follow-upHow would you calculate week-over-week growth?

When would you use a window function?medium

Answer

Use it when you need row-level output plus aggregate context.

Explanation

Window functions are useful for ranking, running totals, cohort calculations, lag comparisons, and moving averages without collapsing rows.

Follow-upHow is ROW_NUMBER different from RANK?

How do you debug a slow query?medium

Answer

Inspect the plan, filter early, index join keys, and reduce scanned data.

Explanation

Start with EXPLAIN, check joins, cardinality, missing indexes, large sorts, and whether predicates can use partitions or indexes.

Follow-upWhat makes an index ineffective?

What is normalization?medium

Answer

Normalization organizes tables to reduce redundancy and update anomalies.

Explanation

Explain entities, keys, relationships, and tradeoffs. In analytics, denormalization may be acceptable for performance and usability.

Follow-upWhen would you denormalize?

What is a transaction?medium

Answer

A transaction is a unit of work that should satisfy ACID properties.

Explanation

Good answers cover atomicity, consistency, isolation, durability, plus examples like money transfer or inventory update.

Follow-upWhat problem does isolation solve?

How do indexes speed reads?medium

Answer

Indexes let the database find rows without scanning the whole table.

Explanation

They improve lookup and join performance but add write overhead and storage cost, so they should match query patterns.

Follow-upWhat is a composite index?

How do you write maintainable Python?medium

Answer

Use simple functions, clear names, tests, typing where useful, and small modules.

Explanation

Interviewers look for readability, error handling, separation of concerns, and code that another engineer can change safely.

Follow-upWhen would you add type hints?

What are generators useful for?medium

Answer

They produce values lazily without building a full list.

Explanation

Generators are useful for streams, large files, and pipelines where memory efficiency matters.

Follow-upHow is yield different from return?

How do you handle errors in Python?medium

Answer

Catch specific exceptions and keep recovery close to the failure.

Explanation

Avoid broad except blocks unless re-raising or adding context. Good error handling makes failure modes explicit.

Follow-upWhen should you create a custom exception?

How do you model facts and dimensions?medium

Answer

Facts store measurable events; dimensions describe the entities around them.

Explanation

A clean model separates grain, keys, slowly changing attributes, and measures so analysts can query consistently.

Follow-upWhat is the grain of a table?

What is a star schema?medium

Answer

A fact table connected to dimension tables.

Explanation

Star schemas simplify analytics because measures are centralized and dimensions are easy to join for slicing.

Follow-upWhen would a snowflake schema help?

How do you handle slowly changing dimensions?medium

Answer

Choose a strategy based on whether history matters.

Explanation

Type 1 overwrites, Type 2 keeps history with effective dates, and hybrid approaches depend on reporting needs.

Follow-upWhat is the cost of Type 2?

How do you design a reliable data pipeline?medium

Answer

Make it idempotent, observable, tested, and recoverable.

Explanation

A production pipeline needs retries, checkpoints, data quality checks, lineage, alerting, and backfill support.

Follow-upWhat does idempotent mean here?

How do batch and streaming differ?medium

Answer

Batch processes bounded data; streaming processes events continuously.

Explanation

Batch is simpler for historical workloads, while streaming helps low-latency use cases but adds ordering and state complexity.

Follow-upWhen is micro-batch enough?

How do you handle late-arriving data?medium

Answer

Use event time, watermarks, and correction logic.

Explanation

Late events require clear windows, reprocessing rules, and downstream consumers that understand revised outputs.

Follow-upHow would you communicate corrected metrics?

How do you make systems reliable?medium

Answer

Design for failure with monitoring, retries, fallbacks, and clear ownership.

Explanation

Reliability is built through simple dependencies, tested recovery, good alerts, capacity planning, and post-incident learning.

Follow-upWhat makes an alert actionable?

What is idempotency?medium

Answer

Repeating the same operation produces the same final result.

Explanation

Idempotency matters for retries, payments, data jobs, and APIs because distributed systems often repeat work after failures.

Follow-upHow would you design an idempotency key?

How do you debug production issues?medium

Answer

Start with impact, recent changes, logs, metrics, and rollback options.

Explanation

Good debugging is systematic: narrow the blast radius, form hypotheses, verify with evidence, and communicate status.

Follow-upWhen should you rollback?

Back to Home