How do you find weekly active users?medium
Group activity by week and count distinct users.
A strong answer defines the activity event, handles timezone and deduplication, then uses a weekly date bucket with COUNT(DISTINCT user_id).
InterviewRole
A path for SQL, DBMS, Python, data modeling, pipelines, and reliability interviews.
Group activity by week and count distinct users.
A strong answer defines the activity event, handles timezone and deduplication, then uses a weekly date bucket with COUNT(DISTINCT user_id).
Use it when you need row-level output plus aggregate context.
Window functions are useful for ranking, running totals, cohort calculations, lag comparisons, and moving averages without collapsing rows.
Inspect the plan, filter early, index join keys, and reduce scanned data.
Start with EXPLAIN, check joins, cardinality, missing indexes, large sorts, and whether predicates can use partitions or indexes.
Normalization organizes tables to reduce redundancy and update anomalies.
Explain entities, keys, relationships, and tradeoffs. In analytics, denormalization may be acceptable for performance and usability.
A transaction is a unit of work that should satisfy ACID properties.
Good answers cover atomicity, consistency, isolation, durability, plus examples like money transfer or inventory update.
Indexes let the database find rows without scanning the whole table.
They improve lookup and join performance but add write overhead and storage cost, so they should match query patterns.
Use simple functions, clear names, tests, typing where useful, and small modules.
Interviewers look for readability, error handling, separation of concerns, and code that another engineer can change safely.
They produce values lazily without building a full list.
Generators are useful for streams, large files, and pipelines where memory efficiency matters.
Catch specific exceptions and keep recovery close to the failure.
Avoid broad except blocks unless re-raising or adding context. Good error handling makes failure modes explicit.
Facts store measurable events; dimensions describe the entities around them.
A clean model separates grain, keys, slowly changing attributes, and measures so analysts can query consistently.
A fact table connected to dimension tables.
Star schemas simplify analytics because measures are centralized and dimensions are easy to join for slicing.
Choose a strategy based on whether history matters.
Type 1 overwrites, Type 2 keeps history with effective dates, and hybrid approaches depend on reporting needs.
Make it idempotent, observable, tested, and recoverable.
A production pipeline needs retries, checkpoints, data quality checks, lineage, alerting, and backfill support.
Batch processes bounded data; streaming processes events continuously.
Batch is simpler for historical workloads, while streaming helps low-latency use cases but adds ordering and state complexity.
Use event time, watermarks, and correction logic.
Late events require clear windows, reprocessing rules, and downstream consumers that understand revised outputs.
Design for failure with monitoring, retries, fallbacks, and clear ownership.
Reliability is built through simple dependencies, tested recovery, good alerts, capacity planning, and post-incident learning.
Repeating the same operation produces the same final result.
Idempotency matters for retries, payments, data jobs, and APIs because distributed systems often repeat work after failures.
Start with impact, recent changes, logs, metrics, and rollback options.
Good debugging is systematic: narrow the blast radius, form hypotheses, verify with evidence, and communicate status.