What is an AI agent?medium
An AI agent uses a model to decide actions toward a goal.
Agents often combine LLM reasoning with tools, memory, planning, and feedback loops to complete tasks beyond a single response.
InterviewSkill
Tool use, planning, workflows, memory, and guardrails for agentic AI systems.
An AI agent uses a model to decide actions toward a goal.
Agents often combine LLM reasoning with tools, memory, planning, and feedback loops to complete tasks beyond a single response.
Tool calling lets a model request external functions or APIs.
Tools can retrieve data, run calculations, search, write files, or trigger workflows. The system decides which tool calls are allowed.
Memory stores useful state or history for future decisions.
Short-term memory may live in context, while long-term memory may use databases or vector stores. Memory must be curated to avoid noise.
Guardrails constrain behavior to keep agents safe and reliable.
They include tool permissions, input validation, output checks, human approval, policy filters, and execution limits.
Measure task success, safety, cost, latency, tool accuracy, and recovery from errors.
Agent evaluation often needs multi-step test cases because failures can happen in planning, tool selection, execution, or final response.
It is a cycle where the agent decides a next step, acts, observes the result, and updates its plan.
Planning loops let agents solve multi-step tasks, but they need limits, state tracking, tool validation, and stopping conditions to avoid wasted work or unsafe actions.
It should ask before high-impact, irreversible, expensive, or low-confidence actions.
Human approval is useful for payments, deleting data, sending external messages, changing permissions, production deployments, or actions with legal or safety implications.
Give the agent the minimum tools and scopes needed for the task.
Use allowlists, scoped credentials, argument validation, audit logs, dry-run modes, and separate read-only from write-capable tools to reduce blast radius.
Tool use: model dynamically decides to call an external function mid-reasoning, gets the result, continues.
Tool use: model dynamically decides to call an external function mid-reasoning, gets the result, continues. RAG: retrieval is a preprocessing step — fetch relevant docs, inject into context, then generate. Tool use is dynamic and chainable; RAG is a single retrieval step. In a fund document processing system, agents use tool use to call financial data APIs and Step Functions for orchestration.
Structured output forces the model to return valid JSON matching a schema.
Structured output forces the model to return valid JSON matching a schema. Approaches: (1) Prompt instruction + few-shot. (2) JSON mode in API (guarantees valid JSON, not schema). (3) Tool use / function calling — model must produce arguments matching the tool's JSON schema. (4) Pydantic parsing with retry loop: if parse fails, send error back with correction instruction. Layer all three in production.
Instrument with OpenTelemetry (as in an agent platform): trace spans per agent call, LLM invocation, tool call.
Instrument with OpenTelemetry (as in an agent platform): trace spans per agent call, LLM invocation, tool call. Metrics: end-to-end latency, per-step latency, token count (input/output), cost per run. CloudWatch for Step Functions execution time. Set token budget per agent, log overruns. Use batching and Bedrock prompt caching to reduce cost on repeated document patterns.
OpenTelemetry (OTel) is an open-source observability framework for distributed tracing, metrics, and logs.
OpenTelemetry (OTel) is an open-source observability framework for distributed tracing, metrics, and logs. In an agent platform: instrument each agent run as a trace with spans for LLM calls, tool executions, and memory operations. Capture attributes: model, token count, latency, tool name, success/failure. Export to a backend (Jaeger, Grafana Tempo) for full visibility into agent execution.
Unit tests per agent with mocked tool responses and deterministic LLM outputs.
Unit tests per agent with mocked tool responses and deterministic LLM outputs. Integration tests: full pipeline on golden test cases, compare final output to expected. Per-agent metrics: task completion rate, tool call accuracy, hallucination rate. System-level: end-to-end latency, cost per run, human escalation rate. Use OTel traces to replay failed runs for debugging. Gate deploys on regression test pass rates.
ReAct (Reasoning + Acting) interleaves thought and action: model reasons (Thought), calls a tool (Action), observes the result (Observation), repeats until done.
ReAct (Reasoning + Acting) interleaves thought and action: model reasons (Thought), calls a tool (Action), observes the result (Observation), repeats until done. In Strands: the agent loop manages this cycle — model generates a response, if it includes a tool call, Strands executes it and feeds the result back, until the model outputs a final response with no tool call.
Orchestrator agent delegates to: parsing agent (extract raw text from PDF), classification agent (identify document type/section), metadata enrichment agent (extract structured financial fields), and validation agent (sc
Orchestrator agent delegates to: parsing agent (extract raw text from PDF), classification agent (identify document type/section), metadata enrichment agent (extract structured financial fields), and validation agent (schema compliance). Step Functions orchestrates the state machine — each agent is a Lambda function. EventBridge triggers pipeline on S3 uploads. Results written to DynamoDB.
HITL pauses the agent at a decision point for human review before proceeding — used for high-stakes actions.
HITL pauses the agent at a decision point for human review before proceeding — used for high-stakes actions. In an enterprise AI platform: Step Functions Wait for Callback pattern — agent sends a task token to a review queue (SQS/SNS), human approves via UI, UI calls SendTaskSuccess/Failure with token, agent resumes. In a resume screening system: low-confidence candidates route to HITL before final scoring.
In-context: current session scratch pad. Episodic: records of past interactions.
In-context: current session scratch pad. Episodic: records of past interactions. Semantic: long-term factual knowledge. Procedural: learned skills/tools. an agent platform focuses on explicit memory management: in-context (current run state), episodic (stored in DB, retrieved on demand), and tool memory (registered tools with descriptions). Unlike LangChain's implicit memory — everything is explicit and inspectable.
Tool call: agent invokes a deterministic function (API, DB query, calculator) — takes inputs, returns outputs, no reasoning.
Tool call: agent invokes a deterministic function (API, DB query, calculator) — takes inputs, returns outputs, no reasoning. Subagent call: agent delegates to another agent with its own LLM, system prompt, memory, and tools. Subagent can reason and make multi-step decisions. Use tools for simple deterministic actions; subagents for complex stateful subtasks that require reasoning.
Step Functions has built-in retry/catch: configure attempts, backoff rate, and interval per state.
Step Functions has built-in retry/catch: configure attempts, backoff rate, and interval per state. Catch specific exceptions (LLM timeout, schema failure), route to error handler. Retry transient failures (API rate limits) with exponential backoff. For logical failures: route to HITL or fallback agent. Dead-letter queue for unrecoverable failures. All transitions logged in CloudWatch.
Event-driven: triggered by an external event (S3 upload → EventBridge → Step Functions) — fully automated.
Event-driven: triggered by an external event (S3 upload → EventBridge → Step Functions) — fully automated. Used in a fund document processing system (new filing → auto-process). Ad-hoc: triggered on demand by a user or API call (user submits a contract in a document extraction pipeline). Same agent logic, different trigger mechanisms routed through API Gateway or EventBridge rules.
(1) Max iterations / max tool calls limit per run — hard stop.
(1) Max iterations / max tool calls limit per run — hard stop. (2) Step budget: track tokens + calls remaining, instruct model to wrap up when low. (3) Loop detection: if the same tool is called with the same args twice, break. (4) Step Functions execution timeout at state machine level. (5) Tool call validator: reject calls not matching expected schema. an agent platform: RunConfig exposes max_steps and max_tokens as explicit constraints.
Define expected output structure as a Pydantic model. After each LLM call, parse the response — Pydantic validates types, required fields, and constraints automatically.
Define expected output structure as a Pydantic model. After each LLM call, parse the response — Pydantic validates types, required fields, and constraints automatically. On ValidationError: catch it, format it clearly, send back to the model with a correction instruction (self-healing loop). In a document extraction pipeline: caught 100% of structural errors before they hit downstream systems, eliminating silent data corruption.
LangChain chains are linear: A → B → C. Can't loop or branch.
LangChain chains are linear: A → B → C. Can't loop or branch. LangGraph models workflows as a directed graph with explicit state: nodes are functions/agents, edges define transitions (conditional or fixed). Supports cycles (loop back to previous steps), branching (route based on state), and state persistence for long-running tasks. Used in an enterprise AI platform for complex multi-step workflows with HITL branching.
(1) Pass state explicitly — orchestrator collects outputs and injects relevant parts into the next agent's prompt.
(1) Pass state explicitly — orchestrator collects outputs and injects relevant parts into the next agent's prompt. (2) Shared store — agents read/write to a central state object (LangGraph StateGraph, DynamoDB, or in-memory dict). (3) Message bus — agents publish events, others subscribe. In a fund document processing system: Step Functions passes execution state between Lambda agents; DynamoDB stores intermediate results.
EventBridge is a serverless event bus. S3 file uploads emit events → EventBridge rule matches on object type → triggers Step Functions or Lambda.
EventBridge is a serverless event bus. S3 file uploads emit events → EventBridge rule matches on object type → triggers Step Functions or Lambda. Decouples producers (data sources) from consumers (agents). Supports event filtering, scheduling (nightly batch jobs), and cross-account routing. Adding a new agent doesn't require changing the data source — just add a new EventBridge rule.
Every LLM call logs: input (system prompt + messages), output, model ID, timestamp, token count, latency, run ID.
Every LLM call logs: input (system prompt + messages), output, model ID, timestamp, token count, latency, run ID. Stored immutably in S3 with object lock. Structured as JSON for queryability via Athena. Agent-level: log each tool call (name, args, result) and reasoning steps. Correlation ID traces a request across all agents. Also log which human approved any HITL decision.
MCP is an open protocol by Anthropic standardizing how LLM applications connect to external tools and data sources.
MCP is an open protocol by Anthropic standardizing how LLM applications connect to external tools and data sources. Defines a client-server model: the LLM client discovers and calls tools exposed by an MCP server with consistent schemas. an agent platform uses MCP for external messaging and tool integrations — agents connect to MCP servers (Telegram, Slack, APIs) without custom integration code per tool.
Job templates define required skills, experience levels, and custom screening questions with configurable weights (e.g., Python: 30%, LLM experience: 40%, communication: 30%).
Job templates define required skills, experience levels, and custom screening questions with configurable weights (e.g., Python: 30%, LLM experience: 40%, communication: 30%). The agentic pipeline extracts structured candidate data, scores each dimension using an LLM evaluator against the rubric, computes weighted total. Configurable thresholds route candidates to auto-pass, HITL review, or auto-reject. Integrates with Workday for status updates.
System prompt: static, set at agent initialization — defines persona, capabilities, constraints, output format.
System prompt: static, set at agent initialization — defines persona, capabilities, constraints, output format. Doesn't change per run. Runtime instruction: dynamic, passed per invocation — the specific task for this run. Separating them allows: (1) Reuse the same agent for multiple tasks. (2) Cache the system prompt token cost. (3) Cleaner API — callers only pass the task, not re-specify the agent's full context.
Track token usage cumulatively. When approaching limit: (1) Summarize older turns, replace with summary (rolling context compression).
Track token usage cumulatively. When approaching limit: (1) Summarize older turns, replace with summary (rolling context compression). (2) Evict least-relevant messages by importance scoring. (3) Move completed context to external memory (DynamoDB), fetch back when needed. (4) Use Bedrock prompt caching to avoid re-processing stable system prompt on every turn. Hard limits per agent via RunConfig.
Planner: decomposes a high-level goal into subtasks, creates an execution plan.
Planner: decomposes a high-level goal into subtasks, creates an execution plan. Doesn't execute. Executor: carries out individual subtasks — calls tools, writes output. No high-level planning. Critic: reviews executor output against the goal — identifies errors or missing steps, feeds back to planner or executor for correction. This pattern is used in AutoGen and similar frameworks for higher-quality autonomous task completion.
LangChain's memory is implicit — ConversationBufferMemory automatically appends everything.
LangChain's memory is implicit — ConversationBufferMemory automatically appends everything. an agent platform makes memory explicit: you define what gets stored (agent_result.memory_write), what gets retrieved (memory.fetch(query)), and when memory is cleared. Memory is typed (episodic vs semantic), stored in a pluggable backend (in-memory for tests, DynamoDB for production), and retrieved via semantic search.
(1) Unit tests: each agent in isolation with mocked tool responses and deterministic LLM outputs.
(1) Unit tests: each agent in isolation with mocked tool responses and deterministic LLM outputs. (2) Integration tests: full pipeline with real LLM calls on golden dataset. (3) Contract tests: verify each agent's input/output schema stays stable (Pydantic). (4) Load tests: N parallel executions, verify Step Functions handles concurrency. (5) Chaos tests: inject failures (tool timeout, LLM error), verify retry/fallback logic. Gate deploys on all passing.