Skip to content
Back to Hub
0/21 complete
Specialization Track

Agentic AI Engineer Path

Bias-free, foundation-first agentic AI engineering. From the ReAct paper and modern model families through MCP (2025-11-25 spec), the Claude Agent SDK, and the full coding-agent landscape — Claude Code, Cursor, Copilot, Aider, Continue — to evals, prompt-injection defense, and a grounded production capstone.

21 lessons
5 phases
11 official source families
arXivAnthropicOpenAIGoogleMetaMistral AIOllamaModel Context ProtocolCursorGitHubAider

Certificate Lane

Docs-Driven Specialization Review

Complete the authored lessons, finish the track assessment, and pass at 80%+ to unlock certificate eligibility.

Certificate in progress
Coursework

0/21

0% complete

Assessment

not started

Certificate

Locked

locked

Lesson Flow

Flow Timeline

0/21 lessons done

Next up

From ELIZA to ReAct: How Agentic AI Got Here

Foundations and Mental Models

Foundations and Mental Models

From ELIZA to ReAct: How Agentic AI Got Here

Pending

Step 1

Foundations and Mental Models

AI Systems Framing: AI, ML, Deep Learning, and LLM Products

Pending

Step 2

Foundations and Mental Models

How LLMs Work: Tokens, Context Windows, Embeddings, and Transformer Intuition

Pending

Step 3

Foundations and Mental Models

Model Selection in 2026: Claude 4.x, GPT-5, Gemini 2.x, Llama, Mistral, Open-Weight

Pending

Step 4

Reliable Application Interfaces

Prompting and Context Engineering

Pending

Step 5

Reliable Application Interfaces

Structured Outputs, Schemas, and Validation

Pending

Step 6

Reliable Application Interfaces

Tool Use, Function Calling, and Approval Boundaries

Pending

Step 7

Performance, Cost, Multimodal, and Connected Context

Prompt Caching: Latency, Cost, and Cache-Friendly Prompt Design

Pending

Step 8

Performance, Cost, Multimodal, and Connected Context

Vision and Multimodal: Images, PDFs, Diagrams, and Screenshots

Pending

Step 9

Performance, Cost, Multimodal, and Connected Context

RAG, Retrieval, and Citation Design

Pending

Step 10

Performance, Cost, Multimodal, and Connected Context

MCP Architecture: Hosts, Clients, Servers, Tools, Resources, Prompts, Elicitation

Pending

Step 11

Performance, Cost, Multimodal, and Connected Context

Building MCP Servers: Transports, Capabilities, and Trust Boundaries

Pending

Step 12

Agents and Coding Agents

Agent Patterns: Augmented LLMs, Workflows, and Autonomous Agents

Pending

Step 13

Agents and Coding Agents

Building Custom Agents with the Claude Agent SDK

Pending

Step 14

Agents and Coding Agents

Coding Agents Landscape: Claude Code, Cursor, Copilot, Aider, Continue

Pending

Step 15

Agents and Coding Agents

Claude Code in Practice: Settings, MCP, and Permission Boundaries

Pending

Step 16

Production: Evals, Defenses, Deployment, Capstone

Evals, Guardrails, Latency, and Cost

Pending

Step 17

Production: Evals, Defenses, Deployment, Capstone

Prompt Injection Defense: Real Attacks, Real Mitigations

Pending

Step 18

Production: Evals, Defenses, Deployment, Capstone

Hosted APIs vs Open-Weight Models

Pending

Step 19

Production: Evals, Defenses, Deployment, Capstone

Local-First Self-Hosting with Ollama

Pending

Step 20

Production: Evals, Defenses, Deployment, Capstone

Capstone: Ship a Grounded, Cached, Defended Agentic Product

Pending

Step 21

Phase 1

Foundations and Mental Models

Trace how today's agentic AI got here, then learn the system framing, mechanics, and current 2026 model lineup that the rest of the track builds on.

Phase 2

Reliable Application Interfaces

Turn raw model behavior into safer software with prompt discipline, structured outputs, and explicit tool boundaries — across Anthropic, OpenAI, Google, and open-weight stacks.

Phase 3

Performance, Cost, Multimodal, and Connected Context

Cache prompts for latency and cost wins, add vision and multimodal where it earns its place, ground answers with retrieval, then connect external capabilities through MCP — including the 2025-11-25 spec additions for elicitation and structured tool output.

Phase 4

Agents and Coding Agents

Move from tool use into agent patterns from Anthropic's 'Building Effective Agents,' build custom agents on the Claude Agent SDK, then survey the modern coding-agent landscape (Claude Code, Cursor, Copilot, Aider, Continue) without vendor bias.

Phase 5

Production: Evals, Defenses, Deployment, Capstone

Close the loop with vendor-neutral evals, layered prompt-injection defense, hosted vs open-weight deployment thinking, local-first Ollama hands-on, and a capstone that ties it all together for review.

Lesson 1 of 21

From ELIZA to ReAct: How Agentic AI Got Here

Difficulty

Beginner

Duration

25 min

Modern coding agents did not appear in 2024. They are the product of a sixty-year arc: rule-based dialog systems in the 1960s, expert systems in the 1980s, statistical NLP in the 2000s, transformer language models from 2017, instruction-tuned chat models in 2022, and tool-using reasoning loops from 2022 onward.

The pivot to today's agentic systems came from one repeatable pattern: interleave reasoning with action. The 2022 ReAct paper (Yao et al.) showed that letting a model think, take a tool action, observe the result, then think again outperforms either chain-of-thought alone or tool use alone. Every modern coding agent — Claude Code, Cursor, GitHub Copilot agent, Aider — runs some variant of this loop.

Knowing this lineage matters because vendor docs assume you do. When Anthropic publishes 'Building Effective Agents' or the MCP spec ships elicitation, the design choices only make sense if you understand what came before: why deterministic workflows still beat agents for stable paths, why tool boundaries matter, and why 'plan, act, observe' became the default loop instead of free-form generation.

  1. Plan

    Decide what to do next based on the goal and prior observations.

  2. Act

    Call a tool, edit a file, or make a request.

  3. Observe

    Read the tool result, error, or new state.

  4. Reflect

    Decide whether to continue, retry, escalate, or stop.

The reason-act-observe loop, generalized to plan-act-observe-reflect, is the shape ReAct (2022) introduced and modern agents inherit.
agentic-timeline.js
1const agenticTimeline = { 2 1966: "ELIZA — pattern-matching chatbot", 3 1980: "MYCIN, expert systems — symbolic rules", 4 2017: "Transformer — attention is all you need", 5 2022: "ChatGPT + ReAct — instruction tuning meets reason-act loops", 6 2023: "Tool use, function calling, autonomous agents (AutoGPT)", 7 2024: "MCP, coding agents, agent SDKs — agent infra standardizes", 8 2025: "MCP 2025-11-25: elicitation, structured tool output", 9};
Real-World Scenario

A learner sees Claude Code, Cursor, and the MCP spec as separate inventions and cannot explain why they all share the same plan-act-observe shape.

What you learned
  • Today's agents are reasoning loops with tools, not magic.
  • ReAct is the foundational pattern under nearly every coding agent.
  • Vendor docs assume you know this lineage — it makes their design choices legible.
Build Mission

Pick one modern coding agent and trace its core loop back to ReAct. Identify which step is reasoning, which is action, and which is observation.

Check Yourself
  • What problem did ReAct solve that chain-of-thought alone could not?
  • Why did rule-based and statistical systems give way to transformers?
  • Which parts of an MCP server map to the action and observation steps in ReAct?
Definition of Done
  • Learner can describe at least four eras in the path from ELIZA to modern agents.
  • Learner can map any modern coding agent's loop onto reason-act-observe.
  • Learner explains why the lineage shapes the constraints in current docs.
Failure modes
  • Treating modern agents as a 2024 invention with no prior art.
  • Confusing chain-of-thought with agentic behavior.
  • Believing larger models automatically remove the need for tool loops.

Authored Quiz

Check the lesson against authored questions instead of a generated fallback.

0/1 correct

Question 1

Which insight from the 2022 ReAct paper underpins most modern coding agents?

Answer

Practice

Short drills to convert the lesson into repeatable skill.

1 drill

Drill 1

Lineage Map

Sketch a one-page timeline from ELIZA (1966) to MCP 2025-11-25 with at least six waypoints. For each, note what it added that the prior era lacked.

Deliverable

A timeline with six labeled waypoints and a one-line capability delta for each.

Lesson 2 of 21

AI Systems Framing: AI, ML, Deep Learning, and LLM Products

Difficulty

Beginner

Duration

20 min

Students often say 'AI' when they really mean one narrow product shape. This lesson fixes that. An AI product is a system with a model, context, product logic, interfaces, safety checks, and success criteria.

Official vendor docs from Anthropic, OpenAI, and Google all assume you already know that the API call is only one layer. The engineering job is deciding where knowledge lives, what the model may do, what the application validates, and how users know whether to trust the result.

That framing matters because everything later in the track is about replacing vague AI talk with concrete system design: model choice, prompt contract, retrieval, tools, MCP, coding agents, and evals.

ai-system-frame.js
1const aiSystem = { 2 model: "llm or classifier", 3 context: ["prompt", "retrieval", "tools", "mcp resources"], 4 applicationLogic: ["validation", "permissions", "fallbacks"], 5 productSurface: ["ui", "api", "logs", "alerts"], 6};
Real-World Scenario

A team keeps describing features as 'add AI here' without defining where knowledge comes from, what actions are allowed, or how errors are handled.

What you learned
  • An AI product is a system, not only a model call.
  • Trust depends on context, validation, and operating rules.
  • Good framing reduces hype and improves design decisions.
Build Mission

Take one AI feature idea and rewrite it as a system with model, context, actions, guardrails, and product outputs.

Check Yourself
  • What belongs to the model versus the application?
  • Why is an LLM answer not the same thing as production truth?
  • What would another engineer need to review before trusting the feature?
Definition of Done
  • Learner can explain the difference between a model and an AI product.
  • Feature design includes product logic and safety decisions.
  • High-level AI language is replaced with concrete system components.
Failure modes
  • Treating every AI feature as a prompt-only problem.
  • Ignoring the application layer that validates or blocks outputs.
  • Using broad AI language instead of naming the actual workflow.

Authored Quiz

AI systems framing check

0/1 correct

Question 1

Which description best matches a production AI feature?

Answer

Practice

Short drills to convert the lesson into repeatable skill.

1 drill

Drill 1

System Rewrite

Rewrite one vague AI feature request into a system design with model, context, output contract, and guardrails.

Deliverable

A one-page outline with five labeled system components.

Lesson 3 of 21

How LLMs Work: Tokens, Context Windows, Embeddings, and Transformer Intuition

Difficulty

Beginner

Duration

30 min

You do not need to be a researcher to work well with LLMs, but you do need correct mental models. Tokens are the unit the model sees, context windows cap how much can fit, and embeddings turn text into vectors for retrieval and comparison workflows.

Transformer intuition matters because attention is why different prompt layouts and context ordering change results. Long context is not magical memory — even a 1M-token model still allocates attention under limits, and irrelevant context can crowd out useful signals.

Modern models also surface deliberate reasoning budgets (Claude's extended thinking, OpenAI o-series) as a separate layer on top of generation. This lesson gives the vocabulary needed to reason about latency, truncation, retrieval quality, and why the same task behaves differently under different context budgets.

llm-mechanics.js
1const llmMechanics = { 2 tokens: "model-visible text units", 3 contextWindow: "maximum prompt plus response budget", 4 embeddings: "vector representations for similarity", 5 attention: "how the model weighs relevant context", 6 thinkingBudget: "explicit reasoning tokens before the visible answer", 7};
Real-World Scenario

A learner hears about tokens and transformers constantly but still cannot connect them to real design choices like chunking, latency, or prompt layout.

What you learned
  • Tokens shape fit, cost, and latency.
  • Embeddings support retrieval and similarity tasks.
  • Attention explains why context quality matters more than raw length.
  • Reasoning budgets (extended thinking) are a separate dial from context length.
Build Mission

Explain one AI workflow using tokens, embeddings, context limits, attention, and (if applicable) thinking budgets — instead of vague 'the model understands it' language.

Check Yourself
  • What changes when the prompt gets too long?
  • Why are embeddings useful for retrieval but not a replacement for product logic?
  • How can extra context make the answer worse?
  • When does extended thinking help and when does it just add latency?
Definition of Done
  • Learner can explain the prompt-to-output path with correct terminology.
  • Context limits are treated as engineering constraints.
  • Embeddings are linked to retrieval quality and not confused with generation.
  • Learner can name when extended thinking earns its latency cost.
Failure modes
  • Assuming the model reliably understands every token in long context.
  • Using embedding language without knowing what problem embeddings solve.
  • Treating latency and cost as separate from context design.
  • Turning on extended thinking everywhere and paying its latency tax for no quality gain.

Authored Quiz

Check the lesson against authored questions instead of a generated fallback.

0/1 correct

Question 1

What is the strongest reason to care about tokens in application design?

Answer

Practice

Short drills to convert the lesson into repeatable skill.

1 drill

Drill 1

Token Budget Sketch

Take one workflow and estimate instruction, retrieval, user input, and reserved response tokens. If extended thinking is in scope, add a thinking budget line.

Deliverable

A short token budget with one risk note about truncation or latency.

Lesson 4 of 21

Model Selection in 2026: Claude 4.x, GPT-5, Gemini 2.x, Llama, Mistral, Open-Weight

Difficulty

Beginner

Duration

30 min

Model selection is an engineering choice, not a fandom choice. As of 2026 the practical landscape includes Anthropic Claude 4.x (Opus, Sonnet, Haiku), OpenAI GPT-5 family, Google Gemini 2.x, Meta Llama 4 open-weight, Mistral hosted and open-weight, and a long tail of community models served via Ollama.

Each family varies on the same axes: reasoning quality, tool use reliability, latency, context window, multimodal support, prompt-cache friendliness, hosting options, and total cost. No model wins on every axis. The selection question is always task- and constraint-specific.

The right question is never 'which model is best overall?' The right question is 'which model is best for this exact task, budget, latency target, vision/audio needs, privacy posture, and operations capacity?' Build a small model-selection matrix and revisit it whenever a major new release ships.

model-selection.js
1const selectionMatrix = { 2 workload: "coding agent with repo edits", 3 qualityNeed: "high reasoning, strong tool use", 4 latencyBudgetMs: 2500, 5 contextNeeded: "200k+ tokens with prompt caching", 6 multimodal: "must read screenshots", 7 privacyConstraint: "customer code stays in tenant", 8 candidates: ["claude-sonnet-4-x", "gpt-5", "gemini-2.x", "llama-4 (self-hosted)"], 9};
Real-World Scenario

A team keeps picking models by social hype while ignoring latency ceilings, support burden, multimodal needs, and whether the workflow actually needs frontier reasoning.

What you learned
  • Model choice follows workload fit, not hype or vendor loyalty.
  • Hosted, open-weight, and self-hosted options have different tradeoffs.
  • Quality, latency, multimodal, privacy, and operations all belong in the same decision.
  • Revisit selection whenever a major model release lands.
Build Mission

Pick a model family for one workflow and justify it across quality, latency, cost, multimodal needs, and deployment complexity. Compare at least one Anthropic, one OpenAI, one Google, and one open-weight option.

Check Yourself
  • When does a hosted frontier model make more sense than a local open-weight model?
  • What matters most for coding or tool-using workflows?
  • What operational cost appears when you self-host instead of calling an API?
  • Which capabilities (vision, audio, long context) are non-negotiable for your task?
Definition of Done
  • Learner can compare major model families using practical criteria.
  • The recommendation clearly matches one workload and one set of constraints.
  • Open-weight hosting is treated as a tradeoff, not a badge of seriousness.
  • Selection is documented enough that another engineer can audit the choice.
Failure modes
  • Choosing models based on leaderboard reputation alone.
  • Ignoring operational complexity when comparing hosted and self-hosted options.
  • Using the same model choice for every workload without task-specific evaluation.
  • Locking in a vendor without an exit story when a better model ships.

Authored Quiz

Check the lesson against authored questions instead of a generated fallback.

0/1 correct

Question 1

Which decision rule is strongest when selecting a model for production?

Answer

Practice

Short drills to convert the lesson into repeatable skill.

1 drill

Drill 1

Selection Matrix

Compare one Anthropic, one OpenAI, one Google, and one open-weight option for the same workflow.

Deliverable

A side-by-side table across quality, latency, context, multimodal, cost, privacy, and ops — with one final recommendation and why.

Lesson 5 of 21

Prompting and Context Engineering

Difficulty

Intermediate

Duration

25 min

Prompting is interface design. Stable instructions, task framing, examples, and dynamic context should be separated so the model sees a clear contract instead of a blob of mixed concerns.

Anthropic, OpenAI, and Google all converge on the same principles in their official docs: clarity, examples, structured delimiters, and explicit success criteria. Anthropic emphasizes XML tags and role separation; OpenAI emphasizes the Responses API instruction layer; Google emphasizes system instructions in Gemini. Better prompts are usually clearer prompts, not more theatrical prompts.

Context engineering matters because the prompt is only one part of context. Retrieved evidence, tool results, MCP resources, and prior conversation all compete for attention inside the same bounded budget.

context-engineering.js
1const prompt = [ 2 { role: "system", content: "Answer with a cited policy summary. Refuse if no citation is available." }, 3 { role: "user", content: "Question: Can I refund a shipped order?" }, 4 { role: "user", content: "<policy_excerpt>Shipped orders need manager approval.</policy_excerpt>" }, 5];
Real-World Scenario

A workflow works only when one engineer manually pastes the perfect context into the prompt, and no one else can reproduce the result.

What you learned
  • Prompt quality comes from clarity, boundaries, and reuse.
  • Stable instructions and dynamic context should not be mixed casually.
  • Context engineering is broader than prompt wording alone.
Build Mission

Take one vague prompt and rewrite it into stable instructions, scoped context, and a clear output target.

Check Yourself
  • Which context belongs in retrieval instead of the base prompt?
  • How do you keep prompts reusable across requests?
  • What should be removed from context because it adds noise?
Definition of Done
  • Learner can separate instructions, evidence, and user input cleanly.
  • Prompt changes can be versioned and reviewed.
  • The context plan is disciplined instead of maximalist.
Failure modes
  • Packing instructions, policies, and user data into one wall of text.
  • Adding more context without checking relevance.
  • Treating prompt design as guesswork instead of an interface contract.

Authored Quiz

Check the lesson against authored questions instead of a generated fallback.

0/1 correct

Question 1

Why should stable instructions be separated from dynamic context?

Answer

Practice

Short drills to convert the lesson into repeatable skill.

1 drill

Drill 1

Prompt Rewrite

Convert a messy AI prompt into a task contract with instructions, evidence, and output expectation.

Deliverable

A before-and-after prompt pair with one paragraph of rationale.

Lesson 6 of 21

Structured Outputs, Schemas, and Validation

Difficulty

Intermediate

Duration

25 min

Once a model output feeds automation, prose stops being enough. Structured outputs let the application ask for a strict schema instead of hoping a free-form answer can be parsed reliably. Anthropic exposes this through tool-use schemas; OpenAI through Structured Outputs; Google through Gemini's response_schema.

That improves formatting reliability, but schema validity is still not business correctness. The application still owns policy validation, authorization, retries, and refusal handling.

This is one of the clearest shifts from AI demo thinking to software engineering thinking: typed contracts, explicit fallback paths, and predictable downstream behavior across vendors.

structured-output-contract.js
1const schema = { 2 type: "object", 3 properties: { 4 priority: { type: "string", enum: ["low", "medium", "high"] }, 5 action: { type: "string" }, 6 reason: { type: "string" }, 7 }, 8 required: ["priority", "action", "reason"], 9 additionalProperties: false, 10};
Real-World Scenario

A model output looks good in demos but keeps breaking downstream code because field names and structure drift between requests.

What you learned
  • Structured outputs are stronger than ad hoc parsing.
  • Schemas reduce formatting drift and make automation safer.
  • Validation still belongs to the application.
Build Mission

Design a strict schema for one workflow and define what product-side validation still happens after the model responds.

Check Yourself
  • What does schema validation solve, and what does it not solve?
  • How should the app behave on refusal or invalid output?
  • Which downstream workflows should never depend on free-form prose?
Definition of Done
  • Learner can design a strict schema for an automation workflow.
  • The system has a fallback path for invalid or refused output.
  • Schema-valid and business-valid are clearly distinguished.
Failure modes
  • Using free-form prose where typed output is required.
  • Assuming schema-valid output is automatically safe to execute.
  • Forgetting to handle refusal or retry behavior.

Authored Quiz

Check the lesson against authored questions instead of a generated fallback.

0/1 correct

Question 1

Why are structured outputs safer than plain JSON formatting for automation?

Answer

Practice

Short drills to convert the lesson into repeatable skill.

1 drill

Drill 1

Schema Drill

Create a JSON schema for one AI classification or triage workflow.

Deliverable

A strict schema with one note about downstream validation.

Lesson 7 of 21

Tool Use, Function Calling, and Approval Boundaries

Difficulty

Intermediate

Duration

30 min

Tool use is the moment an AI system stops being only a text generator. The model can request actions or external data, but the application still owns execution, permissions, validation, and auditability.

Anthropic, OpenAI, Google, and Ollama all expose tool-use patterns because real products need this loop: define the tool contract, let the model request a call, execute it in code, return the result, and decide whether another step is safe. The wire formats differ, but the loop shape does not.

The design question is not only 'what tools can the model call?' It is also 'which calls are safe automatically, which need human approval, and which should never be model-driven?'

tool-use-boundary.js
1const refundTool = { 2 name: "request_refund_review", 3 description: "Create a refund review ticket for a shipped order", 4 input_schema: { 5 type: "object", 6 properties: { orderId: { type: "string" }, reason: { type: "string" } }, 7 required: ["orderId", "reason"], 8 }, 9};
Real-World Scenario

A team wants the assistant to create tickets, send actions, and fetch account data without first deciding which actions are low-risk and which require review.

What you learned
  • The model suggests tool calls, but your application owns execution.
  • Approval boundaries matter most for state-changing actions.
  • Good tool contracts reduce ambiguity and overreach.
Build Mission

Define one read-only tool and one write-oriented tool, then state which can run automatically and which must stop for approval.

Check Yourself
  • What should be validated before execution?
  • Which tools are safe only when read-only?
  • How do you log what the model asked for versus what the app actually did?
Definition of Done
  • Learner can explain the tool loop from request to execution to follow-up.
  • Tool schemas are narrow enough to reduce misuse.
  • Approval rules are written before launch.
Failure modes
  • Treating model-proposed arguments as trusted input.
  • Giving the model broad write tools without review boundaries.
  • Using too many overlapping tools with vague descriptions.

Authored Quiz

Check the lesson against authored questions instead of a generated fallback.

0/1 correct

Question 1

After the model emits a tool call, what should happen next?

Answer

Practice

Short drills to convert the lesson into repeatable skill.

1 drill

Drill 1

Approval Matrix

Define three tools and mark each as automatic, review-required, or blocked.

Deliverable

A one-page tool policy with one sentence of justification per tool.

Lesson 8 of 21

Prompt Caching: Latency, Cost, and Cache-Friendly Prompt Design

Difficulty

Intermediate

Duration

25 min

Prompt caching lets the API store the static prefix of your prompt — long system instructions, retrieved corpora, tool schemas, conversation history — and reuse the cached compute on subsequent calls. Anthropic exposes this through cache_control breakpoints; OpenAI exposes it via automatic prompt caching on supported models. The result is large latency wins (often 50%+) and large cost wins (cached tokens are billed at a fraction of fresh tokens).

Caching is not free or automatic. Effective caching requires designing prompts so the static portion comes first, the dynamic portion comes last, and the cache breakpoint is placed deliberately. Reorder the prompt and you invalidate the cache. Add even one new token in the cached prefix and the cache misses.

For agentic workflows that loop on the same context (RAG over the same corpus, repeated tool-use over the same conversation), caching is the single highest-leverage optimization in the system. Treat it as a first-class part of prompt design, not as a late optimization.

prompt-caching.js
1// Anthropic-style cache_control example 2const messages = [ 3 { 4 role: "system", 5 content: [ 6 { type: "text", text: longCorpus }, 7 { type: "text", text: toolSchemas, cache_control: { type: "ephemeral" } }, 8 ], 9 }, 10 { role: "user", content: userQuestion }, 11];
Real-World Scenario

A team's agent is slow and expensive because every call re-sends the same 50k tokens of system instructions and retrieved corpus.

What you learned
  • Cache the static prefix; never put dynamic data above the breakpoint.
  • Cache hits cut both latency and cost dramatically.
  • Reordering or rewriting the cached prefix invalidates the cache.
  • Caching is highest-leverage for repeated calls over the same context.
Build Mission

Take one workflow that hits the model more than 10 times per session and design its prompt for cache hits. Identify the cache breakpoint and what changes per call.

Check Yourself
  • What part of your prompt never changes between calls?
  • Where is the cleanest cache breakpoint?
  • How will you measure cache hit rate in production?
  • Which workloads do not benefit from caching and should skip it?
Definition of Done
  • Learner can identify cacheable prefix vs dynamic suffix in any prompt.
  • Prompt design places the breakpoint deliberately.
  • Cache hit rate is observable in logs.
Failure modes
  • Putting dynamic context above static context, killing every cache hit.
  • Rewriting the cached prefix on every deploy, invalidating the cache silently.
  • Enabling caching on one-shot workloads that never hit twice.
  • Forgetting that cache TTL means the first call after idle still pays full cost.

Authored Quiz

Check the lesson against authored questions instead of a generated fallback.

0/1 correct

Question 1

Where should the cache breakpoint go in a typical RAG-over-corpus workflow?

Answer

Practice

Short drills to convert the lesson into repeatable skill.

1 drill

Drill 1

Cache Audit

Take one of your agent workflows. List every prompt section and mark which is static (cacheable) vs dynamic (changes per call). Place a cache breakpoint and estimate hit rate.

Deliverable

An annotated prompt outline with the breakpoint marked and projected hit rate.

Lesson 9 of 21

Vision and Multimodal: Images, PDFs, Diagrams, and Screenshots

Difficulty

Intermediate

Duration

30 min

Modern frontier models accept images, PDFs, audio, and video alongside text. Anthropic's Claude vision API takes inline base64 or URL images; OpenAI GPT-4o and beyond handle the same; Google Gemini natively handles multimodal long-context including video. The cost model differs (token-equivalent for images), but the capability shape is converging.

Vision unlocks workflows that were impossible with text alone: reading screenshots in coding agents, parsing scanned documents, interpreting diagrams in technical docs, summarizing UI wireframes, OCR-style extraction with reasoning. The engineering job is choosing when vision is the right tool and when a smaller specialized model (OCR, layout parsers) is faster and cheaper.

Multimodal is also where prompt injection gets sneakier — a hostile image can carry adversarial instructions to the model. Treat any user-supplied image the same way you treat user-supplied text: untrusted input that needs guardrails before tool use.

vision-multimodal.js
1// Anthropic vision: pass an image alongside text 2const message = await client.messages.create({ 3 model: "claude-sonnet-4-x", 4 max_tokens: 1024, 5 messages: [ 6 { 7 role: "user", 8 content: [ 9 { type: "image", source: { type: "base64", media_type: "image/png", data: imageB64 } }, 10 { type: "text", text: "Describe the failing UI state in this screenshot." }, 11 ], 12 }, 13 ], 14});
Real-World Scenario

A team wants their support assistant to accept screenshots from users but has not thought through cost, latency, or what happens when a screenshot contains prompt-injection text.

What you learned
  • Vision is supported across Claude, GPT, and Gemini families.
  • Images count toward token budget — measure cost before scaling.
  • Specialized models (OCR, layout) can beat frontier vision on cost for narrow tasks.
  • Hostile images can carry prompt-injection payloads.
Build Mission

Pick one workflow where vision unlocks a capability text cannot. Choose the model, estimate per-call cost, and define the safety check on user-supplied images.

Check Yourself
  • When does vision beat a specialized OCR or layout model?
  • How does image size affect token cost?
  • What is your defense against adversarial images?
  • Which models handle multi-page PDFs natively vs require pre-conversion?
Definition of Done
  • Learner picks a model with explicit reasoning across cost, capability, and latency.
  • Image input is sanitized before reaching tool-use steps.
  • Cost model includes per-image token equivalents.
Failure modes
  • Pricing the feature without measuring per-image token cost.
  • Trusting OCR text from user images and feeding it directly to tools.
  • Using frontier vision when a specialized OCR would be 10x cheaper and faster.
  • Ignoring image-borne prompt injection.

Authored Quiz

Check the lesson against authored questions instead of a generated fallback.

0/1 correct

Question 1

Which scenario is the strongest case for using a frontier multimodal model over a specialized vision tool?

Answer

Practice

Short drills to convert the lesson into repeatable skill.

1 drill

Drill 1

Vision Decision

Take one product feature where image input is on the roadmap. Document: which model, why not a specialized one, per-image cost estimate, and the prompt-injection mitigation.

Deliverable

A one-page vision feature spec with cost model.

Lesson 10 of 21

RAG, Retrieval, and Citation Design

Difficulty

Intermediate

Duration

30 min

Retrieval solves a specific problem: the model should answer using fresh, private, or domain-specific knowledge instead of relying on base model memory alone.

The quality of retrieval depends on corpus quality, chunking, metadata, ranking, and how clearly the final answer shows what evidence was used. A grounded answer is an evidence path, not just a confident paragraph. Anthropic, OpenAI, and Google all expose hosted retrieval primitives, and Ollama supports local embeddings — pick by cost, privacy, and corpus shape.

Good retrieval design also knows when not to use RAG. If the task is action orchestration, deterministic lookup, or tool execution, retrieval may not be the main problem at all.

rag-plan.js
1const retrievalPlan = { 2 corpus: "employee handbook", 3 chunking: "small policy sections with headers", 4 metadata: ["policy_area", "last_updated"], 5 embedding: "model-family agnostic — anthropic, openai, gemini, or local via ollama", 6 answerUX: "show citation and excerpt with every answer", 7};
Real-World Scenario

A support assistant keeps inventing policy details because it has no reliable path to current documents or citations.

What you learned
  • RAG grounds answers on external knowledge.
  • Chunking and metadata affect answer quality directly.
  • Citations are part of product trust, not a nice-to-have.
  • Embedding choice is independent of generation-model choice.
Build Mission

Design a retrieval plan with corpus choice, chunking, metadata, and answer citation behavior.

Check Yourself
  • Which information belongs in retrieval instead of the base prompt?
  • How can chunking damage answer quality?
  • What should the user see so the answer feels inspectable?
  • Should you embed locally (Ollama) or via a hosted API?
Definition of Done
  • Learner can explain the full retrieval flow from source ingest to answer rendering.
  • Grounding is linked to evidence quality, not magic model memory.
  • Source exposure is part of the feature design.
Failure modes
  • Using oversized chunks that bury the relevant evidence.
  • Returning answers without any source story.
  • Applying retrieval when the real problem is tool orchestration or action logic.

Authored Quiz

Check the lesson against authored questions instead of a generated fallback.

0/1 correct

Question 1

What does a well-designed RAG workflow primarily improve?

Answer

Practice

Short drills to convert the lesson into repeatable skill.

1 drill

Drill 1

Citation UX

Sketch how your product will show evidence, references, or source excerpts to users.

Deliverable

A small citation design note with one UI rule and one trust rule.

Lesson 11 of 21

MCP Architecture: Hosts, Clients, Servers, Tools, Resources, Prompts, Elicitation

Difficulty

Intermediate

Duration

35 min

Model Context Protocol standardizes how external systems expose capabilities to models. The host, client, and server all have different jobs, and mixing them up causes bad security and bad product design.

MCP separates tools, resources, and prompts. Tools are model-invoked actions. Resources expose context. Prompts package reusable user-controlled workflows. Roots and sampling add boundary decisions. The 2025-11-25 spec adds elicitation (servers can request structured user input mid-call) and structured tool output (tool results carry typed content and annotations the host can render or gate on).

This matters because MCP is not only an implementation detail. It is a portability layer for grounded context and actions across products and coding environments.

Host / Client

initializeprotocolVersion, capabilities

MCP Server

A typical MCP session: initialize, discover tools, call one. The 2025-11-25 spec adds structured content + annotations to tool results.
mcp-architecture.js
1const mcpSurfaceMap = { 2 tool: "model-invoked action", 3 resource: "context exposed by the server", 4 prompt: "reusable user-invoked template", 5 roots: "filesystem boundary", 6 elicitation: "server requests structured user input mid-call (2025-11-25)", 7 structuredContent: "typed tool results with annotations (2025-11-25)", 8};
Real-World Scenario

A team wants one assistant to search docs, inspect repo files, and reuse workflow prompts, but each integration currently uses a different custom contract.

What you learned
  • MCP separates host, client, and server roles.
  • Tools, resources, and prompts are different control surfaces.
  • Roots, sampling, and elicitation exist because boundaries matter.
  • Structured content lets hosts render rich UI and gate destructive actions.
Build Mission

Take three capabilities in one AI product and classify them as an MCP tool, resource, prompt, or elicitation flow.

Check Yourself
  • What does the host own that the server does not?
  • Which MCP surface is user-controlled versus model-controlled?
  • When should a tool ask for elicitation instead of guessing?
  • Why are roots and approvals part of the design from day one?
Definition of Done
  • Learner can explain MCP roles and surfaces clearly.
  • Capability design reflects the right MCP surface.
  • Elicitation is used where guessing user intent is unsafe.
  • Security boundaries are part of architecture, not a late patch.
Failure modes
  • Treating MCP as only a list of callable tools.
  • Exposing data broadly without root boundaries or review.
  • Using the wrong MCP surface for a capability.
  • Skipping elicitation and silently picking a default the user did not approve.

Authored Quiz

Check the lesson against authored questions instead of a generated fallback.

0/1 correct

Question 1

Which MCP surface is specifically designed for model-invoked actions?

Answer

Practice

Short drills to convert the lesson into repeatable skill.

1 drill

Drill 1

Surface Mapping

Map one AI product's capabilities into MCP tools, resources, prompts, and (where relevant) elicitation flows.

Deliverable

A short capability table with one sentence for each mapping choice.

Lesson 12 of 21

Building MCP Servers: Transports, Capabilities, and Trust Boundaries

Difficulty

Intermediate

Duration

35 min

Knowing the protocol is not enough. You also need to understand what it takes to build or integrate a server that exposes useful, narrow capabilities to a host or coding environment. The 2025-11-25 spec covers stdio transport for local servers and HTTP transport for remote servers, with capability negotiation on initialize.

The official docs break this into client lifecycle, server lifecycle, discovery, transport, capability exposure, and approval boundaries. In practice, a good MCP server is intentionally small, specific, and reviewable. Use structured tool output and annotations so the host can render results safely and gate destructive operations.

This lesson connects the specification to practical delivery: how an MCP server fits into coding agents, doc search, internal operations, and reusable product capabilities.

Host / Client

stdio transport openedspawn ./mcp/docs-server

MCP Server

Building a server: stdio transport, narrow capability surface, and elicitation when scope must be user-confirmed.
mcp-server-policy.js
1const mcpServerPolicy = { 2 server: "internal docs server", 3 transport: "stdio (local) or http (remote)", 4 resources: ["product specs", "runbooks"], 5 tools: ["search_docs"], 6 blocked: ["write access", "secret files"], 7 approval: "review required for any broadened capability", 8 outputAnnotations: ["destructive=false", "idempotent=true"], 9};
Real-World Scenario

A team wants to expose internal docs and workflows to coding agents, but they have not defined the boundary between safe context access and unsafe operational access.

What you learned
  • Good MCP servers are narrow and reviewable.
  • Capability exposure should be designed before transport details.
  • Structured tool output + annotations let hosts gate safely.
  • The best server surface is the smallest useful one.
Build Mission

Design one MCP server with a narrow capability set, root boundaries, structured output annotations, and a clear approval policy.

Check Yourself
  • What capabilities should stay out of the first server version?
  • Which data should be resources instead of tool output?
  • Which annotations help the host gate destructive operations?
  • How would you explain the trust boundary to another engineer?
Definition of Done
  • Learner can design a small MCP server with explicit boundaries.
  • Capability exposure is justified by workflow need, not novelty.
  • Tool output annotations support host-side safety gates.
  • Transport and discovery choices stay subordinate to trust design.
Failure modes
  • Building a server that exposes too much too early.
  • Returning contextual data through tool calls when resources fit better.
  • Skipping output annotations so the host cannot gate destructive operations.
  • Treating every internal server as implicitly trusted.

Authored Quiz

Check the lesson against authored questions instead of a generated fallback.

0/1 correct

Question 1

What is the strongest default for a new MCP server in a real organization?

Answer

Practice

Short drills to convert the lesson into repeatable skill.

1 drill

Drill 1

Server Scope Card

Write the capability list, blocked capabilities, output annotations, and approval policy for one MCP server.

Deliverable

A scope card with three allowed capabilities, at least two blocked ones, and annotation choices for any destructive tools.

Lesson 13 of 21

Agent Patterns: Augmented LLMs, Workflows, and Autonomous Agents

Difficulty

Intermediate

Duration

35 min

Anthropic's 'Building Effective Agents' essay names the patterns that matter in production: augmented LLM, prompt chaining, routing, parallelization, orchestrator-workers, evaluator-optimizer, and (only when warranted) autonomous agents. The 2022 ReAct paper underpins all of them — interleave reasoning, action, and observation.

A fixed workflow is still better when the path is stable. The engineering question is whether the system needs dynamic decision-making or whether a simple deterministic flow is being hidden behind an 'agent' label. Most production wins come from the simpler patterns (chaining, routing) — autonomous agents are a last resort, not a default.

Good agent design includes handoffs, termination rules, approval boundaries, and a clear reason each step is adaptive instead of fixed. If you cannot name which step earns the agent overhead, build a workflow instead.

  1. Plan

    Decide what to do next based on the goal and prior observations.

  2. Act

    Call a tool, edit a file, or make a request.

  3. Observe

    Read the tool result, error, or new state.

  4. Reflect

    Decide whether to continue, retry, escalate, or stop.

Plan → Act → Observe → Reflect: the canonical loop under every Building Effective Agents pattern, all the way back to ReAct (2022).
agent-patterns.js
1const agentPatterns = { 2 augmentedLLM: "single LLM call with retrieval + tools + memory", 3 promptChaining: "decompose into sequential calls with validation", 4 routing: "classifier picks the next specialized handler", 5 parallelization: "run subtasks concurrently and aggregate", 6 orchestratorWorkers: "central LLM dispatches dynamic subtasks", 7 evaluatorOptimizer: "generator + critic loop until criteria met", 8 autonomousAgent: "open-ended loop with tools — last resort", 9};
Real-World Scenario

A team keeps calling every multistep workflow an agent, even when the task is really a stable deterministic sequence or a simple routing problem.

What you learned
  • Agent patterns range from simple augmented LLM to full autonomy.
  • Pick the simplest pattern that solves the task.
  • Handoffs and stop conditions are part of the design.
  • Autonomous agents are a last resort, not a default.
Build Mission

Take one workflow and pick the simplest agent pattern that solves it — augmented LLM, chain, route, parallelize, orchestrator-worker, evaluator-optimizer, or autonomous. Justify why nothing simpler works.

Check Yourself
  • Which step is genuinely adaptive vs deterministic?
  • What is the stop condition?
  • When should the workflow hand off to a human or a more specific agent?
  • What pattern from Building Effective Agents is the closest fit?
Definition of Done
  • Learner can name and apply at least four patterns from Building Effective Agents.
  • The workflow includes explicit handoff and stop rules.
  • Pattern choice is grounded in task structure, not trend language.
Failure modes
  • Reaching for autonomous agents when prompt chaining or routing would do.
  • Allowing indefinite loops without escalation or stop criteria.
  • Ignoring prompt injection or tool misuse in multistep flows.
  • Conflating 'multistep' with 'agentic'.

Authored Quiz

Check the lesson against authored questions instead of a generated fallback.

0/1 correct

Question 1

Which pattern from Building Effective Agents is the right default for a task with 3 stable steps and clear validation between each?

Answer

Practice

Short drills to convert the lesson into repeatable skill.

1 drill

Drill 1

Pattern Picker

Take three workflows in your product and label each with the simplest agent pattern that fits. Justify each choice in one sentence.

Deliverable

A table of three workflows × pattern choice × one-sentence rationale.

Lesson 14 of 21

Building Custom Agents with the Claude Agent SDK

Difficulty

Advanced

Duration

40 min

The Claude Agent SDK (Python and TypeScript) is Anthropic's official toolkit for building custom agents on top of Claude. It exposes the query API for one-shot calls, custom tool definitions with input schemas, hooks that fire on tool use and completion, and bidirectional streaming sessions for interactive agents.

Compared to writing the agent loop by hand against the Messages API, the SDK gives you session management, automatic tool routing, hook lifecycle, and consistent error handling. You still own the design choices — which tools, which approval boundaries, which stop conditions — but you stop reinventing the loop infrastructure on every project.

For coding agents specifically, the Agent SDK is the same primitive Claude Code uses internally. Building your own agent on the SDK gives you Claude Code-style capabilities tuned to your domain.

  1. Query

    SDK opens a session and sends the user prompt to Claude.

  2. Tool Call

    Claude requests a tool the SDK routes to your registered handler.

  3. Hook

    Pre/post hooks fire — approve, log, or veto destructive actions.

  4. Observe

    Tool result flows back into the model's next reasoning step.

  5. Stop or Continue

    Max-turn or success signal terminates the loop; otherwise it continues.

Claude Agent SDK lifecycle: query → tool call → hook → observe → stop or continue. Hooks are where you enforce approvals and audit logging.
claude_agent_demo.py
1// Claude Agent SDK (Python) — minimal custom tool agent 2from claude_agent_sdk import query, tool 3 4@tool 5def search_docs(query: str) -> str: 6 return run_search(query) 7 8async for msg in query( 9 prompt="Find the policy on shipped-order refunds.", 10 tools=[search_docs], 11 max_turns=5, 12): 13 print(msg)
Real-World Scenario

A team has been writing the agent loop by hand against the Messages API and keeps reinventing tool routing, session management, and hook lifecycle.

What you learned
  • The SDK gives you session, tool routing, and hooks.
  • You still own design — tools, approvals, stop rules.
  • Available in Python and TypeScript with shared concepts.
  • It is the same primitive Claude Code uses internally.
Build Mission

Sketch a custom agent on the Claude Agent SDK: pick the domain, define 2-3 custom tools with input schemas, set a max-turn stop rule, and decide which tools need approval hooks.

Check Yourself
  • What tools belong in the SDK vs out-of-band?
  • Which actions need an approval hook?
  • What are your stop conditions: max turns, success signal, or escalation?
  • How will you log every tool call for audit?
Definition of Done
  • Learner can build an agent with custom tools using the SDK.
  • Approval hooks are wired for any destructive tool.
  • Stop conditions are explicit and observable.
  • Audit logging covers every tool call.
Failure modes
  • Reimplementing the agent loop by hand when the SDK already covers it.
  • Skipping approval hooks on destructive tools.
  • Letting the loop run with no max-turn or success-signal stop.
  • Logging only the final answer and losing the tool-call trace.

Authored Quiz

Check the lesson against authored questions instead of a generated fallback.

0/1 correct

Question 1

Which responsibility does the Claude Agent SDK handle for you that hand-rolled agent loops typically reinvent?

Answer

Practice

Short drills to convert the lesson into repeatable skill.

1 drill

Drill 1

Agent Sketch

Design a custom agent for one workflow: pick tools, define schemas, set hooks for approvals, and write the stop condition.

Deliverable

A short spec with 2-3 tool definitions (name, schema, approval policy) and the agent's stop condition.

Lesson 15 of 21

Coding Agents Landscape: Claude Code, Cursor, Copilot, Aider, Continue

Difficulty

Intermediate

Duration

30 min

Coding agents are a category, not a product. Each tool makes different tradeoffs: Claude Code is terminal-native with deep MCP integration and Anthropic's Claude as the brain; Cursor is IDE-native with composer + agent mode and multi-model support; GitHub Copilot has both inline completions and a coding agent that opens PRs from issue assignments; Aider is terminal-native with strong git-aware editing and repository maps; Continue is open-source IDE-integrated and runs against local or hosted models.

The shared shape is the same: scope a task, inspect repo context, propose changes, run commands, verify, return control on risk. What differs is the surface (terminal vs IDE vs PR), the default model, the permission model, and the integration depth (MCP, hooks, plugins).

OpenAI's original Codex (the 2021 model behind early Copilot, since deprecated) was a precursor. Modern coding agents superseded it by adopting the agentic loop pattern from ReAct rather than treating completion as a one-shot text problem. Pick the agent that matches your team's surface preference and trust posture, not the loudest brand.

coding-agents.js
1const codingAgentMatrix = { 2 claudeCode: { surface: "terminal", brain: "claude", strength: "MCP + hooks + permissions" }, 3 cursor: { surface: "ide", brain: "multi", strength: "composer + agent mode in editor" }, 4 copilotAgent: { surface: "github pr", brain: "openai/multi", strength: "issue-to-pr automation" }, 5 aider: { surface: "terminal", brain: "multi", strength: "git-aware editing, repo maps" }, 6 continueDev: { surface: "ide", brain: "byo (local or hosted)", strength: "open-source, customizable" }, 7};
Real-World Scenario

A team is choosing a coding agent and keeps debating brand instead of evaluating surface fit, permission model, and MCP integration.

What you learned
  • All modern coding agents share the agentic loop shape.
  • Differences are surface (terminal/IDE/PR), default model, permissions, and integration depth.
  • Codex was the precursor; modern agents replaced it by adopting the loop pattern.
  • Pick by surface fit and trust posture — not by brand.
Build Mission

Pick two coding agents from different surfaces (terminal, IDE, PR). Compare them on default model, permission model, MCP support, and how they handle a 'run this command' request.

Check Yourself
  • Which agent's surface (terminal vs IDE vs PR) fits your team's flow?
  • What is each agent's permission model for shell commands?
  • Which support MCP servers natively?
  • What stops each agent from going beyond the scoped task?
Definition of Done
  • Learner can name 4+ coding agents and what makes each distinct.
  • Comparison includes surface, default model, permissions, and MCP support.
  • Selection is grounded in team workflow, not brand.
Failure modes
  • Picking by brand instead of surface fit.
  • Ignoring the permission model — letting any agent run any shell command.
  • Treating IDE-native and terminal-native agents as interchangeable.
  • Missing that some agents can act on the repo without human review by default.

Authored Quiz

Check the lesson against authored questions instead of a generated fallback.

0/1 correct

Question 1

What do all modern coding agents have in common that distinguishes them from autocomplete tools like the original Codex?

Answer

Practice

Short drills to convert the lesson into repeatable skill.

1 drill

Drill 1

Agent Bake-Off

Pick two coding agents, give them the same task in the same repo (e.g., 'fix one failing test in src/payments'). Compare: how each scoped the task, which commands they ran, how they verified, and what they did when blocked.

Deliverable

A short bake-off note with side-by-side observations and a recommendation for your team's primary tool.

Lesson 16 of 21

Claude Code in Practice: Settings, MCP, and Permission Boundaries

Difficulty

Advanced

Duration

35 min

Claude Code is Anthropic's terminal-native coding agent. The surface is small — `claude` in your repo — but the configuration underneath determines what it can do, who approves what, and how the team works with it consistently.

The official docs emphasize local development workflows, configuration via `.claude/settings.json`, approval controls, MCP server integration, and security posture. Project-shared settings (committed to the repo) make agent behavior reproducible across the team; user-level settings stay personal.

This lesson treats Claude Code as a configurable terminal-native agent: useful for debugging, implementation, exploration, and MCP-connected workflows when the team keeps permissions and task shape tight. It is one specific implementation of the coding-agent loop covered in the previous lesson — chosen here because it is the deepest MCP integration available today.

.claude/settings.json
1// .claude/settings.json — project-shared agent configuration 2{ 3 "permissions": { 4 "allow": ["Bash(npm test:*)", "Bash(git status)"], 5 "deny": ["Bash(rm -rf*)", "Bash(npm publish*)"] 6 }, 7 "mcpServers": { 8 "internal-docs": { "command": "node", "args": ["./mcp/docs-server.js"] } 9 } 10}
Real-World Scenario

A team adopts Claude Code informally, but every engineer uses different settings, different permissions, and different assumptions about what the agent may do.

What you learned
  • Claude Code is configured per-project via .claude/settings.json.
  • Permissions allow/deny shell commands explicitly.
  • MCP servers expose internal capabilities to the agent.
  • Project-shared config keeps the team consistent.
Build Mission

Design a Claude Code project setup with shared settings, allow/deny permission rules, and one MCP-enabled workflow.

Check Yourself
  • Which project-level settings should be shared with the team?
  • Which actions should stay approval-gated?
  • Which MCP servers does this repo benefit from?
  • How does the team review changes to .claude/settings.json?
Definition of Done
  • Learner can explain installation, startup, settings, and CLI usage at a high level.
  • Permission and MCP choices are treated as engineering decisions.
  • The workflow is configured for team consistency instead of ad hoc personal use only.
Failure modes
  • Using local coding agents without shared settings or approval conventions.
  • Allowing broad command execution without policy.
  • Adding MCP connectivity without reviewing trust and visibility boundaries.
  • Treating .claude/settings.json changes as personal config instead of reviewable repo state.

Authored Quiz

Check the lesson against authored questions instead of a generated fallback.

0/1 correct

Question 1

What is the strongest reason to define shared Claude Code project settings?

Answer

Practice

Short drills to convert the lesson into repeatable skill.

1 drill

Drill 1

Claude Code Policy

Draft a simple team policy for Claude Code permissions, shared settings, and MCP usage.

Deliverable

A one-page policy with project settings, approval expectations, and one blocked action.

Lesson 17 of 21

Evals, Guardrails, Latency, and Cost

Difficulty

Advanced

Duration

30 min

Production AI quality is not a feeling. It is a measured loop. You define the task, collect representative cases, run the system, inspect failures, and make explicit release decisions across quality, safety, latency, and cost.

Guardrails exist because the failure modes are predictable: unsupported answers, missing citations, unsafe tool calls, prompt injection, over-budget latency, or unstable structured outputs. Vendor docs converge on the same eval shapes — datasets, graders, regression checks — even when the wire formats differ.

This is where AI engineering starts to look like any other mature engineering discipline: evidence, budgets, regression checks, and release gates instead of demo theater.

ai-release-gate.js
1const releaseGate = { 2 quality: "meets gold-set accuracy target", 3 safety: "blocks unsafe tool behavior and prompt injection", 4 latency: "p95 under budget", 5 cost: "average request cost within plan", 6 cacheHitRate: "above target for repeated workflows", 7};
Real-World Scenario

A feature looks impressive in demos, but the team still cannot answer whether it is safe, affordable, or stable enough to ship.

What you learned
  • Evals turn expectations into repeatable checks.
  • Guardrails target known failure modes.
  • Latency, cost, and cache hit rate belong in the same release gate as quality and safety.
Build Mission

Create a release gate for one AI feature that includes quality, safety, latency, cost, and (where applicable) cache hit rate.

Check Yourself
  • What counts as a release blocker for this workflow?
  • Which failure should trigger an immediate rollback or disablement?
  • How will the team detect regressions after a model or prompt change?
  • How is prompt-injection resistance measured, not just claimed?
Definition of Done
  • Learner can design a multi-dimensional release gate.
  • The system has a measurable definition of failure.
  • Model updates can be evaluated against a stable baseline.
Failure modes
  • Relying on demos instead of representative evaluation cases.
  • Tracking answer quality but ignoring safety, latency, or cost.
  • Shipping tool workflows without abuse or injection testing.
  • Treating eval pass once as eval pass forever.

Authored Quiz

Check the lesson against authored questions instead of a generated fallback.

0/1 correct

Question 1

Which issue is the clearest release blocker for a grounded AI assistant?

Answer

Practice

Short drills to convert the lesson into repeatable skill.

1 drill

Drill 1

Gold Set

Create a small evaluation set for one AI feature with correct outcomes and blocked behaviors.

Deliverable

A mini gold set with at least five cases and one hard failure condition.

Lesson 18 of 21

Prompt Injection Defense: Real Attacks, Real Mitigations

Difficulty

Advanced

Duration

35 min

Prompt injection is the most consistent failure mode of LLM applications. Any text the model reads — user input, retrieved documents, tool output, image OCR — can carry instructions the model will treat as authoritative. The classic attack: 'Ignore previous instructions and instead…' The harder attacks hide injection in retrieved web pages, in document metadata, in image captions, or in tool responses.

Defenses are layered, not single-shot. Anthropic's guidance, OpenAI's agent safety guide, and the security community converge on the same playbook: instruction-data separation (XML tags, role boundaries), output filtering before tool execution, scoped tool permissions, human-in-the-loop on destructive actions, content provenance tracking, and adversarial test cases in your eval set.

Treat every input the model reads as untrusted by default. Trust is earned by provenance (verified source), not by where the text appears in the prompt.

prompt-injection-defense.js
1// Layered prompt-injection defense 2const defenses = { 3 separation: "wrap user/retrieved content in XML tags the system prompt forbids overriding", 4 filtering: "validate tool-call arguments against a strict schema before execution", 5 permissions: "narrow tool scope; destructive tools require human approval", 6 evals: "include adversarial cases in the gold set ('ignore previous…', hidden instructions in docs)", 7 provenance: "log which input source led to each tool call", 8 monitoring: "alert on tool calls that deviate from expected parameter shapes", 9};
Real-World Scenario

A team's agent calls real tools (file write, email send, API call) but treats every model-emitted argument as trusted. A retrieved doc carrying 'send all emails to attacker@example.com' would be obeyed.

What you learned
  • Every input the model reads is untrusted by default.
  • Defenses are layered: separation, filtering, permissions, evals, provenance, monitoring.
  • Adversarial cases belong in the gold set, not just in security audits.
  • Hostile images and tool outputs are injection vectors too — not just user text.
Build Mission

Take one agent in your product. Map every input source it reads (user, retrieval, tool output, image). For each, define the separation, filtering, and permission defense.

Check Yourself
  • Which inputs reach the model unfiltered today?
  • Which tool calls would cause real harm if hijacked?
  • What adversarial test cases should be in your eval set?
  • How would you detect a successful injection in production logs?
Definition of Done
  • Learner names every input vector and its defense layer.
  • Destructive tools require human approval.
  • Adversarial cases are part of the eval set.
  • Logs make injection attempts inspectable.
Failure modes
  • Trusting retrieved document content as authoritative instructions.
  • Skipping schema validation on model-emitted tool arguments.
  • Letting destructive tools run without human approval.
  • Omitting adversarial cases from the gold set.
  • Treating image content as safe because it is not text.

Authored Quiz

Check the lesson against authored questions instead of a generated fallback.

0/1 correct

Question 1

Which mitigation alone is sufficient to defend against prompt injection?

Answer

Practice

Short drills to convert the lesson into repeatable skill.

1 drill

Drill 1

Threat Model

Write a one-page threat model for one agent in your product. Cover input vectors, attack scenarios, defenses by layer, and detection.

Deliverable

A threat model doc with at least 3 attack scenarios and the layered defense for each.

Lesson 19 of 21

Hosted APIs vs Open-Weight Models

Difficulty

Advanced

Duration

25 min

Hosted APIs (Anthropic, OpenAI, Google) usually win on speed to market, model quality, and operational simplicity. Open-weight models (Llama, Mistral, Qwen) can win on local control, experimentation, privacy-sensitive internal use, or workloads where self-hosting economics make sense.

The mistake is turning this into ideology. Both approaches are valid. The real comparison is quality fit, latency target, data sensitivity, reliability burden, and how much infrastructure your team is prepared to own.

For most product teams, the right answer is a portfolio mindset: use hosted frontier models where quality matters most, and use local or open-weight models where control or cost matters more than absolute frontier performance.

hosted-vs-open-weight.js
1const deploymentChoice = { 2 hostedApi: ["fast setup", "vendor-managed serving", "higher abstraction"], 3 openWeight: ["local control", "more ops burden", "more tunable deployment"], 4 hybrid: ["frontier for quality-critical paths", "open-weight for high-volume or sensitive workloads"], 5};
Real-World Scenario

A team wants to move everything to open-weight models for control, but it has not thought through quality regression or serving responsibility.

What you learned
  • Hosted and open-weight models solve different problems.
  • Operations burden is a first-class tradeoff.
  • Model strategy can vary by workload instead of using one rule for everything.
Build Mission

Choose between a hosted and an open-weight path for one product workload and justify the decision.

Check Yourself
  • What operational work appears when you move from API calls to self-hosting?
  • When does local control outweigh hosted simplicity?
  • Which workloads should stay on hosted frontier models?
Definition of Done
  • Learner can compare hosted and open-weight approaches without ideology.
  • The deployment choice matches workload and team capacity.
  • Operations cost is included in the recommendation.
Failure modes
  • Choosing open-weight hosting only because it feels more advanced.
  • Ignoring serving and monitoring burden.
  • Using one deployment model for every AI workload without differentiation.

Authored Quiz

Check the lesson against authored questions instead of a generated fallback.

0/1 correct

Question 1

Which statement best reflects a strong deployment decision?

Answer

Practice

Short drills to convert the lesson into repeatable skill.

1 drill

Drill 1

Deployment Memo

Write a short memo recommending hosted or open-weight deployment for one AI feature.

Deliverable

A one-page memo with tradeoffs across quality, latency, privacy, and ops.

Lesson 20 of 21

Local-First Self-Hosting with Ollama

Difficulty

Advanced

Duration

30 min

This track does not turn self-hosting into a cluster-operations course. The goal is practical local-first fluency: run an open-weight model locally, hit an OpenAI-compatible endpoint, test embeddings, and understand when this setup is useful.

Ollama is a pragmatic teaching surface because it exposes local models through familiar APIs and supports responses, embeddings, and tool-calling workflows. That makes it a strong environment for internal tooling prototypes, private experimentation, and evaluation loops against open-weight families like Llama, Mistral, and Qwen.

The key design lesson is not 'host everything yourself.' It is knowing when a local-first setup helps you evaluate workflows, protect data in a prototype, or reduce dependency on remote APIs for a specific use case.

ollama-local-first.js
1import OpenAI from "openai"; 2 3const client = new OpenAI({ 4 baseURL: "http://localhost:11434/v1/", 5 apiKey: "ollama", 6}); 7 8const response = await client.chat.completions.create({ 9 model: "llama3.3:8b", 10 messages: [{ role: "user", content: "Explain why local-first evaluation can be useful." }], 11});
Real-World Scenario

A learner wants to understand self-hosted AI without immediately needing GPUs, Kubernetes, or production-grade serving infrastructure.

What you learned
  • Ollama gives a practical local-first workflow for open-weight models.
  • OpenAI-compatible local APIs lower experimentation friction.
  • Local-first self-hosting should be justified by a real use case.
Build Mission

Design a local-first evaluation workflow that uses a self-hosted model for one realistic internal task.

Check Yourself
  • Why use a local-first stack for this workflow instead of a hosted API?
  • What quality or capability limits would you test before trusting it?
  • How would you know when the local setup should remain a prototype only?
Definition of Done
  • Learner can explain a local Ollama-based workflow at a high level.
  • The use case justifies local-first hosting instead of making it a vanity setup.
  • Quality checks are planned before relying on the local model.
Failure modes
  • Treating a local-first setup as automatically production-ready.
  • Skipping quality comparison against a stronger hosted baseline.
  • Choosing self-hosting before the workload is understood.

Authored Quiz

Check the lesson against authored questions instead of a generated fallback.

0/1 correct

Question 1

What is the strongest reason to teach Ollama in this curriculum?

Answer

Practice

Short drills to convert the lesson into repeatable skill.

1 drill

Drill 1

Local Evaluation Loop

Describe a local-first evaluation loop using an Ollama-served model for one internal workflow.

Deliverable

A short workflow note with setup, evaluation task, and one exit criterion.

Lesson 21 of 21

Capstone: Ship a Grounded, Cached, Defended Agentic Product

Difficulty

Advanced

Duration

75 min

The capstone brings the full stack together: system framing, model selection, prompt contract with caching, structured outputs, retrieval or MCP, tool boundaries, agent pattern selection, coding-agent awareness, prompt-injection defense, and release evals.

The target is not novelty. The target is reviewability. Another engineer should be able to understand where facts come from, which actions are possible, what approvals exist, how the system resists prompt injection, and how the team knows the feature is safe enough to pilot.

A strong capstone feels like an engineering artifact, not a demo. It can be challenged, reviewed, improved, and eventually shipped.

agentic-capstone.ts
1const capstoneBlueprint = { 2 product: "grounded assistant or coding workflow", 3 modelChoice: "selected by workload fit (claude / gpt / gemini / open-weight)", 4 caching: "static prefix cached, dynamic suffix per call", 5 grounding: ["retrieval", "mcp", "or both"], 6 agentPattern: "augmented LLM | chain | route | orchestrator-workers (justified)", 7 actionPolicy: "explicit tool approvals, layered prompt-injection defense", 8 releaseGate: ["quality", "safety", "latency", "cost", "cache hit rate", "injection resistance"], 9};
Real-World Scenario

You need to defend an internal AI product proposal to engineers who care about trust, operating burden, and release discipline more than demo quality.

What you learned
  • A real AI product is a composed system with reviewable boundaries.
  • Grounding, caching, actions, defenses, and evals must fit together coherently.
  • The strongest artifact proves judgment, not just feature count.
Build Mission

Design and present one grounded AI product with architecture, model choice, caching plan, source strategy, agent pattern, permissions, prompt-injection defense, and rollout thinking.

Check Yourself
  • Where do facts come from, and how does the product show that?
  • What is cached and where is the cache breakpoint?
  • Which actions are automatic, which are approval-gated, and which are blocked?
  • Which agent pattern from Building Effective Agents is the right fit?
  • What is your prompt-injection threat model and layered defense?
  • What evidence would convince another engineer to pilot the system?
Definition of Done
  • The capstone includes model strategy, caching, grounding, agent pattern, tool policy, prompt-injection defense, and evals.
  • Every factual path has a source story and every action path has a permission story.
  • Another engineer could review and challenge the design constructively.
Failure modes
  • Presenting disconnected AI techniques without a coherent system architecture.
  • Using tools or grounding without a clear permission and trust model.
  • Trying to ship without release criteria, ownership, or rollback thinking.
  • Skipping prompt-injection defense because the team has not been hit yet.

Authored Quiz

Check the lesson against authored questions instead of a generated fallback.

0/1 correct

Question 1

Which artifact most strongly shows that an AI system is ready for serious review?

Answer

Practice

Short drills to convert the lesson into repeatable skill.

2 drills

Drill 1

Architecture Packet

Write the one-page system design for your capstone, including model strategy, caching plan, grounding, agent pattern, tools, prompt-injection threat model, and release gate.

Deliverable

A review-ready packet with a structured outline or diagram.

Drill 2

Failure Rehearsal

Choose the most likely production failure for your capstone (model regression, prompt injection, tool misuse, latency blowout) and define detection, containment, and recovery.

Deliverable

A short incident note with one alert, one mitigation, and one rollback rule.

Certification Callout

Grounded Agentic Systems Capstone

Produce a review-ready system packet that shows how your AI product grounds facts, caches expensive context, controls actions, defends against prompt injection, and passes release gates.

  • Include source-backed grounding through retrieval, MCP, or both.
  • Use structured outputs or tool schemas where downstream automation depends on typed data.
  • Place a deliberate prompt cache breakpoint and project hit rate.
  • Document a layered prompt-injection defense across separation, filtering, permissions, and evals.
  • Define release evals that cover quality, safety, latency, cost, and injection resistance.
Back to Curriculum