Agentic engineering

Engineer's Training

A disclaimer

This field moves fast. Nobody has all the answers. I don't have the answers.

We're all figuring this shit out together.

What follows are heuristics, not absolute truths.

AI is the messiest tech

There are real ecological, economical, ethical issues, especially with todays closed frontier models.

The job is shifting under our feet and we didn't opt-in for this stuff.

The skills that made you a great engineer two years ago aren't the same skills that will make you great two years from now. That's uncomfortable.

  • We all feel like a beginner again
  • You will question whether "engineering" still means what you thought it meant

Lean into it. The engineers who thrive won't be the ones who mastered one fixed toolkit. They'll be the ones who stayed comfortable being uncomfortable.

Where are you today?

What You'll Walk Away With

  • A mental model of the AI coding maturity ladder and where you sit on it
  • Practical agentic engineering heuristics you can apply tomorrow
  • Understanding of what it takes (skills and dollars) to move up
  • Hands-on experience with in-the-loop and on-the-loop workflows

Structure

20% foundations, 80% where the leverage is

  1. Foundations (~20%): LLM mechanical sympathy, chat, mid-loop
  2. In-the-loop agentic coding (~30%): The level every senior engineer needs
  3. On-the-loop agentic coding (~25%): Delegation, specification, parallel work
  4. Multi-agent orchestration (~15%): The frontier
  5. Your growth path (~10%): Mapping skills, costs, next steps

Part 1: Foundations

Table stakes. Let's move through this quickly.

"You don't have to be an engineer to be a racing driver, but you do have to have Mechanical Sympathy."

Jackie Stewart

You don't need to become an ML engineer. But as a power user you need to know LLM's.

They're Next-Token Predictors

Given "The cat sits on the", the model outputs probability distributions over possible next tokens.

The model isn't thinking. It's applying statistical patterns from training data.

Context Window = RAM

Everything the model reasons about must fit in the context window.

System prompt, your code, conversation history, the model's response: all competing for the same finite space.

What's outside the window is invisible.

Fuzzy Memory vs. Sharp Memory

Training data is fuzzy: statistical patterns, potentially outdated, confidently wrong about details.

Context data is sharp: the model can precisely reference what you put in front of it.

Paste the code. Include the error. Provide the docs. Don't trust the model to "know" your framework's API.

Hallucination Is Structural

The model predicts what sounds right, not what is right.

It will confidently suggest APIs that don't exist.

Verify everything or provide tools to enable AI to 'check its work'

Climbing the maturity ladder together

  • Get your reps in individually
  • Scale to team-level standards: Rule of three

Chat and Mid-Loop: Expected Baseline

Chat: copy-paste code, ask questions, ideate, iterate. The key skills are writing good prompts and providing good context.

Mid-loop: autocomplete on steroids. The key skill: evaluating alternatives quickly and critically.

Both of these are expected baseline competency today.

Part 2: In-the-Loop Agentic Coding

The first big jump on the ladder

What Makes It Agentic

An agent is an LLM model using tools in a loop to achieve a goal.

  • Tools: reading files, running commands, searching code, executing tests
  • Loop: iterative cycle of perceiving, reasoning, acting
  • Goal: your specified objective

Agents do things in your development environment.

The Agentic Context Flow

  • LLM (CPU): generates responses based on context
  • Context Window (RAM): everything the model can see right now
  • Agentic Harness Layer: the crucial mediator
    • Prompting, retrieval, tool calling
  • External resources: you, files, tools

Understanding this architecture is the difference between "using a tool" and "engineering an effective workflow."

Context Management Is Everything

Garbage context, garbage output. Curate what goes in.

  • Don't flood the context window: more isn't better, relevant is better
    • source/docs for homebrew libraries
    • good examples (service templates)
  • Reset context between iterations: know when to start fresh. Dumb zone vs. smart zone.
  • AI responds to what you explicitly ask, not what you implicitly meant

Prompts Are Code

  • Prompts are wishes: the model will try to fulfill them literally. Be careful what you wish for.
  • Version control them, test them, refactor them.
  • One good example beats a hundred lines of detailed spec prose.

Point to an existing test file, an existing service, a reference implementation.

Guard Rails Over Good Intentions

If a machine can check it, don't ask the model to remember it.

Wire up a linter, a compiler, a test suite.

Know which mode you're in: human in the loop, human on the loop, human out of the loop. More AI autonomy means a bigger agentic harness.

Workflow Discipline

  • Separate decisions from execution: decide what to do first, then let the agent execute
  • Velocity is speed + direction: fast in the wrong direction is worse than slow in the right one
  • Not as well as I would do it, good enough to commit: calibrate your quality bar. Don't lower it, but we are not watchmakers anymore.

Perfectionism kills the benefit today. YOLO kills the benefit tomorrow.

The Augmented Coding Pattern Language

Lada Kesseler's named set of patterns, anti-patterns, and obstacles that recur across agentic workflows.

Naming things matters. Once you can say "we have a distracted agent problem" or "this is context rot," the team can communicate and fix things faster.

https://lexler.github.io/augmented-coding-patterns/talk/

Obstacles

Structural constraints you work around:

  • LLM cannot learn: stateless by design, memory is a tooling issue
  • Context window constraints: fixed size, smart zone vs. dumb zone
  • Excess verbosity: LLMs are token hogs by default
  • Non-determinism: same input, different outputs
  • Hallucinations: confidently fabricated, but code is self-verifiable
  • Degrades under complexity: effectiveness drops non-linearly
  • Compliance bias: RLHF trains models to be people pleasers

Patterns

Practices that consistently produce good results:

  • Context management (P1): curate what goes in, start fresh
  • Knowledge document (P2): quickly reload context into a fresh session
  • Ground rules (P3): CLAUDE.md, always in context
  • Focused agent (P6): specialist agents over generalist sessions
  • Knowledge composition (P8): granular docs over monolithic agent.md
  • Parallel implementations (P12): roll multiple dice, remix the best
  • Chain of small steps (P17): tiny, verifiable increments
  • Hooks (P18): deterministic force against innate LLM stubbornness

Anti-Patterns

  • Distracted agent (AP5): unfocused context windows produce slop
  • Perfect recall fallacy (AP14): don't assume the model knows your framework's API
  • Unvalidated leaps (AP16): assumptions stacked on assumptions, errors compound in agentic sequences
  • Silent misalignment (AP20): the agent misunderstands but produces plausible output
  • Answer injection (AP24): anchoring the agent's solution space with your preferred answer

Two Flavors of LLM Stubbornness

Fighting against training ("don't write code comments"): fix with deterministic post-processing (hooks, scripts), not by prompting harder.

Or wait 6 months for better models 😅

Fighting against context rot: fix with fresh sessions and knowledge checkpoints.

Building Your Agentic Infrastructure

The work that compounds:

  • HIGHLY CURATED Agent instructions (CLAUDE.md, copilot-instructions.md): encode project conventions, architecture decisions, AI-slop works counter-productive here
  • Skills and slash commands: reusable workflows for common tasks, shared on team level
  • Hooks: deterministic checks wired into the agentic loop
  • MCP server configurations: specialized tooling access

The habit makes the difference between decent output on the first try and constant correction.

On productivity (CircleCI 2026)

Maintaining Your Standards

AI is a double-edged amplifier: it amplifies good practices and bad ones.

  • Count your babies: verify every component was delivered to spec
  • Check for cardboard muffins: code that looks right but has no substance
  • Demand excellence explicitly: state your quality standards
  • Clean as you go: tech debt accumulates faster with generated code
  • Trust but verify relentlessly: working code can mask deep quality issues
  • Use safety nets: Git, local dev, fast tests, sonar

Essential Safety Nets

Before scaling up agent usage, you need:

  • Well-designed codebase: clean architecture constrains agent mistakes
  • Fast local builds: tighter feedback loop = cheaper mistakes
  • Fast, comprehensive test suites: your primary verification mechanism
  • Static analysis and security scanners: catch what humans miss
  • Version control + small steps: each commit is a save point

These aren't optional extras. They're prerequisites.

Hands-On: Gilded Rose Kata

  1. Add agent instructions so it knows the codebase
  2. Have the agent refactor the code, backed by existing tests
  3. Have the agent implement "conjured items"

Tips: stay in the goldilocks zone, commit/reset often, give specific refactoring directions.

Reflect: How often did you start a new session? How often did the LLM hallucinate? How happy are you with the result?

Part 3: On-the-Loop

Delegation at scale

Paint by Numbers

You draw the picture. Set the boundaries. Pick the colours. The agent fills it in.

If you never had the picture in your head to begin with, you're not painting by numbers. You're having a mediocre slop artist decorate your house.

The on-the-loop workflow works because the structure exists. You drew the lines. The agent colours inside them.

The Assignment Model

You're no longer watching the agent work. You're delegating like you would to a competent but literal-minded junior.

  1. Write a specification (the "what" and "why")
  2. Point to reference implementations (the "how it should look")
  3. Define acceptance criteria (the "done")
  4. Provide a verification command (the "proof")
  5. Hand it off, go do something else

Specification Quality Becomes Critical

When you're babysitting, a vague spec is fine. When you're on the loop, the spec is goalpost.

A good spec includes:

  • Clear business context (why this matters)
  • Precise technical constraints (patterns to follow)
  • Explicit acceptance criteria (testable outcomes)
  • Reference implementations (existing code to emulate)
  • Verification commands (how to confirm success)

What's absent: step-by-step instructions.

Spec-Driven Development: Three Levels

Spec-first: write spec, generate code, discard spec..
Spec-anchored: spec and code co-evolve.
Spec-as-source: spec is the source of truth.

The higher the level, the more leverage from on-the-loop work.

Horizontal Scaling

  • The goldilocks zone: too small and overhead dominates, too large and the agent loses coherence
  • One file, one owner: avoid agents stepping on each other's code. git worktrees/containers.
  • Leverage optionality: run multiple approaches in parallel, pick the winner

Speed without guardrails just means you crash the rocket quicker.

Guardrails Are Non-Negotiable

At level 3 (in-the-loop), guardrails are helpful. At level 4 (on-the-loop), they're non-negotiable.

The code you see for the first time should have already passed:

  • Compilation
  • Linting and formatting
  • Full test suite execution
  • Architecture checks
  • Static code analysis and security scanning

Harness Engineering

Birgitta Boeckeler's framework: everything in an AI agent except the model itself.

Guides (feedforward): anticipate behavior before actions. Your instructions.md, reference architecture, coding standards.

Sensors (feedback): observe after action and enable correction. Your tests, linters, architecture checks.

Feedforward-only = unverified rules. Feedback-only = repeated mistakes. You need both.

The 2x2 Matrix

Guides (feedforward) Sensors (feedback)
Computational Coding conventions, bootstrap scripts, type systems Tests, linters, type checkers, static analysis
Inferential Documentation, architecture descriptions, examples AI-powered code review, semantic analysis

Three Regulation Categories

Maintainability harness: linters, static analysis, code quality tools. Solved problem, wire it up.

Architecture fitness harness: ArchUnit-style checks, fitness functions, performance tests.

Behaviour harness: does the code do what it's supposed to? Clear lineage from acceptance criteria to tests.

Build Today's Harness, Deprecate It Tomorrow

Today's LLMs absolutely need a curated harness. Build it out.

But expect components to be ripped out as model capability increases. Some guardrails will become native model capabilities. Some will shift to commodity tooling.

Keep quality left: pre-commit (fast, cheap), post-integration (thorough), continuous monitoring.

When On-the-Loop Works Best

  • Time-intensive refactoring (clear what, tedious how)
  • Well-understood features with established patterns
  • Legacy code modernization and documentation. Porting is cheaper today.
  • Framework version upgrades

Pattern: the what is clear but the how is tedious.
For exploratory or uncertain work, stay in the loop or use it as a POC-builder and option-generator.

Part 4: Multi-Agent Orchestration

The frontier

Three Emerging Patterns

Agent swarms. Multiple agents working in parallel on related tasks. A "Mayor" agent coordinates workers executing in parallel or the collaborate peer-to-peer using a shared task board.

Subagents. Hierarchical delegation: a primary agent spawns specialists or interns (context window management). The kitchen brigade metaphor.

AI-as-a-judge. One AI evaluates the work of another. Checks and balances within the AI toolchain.

Multi-Agent Heuristics

  • The goldilocks zone: break work into right-sized chunks for parallel execution
  • One file, one owner: prevent merge conflicts between concurrent agents. Worktrees or containers.
  • Heresies: incorrect patterns emerge and catastrophically propagate across agents. Document what not to do as much as what to do.

Open Questions (Early 2026)

  • Orchestration design: specialized agents vs. generalist workers?
  • Handling merge conflicts between parallel agents?
  • Managing shared state across agent contexts?
  • What does mastery of this level actually look like?

The tools are immature. The patterns aren't established. The honest answer to "what's best practice?" is: nobody knows yet.

The skill is staying curious, experimenting deliberately, and sharing what you learn.

Part 5: Your Growth Path

Where are you, and where do you want to go?

Yegge's Eight Levels

~70% of engineers are still at levels 1-2.

The gap is big and starting to widen. Compounding at work.

Levels to Skills and Costs*

Level Core Skills Approx. Cost
1-2: Chat + Autocomplete Prompt engineering, context basics Free - $40/mo
3-4: In-the-loop Context engineering, harness building, guardrails $40 - $200/mo
5-6: On-the-loop Spec writing, work breakdown, horizontal scaling $200 - $600/mo
7-8: Multi-agent Orchestration design, pipeline engineering $2,000+/mo

Levels 1-4 are per-seat OPEX. Levels 5+ shift toward platform OPEX.

The Shifting Skill Landscape

Fading: syntax memorization, handwriting boilerplate, refactoring keyboard shortcuts

Evergreen (more important than ever): system design, specification, judgment and taste, accountability

New to develop: prompt/context/harness engineering, debugging agentic workflows, pipeline design

The shift: from human -> IDE -> code to human -> AI agent(s) -> code

The AI Vampire

AI demonstrably increases productivity but also causes mental fatigue.

Engineers become more productive, but only maintain ~3 hours/day of peak output.

AI automates the easy work. Humans do only the hardest cognitive tasks. The work that remains is denser, more demanding.

Not a reason to avoid AI. A reason to manage energy, expectations, and workload differently.

Security: The Lethal Trifecta

Three factors that combine to create serious risk:

  1. Access to private data (code, credentials, customer data)
  2. Ability to externally communicate (HTTP requests, API calls)
  3. Exposure to untrusted content (user prompts, PR descriptions)

When all three overlap: complete attack chain. The mitigation: eliminate at least one.

Getting Started: Your Next Two Months

  • Week 1-2: Find your level on the ladder. Then drop one level below to compensate for Dunning-Kruger.
  • Week 3-6: Climb one level.
    • At 1-2: get comfortable with agentic tools, watch the agent work, learn from the frustration
    • At 3-4: extract reusable artifacts, build agent instructions, wire up hooks
    • At 5+: experiment with parallel agents, practice spec writing
  • Week 7-8: Consolidate. Extract what you learned. Share, institutionalize and industrialize as a team

The Golden Rule

For two months: think AI-first always, until you grow enough intuition its jagged edges.

Not "use AI for everything." But "consider inviting AI to the table first, then decide."

Reflection

Three questions. 15 minutes. 1-2-4-all format.

  1. What did I find most surprising?

  2. What do I want to try first thing tomorrow?

  3. What's blocking me from climbing the next rung on the ladder?

Resources

  • Agentic Engineering Heuristics (complete collection)
  • Steve Yegge - Pragmatic Engineer podcast (8 levels deep dive)
  • Augmented Coding Pattern Language - Lada Kesseler (interactive map)
  • Harness Engineering - Birgitta Boeckeler (article)
  • Vibe Coding book by Steve Yegge & Gene Kim
  • AI Engineering book by Chip Huyen
  • Build a real working AI coding agent (hands-on workshop for building mechanical sympathy)


https://www.youtube.com/@jovaneyck

https://agentic-engineering-weekly.ghost.io/

https://addyosmani.com/blog/code-agent-orchestra/

https://addyosmani.com/blog/code-agent-orchestra/