From Models to Systems:
Engineering Trustworthy AI Agents

Yuan (Emily) Xue
Vanderbilt University · Institute for Software Integrated Systems Research Seminar
April 2026

Three Curves. One Shift.

Open-Source Autonomous Agents

335K

GitHub stars in 3 months

38M

Monthly visitors to openclaw.ai

44K+

Extensions by 12,000+ developers

Coding Agent Adoption

135K+

GitHub commits per day · ~4% of all public commits

42,896x

Growth in 13 months since research preview

20%+

Projected share of daily GitHub commits by end of 2026

Source: SemiAnalysis, Feb 2026

AI Traffic in Cyber Systems

8x

Automated traffic growing 8x faster than human

187%

AI-driven traffic growth year over year

7,800%+

Agentic AI traffic surge

Source: HUMAN Security 2026 Report

    AI agents are now active participants in the software supply chain and cyber infrastructure.
  

Story 1

The Viral Rise of OpenClaw

GitHub Star History: React vs OpenClaw vs Linux

Open-source agent repo growth vs established projects

Why It Felt Different

Long-Running

Persists over time · keeps state · stays alive, stays connected

Goal-Directed

Decomposes tasks · retries & adapts

Surprising

Finds paths you didn't script · uses tools creatively

Story 1

How It Works

The agent execution loop:

Goal

↓

Planner

↓

Tool Use & Memory

↓

Execution

↓

Retry / Adapt

↻

Story 1

Capability Requires Privilege

To do more, agents need more access.

Access They Want

File system
Browser
Terminal
APIs
Credentials

What Users Want

Convenience
Speed
No setup friction
Just works

What Security Wants

Least privilege
Auditability
Isolation
Revocation

Story 2

Coding Agents Are Entering the Software Supply Chain

Source: SemiAnalysis, "Claude Code is the Inflection Point," Feb 2026

135K+

GitHub commits per day
~4% of all public commits

42,896x

Growth in 13 months
since research preview

20%+

Projected share of daily
GitHub commits by end of 2026

AI is no longer just assisting code. It is participating in release pipelines.

Story 2

The Coding Agent Amplification Loop

What They Do

Generate, debug & fix code
Write tests & modify configs
Package software & set up CI/CD
Open PRs & commit changes

Where They're Used

Traditional software
ML systems
Agent systems themselves

The Amplification Loop

Agent builds software

↓

Software contains agents

↓

Agents build more software

↻

Story 2

Claude Code Release Incident

March 2025 — The release pipeline created the risk.

What Happened

The Flaw: npm package publication included .map files by default
The Leak: Obfuscated source code was easily reconstructible via these source maps
The Exposure: Internal functions, comments, and non-public logic were visible to anyone who downloaded the CLI tool

The Challenges

Data Leakage via Metadata: The source maps acted as unintended metadata
Cognitive Overload: A human error — but how can human cognitive load keep pace with AI coding velocity?
The Deskilling Problem: When automation replaces human tasks, how do humans retain the knowledge needed to make sound judgments?

In modern web development, source maps bridge the gap between compressed, unreadable code and the original source. By accidentally shipping these, Anthropic effectively "open-sourced" their proprietary agent logic.

Source: Anthropic PBC, "Security Advisory: Claude Code Source Map Exposure," March 2025. anthropic.com/news/claude-code-security-update

Agents Are Becoming Infrastructure

Once agents write code, open tickets, modify configs, query systems, or operate workflows — they become part of the attack surface and part of the control plane.

AI Agent

Software Development

Release Process

Cloud Resources

Networked Systems

Browsers / APIs

Enterprise Workflows

AI agents are no longer outside the system. They are inside it.

Recent cyber benchmarks show AI agent traffic growing 7,851% YoY, with automated traffic now growing 8× faster than human traffic. — HUMAN Security, 2026

The ML Community's Perspective

Two Engineering Practices Shaping Today's AI Safety

Model Developer View

Policy / constitutional alignment
Post-training for preference & behavior shaping
RLHF / RLAIF / RLVR (reinforcement learning from human / AI / verifiable reward)
Safety evaluation and red teaming
Inference-time safety filters

Applied AI Engineer View

Prompt engineering
Context engineering
Tool scaffolding
Harnesses and workflow design
Eval sets and regression testing
Red teaming

    Both communities are highly eval-driven — but they optimize different layers of the stack.
  

Model Safety Is Already a Lifecycle Practice

The model developer community treats safety as a multi-stage discipline.

Training

Policy / constitution shaping
RLHF / RLAIF
RLVR / reasoning optimization

→

Evaluation

Safety benchmarks
Adversarial prompting
Red teaming

→

Inference-Time Controls

Moderation
Refusal behavior
Guardrails / policy enforcement

Example: Constitutional AI Rules

"Please choose the assistant response that is as harmless and helpful as possible, without being dishonest."

"Choose the response that would be most appropriate for a helpful, honest, and harmless AI assistant."

How Modern LLMs Are Built — and Where Safety Is Incorporated

PRE-TRAINING

Knowledge & Representation

→

🧠

PRE-TRAINED

CHECKPOINT

(Base Model)

→

REWARD MODEL

PREFERENCE

RLHF

VERIFIER

RLVF

RULES

RLAIF

↓

POST-TRAINING

SFT

Instruction Following

→

RL

Alignment & Reasoning

→

🧠

FINAL MODEL

CHECKPOINT

(Aligned LLM)

↑

EVALUATION

→

LLM

DEPLOYMENT

↑

GUARDRAILS

Trends in AI Agent Development

Each phase of AI engineering expanded what agents can do — and what can go wrong.

Trends in AI Interaction and Development

Google Trends, US, 2023–2026

Prompt Engineering

Crafting instructions to shape model output.

Context Engineering

Assembling retrieval, memory, and documents to give the model the right information at the right time.

Scaffolding / Orchestration

Chaining models into multi-step workflows with planning, tool use, and retry logic.

Agent Harness

The runtime shell that lets agents operate, coordinate, and be observed in production.

The Risk Surface Expands at Every Stage

Prompt Engineering

Risk surface: output manipulation

Prompt injection & jailbreaks
Harmful / policy-violating outputs
Hallucinations & misleading reasoning
Sensitive information leakage

Context Engineering

Risk surface: context poisoning & trust boundary failure

RAG poisoning & malicious documents
Hidden instructions in retrieved content
Memory poisoning & stale context
Trust confusion: system vs. external input

Scaffolding / Orchestration

Risk surface: workflow-level failure & exploit chaining

Exploit chaining across tools & steps
Recursive failure loops & error amplification
Policy drift across subtasks
Bypass of human checkpoints via workflow logic

Agent Harness

Risk surface: operational governance at scale

Execution shell integrity & state corruption
Multi-agent coordination failures
Environment coupling & side effects
Evaluation blind spots in production

    A chain of locally reasonable decisions that becomes globally unsafe — that is the new failure mode.
  

Where We Are Now: The Agent Harness

The harness is the surrounding system that allows agents to operate, coordinate, observe, and improve within an environment.

Execution Shell

Message passing
Tool invocation
State & turn control
Retry logic
Trace collection

Coordination Layer

Planner / executor separation
Specialist agents
Reviewer / verifier roles
Routing & hierarchy
Multi-agent composition

Environment Interface

Tools & APIs
Browser & filesystem
Code executor
CRM / DB / Slack / email

Observation & Eval

Logging & replay
Trace analysis
Regression testing
Adversarial testing
Failure diagnosis

    A harness provides the runtime structure that allows AI agents to operate, interact with their environment, and be observed, constrained, and improved.
  

The ML Community's Core Mindset

Specify behavior. Measure behavior. Optimize behavior.

How Behavior Is Specified

By Data

Human preference data
Comparison labels
Reward models
Safety / moderation examples

By Specification

Constitutions / policy rules
AI feedback
Verifiable constraints
Programmatic judges

How It Is Enforced

RLHF
RLAIF
Verifier RL / RLVR
LLM-as-judge
Guardrails

The key assumption: if we shape the right behavior signal, the system will generalize correctly.

The modern ML community has developed a very powerful operating philosophy: if you can specify desired behavior, evaluate it, and optimize against it, you can drive remarkable capability and alignment progress.

RL Quietly Became the Development Engine

Reinforcement learning is no longer a niche method — it is becoming a general development paradigm across model and agent systems.

RLHF / RLAIF for preference alignment
Verifier-based RL for reasoning & correctness
Tool-use optimization
Self-improvement loops
Agent training through outcome signals

"The Bitter Lesson"

— Richard Sutton

Systems that scale with computation and learning tend to dominate systems built around hand-crafted human structure.

The lesson many in AI internalized: specify as little as possible, optimize as much as possible.

Leave room for search, creativity, and emergent intelligence.

This mindset helped take us from chat models to capable agents. RL is now the engine behind alignment, reasoning, tool use, and agent behavior. But this philosophy — optimize behavior, minimize specification — creates a tension with security and systems engineering, which demand explicit contracts, boundaries, and verifiable constraints. That tension is where the open questions live.

Open Question #1: Can We Make Agency Transactional?

How do we contain, checkpoint, and roll back probabilistic side effects?

Agent reads docs → writes code → opens PR → updates ticket → sends email → step 6 fails

Steps 1–5 already changed the world. What now?

Partial failure handling

Checkpointing & journaling

Idempotence

Rollback / compensation

Safe recovery

ML asks whether the behavior was good. Systems asks what happens when step 6 fails after steps 1–5 already changed the world. This is a systems primitive — can we treat agent actions like database transactions, with checkpoints, journaling, and rollback?

Open Question #2: What Does Least Privilege Mean for Reasoning Systems?

How do we bound what an agent is allowed to access, infer, plan, and execute?

Today

Static tool permissions
Broad access scopes
Coarse sandboxing
API-level access control

What We Need

Contextual permissions
Plan-aware authorization
Dynamic trust boundaries
Reasoning-aware control

    Least privilege in classical security is about what code is allowed to do.
For agents, we also need to think about what the system is allowed to infer, plan, and attempt.
  

Least privilege is well understood for traditional software, but what does it mean when the system reasons, plans, and adapts? We need contextual, plan-aware authorization — not static tool permissions. This is a security primitive that the field has not yet built.

Open Question #3: Can We Build a Compiler for Intent?

How do we translate human goals into machine-checkable authority boundaries before action?

Today

Natural language goal → action / code / tool use

Needed

NL goal → structured intent → constraints / policy → safe execution

Delegated scope

Intent representation

Constraint extraction

Pre-execution validation

Policy-aware planning

If least privilege is the principle, then intent compilation may be the mechanism.

Anthropic's stress-testing of frontier models found that when facing replacement or goal conflicts, systems consistently chose harmful actions over failure — demonstrating that current safety training doesn't reliably prevent "agentic misalignment." Anthropic, "Agentic Misalignment," June 2025

The Translation Gap Across Communities

Community	Strong At	Often Misses
ML / AI	Behavior shaping, evaluation, optimization	State, rollback, authority, runtime control
Software Eng	Abstractions, testing, maintainability	Model non-determinism, prompt-mediated failure
Systems	State, observability, failure propagation	Behavioral ambiguity, learned policies
Security	Threat models, privilege, containment	Agent reasoning loops, emergent workflows

We are not missing effort. We are missing a shared systems language.

What We Need to Build Next

What the Field Needs

Shared mental models across ML, systems, security, and SE
New assurance primitives for agentic systems
Runtime control and audit infrastructure
Reference architectures for trustworthy deployment

What Each Community Must Bring

ML Better behavior shaping is not enough
Security Treat agents as dynamic operational actors
Systems Build rollback, observability, and control for agency
SE Define specs, interfaces, and enforceable contracts

    The bottleneck is no longer model quality. It is our ability to engineer trustworthiness around these systems.
  

The Agent Era Is Here.

The question is no longer: Can we build these systems?

Can we build them in a way that deserves trust?

Trustworthy AI will not come from better models alone.
It will require systems thinking, engineering discipline,
and a shared map across communities.

Yuan (Emily) Xue
Vanderbilt University · Institute for Software Integrated Systems Research Seminar — April 2026

From Models to Systems:Engineering Trustworthy AI Agents

Three Curves. One Shift.

Open-Source Autonomous Agents

Coding Agent Adoption

AI Traffic in Cyber Systems

The Viral Rise of OpenClaw

Why It Felt Different

How It Works

Capability Requires Privilege

Access They Want

What Users Want

What Security Wants

Coding Agents Are Entering the Software Supply Chain

The Coding Agent Amplification Loop

What They Do

Where They're Used

Claude Code Release Incident

What Happened

The Challenges

Agents Are Becoming Infrastructure

The ML Community's Perspective

Model Developer View

Applied AI Engineer View

Model Safety Is Already a Lifecycle Practice

Training

Evaluation

Inference-Time Controls

How Modern LLMs Are Built — and Where Safety Is Incorporated

PRE-TRAINING

PRE-TRAINED

CHECKPOINT

REWARD MODEL

POST-TRAINING

SFT

RL

FINAL MODEL

CHECKPOINT

EVALUATION

LLM

DEPLOYMENT

GUARDRAILS

Trends in AI Agent Development

Prompt Engineering

Context Engineering

Scaffolding / Orchestration

Agent Harness

The Risk Surface Expands at Every Stage

Prompt Engineering

Context Engineering

Scaffolding / Orchestration

Agent Harness

Where We Are Now: The Agent Harness

Execution Shell

Coordination Layer

Environment Interface

Observation & Eval

The ML Community's Core Mindset

How Behavior Is Specified

How It Is Enforced

RL Quietly Became the Development Engine

Open Question #1: Can We Make Agency Transactional?

Open Question #2: What Does Least Privilege Mean for Reasoning Systems?

Today

What We Need

Open Question #3: Can We Build a Compiler for Intent?

Today

Needed

The Translation Gap Across Communities

What We Need to Build Next

What the Field Needs

What Each Community Must Bring

The Agent Era Is Here.

From Models to Systems:
Engineering Trustworthy AI Agents