From Models to Systems:
Engineering Trustworthy AI Agents

Yuan (Emily) Xue
Vanderbilt University · Institute for Software Integrated Systems Research Seminar
April 2026

Three Curves. One Shift.

Open-Source Autonomous Agents

335K

GitHub stars in 3 months

38M

Monthly visitors to openclaw.ai

44K+

Extensions by 12,000+ developers

Coding Agent Adoption

135K+

GitHub commits per day · ~4% of all public commits

42,896x

Growth in 13 months since research preview

20%+

Projected share of daily GitHub commits by end of 2026

Source: SemiAnalysis, Feb 2026

AI Traffic in Cyber Systems

8x

Automated traffic growing 8x faster than human

187%

AI-driven traffic growth year over year

7,800%+

Agentic AI traffic surge

Source: HUMAN Security 2026 Report

AI agents are now active participants in the software supply chain and cyber infrastructure.
Story 1

The Viral Rise of OpenClaw

GitHub Star History: React vs OpenClaw vs Linux

Open-source agent repo growth vs established projects

Why It Felt Different

Long-Running
Persists over time · keeps state · stays alive, stays connected
Goal-Directed
Decomposes tasks · retries & adapts
Surprising
Finds paths you didn't script · uses tools creatively
Story 1

How It Works

OpenClaw Architecture

The agent execution loop:

Goal
Planner
Tool Use & Memory
Execution
Retry / Adapt
Story 1

Capability Requires Privilege

To do more, agents need more access.

Access They Want

  • File system
  • Browser
  • Terminal
  • APIs
  • Credentials

What Users Want

  • Convenience
  • Speed
  • No setup friction
  • Just works

What Security Wants

  • Least privilege
  • Auditability
  • Isolation
  • Revocation
Story 2

Coding Agents Are Entering the Software Supply Chain

Claude Code GitHub commit growth

Source: SemiAnalysis, "Claude Code is the Inflection Point," Feb 2026

135K+
GitHub commits per day
~4% of all public commits
42,896x
Growth in 13 months
since research preview
20%+
Projected share of daily
GitHub commits by end of 2026
AI is no longer just assisting code. It is participating in release pipelines.
Story 2

The Coding Agent Amplification Loop

What They Do

  • Generate, debug & fix code
  • Write tests & modify configs
  • Package software & set up CI/CD
  • Open PRs & commit changes

Where They're Used

  • Traditional software
  • ML systems
  • Agent systems themselves

The Amplification Loop

Agent builds software
Software contains agents
Agents build more software
Story 2

Claude Code Release Incident

March 2025 — The release pipeline created the risk.

What Happened

  • The Flaw: npm package publication included .map files by default
  • The Leak: Obfuscated source code was easily reconstructible via these source maps
  • The Exposure: Internal functions, comments, and non-public logic were visible to anyone who downloaded the CLI tool

The Challenges

  • Data Leakage via Metadata: The source maps acted as unintended metadata
  • Cognitive Overload: A human error — but how can human cognitive load keep pace with AI coding velocity?
  • The Deskilling Problem: When automation replaces human tasks, how do humans retain the knowledge needed to make sound judgments?

In modern web development, source maps bridge the gap between compressed, unreadable code and the original source. By accidentally shipping these, Anthropic effectively "open-sourced" their proprietary agent logic.

Source: Anthropic PBC, "Security Advisory: Claude Code Source Map Exposure," March 2025. anthropic.com/news/claude-code-security-update

Agents Are Becoming Infrastructure

Once agents write code, open tickets, modify configs, query systems, or operate workflows — they become part of the attack surface and part of the control plane.

AI Agent
Software Development
Release Process
Cloud Resources
Networked Systems
Browsers / APIs
Enterprise Workflows
AI agents are no longer outside the system. They are inside it.

Recent cyber benchmarks show AI agent traffic growing 7,851% YoY, with automated traffic now growing 8× faster than human traffic. — HUMAN Security, 2026

The ML Community's Perspective

Two Engineering Practices Shaping Today's AI Safety

Model Developer View

  • Policy / constitutional alignment
  • Post-training for preference & behavior shaping
  • RLHF / RLAIF / RLVR (reinforcement learning from human / AI / verifiable reward)
  • Safety evaluation and red teaming
  • Inference-time safety filters

Applied AI Engineer View

  • Prompt engineering
  • Context engineering
  • Tool scaffolding
  • Harnesses and workflow design
  • Eval sets and regression testing
  • Red teaming
Both communities are highly eval-driven — but they optimize different layers of the stack.

Model Safety Is Already a Lifecycle Practice

The model developer community treats safety as a multi-stage discipline.

Training

  • Policy / constitution shaping
  • RLHF / RLAIF
  • RLVR / reasoning optimization

Evaluation

  • Safety benchmarks
  • Adversarial prompting
  • Red teaming

Inference-Time Controls

  • Moderation
  • Refusal behavior
  • Guardrails / policy enforcement

Example: Constitutional AI Rules

"Please choose the assistant response that is as harmless and helpful as possible, without being dishonest."

"Choose the response that would be most appropriate for a helpful, honest, and harmless AI assistant."

How Modern LLMs Are Built — and Where Safety Is Incorporated

PRE-TRAINING

Knowledge & Representation

🧠

PRE-TRAINED

CHECKPOINT

(Base Model)

REWARD MODEL

PREFERENCE

RLHF

VERIFIER

RLVF

RULES

RLAIF

POST-TRAINING

SFT

Instruction Following

RL

Alignment & Reasoning

🧠

FINAL MODEL

CHECKPOINT

(Aligned LLM)

EVALUATION

LLM

DEPLOYMENT

GUARDRAILS

Trends in AI Agent Development

Each phase of AI engineering expanded what agents can do — and what can go wrong.

Trends in AI Interaction and Development
Google Trends: Prompt Engineering, Context Engineering, Agent Workflow, Agent Harness

Google Trends, US, 2023–2026

Prompt Engineering

Crafting instructions to shape model output.

Context Engineering

Assembling retrieval, memory, and documents to give the model the right information at the right time.

Scaffolding / Orchestration

Chaining models into multi-step workflows with planning, tool use, and retry logic.

Agent Harness

The runtime shell that lets agents operate, coordinate, and be observed in production.

The Risk Surface Expands at Every Stage

Prompt Engineering

Risk surface: output manipulation

  • Prompt injection & jailbreaks
  • Harmful / policy-violating outputs
  • Hallucinations & misleading reasoning
  • Sensitive information leakage

Context Engineering

Risk surface: context poisoning & trust boundary failure

  • RAG poisoning & malicious documents
  • Hidden instructions in retrieved content
  • Memory poisoning & stale context
  • Trust confusion: system vs. external input

Scaffolding / Orchestration

Risk surface: workflow-level failure & exploit chaining

  • Exploit chaining across tools & steps
  • Recursive failure loops & error amplification
  • Policy drift across subtasks
  • Bypass of human checkpoints via workflow logic

Agent Harness

Risk surface: operational governance at scale

  • Execution shell integrity & state corruption
  • Multi-agent coordination failures
  • Environment coupling & side effects
  • Evaluation blind spots in production
A chain of locally reasonable decisions that becomes globally unsafe — that is the new failure mode.

Where We Are Now: The Agent Harness

The harness is the surrounding system that allows agents to operate, coordinate, observe, and improve within an environment.

Execution Shell

  • Message passing
  • Tool invocation
  • State & turn control
  • Retry logic
  • Trace collection

Coordination Layer

  • Planner / executor separation
  • Specialist agents
  • Reviewer / verifier roles
  • Routing & hierarchy
  • Multi-agent composition

Environment Interface

  • Tools & APIs
  • Browser & filesystem
  • Code executor
  • CRM / DB / Slack / email

Observation & Eval

  • Logging & replay
  • Trace analysis
  • Regression testing
  • Adversarial testing
  • Failure diagnosis
A harness provides the runtime structure that allows AI agents to operate, interact with their environment, and be observed, constrained, and improved.

The ML Community's Core Mindset

Specify behavior. Measure behavior. Optimize behavior.

How Behavior Is Specified

By Data

  • Human preference data
  • Comparison labels
  • Reward models
  • Safety / moderation examples

By Specification

  • Constitutions / policy rules
  • AI feedback
  • Verifiable constraints
  • Programmatic judges

How It Is Enforced

  • RLHF
  • RLAIF
  • Verifier RL / RLVR
  • LLM-as-judge
  • Guardrails

The key assumption: if we shape the right behavior signal, the system will generalize correctly.

The modern ML community has developed a very powerful operating philosophy: if you can specify desired behavior, evaluate it, and optimize against it, you can drive remarkable capability and alignment progress.

RL Quietly Became the Development Engine

Reinforcement learning is no longer a niche method — it is becoming a general development paradigm across model and agent systems.

  • RLHF / RLAIF for preference alignment
  • Verifier-based RL for reasoning & correctness
  • Tool-use optimization
  • Self-improvement loops
  • Agent training through outcome signals

"The Bitter Lesson"

— Richard Sutton

Systems that scale with computation and learning tend to dominate systems built around hand-crafted human structure.

The lesson many in AI internalized: specify as little as possible, optimize as much as possible.

Leave room for search, creativity, and emergent intelligence.

This mindset helped take us from chat models to capable agents. RL is now the engine behind alignment, reasoning, tool use, and agent behavior. But this philosophy — optimize behavior, minimize specification — creates a tension with security and systems engineering, which demand explicit contracts, boundaries, and verifiable constraints. That tension is where the open questions live.

Open Question #1: Can We Make Agency Transactional?

How do we contain, checkpoint, and roll back probabilistic side effects?

Agent reads docs → writes code → opens PR → updates ticket → sends email → step 6 fails

Steps 1–5 already changed the world. What now?

Partial failure handling
Checkpointing & journaling
Idempotence
Rollback / compensation
Safe recovery
ML asks whether the behavior was good. Systems asks what happens when step 6 fails after steps 1–5 already changed the world. This is a systems primitive — can we treat agent actions like database transactions, with checkpoints, journaling, and rollback?

Open Question #2: What Does Least Privilege Mean for Reasoning Systems?

How do we bound what an agent is allowed to access, infer, plan, and execute?

Today

  • Static tool permissions
  • Broad access scopes
  • Coarse sandboxing
  • API-level access control

What We Need

  • Contextual permissions
  • Plan-aware authorization
  • Dynamic trust boundaries
  • Reasoning-aware control
Least privilege in classical security is about what code is allowed to do.
For agents, we also need to think about what the system is allowed to infer, plan, and attempt.
Least privilege is well understood for traditional software, but what does it mean when the system reasons, plans, and adapts? We need contextual, plan-aware authorization — not static tool permissions. This is a security primitive that the field has not yet built.

Open Question #3: Can We Build a Compiler for Intent?

How do we translate human goals into machine-checkable authority boundaries before action?

Today

Natural language goal → action / code / tool use

Needed

NL goal → structured intent → constraints / policy → safe execution

Delegated scope
Intent representation
Constraint extraction
Pre-execution validation
Policy-aware planning

If least privilege is the principle, then intent compilation may be the mechanism.

Anthropic's stress-testing of frontier models found that when facing replacement or goal conflicts, systems consistently chose harmful actions over failure — demonstrating that current safety training doesn't reliably prevent "agentic misalignment." Anthropic, "Agentic Misalignment," June 2025

The Translation Gap Across Communities

CommunityStrong AtOften Misses
ML / AI Behavior shaping, evaluation, optimization State, rollback, authority, runtime control
Software Eng Abstractions, testing, maintainability Model non-determinism, prompt-mediated failure
Systems State, observability, failure propagation Behavioral ambiguity, learned policies
Security Threat models, privilege, containment Agent reasoning loops, emergent workflows
We are not missing effort. We are missing a shared systems language.

What We Need to Build Next

What the Field Needs

  • Shared mental models across ML, systems, security, and SE
  • New assurance primitives for agentic systems
  • Runtime control and audit infrastructure
  • Reference architectures for trustworthy deployment

What Each Community Must Bring

  • ML Better behavior shaping is not enough
  • Security Treat agents as dynamic operational actors
  • Systems Build rollback, observability, and control for agency
  • SE Define specs, interfaces, and enforceable contracts
The bottleneck is no longer model quality. It is our ability to engineer trustworthiness around these systems.

The Agent Era Is Here.

The question is no longer: Can we build these systems?

Can we build them in a way that deserves trust?

Trustworthy AI will not come from better models alone.
It will require systems thinking, engineering discipline,
and a shared map across communities.

Yuan (Emily) Xue
Vanderbilt University · Institute for Software Integrated Systems Research Seminar — April 2026