From Models to Systems: Engineering Trustworthy AI Agents
Yuan (Emily) Xue
Vanderbilt University · Institute for Software Integrated Systems Research Seminar
April 2026
Three Curves. One Shift.
Open-Source Autonomous Agents
335K
GitHub stars in 3 months
38M
Monthly visitors to openclaw.ai
44K+
Extensions by 12,000+ developers
Coding Agent Adoption
135K+
GitHub commits per day · ~4% of all public commits
42,896x
Growth in 13 months since research preview
20%+
Projected share of daily GitHub commits by end of 2026
Source: SemiAnalysis, Feb 2026
AI Traffic in Cyber Systems
8x
Automated traffic growing 8x faster than human
187%
AI-driven traffic growth year over year
7,800%+
Agentic AI traffic surge
Source: HUMAN Security 2026 Report
AI agents are now active participants in the software supply chain and cyber infrastructure.
Story 1
The Viral Rise of OpenClaw
Open-source agent repo growth vs established projects
Why It Felt Different
Long-Running
Persists over time · keeps state · stays alive, stays connected
Goal-Directed
Decomposes tasks · retries & adapts
Surprising
Finds paths you didn't script · uses tools creatively
Story 1
How It Works
The agent execution loop:
Goal
↓
Planner
↓
Tool Use & Memory
↓
Execution
↓
Retry / Adapt
↻
Story 1
Capability Requires Privilege
To do more, agents need more access.
Access They Want
File system
Browser
Terminal
APIs
Credentials
What Users Want
Convenience
Speed
No setup friction
Just works
What Security Wants
Least privilege
Auditability
Isolation
Revocation
Story 2
Coding Agents Are Entering the Software Supply Chain
Source: SemiAnalysis, "Claude Code is the Inflection Point," Feb 2026
135K+
GitHub commits per day ~4% of all public commits
42,896x
Growth in 13 months since research preview
20%+
Projected share of daily GitHub commits by end of 2026
AI is no longer just assisting code. It is participating in release pipelines.
Story 2
The Coding Agent Amplification Loop
What They Do
Generate, debug & fix code
Write tests & modify configs
Package software & set up CI/CD
Open PRs & commit changes
Where They're Used
Traditional software
ML systems
Agent systems themselves
The Amplification Loop
Agent builds software
↓
Software contains agents
↓
Agents build more software
↻
Story 2
Claude Code Release Incident
March 2025 — The release pipeline created the risk.
What Happened
The Flaw: npm package publication included .map files by default
The Leak: Obfuscated source code was easily reconstructible via these source maps
The Exposure: Internal functions, comments, and non-public logic were visible to anyone who downloaded the CLI tool
The Challenges
Data Leakage via Metadata: The source maps acted as unintended metadata
Cognitive Overload: A human error — but how can human cognitive load keep pace with AI coding velocity?
The Deskilling Problem: When automation replaces human tasks, how do humans retain the knowledge needed to make sound judgments?
In modern web development, source maps bridge the gap between compressed, unreadable code and the original source. By accidentally shipping these, Anthropic effectively "open-sourced" their proprietary agent logic.
Once agents write code, open tickets, modify configs, query systems, or operate workflows — they become part of the attack surface and part of the control plane.
AI Agent
Software Development
Release Process
Cloud Resources
Networked Systems
Browsers / APIs
Enterprise Workflows
AI agents are no longer outside the system. They are inside it.
Recent cyber benchmarks show AI agent traffic growing 7,851% YoY, with automated traffic now growing 8× faster than human traffic. — HUMAN Security, 2026
The ML Community's Perspective
Two Engineering Practices Shaping Today's AI Safety
Model Developer View
Policy / constitutional alignment
Post-training for preference & behavior shaping
RLHF / RLAIF / RLVR (reinforcement learning from human / AI / verifiable reward)
Safety evaluation and red teaming
Inference-time safety filters
Applied AI Engineer View
Prompt engineering
Context engineering
Tool scaffolding
Harnesses and workflow design
Eval sets and regression testing
Red teaming
Both communities are highly eval-driven — but they optimize different layers of the stack.
Model Safety Is Already a Lifecycle Practice
The model developer community treats safety as a multi-stage discipline.
Training
Policy / constitution shaping
RLHF / RLAIF
RLVR / reasoning optimization
→
Evaluation
Safety benchmarks
Adversarial prompting
Red teaming
→
Inference-Time Controls
Moderation
Refusal behavior
Guardrails / policy enforcement
Example: Constitutional AI Rules
"Please choose the assistant response that is as harmless and helpful as possible, without being dishonest."
"Choose the response that would be most appropriate for a helpful, honest, and harmless AI assistant."
How Modern LLMs Are Built — and Where Safety Is Incorporated
PRE-TRAINING
Knowledge & Representation
→
🧠
PRE-TRAINED
CHECKPOINT
(Base Model)
→
REWARD MODEL
PREFERENCE
RLHF
VERIFIER
RLVF
RULES
RLAIF
↓
POST-TRAINING
SFT
Instruction Following
→
RL
Alignment & Reasoning
→
🧠
FINAL MODEL
CHECKPOINT
(Aligned LLM)
↑
EVALUATION
→
LLM
DEPLOYMENT
↑
GUARDRAILS
Trends in AI Agent Development
Each phase of AI engineering expanded what agents can do — and what can go wrong.
Google Trends, US, 2023–2026
Prompt Engineering
Crafting instructions to shape model output.
Context Engineering
Assembling retrieval, memory, and documents to give the model the right information at the right time.
Scaffolding / Orchestration
Chaining models into multi-step workflows with planning, tool use, and retry logic.
Agent Harness
The runtime shell that lets agents operate, coordinate, and be observed in production.
A chain of locally reasonable decisions that becomes globally unsafe — that is the new failure mode.
Where We Are Now: The Agent Harness
The harness is the surrounding system that allows agents to operate, coordinate, observe, and improve within an environment.
Execution Shell
Message passing
Tool invocation
State & turn control
Retry logic
Trace collection
Coordination Layer
Planner / executor separation
Specialist agents
Reviewer / verifier roles
Routing & hierarchy
Multi-agent composition
Environment Interface
Tools & APIs
Browser & filesystem
Code executor
CRM / DB / Slack / email
Observation & Eval
Logging & replay
Trace analysis
Regression testing
Adversarial testing
Failure diagnosis
A harness provides the runtime structure that allows AI agents to operate, interact with their environment, and be observed, constrained, and improved.
The key assumption: if we shape the right behavior signal, the system will generalize correctly.
The modern ML community has developed a very powerful operating philosophy: if you can specify desired behavior, evaluate it, and optimize against it, you can drive remarkable capability and alignment progress.
RL Quietly Became the Development Engine
Reinforcement learning is no longer a niche method — it is becoming a general development paradigm across model and agent systems.
RLHF / RLAIF for preference alignment
Verifier-based RL for reasoning & correctness
Tool-use optimization
Self-improvement loops
Agent training through outcome signals
"The Bitter Lesson"
— Richard Sutton
Systems that scale with computation and learning tend to dominate systems built around hand-crafted human structure.
The lesson many in AI internalized: specify as little as possible, optimize as much as possible.
Leave room for search, creativity, and emergent intelligence.
This mindset helped take us from chat models to capable agents. RL is now the engine behind alignment, reasoning, tool use, and agent behavior. But this philosophy — optimize behavior, minimize specification — creates a tension with security and systems engineering, which demand explicit contracts, boundaries, and verifiable constraints. That tension is where the open questions live.
Open Question #1: Can We Make Agency Transactional?
How do we contain, checkpoint, and roll back probabilistic side effects?
ML asks whether the behavior was good. Systems asks what happens when step 6 fails after steps 1–5 already changed the world. This is a systems primitive — can we treat agent actions like database transactions, with checkpoints, journaling, and rollback?
Open Question #2: What Does Least Privilege Mean for Reasoning Systems?
How do we bound what an agent is allowed to access, infer, plan, and execute?
Today
Static tool permissions
Broad access scopes
Coarse sandboxing
API-level access control
What We Need
Contextual permissions
Plan-aware authorization
Dynamic trust boundaries
Reasoning-aware control
Least privilege in classical security is about what code is allowed to do. For agents, we also need to think about what the system is allowed to infer, plan, and attempt.
Least privilege is well understood for traditional software, but what does it mean when the system reasons, plans, and adapts? We need contextual, plan-aware authorization — not static tool permissions. This is a security primitive that the field has not yet built.
Open Question #3: Can We Build a Compiler for Intent?
How do we translate human goals into machine-checkable authority boundaries before action?
If least privilege is the principle, then intent compilation may be the mechanism.
Anthropic's stress-testing of frontier models found that when facing replacement or goal conflicts, systems consistently chose harmful actions over failure — demonstrating that current safety training doesn't reliably prevent "agentic misalignment." Anthropic, "Agentic Misalignment," June 2025
The Translation Gap Across Communities
Community
Strong At
Often Misses
ML / AI
Behavior shaping, evaluation, optimization
State, rollback, authority, runtime control
Software Eng
Abstractions, testing, maintainability
Model non-determinism, prompt-mediated failure
Systems
State, observability, failure propagation
Behavioral ambiguity, learned policies
Security
Threat models, privilege, containment
Agent reasoning loops, emergent workflows
We are not missing effort. We are missing a shared systems language.
What We Need to Build Next
What the Field Needs
Shared mental models across ML, systems, security, and SE
New assurance primitives for agentic systems
Runtime control and audit infrastructure
Reference architectures for trustworthy deployment
What Each Community Must Bring
ML Better behavior shaping is not enough
Security Treat agents as dynamic operational actors
Systems Build rollback, observability, and control for agency
SE Define specs, interfaces, and enforceable contracts
The bottleneck is no longer model quality. It is our ability to engineer trustworthiness around these systems.
The Agent Era Is Here.
The question is no longer: Can we build these systems?
Can we build them in a way that deserves trust?
Trustworthy AI will not come from better models alone. It will require systems thinking, engineering discipline, and a shared map across communities.
Yuan (Emily) Xue
Vanderbilt University · Institute for Software Integrated Systems Research Seminar — April 2026