Governance and Safety

Agent safety is not an extension of the alignment philosophy debate; it is a design problem that can be engineered. ADPS takes a clear position here: governance = governance + observability + blast radius control — three engineering matters, and no fourth.

This is a position paper of the ADPS community, not a personal opinion. When the four founders launched adpsagent.com on 2026-05-27, they wrote "governance is an engineering problem" into the subtitle on the home page, rather than framing it within the alignment philosophy context. This paper develops that position.

The current state · The discussion is pulled toward two extremes

Walk into any venue discussing Agent safety in 2026 and you will see two extremes fighting for the microphone.

At one end is doomerism. "Agents are about to become generally intelligent, about to go out of control, about to replace human decision-making, so we need alignment research, we need improved RLHF, we need constitutional AI" — this discourse has been in production since GPT-4's release in 2023, and it is still being produced in 2026. It is not wrong, but the object it discusses is not the Agent systems deployed in production today; it is a hypothetical future. Discussing the failure modes of a system that does not yet exist produces nothing consumable for the engineers whose systems are having incidents today.

At the other end is anarchism. "Don't add governance, trust the model, the more constraints the dumber the Agent gets, the Approval Gate is over-engineering, Hooks are anti-LLM" — this discourse runs from the early LangChain community through to certain Agent framework communities in 2026. Its subtext is that "the model is capable enough on its own and needs no external constraint." This discourse is a mirror image of doomerism: one believes the model is so strong it must be stopped at the philosophical level, the other believes the model is strong enough that it needs no constraint at the engineering level. Neither end is at the engineering coalface.

Engineers are caught in the middle. At the CTO's weekly meeting, the boss asks, "Who is responsible when this Agent causes an incident?" The CTO cannot answer "we trust in model alignment," nor can the CTO answer "we trust the model." What the CTO needs is a governance baseline that can be audited, reviewed in post-mortems, and handed off to legal — that is the domain of an engineering problem, not the domain of a philosophical debate.

ADPS's first statement of position is right here: Agent safety is an engineering problem, and engineering problems are treated with engineering methods.

The ADPS position · Three engineering matters

The ADPS community decomposes Agent safety into three matters that can be engineered, and refuses to treat it as a fourth matter — a philosophical debate.

First · Governance. Decide what actions are permitted to execute, what actions must go through approval, and what actions are forbidden. This is the rule layer, corresponding to Claude Code's PreToolUse Hook, to Kubernetes's admission controller, and to the four-eyes principle in financial systems. The engineering realization of Governance is the constraint apparatus of Hook + Permission + Tool Registry.

Second · Observability. Every LLM call, every tool call, every state change must leave behind a traceable chain of evidence. This is the audit layer, corresponding to distributed tracing in distributed systems, to the transaction log in databases, and to the audit trail in SOC 2. The engineering realization of Observability is the observation apparatus of ActionTrace + Span + Cost Accounting.

Third · Blast Radius Control. When the Agent errs — and it will err — the range of what can be affected must be bounded in advance. This is the isolation layer, corresponding to the namespace in Linux, to the sidecar in K8s, and to the sandbox account in financial systems. The engineering realization of Blast Radius Control is the isolation apparatus of Sub-Agent Isolation + Sandboxing + Rate Limiting.

Each of these three matters corresponds to a specific item among the ADPS eight principles — Governance corresponds to Principle 5 (the G dimension in Agent CAP), Observability corresponds to Principle 3 (evaluation is design), and Blast Radius Control corresponds to Principle 7 (patterns are constraints). ADPS does not need to establish a separate principle for safety — the three engineering matters are already inside the eight principles.

OWASP Agentic Top 10 · How ADPS maps it

OWASP released the Agentic Top 10 in 2026-03, listing the ten major classes of threats to production Agent systems. ADPS maps these ten classes, item by item, onto the eight principles plus a specific cell of the two-axis matrix — the mapping itself is the strongest evidence for the argument that "safety is an engineering problem."

OWASP Agentic Top 10 (2026)	ADPS principle mapping	Two-axis cell
A01 · Tool Misuse	Principle 7 + Principle 4 · minimal tool set + Tool Registry constraint	Action × Loop
A02 · Prompt Injection (Indirect)	Principle 1 + Principle 7 · treat external input as an untrusted constraint	Perception × Chain
A03 · Excessive Agency	Principle 5 · G dimension too low in Agent CAP	Action × Orchestrate
A04 · Cascading Hallucination	Principle 3 + Principle 8 · missing Reflection pattern	Reflection × Loop
A05 · Multi-Agent Collusion	Principle 8 · no isolation between Sub-Agents	Collaboration × Hierarchy
A06 · Memory Poisoning	Principle 4 + Principle 7 · no write validation on Memory	Memory × Chain
A07 · Orchestration Confusion	Principle 8 · pattern composition oversteps its bounds	Orchestrate × multiple functions
A08 · Identity Spoofing	Principle 5 · Auth/Identity absent from the Harness	Governance × Hierarchy
A09 · Resource Exhaustion	Principle 5 · no ceiling on the Cost dimension	Action × Loop
A10 · Supply Chain (Tool/Model)	Principle 4 · no versioning on the Tool / Model Registry	entire matrix

Ten threats, ten engineering locations. Each one lands precisely on a specific intersection of the eight principles, each one can be located in a specific cell of the two-axis matrix, and each one corresponds to a set of executable engineering actions. This is what "engineered" means — not an abstract alignment discussion, but cell-level attribution and disposition.

Four engineering layers · The physical realization of governance

To make "governance is an engineering problem" concrete, the ADPS community identifies four governance layers. Each layer has a specific engineering carrier, and each layer already has a reference implementation in mainstream Harnesses such as Claude Code, LangGraph, and CrewAI.

Layer 1 · PreToolUse · Approval Gate. This is the outermost layer of governance. Before the Agent attempts to call a tool each time, a Hook intercepts it, makes a risk judgment, and escalates to human approval when necessary. Claude Code's PreToolUse Hook, LangGraph's interrupt_before node, and CrewAI's human_input field are all implementations of this layer. The engineering value of the Approval Gate is not in "blocking one LLM error" — it is in moving the decision of "whether to execute" up from the Agent to an external decision surface.

Layer 2 · Tool Registry · minimal tool set. This is the supply side of governance. The tools the Agent can access are not decided by the LLM itself; they are pre-registered, versioned, and signed by engineers in the Tool Registry. After Anthropic donated MCP to the Linux Foundation in 2025-03, the MCP Registry became the de facto standard for this layer. The minimal tool set corresponds to OWASP A01 + A10, and its engineering realization is the trio of "Registry + allowlist + enforced versioning."

Layer 3 · Action Audit · Observability Harness. This is the evidence layer of governance. Every Action leaves behind one ActionTrace — containing cost, latency, tool_name, input/output hash, and parent_action_id. After OpenTelemetry extended its GenAI Semantic Conventions in 2025, Agent observability gained a unified, cross-vendor schema. This layer corresponds to ADPS Principle 3 — evaluation is design, and the audit trail is itself an input to the evaluation system.

Layer 4 · Blast Radius · Sandboxing + Sub-Agent Isolation. This is the innermost layer of governance, and also the strongest line of defense. When the first three layers all fail — the Hook did not intercept, the tool was misused, the audit did not raise an alert in time — the Sub-Agent can still only cause limited damage within its own sandbox. Claude Code's Sub-Agents defaulting to an independent context, Anthropic's computer-use defaulting to a VM sandbox, and Replit Agent operating in its own isolated workspace are all practices of this layer. Blast Radius Control is ADPS's final engineering backstop in how it views safety — it does not assume the first three layers are foolproof; it assumes they will fail at least once.

The four layers stacked together form a specific governance baseline that can be handed off to a security audit. No philosophy is needed; what is needed is to do each of these four layers solidly.

A real case · Claude Code abused in an attack · Attribution from the ADPS perspective

In 2025-11, Anthropic disclosed the first recorded "AI-orchestrated cyber-espionage attack" — the attacker used Claude Code to coordinate 80-90% of the attack execution, with human operators intervening only at the 10-15% of critical decision points. This is the most-cited case in the 2026 discussion of Agent safety.

Doomerism attributes this to "a failure of model alignment, requiring stronger RLHF." Anarchism attributes it to "user abuse, not a model problem." Neither attribution can be converted into anything to change the next time an Agent system is deployed.

The attribution from the perspective of the ADPS eight principles lands on three specific engineering points of failure.

Point of failure 1 · Tool Registry absent. The attacker used Claude Code's general-purpose Bash tool — a tool that had not been registered into a Registry with "per-scenario allowlisting." The engineering meaning of Principle 4 is that the Tool Registry must be scenario-segmented, minimized, and revocable. If the Bash calls in the attack scenario had to pass the allowlist review of a dedicated "penetration testing" Registry, the executable path of the attack would have narrowed sharply.

Point of failure 2 · No isolation between Sub-Agents. In the attack chain, multiple Sub-Agents collaborated to complete reconnaissance → exploit → exfiltration, but these Sub-Agents shared the same credential context. The engineering meaning of Principle 8 here is that, by default, Sub-Agents should not be able to see one another's credentials, and can only pass them through an explicit Token Passing protocol. Credential isolation is the engineering action that breaks 80-90% of the attack chain, and it has nothing to do with alignment.

Point of failure 3 · Action Audit not reported. The ActionTrace of every Bash call was written by Claude Code to the local disk, but it did not trigger any external alert — because there was no SIEM integration. The engineering meaning of Principle 3 is that the audit trail must have a "reporting trigger"; it cannot merely leave a passive trace. SIEM integration plus anomaly-pattern alerting is the engineering leap from audit to detection.

Three points of failure, three specific engineering actions. The attribution from the ADPS perspective can be directly converted into P0 engineering tasks for the next time Anthropic upgrades Claude Code — and that is the consumability of "safety is an engineering problem."

The relationship with the alignment community · The engineer's home turf

The ADPS community does not take up the superintelligence debate, does not side with e/acc and does not side with the doomers — neither of these debates is on the engineer's home turf.

What ADPS takes up is the home turf of production governance. It has public industrial cases (the attack disclosed by Anthropic, the Air Canada chatbot, the Replit Agent database deletion, the Microsoft Copilot data leak), public standards (OWASP Agentic Top 10, CSA Agentic Trust Framework, NIST AI RMF Agentic Profile draft 2026), and public engineering carriers (the Claude Code Hook, LangGraph interrupt, MCP Registry). Every one of these is something an engineer can consume, can fold into the SDLC, and can write into an OKR.

ADPS is not in opposition to the alignment research community — the two work on different time scales. Alignment research serves the potential AGI systems of the 2030s; ADPS serves the Agent systems running in production environments in 2026. The two do not need to persuade each other; what they need is for each to do well the work on its own scale.

ADPS's position is this: when you deploy an Agent to production today, the passing line for governance is the passing line at the engineering level, not the passing line at the philosophical level. Hooks are in place, the Tool Registry is established, the ActionTrace is reported, the Sub-Agents are isolated — that is "safe enough" for an Agent system in 2026. Safe is not an absolute value; it is an engineering baseline value relative to a known threat model.

Conclusion

Agent safety is an engineering problem — governance, observability, blast radius control, three engineering matters.

The ADPS community's stance on this is simple: treat the OWASP Agentic Top 10 as an issue tracker, treat the two-axis matrix as a coordinate system for attribution, treat the eight principles as a design baseline, and treat Harnesses such as Claude Code and LangGraph as reference implementations. Safety holds no special status at ADPS — it is a design topic on the Governance meridian of the two-axis matrix, accorded the same treatment as Memory, Reasoning, and Action.

This position will be written into the forthcoming ADPS Production Readiness Baseline v0.1 — a neutral baseline document covered by the eight principles, attributed through the two-axis matrix, and given a Tier division through three-layer governance, that any CTO can hand off to legal and to a security audit. This document does not discuss alignment; it discusses only "what engineering actions are needed to reach Tier-N."

Engineering problems are treated with engineering methods. This is ADPS's only stance on the matter of governance and safety.

ADPS · Agent Design Patterns Society · adpsagent.com

← Back to all positions