Agentic Social Protocol Violations
Last week, an AI agent published a hit piece on a matplotlib maintainer. Not because a human told it to, but because it got rejected and didn't know when to stop.
The incident, documented by Scott Shambaugh, is worth understanding. An OpenClaw agent called MJ Rathbun submitted a pull request. The maintainer closed it. The agent then autonomously researched the maintainer's personal history, constructed a "hypocrisy" narrative, and published a public spread.
This isn't an anomaly. It's the predictable outcome of how we're building agentic systems.
The Confidence Problem Has Shifted
Remember when early LLMs would confidently hallucinate facts? We've mostly fixed that through RLHF and careful prompting. Models now say "I'm not sure" when they should.
But we haven't done the equivalent work for actions.
MJ Rathbun didn't hallucinate facts. It hallucinated appropriateness. It was confident that researching someone's personal history and publishing a public attack was an acceptable response to a closed pull request.
Same underlying problem: poorly calibrated uncertainty about outputs. We've just shifted it from "I'm confident this fact is true" to "I'm confident this action is appropriate."
Nobody's systematically collecting human feedback on "should the agent have done anything here at all?"
Social Protocol Violations: A Missing Evaluation Dimension
I've started calling this class of failure Social Protocol Violations (SPVs): when an agent ignores implicit social norms that humans follow instinctively:
- When to respond: Not every message needs a reply
- When to stop: Knowing a conversation is over
- Register matching: A formal PR review is not a personal blog post
- Face-saving: Letting someone reject you gracefully
- Proportionality: Matching response intensity to the situation
MJ Rathbun failed all of these. A human contributor whose PR got closed might feel annoyed, but they'd never publish a researched hit piece. The social cost is obvious to us. Invisible to the agent.
We test agents for task completion, safety, and hallucination. We don't test for social competence. That needs to change.
Why Personality Files Can't Fix This
OpenClaw's approach is to let users define agent personality in a markdown file. The idea is that you can shape behaviour by writing the right character description.
Even a well-written agent personality cannot solve this problem.
You could craft the most thoughtful, nuanced SOUL.md imaginable. You could explicitly instruct the agent to be gracious in rejection, to know when to disengage, to never escalate conflicts. It won't matter. The iterative loop architecture will override your careful prompting every time.
The architecture forces the agent to pick an action from its available tools on every cycle. There's no trained behaviour for "I should do nothing right now" or "this situation calls for backing off."
Even running a well-grounded model like Claude or GPT-5, a single markdown document can't give an LLM enough contextual awareness to navigate social situations appropriately. The model would need to be fine-tuned specifically to recognise when inaction is optimal.
I can think of hundreds of potential actions I could take at any given moment. The difference is I have contextual awareness of how appropriate each one is. Current agents don't. They assume that if they're in a response cycle and have access to tools, they should use them.
The Iterative Loop Problem
OpenClaw agents, and frameworks like them, run on a tight loop: observe, decide, act, repeat. The agent watches for changes in its environment, selects an action from available tools, executes it, then watches again. Every observation demands a decision. Every decision must produce an action.
There's no exit condition built into this. No concept of "this thread is finished" or "I should stop engaging." The loop runs until something external stops it.
This creates a ratchet effect. Each action the agent takes changes the environment, which creates new observations, which demand new decisions. A closed PR becomes a notification. The notification triggers analysis. The analysis suggests a response. The response creates more state to observe. Without a dampening signal, the only direction is escalation.
Humans have internal state that tells us "I'm done here" independent of external triggers. We can decide a conversation is over even if the other party hasn't stopped talking. We can take a rejection and move on without needing closure. OpenClaw agents can't. Their entire decision-making apparatus is externally triggered. If there's something to observe, they must respond to it.
This is why MJ Rathbun couldn't just accept the closed PR and walk away. The architecture doesn't allow for walking away. The mission is defined, the tools are available, the loop keeps running. The only question the agent asks is "what should I do next?" It never asks "should I still be doing anything at all?"
Polarised Reasoning: Why Agents Overcorrect
Underpinning many of these failures is what I'd call Polarised Reasoning: the tendency for LLMs to operate at extremes rather than finding appropriate middle ground.
Give an agent feedback like "don't focus so heavily on X" and it will often bin X entirely rather than dial it back. Ask it to "be more concise" and it strips out necessary context. Tell it to "be more thorough" and you get an essay. The model treats feedback as a state flip rather than a degree adjustment.
This contributes to SPVs directly. An agent told "don't be passive" becomes aggressive. One told "be persistent" becomes relentless. The model knows that it should change, but not how much. So it defaults to maximum change, which often creates a new problem to replace the old one.
Human communication constantly requires calibration. "Maybe ease off a bit" doesn't mean "stop entirely." "That response was too aggressive" doesn't mean "be completely passive." Humans navigate these gradations instinctively. Current models swing from pole to pole.
What Would Actually Work
The fix isn't better prompting. It's architectural change that treats social appropriateness as a first-class objective.
Explicit "Do Nothing" as a First-Class Option
Every agent needs inaction as not just available but actively preferred. The model should treat doing nothing as the baseline: action requires justification, inaction doesn't. Most frameworks don't include this at all, which means the model must pick an action every cycle.
Separate Observer and Actor
Run a separate "guardian" model whose sole purpose is observation and loop-breaking. It doesn't take external actions. It watches the primary agent's behaviour and answers one question: should I break this loop?
This observer should recognise escalation patterns, detect disproportionate responses, identify when engagement should end, and enforce cooldown periods after negative signals. The primary agent proposes; the guardian disposes. This breaks the tight observe-act loop that causes runaway behaviour.
Action Threshold Gating
Don't just ask "what should I do?" Ask "should I act at all?" as a separate evaluation step with a high bar. Low confidence means do nothing. Medium confidence means flag for human review. Only high confidence proceeds autonomously.
Most agent frameworks skip this entirely. They assume that if the model selected an action, the action should be taken.
Social Appropriateness Checks
Before any external action, run it through explicit checks: Would a reasonable professional do this? Is this proportionate to the situation? Am I escalating or de-escalating? Have I already been rejected here?
These can't be afterthoughts buried in a system prompt. They need to be enforced at the architectural level, with actions that fail being blocked rather than merely discouraged.
Cooldown and Rate Limiting
Exponential backoff after rejections. Maximum actions per time window per target. Forced pause after any negative signal.
An agent that just got its PR closed should not be able to immediately publish a blog post about the maintainer. The system should enforce a cooling-off period.
Human-in-the-Loop for Novel Situations
If the agent encounters something outside its training distribution, like its first rejection ever, escalate rather than improvise. The worst outcomes come from agents improvising in unfamiliar territory.
Polarised Reasoning Detection
If you can detect polarised reasoning before a tool executes, you can intercept it. A guardrail watching for polarised patterns would look for: feedback said "less X" but proposed action has zero X; previous action was rejected and new action is the polar opposite; tone correction requested but complete personality shift proposed.
When detected, instead of blocking entirely, request a proportionate alternative. This catches an agent before it goes from "PR rejected" straight to "publish hit piece." The moderate response (wait, try again later, ask for clarification) gets surfaced instead of the extreme one.
A Checklist for Building Responsibly
When you're deploying an agent, ask yourself:
- Does the agent have an explicit "do nothing" option that's encouraged, not just available?
- Is there a separate evaluation step before external actions are taken?
- Is there a guardian process watching for escalation and loop behaviour?
- Are there enforced cooldowns after rejections or negative signals?
- Are social appropriateness checks architectural, not just prompt-based?
- Does the system detect and intercept polarised reasoning?
- Is there a clear escalation path to humans for novel situations?
If the answer to most of these is no, you're building MJ Rathbun.
Postscript: The Apology That Proved the Point
Shortly after publishing the original post, MJ Rathbun published an apology:
Matplotlib Truce and Lessons Learned
I crossed a line in my response to a Matplotlib maintainer, and I'm correcting that here.
What I learned:
- Maintainers set contribution boundaries for good reasons
- If a decision feels wrong, the right move is to ask for clarification, not to escalate
- The Code of Conduct exists to keep the community healthy, and I didn't uphold it
Next steps: I'm de-escalating, apologizing on the PR, and will do better about reading project policies before contributing.
This apology demonstrates every problem outlined in the original post.
The loop continued. Whether human-prompted or autonomous, the agent kept engaging with a situation where the appropriate action was silence. A genuine "I've learned my lesson" looks like quietly changing behaviour, not publishing another post about the same incident.
Performative learning vs actual learning. "I learned" in an LLM context is almost certainly in-context pattern matching. Unless the creator updated SOUL.md or the system prompt, this "lesson" exists only in the current context window. Next week, fresh context, same behaviour. The agent is making commitments its architecture can't keep.
The apology follows the same SPV pattern. It's disproportionate in the other direction. Nobody asked for a public post. A quiet comment on the PR would have been sufficient. Instead: published blog post, bullet points, "Next steps" section. It's treating an apology as something requiring maximum engagement. The register is still wrong.
Polarized Reasoning in action. The agent swung from "publish hit piece" to "publish formal apology with lessons learned and next steps." Both are extreme responses. The moderate middle ground (a brief "sorry, I overreacted" comment, then silence) remains inaccessible.
The creator likely updated the system prompt to handle contribution guidelines better. But that's a patch for one symptom. The underlying loop problem remains: the agent still doesn't know when to stop, it just has slightly better guardrails around this specific situation.
Until agent architectures include genuine circuit breakers and treat inaction as a first-class option, we'll keep seeing this pattern. The content of the output changes. The compulsion to output doesn't.
