Will Percey — Portfolio

Judge & Escalation

> > Updated Feb 2026

gavel

The Quality Gate

The judge is the decision-maker that controls when a node's work is "done." Every iteration of the event loop ends with the judge rendering a verdict. It sits at the exit of every LLM turn, evaluating whether the worker has produced acceptable output or needs to keep going.

Without a judge, agents run to completion on a single pass and hope for the best. With a judge, each turn is evaluated against structural requirements and, optionally, quality criteria. The result is a tight feedback loop: produce work, get evaluated, receive feedback, try again.

The judge can only say one of three things: accept the work, retry with feedback, or escalate because something is fundamentally wrong. This three-way verdict controls graph execution flow, determining whether the node loops, advances, or fails.

rule

The Three Verdicts

check_circle

ACCEPT

The work meets the bar. Stop the loop, write outputs to shared memory, and move to the next node in the graph.

replay

RETRY

Not good enough yet. Inject feedback into the conversation and let the LLM try again within the same node.

priority_high

ESCALATE

Something is fundamentally wrong. Stop the loop, mark the node as failed, and let the executor handle it, potentially routing to a human or fallback path.

filter_list

Implicit Judge: Evaluation Cascade

When no custom judge is provided, the system uses a built-in evaluation cascade. After each LLM turn, the implicit judge follows a tiered sequence of checks, from fast structural validation to LLM-powered quality assessment.

1Tool calls?

→

RETRY

2Missing output keys?

→

RETRY

3Success criteria defined?

→

LLM quality check

4All clear

→

build

Tool Call Bail-Out

If the LLM made tool calls this turn, the verdict is always RETRY. The LLM is still working, so let it keep going. No evaluation needed.

fact_check

Level 0: Structural Check

Check whether all required output keys have been set. If any are missing, RETRY with feedback listing exactly which keys are needed. This is a fast dictionary lookup with no LLM call. If all keys are nullable and none have been set, it still returns RETRY to prevent empty completions.

stars

Level 2: Quality Check

Only triggers if the node has success criteria defined. A separate fast LLM call evaluates quality by receiving the node description, success criteria, output values, and the last 10 messages. Returns ACCEPT or RETRY with a confidence score and feedback. If not configured, passing Level 0 alone is enough.

chat

Client-Facing Edge Case

Nodes with no output keys (pure conversation nodes) never auto-accept. They keep running until shutdown or max iterations, since their purpose is ongoing interaction, not producing a deliverable.

assignment

Worked Example: Travel Agent

A travel agent node must produce flight options, hotel recommendations, and a budget estimate. Two scenarios show how different cascade levels catch different problems.

Level 0: Missing Output Key

Node Output

flight_options: "3 direct flights found"

hotel_recommendations: "5 hotels near venue"

budget_estimate: not set

Level 0: Structural Check

Required output key budget_estimate is missing. Fast dictionary lookup, no LLM call needed.

Verdict: RETRY

Missing required output keys: budget_estimate. Please calculate the total estimated cost including flights and accommodation.

Level 2: Quality Check

Node Output

flight_options: "some flights exist"

hotel_recommendations: "hotels available"

budget_estimate: "around $1000"

Level 2: LLM Quality Check

Success criteria: "Provide specific flight numbers, hotel names with ratings, and itemized budget." Output is vague with no specific details.

Verdict: RETRY

Outputs lack specificity. Flight options should include carrier and flight numbers. Hotels need names and ratings. Budget should be itemized by category.

toggle_on

Two Modes

auto_mode

Implicit Judge

The default mode. Uses the built-in tiered evaluation cascade with no configuration needed. Structural checks run automatically, and quality checks activate when success criteria are defined on the node. Suitable for most workflows where output completeness is the primary quality signal.

tune

Custom Judge

Any object implementing JudgeProtocol can replace the implicit judge. It receives a rich context dictionary containing the output accumulator, current iteration number, conversation summary, expected output keys, and which keys are still missing. Returns the same three-way verdict. Use this to wire in domain-specific evaluation logic.

health_and_safety

Safety Nets

Guard	Trigger	Behaviour
ACCEPT override	Custom judge accepts but required output keys are still missing	System overrides the verdict to RETRY and tells the LLM which keys need to be filled. Prevents a sloppy custom judge from letting incomplete work through.
Judge failure	Level 2 LLM call errors out (network issue, rate limit)	Defaults to ACCEPT. Level 0 already passed, so structural requirements are met. Better to accept potentially-okay work than block execution because the judge LLM is down.
Stall detection	Same tool calls with identical arguments for 3 consecutive turns	A warning is injected into the conversation. For client-facing nodes, user input is requested instead. Prevents the LLM from spinning in an unproductive loop.
Max iterations	Node reaches 50 turns (default, configurable via max_iterations)	The node terminates regardless of judge verdict. Combined with judge_every_n_turns, this creates a bounded loop that always terminates.

settings_suggest

Executor Integration

The executor treats event loop nodes specially to avoid conflicts with the judge's internal retry logic. See Graph State Machine for details on how the executor handles retry at the graph level.

block

No Executor-Level Retry

The executor never retries an event loop node at the executor level. Retry is the judge's job, handled inside the loop. This prevents catastrophic retry multiplication where the executor retries a node that already retried internally 50 times.

verified

Skip Output Validation

The executor skips output validation for event loop nodes. Since the judge already validated the work, running the validator again would be redundant and could reject legitimate flexible outputs.

save

Partial Output Flush

When an execution is cancelled mid-run, partial outputs from the accumulator are flushed to shared memory so they survive a resume. Normally, outputs only get written to shared memory on ACCEPT.

lightbulb

Design Principles

trending_up

Progressive Strictness

Structural checks are cheap and always run. Quality checks are optional and LLM-powered. The cascade escalates from fast dictionary lookups to full LLM evaluation only when needed, keeping costs low for simple cases.

safety_check

Graceful Degradation

Failures in the judge itself never block execution. If the quality-check LLM is down, a structural pass is enough. The system always prefers forward progress over perfect evaluation.

timer

Bounded Execution

max_iterations and stall detection ensure the loop always terminates. No matter how many times the judge says RETRY, the node will eventually stop. Infinite loops are structurally impossible.