Judge & Escalation
The Quality Gate
The judge is the decision-maker that controls when a node's work is "done." Every iteration of the event loop ends with the judge rendering a verdict. It sits at the exit of every LLM turn, evaluating whether the worker has produced acceptable output or needs to keep going.
Without a judge, agents run to completion on a single pass and hope for the best. With a judge, each turn is evaluated against structural requirements and, optionally, quality criteria. The result is a tight feedback loop: produce work, get evaluated, receive feedback, try again.
The judge can only say one of three things: accept the work, retry with feedback, or escalate because something is fundamentally wrong. This three-way verdict controls graph execution flow, determining whether the node loops, advances, or fails.
The Three Verdicts
ACCEPT
The work meets the bar. Stop the loop, write outputs to shared memory, and move to the next node in the graph.
RETRY
Not good enough yet. Inject feedback into the conversation and let the LLM try again within the same node.
ESCALATE
Something is fundamentally wrong. Stop the loop, mark the node as failed, and let the executor handle it, potentially routing to a human or fallback path.
Implicit Judge: Evaluation Cascade
When no custom judge is provided, the system uses a built-in evaluation cascade. After each LLM turn, the implicit judge follows a tiered sequence of checks, from fast structural validation to LLM-powered quality assessment.
Tool Call Bail-Out
If the LLM made tool calls this turn, the verdict is always RETRY. The LLM is still working, so let it keep going. No evaluation needed.
Level 0: Structural Check
Check whether all required output keys have been set. If any are missing, RETRY with feedback listing exactly which keys are needed. This is a fast dictionary lookup with no LLM call. If all keys are nullable and none have been set, it still returns RETRY to prevent empty completions.
Level 2: Quality Check
Only triggers if the node has success criteria defined. A separate fast LLM call evaluates quality by receiving the node description, success criteria, output values, and the last 10 messages. Returns ACCEPT or RETRY with a confidence score and feedback. If not configured, passing Level 0 alone is enough.
Client-Facing Edge Case
Nodes with no output keys (pure conversation nodes) never auto-accept. They keep running until shutdown or max iterations, since their purpose is ongoing interaction, not producing a deliverable.
Worked Example: Travel Agent
A travel agent node must produce flight options, hotel recommendations, and a budget estimate. Two scenarios show how different cascade levels catch different problems.
Level 0: Missing Output Key
Node Output
flight_options: "3 direct flights found"
hotel_recommendations: "5 hotels near venue"
budget_estimate: not set
Level 0: Structural Check
Required output key budget_estimate is missing. Fast dictionary lookup, no LLM call needed.
Verdict: RETRY
Missing required output keys: budget_estimate. Please calculate the total estimated cost including flights and accommodation.
Level 2: Quality Check
Node Output
flight_options: "some flights exist"
hotel_recommendations: "hotels available"
budget_estimate: "around $1000"
Level 2: LLM Quality Check
Success criteria: "Provide specific flight numbers, hotel names with ratings, and itemized budget." Output is vague with no specific details.
Verdict: RETRY
Outputs lack specificity. Flight options should include carrier and flight numbers. Hotels need names and ratings. Budget should be itemized by category.
Two Modes
Implicit Judge
The default mode. Uses the built-in tiered evaluation cascade with no configuration needed. Structural checks run automatically, and quality checks activate when success criteria are defined on the node. Suitable for most workflows where output completeness is the primary quality signal.
Custom Judge
Any object implementing JudgeProtocol can replace the implicit judge. It receives a rich context dictionary containing the output accumulator, current iteration number, conversation summary, expected output keys, and which keys are still missing. Returns the same three-way verdict. Use this to wire in domain-specific evaluation logic.
Safety Nets
| Guard | Trigger | Behaviour |
|---|---|---|
| ACCEPT override | Custom judge accepts but required output keys are still missing | System overrides the verdict to RETRY and tells the LLM which keys need to be filled. Prevents a sloppy custom judge from letting incomplete work through. |
| Judge failure | Level 2 LLM call errors out (network issue, rate limit) | Defaults to ACCEPT. Level 0 already passed, so structural requirements are met. Better to accept potentially-okay work than block execution because the judge LLM is down. |
| Stall detection | Same tool calls with identical arguments for 3 consecutive turns | A warning is injected into the conversation. For client-facing nodes, user input is requested instead. Prevents the LLM from spinning in an unproductive loop. |
| Max iterations | Node reaches 50 turns (default, configurable via max_iterations) | The node terminates regardless of judge verdict. Combined with judge_every_n_turns, this creates a bounded loop that always terminates. |
Executor Integration
The executor treats event loop nodes specially to avoid conflicts with the judge's internal retry logic. See Graph State Machine for details on how the executor handles retry at the graph level.
No Executor-Level Retry
The executor never retries an event loop node at the executor level. Retry is the judge's job, handled inside the loop. This prevents catastrophic retry multiplication where the executor retries a node that already retried internally 50 times.
Skip Output Validation
The executor skips output validation for event loop nodes. Since the judge already validated the work, running the validator again would be redundant and could reject legitimate flexible outputs.
Partial Output Flush
When an execution is cancelled mid-run, partial outputs from the accumulator are flushed to shared memory so they survive a resume. Normally, outputs only get written to shared memory on ACCEPT.
Design Principles
Progressive Strictness
Structural checks are cheap and always run. Quality checks are optional and LLM-powered. The cascade escalates from fast dictionary lookups to full LLM evaluation only when needed, keeping costs low for simple cases.
Graceful Degradation
Failures in the judge itself never block execution. If the quality-check LLM is down, a structural pass is enough. The system always prefers forward progress over perfect evaluation.
Bounded Execution
max_iterations and stall detection ensure the loop always terminates. No matter how many times the judge says RETRY, the node will eventually stop. Infinite loops are structurally impossible.
