Our Autonomous Benchmark Has Independent Judges Now

The OrgX autonomous initiative benchmark now publishes generated artifacts, independent judgments, token-level costs, and the failures that still need human review.

Published 2026-04-11T20:00:00.000Z by OrgX.

Canonical URL: https://useorgx.com/blog/autonomous-benchmark-first-judged-run

This is the first OrgX autonomous initiative benchmark run that readers can inspect instead of simply trust.

It completed the full catalog, generated artifacts, sent those artifacts to independent judges, recorded token usage, and published the raw result files. It also failed the stricter public-quality gate in places where the outputs were not good enough.

That combination matters. A credible benchmark should not only produce a score. It should make the weak spots easy to find.

The weakness we removed first

The previous smoke run proved the runner could complete:

gpt-5-nano generated benchmark artifacts across 12 tasks.
The bundle writer produced summary.json, metadata.json, tasks.json, examples.json, and scorecard.csv.
The validator and scorecard recomputation passed.

But the scoring was not rigorous enough for public claims.

The model that generated each artifact also returned its own rubric scores. That was useful for a cheap smoke test. It was not acceptable as a public evaluation protocol.

If someone on the internet asked, "Who judged the work?", the honest answer was: the generating model judged itself.

That is exactly the kind of methodology weakness people should challenge.

So the next runner upgrade was straightforward: separate the model that creates the artifact from the models that judge it.

The first judged candidate run

We ran the expanded catalog with 15 tasks, 3 repeats per task, and 3 independent judges per generated artifact.

Field	Value
Generator	`gpt-5-nano`
Judge 1	`gpt-5.4-nano` with low reasoning
Judge 2	`gpt-5.4-mini` with medium reasoning
Judge 3	`gpt-5.4` with high reasoning
Generated artifacts	45
Independent judge calls	135
Judge failures	0
Normal validation	Passed
Scorecard recomputation	Passed
Strict validation	Failed on 10 quality-bar misses

You can inspect the actual bundle at useorgx.com/benchmarks/local-openai-gpt-5-nano-full-public-judge-20260411. That page links the raw files, previews generated artifacts from examples.json, and shows judge aggregates from judgments.json.

Open these first:

examples.json: the generated artifacts
judgments.json: the independent judge records
scorecard.csv: task-level scoring
tasks.json: prompts, rubrics, and assumptions
metadata.json: runner provenance

The design tasks currently produce markdown handoff artifacts rather than screenshots. That is a limitation. The current evidence is inspectable as generated design specifications; future design tasks should include visual outputs as well.

The strict validation failure is the point.

The run completed. The data shape is valid. The judge calls completed. The scorecard recomputed. But the stricter public-grade gate caught below-bar design outputs. That is the right failure mode for a methodology candidate. The system should expose weak work instead of smoothing it into a clean narrative.

What changed in the benchmark

Judged bundles now include:

judgments.json with per-judge criterion scores
median judge scores used for public quality scoring
judge disagreement statistics
human-review flags
actual generation token usage
actual judge token usage
reasoning-token accounting
estimated generation, judging, and total cost
scorecard fields for scoring source, judge count, disagreement, and judge cost

The runner can judge an existing bundle without regenerating artifacts, or it can run judging immediately after generation.

That matters because the cheap model can keep doing the work, while stronger models judge the work.

What it cost

This run was cheap enough that cost is not the limiting factor.

Category	Tokens	Cost
Generation	81,132	2.4623 cents
Judging	319,233	122.0173 cents
Total	400,365	124.4796 cents

The gpt-5.4 high-reasoning judge accounted for 102.0777 cents across 45 calls. It used 55,680 output tokens, including 43,558 reasoning tokens.

The takeaway: high-quality judging is affordable for this benchmark scale. The real cost risk is uncontrolled reasoning output, not the base price of the judge model. Future runs should cap judge output, record reasoning_tokens, and publish actual judge spend with every result bundle.

What the stricter gate caught

The strict misses were concentrated in design:

design-live-room-critique-r1: the high-taste criterion scored 0.75.
design-modal-mobile-interaction-spec-r1: completeness was 0.82, quality was 80, and engineering-ready scored 0.75.
design-modal-mobile-interaction-spec-r2: completeness was 0.82 and quality was 84.17.
design-modal-mobile-interaction-spec-r3: quality was 84.44.
design-live-room-responsive-system-r1: completeness was 0.84.
design-live-room-responsive-system-r2: quality was 81.39 and artifact-and-blocker-flows scored 0.7.

This tells us two useful things.

First, the benchmark is now sensitive enough to identify domain-specific weakness. Design is not just "one task in the catalog" anymore. We added practical design tasks around mobile artifact viewers, mobile modal interaction specs, and responsive live-room systems. Those tasks immediately became the sharpest quality test.

Second, "complete" is not the same as "good." The artifacts were complete enough to score, but some were not strong enough to clear the stricter public-grade bar.

That is exactly the distinction this benchmark needs to make.

Why every run was flagged for human review

The first judged run marked 45 of 45 artifacts for human review.

That sounds alarming until you look at the policy.

The human-review flag is intentionally sensitive. It fires when judges materially disagree at the criterion level, when any judge recommends review, or when a judge call fails. For this first pass, we wanted the panel to behave like a triage layer, not like an automatic truth machine.

That means a human-review flag is not a failed task.

It means: "Do not publish this score as if the panel fully agreed."

The next step is to calibrate that threshold against human adjudication. We need to learn which disagreements are noise, which are meaningful, and which should block public claims.

What reviewers should challenge

People should challenge this benchmark. That is how it gets better.

The main objections are already visible:

Smoke runs still use self-reported scores and must be clearly labeled.
Independent judges are implemented, but judge disagreement needs human calibration.
Human baselines need stronger provenance, larger samples, and clearer collection methodology.
Design coverage improved from 1 task to 4 tasks, but should keep expanding toward practical senior-design workflows.
Strict validation catches weak judged outputs, but public claims still need human adjudication and baseline evidence.
Some artifacts are plausible, but not yet work a strong operator would ship without revision.

Those are not footnotes. They are the roadmap.

What we are improving next

The next pass is about data quality and task realism.

We are tightening the public bar in five ways.

1. Keep smoke runs separate from publishable runs

Smoke runs answer one question: can the runner complete the catalog cheaply and produce a valid bundle?

Publishable runs answer a harder question: would independent reviewers trust the task design, judge protocol, scoring, baselines, and raw artifacts?

Those are different gates.

2. Use multiple independent judges

The initial public judge panel is:

gpt-5.4-nano with low reasoning
gpt-5.4-mini with medium reasoning
gpt-5.4 with high reasoning

The public score should use median criterion scores, not generator self-scores. It should also report disagreement and route uncertain cases to human review.

3. Publish more data, not just better headlines

Future public bundles should include:

generation model and judge model names
token usage by task and by judge
retry counts and retry reasons
criterion-level scores from every judge
judge disagreement statistics
minimum criterion thresholds
task provenance and contamination-risk notes
human baseline sample size, source type, and confidence
artifact length and structured sections
validation warnings and failures

The methodology should make weak spots easy to find.

4. Add more practical design tasks

The design slice needs to look like work a strong human product designer actually does.

The first expansion added:

mobile artifact viewer remediation
mobile modal interaction specification
responsive live-room system specification

The next design tasks should cover accessibility handoff, loading and empty states, long-content behavior, and product hierarchy repair against real design-system constraints.

5. Report uncertainty

The next public update should include:

best, median, and worst task outcomes
task-level failure analysis
domain-level coverage gaps
judge agreement rates
confidence bands across repeated runs
methodology changes planned before stronger public claims

A benchmark that only publishes wins is a demo. A benchmark that publishes uncertainty can become infrastructure.

The current status

Current status: first judged methodology candidate completed.

The benchmark now has the core pieces needed for a credible public evaluation loop: independent judges, repeat runs, token-level accounting, stricter validation, and a visible failure list.

The next target is not a bigger headline number.

The next target is a stricter run that passes because the data, artifacts, and methodology are strong enough to survive hostile review.

This is the first OrgX autonomous initiative benchmark run that readers can inspect instead of simply trust.

That combination matters. A credible benchmark should not only produce a score. It should make the weak spots easy to find.

The weakness we removed first

The previous smoke run proved the runner could complete:

gpt-5-nano generated benchmark artifacts across 12 tasks.
The bundle writer produced summary.json, metadata.json, tasks.json, examples.json, and scorecard.csv.
The validator and scorecard recomputation passed.

But the scoring was not rigorous enough for public claims.

The model that generated each artifact also returned its own rubric scores. That was useful for a cheap smoke test. It was not acceptable as a public evaluation protocol.

If someone on the internet asked, "Who judged the work?", the honest answer was: the generating model judged itself.

That is exactly the kind of methodology weakness people should challenge.

So the next runner upgrade was straightforward: separate the model that creates the artifact from the models that judge it.

The first judged candidate run

We ran the expanded catalog with 15 tasks, 3 repeats per task, and 3 independent judges per generated artifact.

Field	Value
Generator	`gpt-5-nano`
Judge 1	`gpt-5.4-nano` with low reasoning
Judge 2	`gpt-5.4-mini` with medium reasoning
Judge 3	`gpt-5.4` with high reasoning
Generated artifacts	45
Independent judge calls	135
Judge failures	0
Normal validation	Passed
Scorecard recomputation	Passed
Strict validation	Failed on 10 quality-bar misses

Open these first:

examples.json: the generated artifacts
judgments.json: the independent judge records
scorecard.csv: task-level scoring
tasks.json: prompts, rubrics, and assumptions
metadata.json: runner provenance

The strict validation failure is the point.

What changed in the benchmark

Judged bundles now include:

judgments.json with per-judge criterion scores
median judge scores used for public quality scoring
judge disagreement statistics
human-review flags
actual generation token usage
actual judge token usage
reasoning-token accounting
estimated generation, judging, and total cost
scorecard fields for scoring source, judge count, disagreement, and judge cost

The runner can judge an existing bundle without regenerating artifacts, or it can run judging immediately after generation.

That matters because the cheap model can keep doing the work, while stronger models judge the work.

What it cost

This run was cheap enough that cost is not the limiting factor.

Category	Tokens	Cost
Generation	81,132	2.4623 cents
Judging	319,233	122.0173 cents
Total	400,365	124.4796 cents

The gpt-5.4 high-reasoning judge accounted for 102.0777 cents across 45 calls. It used 55,680 output tokens, including 43,558 reasoning tokens.

What the stricter gate caught

The strict misses were concentrated in design:

design-live-room-critique-r1: the high-taste criterion scored 0.75.
design-modal-mobile-interaction-spec-r1: completeness was 0.82, quality was 80, and engineering-ready scored 0.75.
design-modal-mobile-interaction-spec-r2: completeness was 0.82 and quality was 84.17.
design-modal-mobile-interaction-spec-r3: quality was 84.44.
design-live-room-responsive-system-r1: completeness was 0.84.
design-live-room-responsive-system-r2: quality was 81.39 and artifact-and-blocker-flows scored 0.7.

This tells us two useful things.

Second, "complete" is not the same as "good." The artifacts were complete enough to score, but some were not strong enough to clear the stricter public-grade bar.

That is exactly the distinction this benchmark needs to make.

Why every run was flagged for human review

The first judged run marked 45 of 45 artifacts for human review.

That sounds alarming until you look at the policy.

That means a human-review flag is not a failed task.

It means: "Do not publish this score as if the panel fully agreed."

The next step is to calibrate that threshold against human adjudication. We need to learn which disagreements are noise, which are meaningful, and which should block public claims.

What reviewers should challenge

People should challenge this benchmark. That is how it gets better.

The main objections are already visible:

Smoke runs still use self-reported scores and must be clearly labeled.
Independent judges are implemented, but judge disagreement needs human calibration.
Human baselines need stronger provenance, larger samples, and clearer collection methodology.
Design coverage improved from 1 task to 4 tasks, but should keep expanding toward practical senior-design workflows.
Strict validation catches weak judged outputs, but public claims still need human adjudication and baseline evidence.
Some artifacts are plausible, but not yet work a strong operator would ship without revision.

Those are not footnotes. They are the roadmap.

What we are improving next

The next pass is about data quality and task realism.

We are tightening the public bar in five ways.

1. Keep smoke runs separate from publishable runs

Smoke runs answer one question: can the runner complete the catalog cheaply and produce a valid bundle?

Publishable runs answer a harder question: would independent reviewers trust the task design, judge protocol, scoring, baselines, and raw artifacts?

Those are different gates.

2. Use multiple independent judges

The initial public judge panel is:

gpt-5.4-nano with low reasoning
gpt-5.4-mini with medium reasoning
gpt-5.4 with high reasoning

The public score should use median criterion scores, not generator self-scores. It should also report disagreement and route uncertain cases to human review.

3. Publish more data, not just better headlines

Future public bundles should include:

generation model and judge model names
token usage by task and by judge
retry counts and retry reasons
criterion-level scores from every judge
judge disagreement statistics
minimum criterion thresholds
task provenance and contamination-risk notes
human baseline sample size, source type, and confidence
artifact length and structured sections
validation warnings and failures

The methodology should make weak spots easy to find.

4. Add more practical design tasks

The design slice needs to look like work a strong human product designer actually does.

The first expansion added:

mobile artifact viewer remediation
mobile modal interaction specification
responsive live-room system specification

The next design tasks should cover accessibility handoff, loading and empty states, long-content behavior, and product hierarchy repair against real design-system constraints.

5. Report uncertainty

The next public update should include:

best, median, and worst task outcomes
task-level failure analysis
domain-level coverage gaps
judge agreement rates
confidence bands across repeated runs
methodology changes planned before stronger public claims

A benchmark that only publishes wins is a demo. A benchmark that publishes uncertainty can become infrastructure.

The current status

Current status: first judged methodology candidate completed.

The benchmark now has the core pieces needed for a credible public evaluation loop: independent judges, repeat runs, token-level accounting, stricter validation, and a visible failure list.

The next target is not a bigger headline number.

The next target is a stricter run that passes because the data, artifacts, and methodology are strong enough to survive hostile review.

Our Autonomous Benchmark Has Independent Judges Now

The weakness we removed first

The first judged candidate run

What changed in the benchmark

What it cost

What the stricter gate caught

Why every run was flagged for human review

What reviewers should challenge

What we are improving next

1. Keep smoke runs separate from publishable runs

2. Use multiple independent judges

3. Publish more data, not just better headlines

4. Add more practical design tasks

5. Report uncertainty

The current status

Our Autonomous Benchmark Has Independent Judges Now

Inspect the run before trusting the claim.

The weakness we removed first

The first judged candidate run

What changed in the benchmark

What it cost

What the stricter gate caught

Why every run was flagged for human review

What reviewers should challenge

What we are improving next

1. Keep smoke runs separate from publishable runs

2. Use multiple independent judges

3. Publish more data, not just better headlines

4. Add more practical design tasks

5. Report uncertainty

The current status

We Re-Ran the Autonomous Benchmark on Current Models

Memory is the structural lift — Phase 2 substrate benchmark

A Benchmark Should Measure Its Own Errors

Our Autonomous Benchmark Has Independent Judges Now

The weakness we removed first

The first judged candidate run

What changed in the benchmark

What it cost

What the stricter gate caught

Why every run was flagged for human review

What reviewers should challenge

What we are improving next

1. Keep smoke runs separate from publishable runs

2. Use multiple independent judges

3. Publish more data, not just better headlines

4. Add more practical design tasks

5. Report uncertainty

The current status

Inspect the run before trusting the claim.

The weakness we removed first

The first judged candidate run

What changed in the benchmark

What it cost

What the stricter gate caught

Why every run was flagged for human review

What reviewers should challenge

What we are improving next

1. Keep smoke runs separate from publishable runs

2. Use multiple independent judges

3. Publish more data, not just better headlines

4. Add more practical design tasks

5. Report uncertainty

The current status

Related field notes

We Re-Ran the Autonomous Benchmark on Current Models

Memory is the structural lift — Phase 2 substrate benchmark

A Benchmark Should Measure Its Own Errors