This is the first OrgX autonomous initiative benchmark run that readers can inspect instead of simply trust.
It completed the full catalog, generated artifacts, sent those artifacts to independent judges, recorded token usage, and published the raw result files. It also failed the stricter public-quality gate in places where the outputs were not good enough.
That combination matters. A credible benchmark should not only produce a score. It should make the weak spots easy to find.
The weakness we removed first
The previous smoke run proved the runner could complete:
gpt-5-nanogenerated benchmark artifacts across 12 tasks.- The bundle writer produced
summary.json,metadata.json,tasks.json,examples.json, andscorecard.csv. - The validator and scorecard recomputation passed.
But the scoring was not rigorous enough for public claims.
The model that generated each artifact also returned its own rubric scores. That was useful for a cheap smoke test. It was not acceptable as a public evaluation protocol.
If someone on the internet asked, "Who judged the work?", the honest answer was: the generating model judged itself.
That is exactly the kind of methodology weakness people should challenge.
So the next runner upgrade was straightforward: separate the model that creates the artifact from the models that judge it.
The first judged candidate run
We ran the expanded catalog with 15 tasks, 3 repeats per task, and 3 independent judges per generated artifact.
| Field | Value |
|---|---|
| Generator | gpt-5-nano |
| Judge 1 | gpt-5.4-nano with low reasoning |
| Judge 2 | gpt-5.4-mini with medium reasoning |
| Judge 3 | gpt-5.4 with high reasoning |
| Generated artifacts | 45 |
| Independent judge calls | 135 |
| Judge failures | 0 |
| Normal validation | Passed |
| Scorecard recomputation | Passed |
| Strict validation | Failed on 10 quality-bar misses |
You can inspect the actual bundle at useorgx.com/benchmarks/local-openai-gpt-5-nano-full-public-judge-20260411. That page links the raw files, previews generated artifacts from examples.json, and shows judge aggregates from judgments.json.
Open these first:
- examples.json: the generated artifacts
- judgments.json: the independent judge records
- scorecard.csv: task-level scoring
- tasks.json: prompts, rubrics, and assumptions
- metadata.json: runner provenance
The design tasks currently produce markdown handoff artifacts rather than screenshots. That is a limitation. The current evidence is inspectable as generated design specifications; future design tasks should include visual outputs as well.
The strict validation failure is the point.
The run completed. The data shape is valid. The judge calls completed. The scorecard recomputed. But the stricter public-grade gate caught below-bar design outputs. That is the right failure mode for a methodology candidate. The system should expose weak work instead of smoothing it into a clean narrative.
What changed in the benchmark
Judged bundles now include:
judgments.jsonwith per-judge criterion scores- median judge scores used for public quality scoring
- judge disagreement statistics
- human-review flags
- actual generation token usage
- actual judge token usage
- reasoning-token accounting
- estimated generation, judging, and total cost
- scorecard fields for scoring source, judge count, disagreement, and judge cost
The runner can judge an existing bundle without regenerating artifacts, or it can run judging immediately after generation.
That matters because the cheap model can keep doing the work, while stronger models judge the work.
What it cost
This run was cheap enough that cost is not the limiting factor.
| Category | Tokens | Cost |
|---|---|---|
| Generation | 81,132 | 2.4623 cents |
| Judging | 319,233 | 122.0173 cents |
| Total | 400,365 | 124.4796 cents |
The gpt-5.4 high-reasoning judge accounted for 102.0777 cents across 45 calls. It used 55,680 output tokens, including 43,558 reasoning tokens.
The takeaway: high-quality judging is affordable for this benchmark scale. The real cost risk is uncontrolled reasoning output, not the base price of the judge model. Future runs should cap judge output, record reasoning_tokens, and publish actual judge spend with every result bundle.
What the stricter gate caught
The strict misses were concentrated in design:
design-live-room-critique-r1: thehigh-tastecriterion scored 0.75.design-modal-mobile-interaction-spec-r1: completeness was 0.82, quality was 80, andengineering-readyscored 0.75.design-modal-mobile-interaction-spec-r2: completeness was 0.82 and quality was 84.17.design-modal-mobile-interaction-spec-r3: quality was 84.44.design-live-room-responsive-system-r1: completeness was 0.84.design-live-room-responsive-system-r2: quality was 81.39 andartifact-and-blocker-flowsscored 0.7.
This tells us two useful things.
First, the benchmark is now sensitive enough to identify domain-specific weakness. Design is not just "one task in the catalog" anymore. We added practical design tasks around mobile artifact viewers, mobile modal interaction specs, and responsive live-room systems. Those tasks immediately became the sharpest quality test.
Second, "complete" is not the same as "good." The artifacts were complete enough to score, but some were not strong enough to clear the stricter public-grade bar.
That is exactly the distinction this benchmark needs to make.
Why every run was flagged for human review
The first judged run marked 45 of 45 artifacts for human review.
That sounds alarming until you look at the policy.
The human-review flag is intentionally sensitive. It fires when judges materially disagree at the criterion level, when any judge recommends review, or when a judge call fails. For this first pass, we wanted the panel to behave like a triage layer, not like an automatic truth machine.
That means a human-review flag is not a failed task.
It means: "Do not publish this score as if the panel fully agreed."
The next step is to calibrate that threshold against human adjudication. We need to learn which disagreements are noise, which are meaningful, and which should block public claims.
What reviewers should challenge
People should challenge this benchmark. That is how it gets better.
The main objections are already visible:
- Smoke runs still use self-reported scores and must be clearly labeled.
- Independent judges are implemented, but judge disagreement needs human calibration.
- Human baselines need stronger provenance, larger samples, and clearer collection methodology.
- Design coverage improved from 1 task to 4 tasks, but should keep expanding toward practical senior-design workflows.
- Strict validation catches weak judged outputs, but public claims still need human adjudication and baseline evidence.
- Some artifacts are plausible, but not yet work a strong operator would ship without revision.
Those are not footnotes. They are the roadmap.
What we are improving next
The next pass is about data quality and task realism.
We are tightening the public bar in five ways.
1. Keep smoke runs separate from publishable runs
Smoke runs answer one question: can the runner complete the catalog cheaply and produce a valid bundle?
Publishable runs answer a harder question: would independent reviewers trust the task design, judge protocol, scoring, baselines, and raw artifacts?
Those are different gates.
2. Use multiple independent judges
The initial public judge panel is:
gpt-5.4-nanowith low reasoninggpt-5.4-miniwith medium reasoninggpt-5.4with high reasoning
The public score should use median criterion scores, not generator self-scores. It should also report disagreement and route uncertain cases to human review.
3. Publish more data, not just better headlines
Future public bundles should include:
- generation model and judge model names
- token usage by task and by judge
- retry counts and retry reasons
- criterion-level scores from every judge
- judge disagreement statistics
- minimum criterion thresholds
- task provenance and contamination-risk notes
- human baseline sample size, source type, and confidence
- artifact length and structured sections
- validation warnings and failures
The methodology should make weak spots easy to find.
4. Add more practical design tasks
The design slice needs to look like work a strong human product designer actually does.
The first expansion added:
- mobile artifact viewer remediation
- mobile modal interaction specification
- responsive live-room system specification
The next design tasks should cover accessibility handoff, loading and empty states, long-content behavior, and product hierarchy repair against real design-system constraints.
5. Report uncertainty
The next public update should include:
- best, median, and worst task outcomes
- task-level failure analysis
- domain-level coverage gaps
- judge agreement rates
- confidence bands across repeated runs
- methodology changes planned before stronger public claims
A benchmark that only publishes wins is a demo. A benchmark that publishes uncertainty can become infrastructure.
The current status
Current status: first judged methodology candidate completed.
The benchmark now has the core pieces needed for a credible public evaluation loop: independent judges, repeat runs, token-level accounting, stricter validation, and a visible failure list.
The next target is not a bigger headline number.
The next target is a stricter run that passes because the data, artifacts, and methodology are strong enough to survive hostile review.