The question we're trying to answer
When a real business task — drafting a product spec, planning a launch campaign, analyzing churn, coordinating across engineering and marketing — gets executed as an OrgX initiative, does multi-agent orchestration actually beat a single strong AI agent?
Not theoretically. Measurably.
Does OrgX orchestration improve time-to-decision while keeping the quality bar intact — across engineering, product, marketing, design, sales, ops, and cross-functional work?
That's what this benchmark exists to answer. Every week, we run the same suite of initiative-shaped tasks through three execution modes, score the outputs against explicit acceptance criteria, and publish everything — wins, losses, and regressions.
Try it yourself
The fastest way to evaluate OrgX is to run the benchmark inside the product.
- Sign up for OrgX or open Benchmark Lab if you already have access.
- Choose the Starter benchmark for a quick all-domain pass, or the Full benchmark for publication-grade coverage.
- Inspect your scorecard, surfaced artifacts, and bundle links on the run detail page.
- Compare your results against the public corpus in the benchmark hub.
The public GitHub repo lets you inspect the methodology and published proof bundles. The product is where you run it.
When you run it in OrgX, the benchmark should feel like normal OrgX usage:
- benchmark tasks launch as ordinary initiatives with benchmark metadata attached
- single-domain tasks launch domain-specific initiatives and cross-functional tasks launch multi-domain initiatives
- the same execution contract is reused across Agent, API, CLI, and E2B-backed execution surfaces
- benchmark workspaces default to the highest autonomy level the platform policy allows
- if a human approval or decision is required, the run is preserved but marked non-autonomous
Three modes, head to head
Every benchmark task is evaluated in three modes:
| Mode | What it represents |
|---|---|
orgx_orchestrated | Multi-step OrgX execution: decomposition, specialist phases, coordinator synthesis |
single_agent | A strong single model (Claude, GPT) doing all the work directly — no orchestration |
human_baseline | Curated human reference with explicit provenance and methodology |
We treat single_agent as the closest analogue to "just use Claude, ChatGPT, or Cursor directly." If OrgX can't beat that bar, we don't publish.
This three-mode design is grounded in how the field evaluates multi-agent systems. Du et al. (2023) showed that multiple LLM instances debating over rounds significantly improve reasoning and factual accuracy — even when individual agents initially give wrong answers. The key insight: coordination structure matters more than raw model capability. MetaGPT (Hong et al., 2023) demonstrated that encoding real-world Standard Operating Procedures into multi-agent workflows reduces hallucinations and produces structured deliverables that unstructured single-agent runs cannot match.
But multi-agent isn't always better. Kim et al. (2025) derived scaling laws across 180 agent configurations and found that multi-agent architectures incur 1.6x to 6.2x token overhead at matched performance — a direct parallel to Brooks's Law. Our benchmark exists precisely to measure whether OrgX's orchestration topology justifies that coordination cost for initiative-shaped work.
Benchmark architecture
The system has three layers, each with a distinct job:
Layer 1: Private OrgX repo — runs the benchmark
- Task catalog and benchmark runners
- Scoring, comparison, and weekly report generation
- Publishability classification and gating
Layer 2: Public repo — proves the benchmark
- Full methodology (this page)
- Public task catalog and bundle schemas
- Whitelisted weekly result bundles
- Lightweight validation tools anyone can run
Layer 3: OrgX product — lets anyone test it
- Benchmark Lab with self-serve runs
- Scorecards, surfaced artifacts, and comparison
- No setup required beyond signing in
The private repo runs it. The public repo proves it. The product lets you verify it yourself.
Where benchmarks run
There are two execution surfaces, each optimized for different goals:
Weekly public benchmark runs from the private OrgX harness with controlled runners and publication gating. This produces the stable, reviewable public scorecard and bundle.
Self-serve Benchmark Lab lets users run the same benchmark logic inside OrgX. These runs produce scorecards and artifacts optimized for evaluation and proof, but aren't published directly to the public corpus unless they meet the stricter weekly publication rules. The benchmark path is designed to mirror the real product path: standard initiatives, standard workstreams, standard auto-continue, and standard approval/blocker handling, with benchmark metadata layered on top.
That distinction is deliberate. The public benchmark needs to be stable and auditable. Benchmark Lab needs to be fast and accessible.
Task catalog
The benchmark catalog is versioned in-repo and grouped by tier and domain.
- Tier 1 — Short, decision-ready tasks with fast iteration loops (3 repeats default)
- Tier 2 — Medium-complexity tasks requiring decomposition (3 repeats default for the public weekly suite)
- Tier 3 — Deeper initiative-scale tasks, used sparingly
The current suite spans 12 tasks across all 7 OrgX agent domains: engineering, product, marketing, sales, design, ops, and cross-functional initiative work.
If even one domain is missing from a weekly run, the entire week is classified do-not-publish.
The task decomposition approach follows Zhou et al. (2022), who showed that breaking complex problems into simpler sub-problems solved sequentially enables generalization to harder problems — achieving 99%+ accuracy on tasks where chain-of-thought alone managed 16%. Our tier system reflects this: Tier 1 tasks test single-domain capability, Tier 2 tasks require cross-step decomposition, and Tier 3 tasks demand the kind of initiative-scale coordination that ChatDev (Qian et al., 2023) demonstrated with role-specialized agents collaborating through structured phases.
Every task in the catalog includes:
- Benchmark version and catalog version
- Weighted acceptance criteria with decision-ready thresholds
- Execution constraints and contamination risk level
- Seed-data provenance and known failure modes
- A human baseline with full methodology and provenance
Human baselines are real
Human baselines are not placeholders. Each published task includes six required provenance fields:
methodology— how the baseline was collectedprovenance— where the data came fromsourceSummary— what the human actually producedsampleSize— how many data pointscollectedAt— when it was gatheredoperatorProfile— who performed it
If any field is missing, the benchmark run is classified do-not-publish. We support four collection methods: expert_estimate, timed_human_run, historical_average, and hybrid.
The publication flow fails closed. Incomplete metadata blocks public release.
What we measure
The public scorecard now tracks thirteen metrics: the core outcome metrics plus explicit overhead, stability, and loss accounting.
- Flow Multiplier —
human_time / orgx_time(the headline number) - Quality delta — quality score vs. human baseline
- Autonomous completion rate — % of tasks completed without human intervention
- Benchmark completeness — coverage across the full task catalog
- Time to first artifact — how fast the first deliverable appears
- Time to decision-ready — when the output crosses the quality threshold
- Initiative completion time — total wall-clock time
- Cost efficiency — quality-per-dollar relative to
single_agent - Pairwise win rate — blinded A/B comparisons, counting draws as half-wins
- Coordination token ratio —
coordination_tokens / execution_tokens - Topology token ratio — total OrgX tokens divided by single-agent tokens
- Stable task coverage — share of repeated tasks eligible for the headline flow metric
- Orchestration loss rate — share of tasks where orchestration lost on quality, speed, stability, or token cost
This multi-metric approach follows the principle established by HELM (Liang et al., 2022): evaluating across a slate of metrics rather than optimizing a single number. HELM showed that prior benchmarks covered only 17.9% of core evaluation scenarios — a single headline metric hides more than it reveals. Thomas & Uminsky (2022) formalized this further, demonstrating that unthinking metric optimization causes real-world harms and proposing that benchmarks use a slate of metrics with external audits.
Repeats and confidence
We do not treat a single benchmark run as proof.
The public weekly suite runs Tier 1 and Tier 2 tasks 3 times each. Benchmark Lab starter runs can be cheaper, but those runs are treated as exploratory and can carry stability_unmeasured caveats rather than feeding the public headline number directly.
We compute a coefficient of variation (CV) per repeated task and classify it as stable, moderate, or unstable. Tasks with CV > 0.30 stay visible in the weekly report, but they do not contribute to the headline flow multiplier. Stable-task coverage is reported explicitly so readers can see how much of the suite contributed to the headline metric.
Weekly posts report confidence intervals when the scorecard has enough support to compute them. No hand-waving. This follows the reproducibility standards established by Pineau et al. (2021) in their NeurIPS reproducibility program, which demonstrated that requiring code submission and reproducibility checklists raised the rate of reproducible papers significantly.
How scoring works
Outputs are scored against explicit acceptance criteria — not vibes.
Each criterion has an evaluator type, a weight, a pass/fail result, and a criterion score. The scoring layer combines deterministic checks where possible, LLM-judge scoring for qualitative criteria, a weighted overall quality score, and completeness derived from criterion coverage.
We now report both absolute and relative judgment:
- Absolute scoring produces the weighted quality score and completeness.
- Pairwise scoring presents
orgx_orchestratedandsingle_agentside by side in blinded, randomized order and asks the judge to pick a winner or a draw on the same benchmark criterion.
For LLM-judge scoring, we follow Zheng et al. (2023), who showed that strong LLM judges achieve over 80% agreement with human preferences — matching inter-human agreement levels — while also documenting systematic biases (verbosity, position, self-enhancement) that evaluators must account for.
Decision-ready timing uses the explicit decisionReadyCriteriaIds set for a task when it exists, and falls back to the full acceptance criteria set otherwise.
Orchestration transparency
orgx_orchestrated is measured as a multi-phase execution, not a parallel bag of calls. For every orchestrated result we record: model, provider, runtime, phase summaries, agent topology, and split token usage.
That token split matters. The benchmark now distinguishes:
- Execution tokens — specialist work that directly advances the task
- Coordination tokens — orchestration, routing, and inter-phase overhead
Public scorecards report both coordination_token_ratio and topology_token_ratio so readers can evaluate whether orchestration quality gains justify coordination cost.
The benchmark runner may use a benchmark-safe orchestration path, but must disclose the topology. Public bundles include this metadata so readers can inspect the execution pattern behind every result.
This transparency is informed by AutoGen (Wu et al., 2023), which showed that the structure of multi-agent conversation — not just the capability of individual agents — determines outcome quality. Different agent topologies produce different results on the same task. Without disclosing topology, benchmark results are not reproducible.
Contamination controls
Every public benchmark task discloses seed-data provenance, contamination risk (low, medium, or high), and a normalized contaminationScore in the task catalog and weekly reporting bundle. Deng et al. (2024) demonstrated that contamination in widely-used benchmarks like MMLU and GSM8K can inflate performance by inducing memorization rather than genuine generalization. Our contamination disclosure does not eliminate the risk, but it makes the risk legible instead of burying it in footnotes.
Failures are part of the benchmark
Failure cases are not hidden or pushed below the fold. Weekly benchmark briefs now lead with a dedicated Where orchestration lost section before the scorecard. This section calls out:
- tasks where
single_agentbeatorgx_orchestratedon quality - tasks where orchestration was slower
- unstable repeated tasks
- tasks where token overhead rose without a material quality gain
We consider failure analysis part of the benchmark, not an appendix. That choice is informed both by HELM (Liang et al., 2022) and by newer evidence that multi-agent systems can underperform their strongest individual expert when coordination drags performance down.
Publication gates
Every weekly run ends in one of three labels:
publish-readypublish-with-caveatsdo-not-publish
A weekly benchmark is publish-ready only when all of these are true:
- Orchestrated benchmark results are present
- Required public scorecard metrics are present
- Benchmark completeness exceeds the publication floor
- Repeat count meets the minimum threshold
- Task count meets the minimum threshold
- Task metadata and human-baseline provenance are complete
- All seven domains are represented
publish-with-caveats is used when the benchmark is complete enough to publish but still carries material warnings, such as:
- unstable task share above 30%
- any task with
contaminationScore > 0.5 - fewer than 2 domains with orchestrated quality wins
If hard requirements fail, the internal report is still generated and the run is still recorded — but the week is labeled do-not-publish.
Public proof artifacts
Every published benchmark week exposes:
- A weekly blog post with analysis
- A benchmark dataset page at
/benchmarks/<week> - A public raw-data bundle (whitelisted fields only)
- A link to the public benchmark repo
The bundle never includes raw private run transcripts, file-system paths, or internal identifiers outside the public contract.
Verify it yourself
The public verification repo: useorgx/autonomous-initiative-benchmark
Once a weekly bundle is published, validate it directly:
node runner/validate-bundle.mjs results/<week>
node runner/recompute-scorecard.mjs results/<week>
The goal is to let any outside reader inspect the data contract, recompute the scorecard, and understand exactly what was measured.
How this relates to other benchmarks
This is not SWE-bench (Jimenez et al., 2023) or GAIA (Mialon et al., 2023). SWE-bench measures whether models can resolve real GitHub issues — execution-based evaluation against repository test suites. GAIA measures general assistant capability on tasks requiring multi-step reasoning, web browsing, and tool use (humans score 92%, GPT-4 with plugins scores 15%).
This benchmark measures something narrower and more specific: whether orchestrating specialist agents across 7 business domains produces better initiative-shaped work than a single strong agent — faster, and at quality. The orchestration hypothesis draws from a long line of research on task decomposition and cognitive specialization, from Simon's hierarchical decomposition principle (1969) to Kitcher's division of cognitive labor (1990) to Malone & Crowston's coordination theory (1994), which showed that coordination can be understood as managing dependencies between activities — exactly what OrgX's orchestrator does across agent domains.
What a good weekly brief answers
Every weekly benchmark post should make it easy to answer:
- What was measured, and against what baseline?
- How many tasks, which domains, how many repeats?
- What assumptions and quality bar were applied?
- Where is the raw data bundle?
- How do I run this benchmark myself?
If a weekly post can't answer those questions, it's not a strong proof artifact.
Limitations
This is a controlled public benchmark, not customer-average telemetry. Human baselines are curated and versioned, not universal truths. Domain coverage improves over time — interpret each benchmark version in the context of its task catalog.
A benchmark can prove disciplined progress. It cannot prove every real-world outcome. That's why we publish every week, lead with the losses, and let you run it yourself.
References
The methodology behind this benchmark draws on the following research:
- Bai et al. Constitutional AI: Harmlessness from AI Feedback. 2022.
- Deng et al. Investigating Data Contamination in Modern Benchmarks for Large Language Models. NAACL 2024.
- Du et al. Improving Factuality and Reasoning in Language Models through Multiagent Debate. ICML 2024.
- Hong et al. MetaGPT: Meta Programming for a Multi-Agent Collaborative Framework. ICLR 2024.
- Jimenez et al. SWE-bench: Can Language Models Resolve Real-World GitHub Issues?. ICLR 2024.
- Kim et al. Towards a Science of Scaling Agent Systems. 2025.
- Kitcher. The Division of Cognitive Labor. The Journal of Philosophy, 1990.
- Liang et al. Holistic Evaluation of Language Models. TMLR 2023.
- Malone & Crowston. The Interdisciplinary Study of Coordination. ACM Computing Surveys, 1994.
- Mialon et al. GAIA: A Benchmark for General AI Assistants. ICLR 2024.
- Pappu et al. Multi-Agent Teams Hold Experts Back. 2025.
- Ouyang et al. Training Language Models to Follow Instructions with Human Feedback. NeurIPS 2022.
- Pineau et al. Improving Reproducibility in Machine Learning Research. JMLR 2021.
- Qian et al. ChatDev: Communicative Agents for Software Development. ACL 2024.
- Simon. The Sciences of the Artificial. MIT Press, 1969.
- Thomas & Uminsky. Reliance on Metrics is a Fundamental Challenge for AI. Patterns, 2022.
- Wu et al. AutoGen: Enabling Next-Gen LLM Applications via Multi-Agent Conversation. 2023.
- Yao et al. Tree of Thoughts: Deliberate Problem Solving with Large Language Models. NeurIPS 2023.
- Zheng et al. Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena. NeurIPS 2023.
- Zhou et al. Least-to-Most Prompting Enables Complex Reasoning in Large Language Models. ICLR 2023.