How We Prove OrgX Works

12 tasks. 7 domains. 3 execution modes. Every week, we run real initiative work through OrgX orchestration, a single AI agent, and a human baseline — then publish the results with full provenance. Here's exactly how.

Published by .

Canonical URL:

The question we're trying to answer

When a real business task — drafting a product spec, planning a launch campaign, analyzing churn, coordinating across engineering and marketing — gets executed as an OrgX initiative, does multi-agent orchestration actually beat a single strong AI agent?

Not theoretically. Measurably.

Does OrgX orchestration improve time-to-decision while keeping the quality bar intact — across engineering, product, marketing, design, sales, ops, and cross-functional work?

That's what this benchmark exists to answer. Every week, we run the same suite of initiative-shaped tasks through three execution modes, score the outputs against explicit acceptance criteria, and publish everything — wins, losses, and regressions.

Try it yourself

The fastest way to evaluate OrgX is to run the benchmark inside the product.

  1. Sign up for OrgX or open Benchmark Lab if you already have access.
  2. Choose the Starter benchmark for a quick all-domain pass, or the Full benchmark for publication-grade coverage.
  3. Inspect your scorecard, surfaced artifacts, and bundle links on the run detail page.
  4. Compare your results against the public corpus in the benchmark hub.

The public GitHub repo lets you inspect the methodology and published proof bundles. The product is where you run it.

When you run it in OrgX, the benchmark should feel like normal OrgX usage:

  • benchmark tasks launch as ordinary initiatives with benchmark metadata attached
  • single-domain tasks launch domain-specific initiatives and cross-functional tasks launch multi-domain initiatives
  • the same execution contract is reused across Agent, API, CLI, and E2B-backed execution surfaces
  • benchmark workspaces default to the highest autonomy level the platform policy allows
  • if a human approval or decision is required, the run is preserved but marked non-autonomous

Three modes, head to head

Every benchmark task is evaluated in three modes:

ModeWhat it represents
orgx_orchestratedMulti-step OrgX execution: decomposition, specialist phases, coordinator synthesis
single_agentA strong single model (Claude, GPT) doing all the work directly — no orchestration
human_baselineCurated human reference with explicit provenance and methodology

We treat single_agent as the closest analogue to "just use Claude, ChatGPT, or Cursor directly." If OrgX can't beat that bar, we don't publish.

This three-mode design is grounded in how the field evaluates multi-agent systems. Du et al. (2023) showed that multiple LLM instances debating over rounds significantly improve reasoning and factual accuracy — even when individual agents initially give wrong answers. The key insight: coordination structure matters more than raw model capability. MetaGPT (Hong et al., 2023) demonstrated that encoding real-world Standard Operating Procedures into multi-agent workflows reduces hallucinations and produces structured deliverables that unstructured single-agent runs cannot match.

But multi-agent isn't always better. Kim et al. (2025) derived scaling laws across 180 agent configurations and found that multi-agent architectures incur 1.6x to 6.2x token overhead at matched performance — a direct parallel to Brooks's Law. Our benchmark exists precisely to measure whether OrgX's orchestration topology justifies that coordination cost for initiative-shaped work.

Benchmark architecture

The system has three layers, each with a distinct job:

Layer 1: Private OrgX repo — runs the benchmark

  • Task catalog and benchmark runners
  • Scoring, comparison, and weekly report generation
  • Publishability classification and gating

Layer 2: Public repo — proves the benchmark

  • Full methodology (this page)
  • Public task catalog and bundle schemas
  • Whitelisted weekly result bundles
  • Lightweight validation tools anyone can run

Layer 3: OrgX product — lets anyone test it

  • Benchmark Lab with self-serve runs
  • Scorecards, surfaced artifacts, and comparison
  • No setup required beyond signing in

The private repo runs it. The public repo proves it. The product lets you verify it yourself.

Where benchmarks run

There are two execution surfaces, each optimized for different goals:

Weekly public benchmark runs from the private OrgX harness with controlled runners and publication gating. This produces the stable, reviewable public scorecard and bundle.

Self-serve Benchmark Lab lets users run the same benchmark logic inside OrgX. These runs produce scorecards and artifacts optimized for evaluation and proof, but aren't published directly to the public corpus unless they meet the stricter weekly publication rules. The benchmark path is designed to mirror the real product path: standard initiatives, standard workstreams, standard auto-continue, and standard approval/blocker handling, with benchmark metadata layered on top.

That distinction is deliberate. The public benchmark needs to be stable and auditable. Benchmark Lab needs to be fast and accessible.

Task catalog

The benchmark catalog is versioned in-repo and grouped by tier and domain.

  • Tier 1 — Short, decision-ready tasks with fast iteration loops (3 repeats default)
  • Tier 2 — Medium-complexity tasks requiring decomposition (3 repeats default for the public weekly suite)
  • Tier 3 — Deeper initiative-scale tasks, used sparingly

The current suite spans 12 tasks across all 7 OrgX agent domains: engineering, product, marketing, sales, design, ops, and cross-functional initiative work.

If even one domain is missing from a weekly run, the entire week is classified do-not-publish.

The task decomposition approach follows Zhou et al. (2022), who showed that breaking complex problems into simpler sub-problems solved sequentially enables generalization to harder problems — achieving 99%+ accuracy on tasks where chain-of-thought alone managed 16%. Our tier system reflects this: Tier 1 tasks test single-domain capability, Tier 2 tasks require cross-step decomposition, and Tier 3 tasks demand the kind of initiative-scale coordination that ChatDev (Qian et al., 2023) demonstrated with role-specialized agents collaborating through structured phases.

Every task in the catalog includes:

  • Benchmark version and catalog version
  • Weighted acceptance criteria with decision-ready thresholds
  • Execution constraints and contamination risk level
  • Seed-data provenance and known failure modes
  • A human baseline with full methodology and provenance

Human baselines are real

Human baselines are not placeholders. Each published task includes six required provenance fields:

  • methodology — how the baseline was collected
  • provenance — where the data came from
  • sourceSummary — what the human actually produced
  • sampleSize — how many data points
  • collectedAt — when it was gathered
  • operatorProfile — who performed it

If any field is missing, the benchmark run is classified do-not-publish. We support four collection methods: expert_estimate, timed_human_run, historical_average, and hybrid.

The publication flow fails closed. Incomplete metadata blocks public release.

What we measure

The public scorecard now tracks thirteen metrics: the core outcome metrics plus explicit overhead, stability, and loss accounting.

  • Flow Multiplierhuman_time / orgx_time (the headline number)
  • Quality delta — quality score vs. human baseline
  • Autonomous completion rate — % of tasks completed without human intervention
  • Benchmark completeness — coverage across the full task catalog
  • Time to first artifact — how fast the first deliverable appears
  • Time to decision-ready — when the output crosses the quality threshold
  • Initiative completion time — total wall-clock time
  • Cost efficiency — quality-per-dollar relative to single_agent
  • Pairwise win rate — blinded A/B comparisons, counting draws as half-wins
  • Coordination token ratiocoordination_tokens / execution_tokens
  • Topology token ratio — total OrgX tokens divided by single-agent tokens
  • Stable task coverage — share of repeated tasks eligible for the headline flow metric
  • Orchestration loss rate — share of tasks where orchestration lost on quality, speed, stability, or token cost

This multi-metric approach follows the principle established by HELM (Liang et al., 2022): evaluating across a slate of metrics rather than optimizing a single number. HELM showed that prior benchmarks covered only 17.9% of core evaluation scenarios — a single headline metric hides more than it reveals. Thomas & Uminsky (2022) formalized this further, demonstrating that unthinking metric optimization causes real-world harms and proposing that benchmarks use a slate of metrics with external audits.

Repeats and confidence

We do not treat a single benchmark run as proof.

The public weekly suite runs Tier 1 and Tier 2 tasks 3 times each. Benchmark Lab starter runs can be cheaper, but those runs are treated as exploratory and can carry stability_unmeasured caveats rather than feeding the public headline number directly.

We compute a coefficient of variation (CV) per repeated task and classify it as stable, moderate, or unstable. Tasks with CV > 0.30 stay visible in the weekly report, but they do not contribute to the headline flow multiplier. Stable-task coverage is reported explicitly so readers can see how much of the suite contributed to the headline metric.

Weekly posts report confidence intervals when the scorecard has enough support to compute them. No hand-waving. This follows the reproducibility standards established by Pineau et al. (2021) in their NeurIPS reproducibility program, which demonstrated that requiring code submission and reproducibility checklists raised the rate of reproducible papers significantly.

How scoring works

Outputs are scored against explicit acceptance criteria — not vibes.

Each criterion has an evaluator type, a weight, a pass/fail result, and a criterion score. The scoring layer combines deterministic checks where possible, LLM-judge scoring for qualitative criteria, a weighted overall quality score, and completeness derived from criterion coverage.

We now report both absolute and relative judgment:

  • Absolute scoring produces the weighted quality score and completeness.
  • Pairwise scoring presents orgx_orchestrated and single_agent side by side in blinded, randomized order and asks the judge to pick a winner or a draw on the same benchmark criterion.

For LLM-judge scoring, we follow Zheng et al. (2023), who showed that strong LLM judges achieve over 80% agreement with human preferences — matching inter-human agreement levels — while also documenting systematic biases (verbosity, position, self-enhancement) that evaluators must account for.

Decision-ready timing uses the explicit decisionReadyCriteriaIds set for a task when it exists, and falls back to the full acceptance criteria set otherwise.

Orchestration transparency

orgx_orchestrated is measured as a multi-phase execution, not a parallel bag of calls. For every orchestrated result we record: model, provider, runtime, phase summaries, agent topology, and split token usage.

That token split matters. The benchmark now distinguishes:

  • Execution tokens — specialist work that directly advances the task
  • Coordination tokens — orchestration, routing, and inter-phase overhead

Public scorecards report both coordination_token_ratio and topology_token_ratio so readers can evaluate whether orchestration quality gains justify coordination cost.

The benchmark runner may use a benchmark-safe orchestration path, but must disclose the topology. Public bundles include this metadata so readers can inspect the execution pattern behind every result.

This transparency is informed by AutoGen (Wu et al., 2023), which showed that the structure of multi-agent conversation — not just the capability of individual agents — determines outcome quality. Different agent topologies produce different results on the same task. Without disclosing topology, benchmark results are not reproducible.

Contamination controls

Every public benchmark task discloses seed-data provenance, contamination risk (low, medium, or high), and a normalized contaminationScore in the task catalog and weekly reporting bundle. Deng et al. (2024) demonstrated that contamination in widely-used benchmarks like MMLU and GSM8K can inflate performance by inducing memorization rather than genuine generalization. Our contamination disclosure does not eliminate the risk, but it makes the risk legible instead of burying it in footnotes.

Failures are part of the benchmark

Failure cases are not hidden or pushed below the fold. Weekly benchmark briefs now lead with a dedicated Where orchestration lost section before the scorecard. This section calls out:

  • tasks where single_agent beat orgx_orchestrated on quality
  • tasks where orchestration was slower
  • unstable repeated tasks
  • tasks where token overhead rose without a material quality gain

We consider failure analysis part of the benchmark, not an appendix. That choice is informed both by HELM (Liang et al., 2022) and by newer evidence that multi-agent systems can underperform their strongest individual expert when coordination drags performance down.

Publication gates

Every weekly run ends in one of three labels:

  • publish-ready
  • publish-with-caveats
  • do-not-publish

A weekly benchmark is publish-ready only when all of these are true:

  • Orchestrated benchmark results are present
  • Required public scorecard metrics are present
  • Benchmark completeness exceeds the publication floor
  • Repeat count meets the minimum threshold
  • Task count meets the minimum threshold
  • Task metadata and human-baseline provenance are complete
  • All seven domains are represented

publish-with-caveats is used when the benchmark is complete enough to publish but still carries material warnings, such as:

  • unstable task share above 30%
  • any task with contaminationScore > 0.5
  • fewer than 2 domains with orchestrated quality wins

If hard requirements fail, the internal report is still generated and the run is still recorded — but the week is labeled do-not-publish.

Public proof artifacts

Every published benchmark week exposes:

  • A weekly blog post with analysis
  • A benchmark dataset page at /benchmarks/<week>
  • A public raw-data bundle (whitelisted fields only)
  • A link to the public benchmark repo

The bundle never includes raw private run transcripts, file-system paths, or internal identifiers outside the public contract.

Verify it yourself

The public verification repo: useorgx/autonomous-initiative-benchmark

Once a weekly bundle is published, validate it directly:

node runner/validate-bundle.mjs results/<week>
node runner/recompute-scorecard.mjs results/<week>

The goal is to let any outside reader inspect the data contract, recompute the scorecard, and understand exactly what was measured.

How this relates to other benchmarks

This is not SWE-bench (Jimenez et al., 2023) or GAIA (Mialon et al., 2023). SWE-bench measures whether models can resolve real GitHub issues — execution-based evaluation against repository test suites. GAIA measures general assistant capability on tasks requiring multi-step reasoning, web browsing, and tool use (humans score 92%, GPT-4 with plugins scores 15%).

This benchmark measures something narrower and more specific: whether orchestrating specialist agents across 7 business domains produces better initiative-shaped work than a single strong agent — faster, and at quality. The orchestration hypothesis draws from a long line of research on task decomposition and cognitive specialization, from Simon's hierarchical decomposition principle (1969) to Kitcher's division of cognitive labor (1990) to Malone & Crowston's coordination theory (1994), which showed that coordination can be understood as managing dependencies between activities — exactly what OrgX's orchestrator does across agent domains.

What a good weekly brief answers

Every weekly benchmark post should make it easy to answer:

  • What was measured, and against what baseline?
  • How many tasks, which domains, how many repeats?
  • What assumptions and quality bar were applied?
  • Where is the raw data bundle?
  • How do I run this benchmark myself?

If a weekly post can't answer those questions, it's not a strong proof artifact.

Limitations

This is a controlled public benchmark, not customer-average telemetry. Human baselines are curated and versioned, not universal truths. Domain coverage improves over time — interpret each benchmark version in the context of its task catalog.

A benchmark can prove disciplined progress. It cannot prove every real-world outcome. That's why we publish every week, lead with the losses, and let you run it yourself.

References

The methodology behind this benchmark draws on the following research: