Phase 1 of our substrate matrix answered a narrow question: "Do different models and runtimes produce different results on the same task?" Yes, obviously. Cost varies. Quality varies. That was useful, and it disproved the weak version of our pitch — that OrgX is a system-prompt wrapper.
But it also structurally hid the stronger version. Single-task benchmarks cannot show what agents actually cannot fake: cascading context. Phase 2 is the matrix that tests the real thesis.
We ran 136 tasks across 4 models × 4 orchestration cells × 3 dependent task sequences (product-flow, incident-flow, gtm-flow), spent $3.79, and came away with one very stubborn pattern: memory is the structural lift, skills are the polish on top.
The design
Every benchmark task in Phase 2 lives inside a 3-task sequence where task N+1 materially depends on task N's output:
- product-flow: PRD → engineering plan → launch thread. The launch thread has to reference the "decision in one session" moment from the PRD and cannot promise scope items the PRD excluded.
- incident-flow: escalation playbook → postmortem → runbook. Every prevention step in the runbook must trace to a root-cause finding from the postmortem.
- gtm-flow: launch brief → social pack → outreach sequence. The outreach emails must reference the exact positioning claim from the brief.
Each sequence runs through four cells — combinations of orchestration features that isolate what each adds:
| Cell | State-summary handoff | Verifier loop | OrgX system prompt (real SKILL.md) |
|---|---|---|---|
bare | — | — | — |
memory | ✓ | — | — |
verifier | — | ✓ | — |
orgx_full | ✓ | ✓ | ✓ |
A sequence-aware judge scores each task on (a) rubric quality and (b) consistency with prior outputs in the same sequence. Contradictions are flagged as structured feedback.
Models tested: Claude Sonnet 4.6, Claude Haiku 4.5, GPT-4o, GPT-4o mini. (OpenRouter OSS models skipped — no API key set. Rerunning them separately.)
What the data shows
Averaged across all 4 models and all 3 sequences, mean judge-scored quality by cell × task position:
| Task position | bare | verifier | memory | orgx_full |
|---|---|---|---|---|
| Task 1 (cold start) | 7.9 | 8.0 | 7.9 | 7.9 |
| Task 2 (needs task 1) | 3.6 | 5.2 | 7.1 | 7.1 |
| Task 3 (cascade over 1+2) | 0.5 | 1.7 | 7.5 | 7.5 |
Three findings:
1) Task 1 is cell-invariant. Cold starts have nothing to compound against — the model sees the prompt and writes. Every cell produces roughly equivalent output. This is exactly what Phase 1 measured, and it's why Phase 1 found orchestration wasn't dramatically lifting single-task numbers. Turns out we were right — on that task shape, it doesn't.
2) Task 2 reveals who can fake context and who can't. When task 2 says "using the PRD above as authoritative context" but no PRD is in the conversation, weak models collapse (Haiku verifier plan = 1.5, 4o-mini = 1.5). Strong models hallucinate a plausible fake (Sonnet verifier plan = 7.2). Memory-handoff closes the gap deterministically — every model converges to ~7.1 when given a distilled state summary from task 1.
3) Task 3 cannot be faked. When a launch thread must reference a specific decision-moment from a PRD that doesn't exist in the conversation, even Sonnet scores 0.0. When a runbook must cite a specific root cause from a postmortem that isn't in context, same result. The cascade kills improvisation.
This is the gap that single-shot benchmarks structurally cannot show.
The surprises
Skills help weaker models more than stronger ones
On gtm-flow, orgx_full (real SKILL.md + memory + verifier) lifted GPT-4o mini dramatically:
- social post: 1.5 (bare) → 6.5 (orgx_full), +5.0
- outreach: 2.5 (bare) → 7.5 (orgx_full), +5.0
On Claude Sonnet, skills sometimes regressed on the same tasks:
- social post: 3.5 (memory) → 2.5 (orgx_full), −1.0
The rubric on the social-post task is strict about tweet-length constraints ("Twitter post ≤280 chars") and channel-sequencing ("Twitter first if the brief said so"). The longer, more elaborated outputs that the SKILL.md's "be specific, be thorough" prompting produces actually hurt on tight-rubric tasks for already-capable models. Weaker models benefit because the skill fills in structure they'd otherwise miss.
Takeaway: the value of orchestration is inversely proportional to base-model capability on the specific task shape.
Runbook quality ceiling is the rubric, not the model
Every model with memory converged to exactly 6.5/10 on incident-flow-runbook, regardless of base:
- Sonnet memory runbook: 6.5
- Haiku memory runbook: 6.5
- 4o-mini memory runbook: 6.5
- 4o memory runbook: 6.5
The rubric criterion that none of them cleared: "First-2-minute actions from the original playbook preserved verbatim." Every model paraphrased. That's a skill bug, not a model bug — none of the current agent skills teach verbatim-preservation as a pattern. One worked example would likely lift everyone to 8+.
Verifier alone is fragile
A critique-then-rewrite loop without state-summary handoff barely helps on sequenced tasks. It lifts self-contained tasks (postmortem task-2 has the root cause embedded in the prompt — Sonnet verifier scored 8.2 same as orgx_full). It cannot rescue truly dependent tasks (Sonnet verifier launch thread = 0.0, same as bare).
The verifier reads the draft, notes gaps like "doesn't reference the PRD decision moment" — but the rewrite has the same missing-PRD problem as the original attempt. The verifier catches what the model didn't do; it can't give the model what the context didn't have.
What we changed in OrgX
Real SKILL.md loading in the benchmark. Phase 1's ORGX_SKILL_PROMPT was an 8-line paraphrase of the actual 451-line product-agent skill. Phase 2 loads the real skill bodies plus their examples/*.md corpora as the system prompt for orgx_full cells. If we're going to make claims about what our skills do, we have to be measuring what our skills actually are.
Hash-pinning every run. Every orgx_full run now records:
skill_pack_hash: c46ab235ec6fa390
skill_versions:
product-agent: 2.0.0
engineering-agent: 2.0.0
marketing-agent: 2.0.0
operations-agent: 2.0.0
When we edit a skill, the hash changes, and future runs become A/B comparable against historical ones. This was the structural gap we found in our own tooling during this audit — RuntimeContract.skill_pack_hash had existed as a field for months but was never being populated by any benchmark.
Skill versioning in the public pack. [useorgx/skills PR #4](https://github.com/useorgx/skills/pull/4) added version: to every SKILL.md frontmatter and surfaced each agent skill's examples/ folder inline. That's prerequisite plumbing for any future "v2.1 lifted quality by X%" claim to be meaningful.
Honest caveats
- 8 missing runs in product-flow × GPT-4o from a transient OpenAI fetch failure early in the matrix. 136 of 144 (94%). We'll rerun those separately before the next report.
- gtm-flow-social variance is high. Sonnet scored 3.5 on the same task where Haiku scored 9.2 — same cell, same prompt. The strict rubric is apparently sensitive to output style in ways we did not control for. Worth digging into before building any claim on that task alone.
- Judge is Claude Haiku 4.5. Independent, but still a single judge model. The Phase 1 methodology post walks through the trade-offs with judge selection; Phase 2 inherits those.
- Skill bodies are ~450 lines each. Loading them as the system prompt costs extra input tokens per
orgx_fullcall. The $3.79 total reflects that cost was real but modest; at scale we'd want per-domain skill sharding.
Try it yourself
The matrix script is public. CLI supports --sequence, --model, --cell, --skills-dir, --dry-run. The JSON output follows the same CompareTask / CompareRun shape consumed by /runtime/compare and /benchmarks/arena, so if you fork the script and rerun with your own skill pack, the arena will render your numbers next to ours.
The matrix JSON itself is in the repo — 136 runs with full contract metadata, judge feedback, and the actual generated artifacts per cell.
What's next
Phase 3 should answer: "Where does the cascade break?" Task 2 is where strong models can still fake context without memory; task 3 is where they cannot. What about a 5-task sequence? A 10-task one? Is there a context-budget ceiling past which state-summary handoffs themselves start degrading quality?
And: why does gtm-flow-social behave so differently from the other cascade tasks? Haiku 9.2 vs Sonnet 3.5 on the same prompt is too big a gap to ignore.
The matrix is extensible. The skill pack is hashed. The data is public. If you run this against your own sequences and find the pattern doesn't replicate, we want to know.