Weekly benchmark updates, methodology notes, and product learnings from OrgX.
Published posts
We Re-Ran the Autonomous Benchmark on Current Models: The OrgX autonomous initiative benchmark had a quiet rot problem: its judge panel pointed at models that no longer exist. We fixed it, re-ran the full catalog on current GA models, fixed a cost bug along the way, and tied the autonomy score to the gated-autonomy controls that now ship in OrgX.
Agents Do Not Need Another Issue Tracker: Linear, Jira, and Notion all expose useful MCP surfaces. OrgX is taking a different bet: the fastest agent workflow is the one where one tool call becomes durable company memory, execution state, and proof.
OrgX. The Company That Runs While You Are Away.: Continuity is the missing infrastructure for AI-native companies. OrgX is the operating system that holds the thread between people, agents, decisions, and time.
Memory is the structural lift — Phase 2 substrate benchmark: We ran 136 tasks across 4 models × 4 orchestration cells × 3 dependent task sequences. Single-shot benchmarks structurally hide what agents cannot fake: cascading context. Here is the data, the surprises, and what we changed.
The Most Underrated Product Surface in AI Is the Setup Script: The first five minutes decide whether your AI tools share company memory or become six separate rooms with six separate amnesia. That is why we built OrgX Wizard.
We Generated 75 Ad Concepts. The Useful Part Was Killing 60.: The point was not volume. It was forcing the system to find the few ideas with enough pain, specificity, and visual tension to deserve production.
Our Autonomous Benchmark Has Independent Judges Now: The OrgX autonomous initiative benchmark now publishes generated artifacts, independent judgments, token-level costs, and the failures that still need human review.
Why AI-Generated Brand Content Is Mostly Slop: AI content looks like slop when a model is asked to carry taste, memory, and QA by itself. The fix is not a better prompt. It is a content system.