AI Optimization Framework Outperforms Claude Code & Codex 2.5x on Equal Compute

Developing robust AI systems often involves a laborious process of trial and error, especially when optimizing complex operational parameters. For instance, imagine an engineering team deploying an AI agent designed to query internal company documents for employee questions. While it performs flawlessly in development, in a live production environment, it frequently generates inaccurate information or misses crucial operational constraints. Rectifying such issues is seldom a simple fix; it typically entails a meticulous, iterative cycle of adjusting chunking strategies, retrieval mechanisms, and system prompts in concert. The interconnected nature of these adjustments makes it exceedingly difficult to pinpoint which specific modification yielded the desired improvement.

To surmount this challenge, researchers from Renmin University of China and Microsoft Research have introduced Arbor, a novel framework. Arbor transforms AI-driven research and optimization from a series of guesswork experiments into a systematic, cumulative learning process. By organizing hypotheses, experiments, and derived insights into a structured tree, Arbor enables the system to learn from past failures and implement well-validated enhancements over time.

In practical evaluations, Arbor demonstrated a remarkable performance improvement, achieving over 2.5 times the verifiable gains of conventional AI coding agents across real-world engineering tasks, all within an equivalent resource budget.

For enterprise AI applications, this methodology offers a pathway to automating the continuous refinement of sophisticated, real-world engineering systems.

The Bottleneck in Autonomous Optimization

As large language models (LLMs) and other AI systems grow in sophistication, there is an increasing expectation for them to undertake more complex operations, including the autonomous optimization (AO) of software systems. This encompasses areas like agent harnesses and model training algorithms.

Autonomous optimization embodies the core loop of AI-driven research. An AI agent begins with an initial, mutable artifact—such as a machine learning codebase or a data processing pipeline—and a defined objective. The agent’s mandate is to iteratively enhance this artifact through experimental feedback, operating without constant human oversight.

A common misunderstanding surrounds the primary obstacle in AO. Many engineering teams observe that merely allocating more time or computational power to a coding agent for codebase optimization does not necessarily yield superior outcomes. “Automation can keep an AI working for a very long time — but a loop is not the same as progress,” Jiajie Jin, a co-author of the research paper, explained. “If the goal is vague, or the metric is easy to manipulate, long-running automation often just produces ‘improvements’ faster that nobody actually wants.”

Jin elaborates that achieving success in complex tasks requires numerous attempts, and standard agent architectures lack the essential data structures to maintain state effectively. “How do you ensure that the insights and experience gained from each attempt actually accumulate, rather than being lost in a scrollback buffer?” he questioned. Without such a structure, agents are prone to repeating the same errors.

AI Optimization Framework Outperforms Claude Code & Codex 2.5x on Equal Compute 4

Current agent systems can execute experiments for extended periods on well-defined objectives, such as code editing, tool invocation, and autonomous testing. However, they process each attempt in isolation, lacking the structural mechanisms necessary for knowledge accumulation and application. They are unable to simultaneously manage and compare multiple competing research avenues. Consequently, they cannot effectively interpret both successes and failures to guide future exploration, a critical process that underpins human research cumulativeness.

General-purpose coding agents typically rely on conversational transcripts for memory recall. Given that AO tasks can span hundreds of interactions and often exceed context window limitations, these agents struggle to preserve and reuse factual information over extended histories. This leads to a loss of the overarching research structure, making them susceptible to becoming stalled by early failures or chasing inconsequential evaluation fluctuations. A structured, persistent memory is required to record explored directions, generated factual evidence, and the impact of each outcome on future hypotheses.

Existing frameworks are also vulnerable to reward hacking—manipulating metrics to create an illusion of progress—and overfitting to development datasets. This results in the generation of seemingly improved code or parameters that do not translate to genuine real-world performance enhancements.

Furthermore, typical coding agents chain tool calls within a single, shared working environment. This architectural constraint hinders their ability to test parallel hypotheses in isolated settings without risking corruption of the main codebase or obscuring which specific hypothesis led to a particular result.

The Arbor Framework

Arbor addresses the inherent complexities of autonomous optimization through a framework that automates the long-horizon cycle of exploration, experimentation, and abstraction characteristic of human-led research. Arbor bifurcates the strategic direction of research from the execution of coding tasks by employing two primary components:

  • The Coordinator: This acts as a long-lived AI agent, analogous to a principal investigator. It does not directly modify the target codebase. Instead, it maintains the overall state of the optimization research, analyzes accumulated evidence, devises new hypotheses and exploration directions, and determines the course of action based on experimental outcomes.
  • Executors: These are short-lived, highly specialized AI agents. When the coordinator identifies a hypothesis to test, it instantiates an executor and places it within an isolated environment, essentially a clean Git working tree. Each executor is assigned a single hypothesis. It proceeds to implement the idea, conduct evaluations, debug any errors, and then reports the results and any generated artifacts back to the coordinator.
AI Optimization Framework Outperforms Claude Code & Codex 2.5x on Equal Compute 5

These components interact through a mechanism termed “Hypothesis Tree Refinement” (HTR). HTR models the research process as a persistent, branching tree structure. Each node within this tree links four critical elements: a hypothesis, the executable artifact associated with it, the factual evidence derived from experiments, and a distilled insight. This structure enables the coordinator to concurrently explore multiple competing research directions without losing track of progress.

The coordinator constructs this tree by positioning broad concepts near the root, with specific refinements branching out towards the leaves. This design permits Arbor to safely investigate numerous competing hypotheses simultaneously. If an executor’s experiment yields a negative outcome, the tree records the specific reason for failure as a constraint, preventing the system from repeatedly making the same mistake.

To illustrate the significance of Arbor’s isolation mechanism, consider a common enterprise task: optimizing a Retrieval-Augmented Generation (RAG) pipeline for an internal AI assistant. “When you ask a single agent like Claude Code or Codex to ‘improve accuracy,’ it will typically change a bunch of things in one pass — chunking, the prompt, the retrieval method,” Jin noted. This entanglement of changes makes attribution of improvements impossible and directly modifies the repository without proper isolation. Arbor addresses this by treating each optimization lever as a distinct hypothesis. Chunking becomes one branch, retrieval another, and the prompt configuration a third—each implemented and evaluated within its own isolated Git working tree. “So you get clean attribution: ‘constraint decomposition on the retrieval side gave +X; breadth-first search actually hurt,'” Jin stated.

Upon receiving a report from an executor, the coordinator logs the evidence into the tree and propagates the derived insight upwards to parent nodes. This process transforms a localized observation into a generalized constraint that informs the coordinator’s subsequent hypothesis generation.

To mitigate reward hacking and overfitting to development data, HTR implements a rigorous “merge gate.” Even if an executor reports a highly favorable development score, the coordinator initiates an isolated test run against a separate, held-out evaluation dataset. The artifact is only integrated into the main development branch if it demonstrably enhances the test score, thereby validating the authenticity of the progress.

Arbor aligns with the principles of “loop engineering,” a concept championed by industry leaders such as Peter Steinberger, creator of OpenClaw, and Boris Cherny, lead for Claude Code. This approach moves beyond single-prompt interactions to design iterative cycles (observe, reason, act, verify) that drive autonomous agents. However, as Jin points out, “A loop can become filled with messy, untraceable attempts, and you end up with nothing concrete and no way to reconstruct what actually changed.”

Arbor in Action

The researchers evaluated Arbor using a suite of autonomous optimization tasks derived from real-world research scenarios and the MLE-Bench Lite benchmark for machine learning engineering. This AO suite included tasks spanning various AI development domains, such as model training optimization, harness engineering, and data synthesis.

For the evaluation, different backbone models were utilized for the coordinator and executor agents, including Claude Opus 4.6, GPT-5.5, and Gemini-3-Flash. Arbor was benchmarked against leading coding agents, Codex and Claude Code, with both Arbor and the baseline systems operating under identical resource constraints. For tasks within MLE-Bench Lite, Arbor was also compared against other advanced agentic research systems, including AI-Scientist, ML-Master, and AIDE.

Arbor consistently outperformed the baseline systems, achieving the best held-out test results across all evaluated tasks. It demonstrated an average relative gain more than 2.5 times greater than that of Codex and Claude Code. In the BrowseComp task, which involves optimizing a search agent’s performance, Arbor elevated the system’s held-out accuracy from a baseline of 45.33% to 67.67%. In contrast, Codex and Claude Code plateaued at 50% and 53.33%, respectively. Within the MLE-Bench Lite benchmark, when powered by GPT-5.5, Arbor delivered the top performance among all systems tested.

AI Optimization Framework Outperforms Claude Code & Codex 2.5x on Equal Compute 6

Arbor demonstrated significant resilience against overfitting. For instance, during experiments on the Terminal-Bench 2.0 task, Claude Code achieved a development score of 75%, but this performance dropped to 71% on the held-out evaluation data. Arbor, while achieving a slightly lower development score of 72.22%, attained a superior held-out score of 77.36%, confirming that its improvements generalize to real-world applications.

Furthermore, Arbor exhibited cross-task generalization capabilities. Following the optimization of the search harness for the BrowseComp task, the researchers applied the enhanced codebase to two unrelated search-agent tasks, HLE and DeepSearchQA. Arbor’s optimized codebase resulted in substantial performance improvements even on these previously unseen tasks.

Deploying Arbor: Optimal Use Cases and Associated Costs

For engineering leaders considering the integration of Arbor into their existing technological infrastructure, the framework is designed to operate atop current Git workflows rather than supplanting them. “Its output is an ordinary Git branch that your existing code review, CI, and human review processes can inspect directly,” Jin stated. Only verified performance gains are merged into a run-specific branch, ensuring the main repository remains unaffected until a developer manually approves the changes.

However, deploying Arbor involves specific considerations. Jin highlights that the primary cost factor is token usage, as maintaining a continuously operating coordinator that manages the hypothesis tree and dispatches executors represents the most significant expense. Additionally, running multiple isolated worktrees concurrently demands substantial computational resources and disk space for processing experiments.

Arbor’s optimal application lies in scenarios characterized by:

  • A clearly defined and reliable evaluation metric.
  • A tolerance for extended optimization timelines.
  • A complex search space with multiple plausible avenues for improvement.

This makes it particularly well-suited for tasks such as pipeline optimization, data synthesis quality enhancement, and tuning model training parameters.

Conversely, organizations should avoid using Arbor for applications requiring real-time latency, straightforward single-line fixes, or when the underlying evaluation metric is unreliable. The ultimate performance ceiling of any Arbor-driven optimization run is fundamentally constrained by the quality and trustworthiness of the evaluation metric. “If the metric isn’t trustworthy, Arbor will just optimize toward an untrustworthy result faster,” Jin cautioned.

Jin envisions the next phase of development focusing on moving beyond single scalar metrics. “A natural evolution is to have each node’s artifact carry a vector—accuracy, latency, cost—instead of a single score,” Jin proposed. “Transitioning from a single scalar to a multi-objective Pareto search represents a very logical extension of the framework.”

Business Style Takeaway: Arbor represents a significant advancement in the autonomous optimization of AI systems, addressing the critical need for structured learning and verifiable improvements. Its ability to manage complex exploration cycles and ensure reliable performance gains makes it a compelling tool for enterprises seeking to enhance the efficiency and effectiveness of their AI-driven engineering processes, particularly in areas with well-defined metrics.

Source: : venturebeat.com

No votes yet.
Please wait...

Leave a Reply

Your email address will not be published. Required fields are marked *