
Test-time scaling (TTS) has proven effective in enhancing the performance of large language models (LLMs) for practical applications by allocating additional computational resources during the inference phase. However, traditional TTS strategies have largely been manually engineered, relying on human intuition to define the model’s reasoning protocols.
To overcome this limitation, researchers from Meta, Google, and several academic institutions have developed AutoTTS, a novel framework designed to automatically discover optimal TTS strategies. This automated approach empowers organizations to dynamically optimize compute allocation, circumventing the need for manual heuristic tuning.
By deploying the optimal strategies identified by AutoTTS, businesses can significantly reduce token consumption and operational expenses associated with deploying advanced reasoning models in production. In experimental evaluations, AutoTTS demonstrated efficient management of inference budgets, achieving up to a 69.5% reduction in token usage without compromising accuracy.
The Manual Bottleneck in Test-Time Scaling
Test-time scaling improves LLM capabilities by providing extra compute cycles during response generation. This additional processing power enables the model to explore multiple reasoning pathways or refine intermediate steps before finalizing a response.
The primary challenge in developing effective TTS strategies lies in optimally allocating this supplementary computation. Historically, these strategies have been designed manually, based on educated guesses and rigid heuristics. Engineers must formulate hypotheses regarding the rules and thresholds that govern when a model should branch into new lines of reasoning, delve deeper into an existing path, prune less promising avenues, or cease its reasoning process altogether.
This reliance on human intuition inherently limits the scope of potential strategies, leaving many viable approaches unexplored. The consequence is often a compromise between model accuracy and computational cost that is less than ideal.
Existing TTS methodologies can be conceptualized within a width-depth control spectrum, where “width” refers to the number of reasoning branches explored and “depth” pertains to how far each branch is developed. Techniques like Self-Consistency (SC) sample a fixed number of trajectories and determine the answer by majority vote. Adaptive-Consistency (ASC) conserves compute by terminating early once a predefined confidence threshold is met. Parallel-Probe offers a more granular approach, pruning suboptimal branches while expanding others. Each of these methods is hand-crafted, representing the very constraint AutoTTS aims to dismantle.
While some advanced techniques incorporate more complex structures like tree search or external verifiers, they all share the common characteristic of being meticulously hand-engineered. This manual process restricts the discovery of strategies, leaving a substantial portion of the potential resource allocation space untapped.
Automating Strategy Discovery with AutoTTS
AutoTTS revolutionizes the optimization of test-time scaling. Instead of treating strategy design as a human-driven task, AutoTTS frames it as an algorithmic search problem within a controlled experimental setting.
This framework redefines the roles of both the human engineer and the AI model. Rather than manually defining specific rules for when an LLM should branch, prune, or halt its reasoning, the engineer’s focus shifts to constructing the discovery environment. This involves setting the parameters, including the state and action control space, the optimization objectives that balance accuracy against cost, and the specific feedback mechanisms to be employed.

An exploratory LLM, such as Claude Code, generates the strategies. This explorer functions as an autonomous agent that iteratively proposes TTS “controllers”—policies or algorithms defined in code that govern how an AI model allocates its computational budget during inference. The explorer tests and refines these controllers based on feedback until it discovers an optimal resource allocation policy.
To ensure the computational feasibility of this automated search process, AutoTTS utilizes an “offline replay environment.” If the explorer LLM were required to invoke a base reasoning model for new token generation with every tested strategy, the computational cost would be prohibitive. Instead, it leverages thousands of reasoning trajectories pre-recorded from the base LLM. These trajectories include “probe signals”—intermediate outputs that assist the controller in evaluating progress across various reasoning branches.
Within the discovery loop, the explorer agent proposes a controller and assesses it against the offline data. The agent observes the execution traces of the proposed controller, which detail its compute allocation over time. By analyzing these traces, the agent can diagnose specific failure modes, such as identifying if a controller prematurely pruned branches in a particular scenario. This offers an advantage over simply reviewing the final outcome. Subsequently, the agent iteratively refines its code to enhance the accuracy-cost trade-off.
Inside the AI-Designed Controller
Unburdened by human intuition, the explorer agent can uncover highly coordinated and complex rules that a human engineer would likely never devise. One such optimized controller discovered by AutoTTS, termed the Confidence Momentum Controller (CMC), employs several non-intuitive mechanisms to manage computational resources:
-
Trend-Based Stopping: Manual strategies often direct the model to cease reasoning once an immediate confidence threshold is reached. The AutoTTS agent recognized that instantaneous confidence can be misleading due to temporary fluctuations. Consequently, the CMC tracks an exponential moving average (EMA) of confidence, halting reasoning only when the overall confidence level is high and the trend is not actively decreasing.
-
Coupled Width-Depth Control: Manually designed algorithms typically treat the expansion of new reasoning paths (“widening”) and the progression of current paths (“deepening”) as separate decisions. AutoTTS identified a closed feedback loop where these two actions are intrinsically linked. If the confidence across active branches plateaus or declines, the controller automatically initiates the spawning of new branches.
-
Alignment-Aware Depth Allocation: Rather than distributing computational budget equally among all active reasoning branches, the controller dynamically identifies branches that align with the current leading answer. It then prioritizes these branches with “bursts” of additional computation, thereby concentrating resources on the emerging consensus to expedite verification.
Cost Savings and Accuracy Gains in Real-World Benchmarks
To validate the efficacy of AI-driven autonomous discovery of test-time scaling strategies, researchers established a rigorous evaluation framework. The core experiments utilized Qwen models ranging from 0.6 billion to 8 billion parameters. The system’s generalization capability was further tested on a distilled 8 billion parameter version of the DeepSeek-R1 model.
The explorer AI agent was initially tasked with discovering an optimal strategy using the AIME24 mathematical reasoning benchmark. The derived strategy was subsequently evaluated on two held-out math benchmarks (AIME25 and HMMT25) and the graduate-level general reasoning benchmark, GPQA-Diamond.
The AutoTTS-discovered controller was benchmarked against four industry-standard, manually designed test-time scaling algorithms. These included Self-Consistency with 64 parallel reasoning paths (SC@64), Adaptive-Consistency (ASC), Parallel-Probe, and Early-Stopping Self-Consistency (ESC)—a hybrid approach that generates trajectories in parallel and stops early when an answer achieves stability.

In a balanced, cost-conscious configuration, the AutoTTS-discovered controller reduced total token consumption by approximately 69.5% compared to SC@64, while maintaining equivalent average accuracy across the four Qwen models. When the inference budget was increased, AutoTTS achieved peak accuracy surpassing all handcrafted baselines in five out of eight test scenarios.
VB Transform · July 14–15 · Menlo Park · Agentic orchestration
Intuit rebuilt its multi-agent system in 60 days. What did they change — and why?
At Transform, engineering leaders from Intuit, Target, and Instacart break down how they redesigned their orchestration architectures for reliability, scale, and real customers.
See the full agenda →
This efficiency translated effectively to other tasks. On the GPQA-Diamond benchmark, the balanced AutoTTS variant reduced inference token costs from 510K to just 151K tokens, with a slight improvement in overall accuracy. For the DeepSeek model, AutoTTS achieved the highest overall accuracy on the HMMT25 benchmark while nearly halving token expenditure.
For professionals developing enterprise AI applications, these findings highlight two significant operational advantages:
-
Elevated Peak Performance: AutoTTS not only optimizes token consumption but also actively enhances the base model’s maximum achievable performance. The AI-designed controller excels at identifying unproductive reasoning branches in real-time and dynamically redirecting computational resources toward pathways yielding the most valuable reasoning signals.
-
Cost-Effective Custom Development: The discovery process, leveraging an offline replay environment, incurred a cost of only $39.90 and took 160 minutes. This makes optimized reasoning strategies tailored to proprietary models and internal tasks accessible to enterprise teams without requiring a substantial research budget.
Both the AutoTTS framework and the Confidence Momentum Controller are publicly available on GitHub, with the CMC designed as a direct substitute for existing TTS controllers.
Business Style Takeaway: The automation of test-time scaling strategies through frameworks like AutoTTS offers a significant pathway for enterprises to reduce operational costs and enhance LLM performance. By moving beyond manual heuristics, businesses can unlock more efficient resource allocation, leading to substantial savings and improved accuracy, making advanced AI deployment more economically viable and effective.
Information compiled from materials : venturebeat.com
