RecursiveMAS: 2.4x Faster Multi-Agent Inference, 75% Fewer Tokens

Current multi-agent AI systems often face significant hurdles in collaboration, primarily due to their reliance on sequential text generation for communication. This method not only introduces latency and escalates token costs but also complicates the training of the entire system as a unified entity.

To address these limitations, researchers from the University of Illinois Urbana-Champaign and Stanford University have introduced RecursiveMAS, a novel framework that enables agents to communicate and share information within their embedding space rather than through text. This architectural shift yields substantial improvements in both efficiency and performance.

Experimental results demonstrate that RecursiveMAS enhances accuracy across complex tasks, including code generation, medical reasoning, and information retrieval, while simultaneously boosting inference speed and drastically reducing token consumption.

Furthermore, RecursiveMAS proves to be significantly more cost-effective to train compared to standard methods like full fine-tuning or LoRA, positioning it as a scalable and economical blueprint for developing bespoke multi-agent systems.

The inherent challenges in enhancing multi-agent systems

Multi-agent systems offer a powerful approach to tackling intricate problems that often elude the capabilities of single-agent configurations. However, scaling these systems for practical, real-world applications presents a significant challenge: ensuring they can continuously evolve, improve, and adapt to diverse and dynamic scenarios.

One method for improving agent interaction involves prompt-based adaptation, where the shared context provided to agents is iteratively refined. In this model, the system acts as a director, modifying prompts to guide agents toward generating responses that better align with the overall objective. The fundamental constraint here is that the underlying capabilities of the models powering each agent remain static.

A more advanced strategy entails training the agents by directly updating the weights of their underlying models. However, orchestrating the training of an entire system comprising multiple agents is computationally demanding, as modifying all parameters across various models presents a non-trivial engineering and computational task.

Even when teams commit to model training, the conventional method of agent communication via text-based interactions creates substantial bottlenecks. Because agents depend on sequential text generation, latency becomes an issue as each model must await the completion of the preceding one’s text output before initiating its own processing.

This process of requiring models to articulate their intermediate reasoning token by token, solely for the next model to interpret, is inherently inefficient. It leads to inflated token usage, increased computational costs, and significantly slows down the iterative learning process across the entire system, hindering scalability.

The operational mechanics of RecursiveMAS

Unlike approaches that focus on improving individual agents as isolated components, RecursiveMAS is engineered to facilitate the co-evolution and scaling of the entire multi-agent system as a singular, integrated entity.

The framework draws inspiration from recursive language models (RLMs). In a typical language model, data progresses linearly through a series of distinct layers. In contrast, an RLM reuses a shared set of layers, processing the data and feeding it back into the same layers. This computational looping enables the model to deepen its reasoning capabilities without necessitating an increase in parameters.

RecursiveMAS: 2.4x Faster Multi-Agent Inference, 75% Fewer Tokens 5

RecursiveMAS extends this principle of recursive scaling from a single model to a multi-agent architecture that functions as a cohesive recursive system. Within this framework, each agent operates analogously to a layer in a recursive language model. Instead of generating text, agents iteratively transmit their continuous latent representations to the subsequent agent in the sequence, establishing a looped stream of information flow within the system’s hidden state.

This exchange of latent information progresses sequentially through all agents. Upon completion of processing by the final agent, its latent outputs are fed directly back to the initial agent, initiating a new cycle of recursion.

This architecture enables the entire multi-agent system to interact, reflect, and refine its collective reasoning over multiple rounds entirely within the latent space, with only the final agent producing a textual output in the concluding round. It effectively mimics agents communicating telepathically as a unified consciousness, with the final agent articulating the synthesized response in text.

The architecture enabling latent collaboration

To facilitate seamless collaboration in the continuous latent space, the researchers have developed a specialized architectural component termed the RecursiveLink. This module, characterized by its lightweight design and two-layer structure, is engineered to transmit and refine a model’s latent states, thereby bypassing the necessity of text decoding.

The hidden states of a language model’s final layer encapsulate a rich, semantic representation of its reasoning process. The RecursiveLink is specifically designed to preserve and transfer this high-dimensional information between different embedding spaces.

To mitigate the computational costs associated with updating every parameter across multiple large language models (LLMs), the framework maintains the parameters of the core models in a frozen state. Instead, system optimization is achieved by training solely the parameters of the RecursiveLink modules.

RecursiveMAS: 2.4x Faster Multi-Agent Inference, 75% Fewer Tokens 6

The system employs two variations of the RecursiveLink module to manage both internal reasoning and external communication. The inner RecursiveLink operates within an agent during its reasoning phase. It receives the model’s newly generated embeddings and maps them directly back into its own input embedding space, enabling the agent to continuously produce a stream of latent thoughts without generating discrete text tokens.

The outer RecursiveLink serves as the crucial interface between agents. Given that agents in a production system may utilize disparate model architectures and sizes, their internal embedding spaces often have different dimensionalities. The outer RecursiveLink incorporates an additional layer specifically designed to align the embeddings from one agent’s hidden dimension with the embedding space of the next agent in the sequence.

During the training process, the inner links are initially trained independently to acclimatize each agent’s capacity for thinking in continuous latent embeddings. Subsequently, the system enters an outer-loop training phase, where the diverse, frozen models are interconnected in a loop, and the system’s performance is assessed based on the final textual output generated by the last agent.

The training process exclusively updates the parameters of the RecursiveLink modules, leaving the original model weights untouched—a methodology akin to low-rank adaptation (LoRA). An additional benefit arises when multiple agents are built upon the same foundational model.

In scenarios where two agents share the exact same backbone model but assume distinct roles, there is no need to load duplicate instances of the model into GPU memory or train them separately. The agents effectively share the same underlying model as their cognitive core, utilizing the RecursiveLink as the connective tissue for communication.

RecursiveMAS in application

The researchers rigorously evaluated RecursiveMAS across nine benchmarks encompassing mathematics, science and medicine, code generation, and search-based question answering. For these evaluations, they constructed multi-agent systems using open-weight models such as Qwen, Llama-3, Gemma3, and Mistral. These models were assigned specific roles to facilitate diverse agent collaboration patterns, including sequential reasoning and mixture-of-experts architectures.

RecursiveMAS: 2.4x Faster Multi-Agent Inference, 75% Fewer Tokens 7

RecursiveMAS was benchmarked against several baseline approaches. These included standalone models augmented with LoRA or full supervised fine-tuning, alternative multi-agent frameworks such as Mixture-of-Agents and TextGrad, and recursive baselines like LoopLM. Performance was also compared against Recursive-TextMAS, which utilizes the same recursive looping structure as RecursiveMAS but mandates explicit text-based communication between agents.

Across the evaluated benchmarks, RecursiveMAS achieved an average accuracy improvement of 8.3% compared to the most effective baseline methods. It demonstrated particularly strong performance on tasks requiring complex reasoning, outperforming text-based optimization methods like TextGrad by 18.1% on the AIME2025 benchmark and 13% on AIME2026.

RecursiveMAS: 2.4x Faster Multi-Agent Inference, 75% Fewer Tokens 8

By circumventing the generation of text at each intermediate step, RecursiveMAS achieved an end-to-end inference speedup ranging from 1.2x to 2.4x. The system also demonstrates remarkable token efficiency. When compared to the text-based Recursive-TextMAS, RecursiveMAS reduced token usage by 34.6% in the first recursion round, and by the third round, this reduction increased to 75.6%. Furthermore, RecursiveMAS proved exceptionally cost-effective for training. Because it only updates the parameters of the relatively small RecursiveLink modules—approximately 13 million parameters, constituting about 0.31% of the trainable parameters in the frozen models—it requires minimal peak GPU memory and reduces training costs by over 50% compared to full fine-tuning.

Considerations for Enterprise Adoption

The substantial efficiency gains—encompassing reduced token consumption, lower GPU memory demands, and accelerated inference speeds—are poised to make complex, multi-step agent workflows feasible in production environments. This circumvents the prohibitive computational overhead that currently limits the widespread deployment of agentic systems in enterprise settings. The researchers have made the code and trained model weights publicly available under the Apache 2.0 license, facilitating broader adoption and development.

Business Style Takeaway: The development of RecursiveMAS represents a significant advancement in multi-agent AI efficiency, moving communication from costly text tokens to faster, internal latent spaces. This innovation is crucial for enterprises looking to scale complex AI workflows, promising reduced operational costs and enhanced performance in areas like automated customer service, complex data analysis, and sophisticated content generation.

Original article : venturebeat.com

No votes yet.
Please wait...

Leave a Reply

Your email address will not be published. Required fields are marked *