MiniMax Unveils M3 with Lightning-Fast Sparse Attention for 15.6X Longer Context Responses

Amidst the intense global competition among AI developers, Chinese AI firm MiniMax has garnered significant attention for its dedication to advancing frontier intelligence across text, coding, and video generation, notably through its Hailuo model series. What sets MiniMax apart is its commitment to providing these powerful capabilities often under permissive, enterprise-friendly open-source licenses.

The company is once again capturing the interest of AI practitioners and developers worldwide with the release of a detailed technical report on its M2 series of language models (M2, M2.5, and M2.7). This report illuminates numerous engineering innovations and ingenious strategies employed in their development. Simultaneously, MiniMax has teased its upcoming M3 series, which will feature a novel sparse attention mechanism. This new approach promises up to a 15.6-fold increase in decoding speed for extended contexts (up to one million tokens) by utilizing a custom sub-quadratic framework, thereby making ultra-long-context AI agent deployment economically feasible.

The M2 report is particularly valuable for enterprises engaged with AI models, especially those focused on in-house fine-tuning and training. When initially released, the M2 series models consistently achieved top-tier performance benchmarks among open-source AI solutions.

While subsequent models from other Chinese AI labs, such as DeepSeek and Xiaomi, have since surpassed these benchmarks, MiniMax’s latest report offers a foundational blueprint that can significantly enhance AI model and agent performance for organizations globally.

As Adina Yakup from Hugging Face commented on X, “Beyond the benchmarks, they’ve done some really solid work on MoE efficiency and agent oriented design. Excited to see where M3 goes next!”

The Attention Dilemma in Large Language Models

The core architecture of the M2 series is built upon a sparse Mixture-of-Experts (MoE) decoder-only Transformer design, a framework common to many leading large language models (LLMs). This foundational structure comprises a total of 229.9 billion parameters. However, it maintains an efficient operational footprint by activating only 9.8 billion parameters per token, distributed across 256 specialized experts.

To optimize the routing of information and mitigate common load-balancing issues, MiniMax implemented sigmoid gating combined with learnable, expert-specific bias terms. This approach significantly reduces the reliance on restrictive auxiliary loss functions typically used in such architectures.

A critical engineering decision highlighted in the M2 paper was the consistent use of full multi-head attention with Grouped Query Attention (GQA) across all 62 layers. In the context of LLMs, “quadratic scaling” refers to the computationally intensive nature of standard full attention mechanisms, where every token in a sequence must establish a mathematical connection with every other token. This process, while yielding comprehensive context, requires processing power and memory that escalates exponentially with the input sequence length, creating substantial hardware bottlenecks when processing lengthy documents.

Addressing the Trade-offs of Sub-Quadratic Scaling

“Sub-quadratic” scaling introduces architectural optimizations designed to circumvent this exponential computational burden. Instead of establishing all possible connections, these methods, such as Sliding Window Attention or compressed linear attention, may focus on localized segments of text or generate compressed summaries of broader content. While these efficient techniques drastically lower hardware costs and enable high-speed processing of large documents, they historically compromise accuracy. This often leads to AI models missing crucial “big picture” details or losing track of information distant within the text.

This inherent challenge defined the architectural progression from MiniMax’s M2 to its forthcoming M3 series. During the M2 development phase, researchers extensively evaluated sub-quadratic shortcuts but found they severely impaired the model’s “multi-hop reasoning”—its capability to connect disparate pieces of information across lengthy documents. This limitation necessitated the acceptance of the substantial computational overhead associated with full quadratic attention to preserve state-of-the-art intelligence.

The team conducted rigorous benchmarking of efficient attention alternatives during pre-training but ultimately discarded them. They experimented extensively with hybrid configurations, interleaving full attention with sub-quadratic architectures like Lightning Attention or hybrid Sliding Window Attention (SWA) setups. The empirical findings were clear: at scale, linear and windowed attention variants demonstrated significant reasoning deficits. On evaluations exceeding 32,000 context windows, SWA variants underperformed full attention, with scores dropping from a baseline of 90.0 to 72.0 on the RULER 128K complex word extraction task. Furthermore, sub-quadratic configurations were prone to memory constraints during training, lacked native prefix caching support, and did not integrate smoothly with Multi-Token Prediction (MTP) modules used for speculative decoding. Full attention was deemed indispensable for maintaining multi-hop reasoning capabilities.

Recognizing that physical hardware limitations preclude indefinite quadratic scaling, MiniMax is now designing the M3 series around an innovative sub-quadratic framework intended to deliver both high-speed processing and uncompromising reasoning integrity.

Introducing MiniMax Sparse Attention (MSA) and Sub-Quadratic Scaling

The forthcoming MiniMax-M3 model represents a significant departure from the computationally demanding constraints of its predecessor. As revealed by MiniMax’s engineering team under the teaser “Something BIG is coming,” the M3 series integrates “MiniMax Sparse Attention” (MSA). Unlike approaches such as DeepSeek’s Multi-head Latent Attention (MLA), which compresses keys and values into a lower-dimensional latent space, MSA operates on a standard Grouped Query Attention (GQA) backbone but employs block-level selection on actual, uncompressed Key-Value pairs.

Elie Bakouch at Prime Intellect, an AI training infrastructure and platform lab, noted on X that the key innovation involves “block level selection like in CSA but attention is done on the real KV, not in [compressed space].” This approach effectively resolves the precision loss and prefix-caching challenges identified in the M2 paper. By dynamically filtering and selecting block-level sequences, MSA achieves a significant architectural advancement. Preliminary hardware profiling indicates a 9.7x speedup in prefilling latency and a substantial 15.6x speedup during decoding at a one-million token sequence length compared to the full-attention M2 architecture.

Understanding the importance of the “decoding phase” speedup requires examining how AI models process information. When a user interacts with an AI, processing occurs in two distinct stages: prefilling and decoding. “Prefilling” is the initial phase where the AI processes the entire input prompt—whether brief or extensive—in parallel to build an initial understanding and establish context. The “decoding phase” is dedicated to generating the response. To predict each subsequent token (word or sub-word), the AI must consider the original prompt plus all previously generated tokens. This means the computational load increases progressively as the response grows longer, making the generation of the final tokens the most demanding part of the process.

For instance, consider the analogy of reading a complex legal document (prefilling) and then needing to write a summary. For every new word written in the summary, you must re-read the entire legal document along with everything previously written in the summary to ensure coherence. This backward-looking dependency makes the decoding phase the primary computational bottleneck in text generation, explaining why AI models often generate responses token by token and slow down considerably with longer interactions.

Therefore, the reported 15.6x speedup in the decoding phase at a one-million token sequence length signifies that the M3 model has discovered a structural optimization that enables it to generate its response token by token nearly 16 times faster. This directly addresses the critical bottleneck that typically causes AI chatbots to falter or freeze when processing vast amounts of information.

The Evolution of the MiniMax M Series and the “Forge” System

On a product level, MiniMax has consistently evolved its models from basic text generation tools into sophisticated autonomous agents. The M2 series introduced an “interleaved thinking” protocol, where the model strategically alternates between natural language planning and explicit tool invocations within a single operational sequence. Instead of omitting intermediate “chain-of-thought” steps between execution turns, M2 incorporates the complete thinking history directly into the conversation context. This persistent planning allows the model to recover from runtime errors more effectively and adapt its strategies based on feedback from its operational environment.

To facilitate the training of these long-horizon workflows, MiniMax developed “Forge,” a scalable reinforcement learning system specifically designed for agents. Forge architecturally separates execution into three distinct modules: the Agent Side, an abstraction middleware layer (comprising a Gateway Server and Data Pool), and the Training/Inference engines. As MiniMax engineer Olive Song explained on the ThursdAI podcast, the team focused on training smaller models with reinforcement learning across a vast number of environments and agents, a complex but rewarding endeavor. To manage the significant variance in trajectory lengths common in multi-step agent environments, Forge incorporates two crucial engineering solutions:

Windowed FIFO Scheduling: This training scheduler employs a sliding window over the generation queue. It enables high-throughput, greedy task fetching within the window to minimize cluster idle time while strictly adhering to First-In, First-Out (FIFO) principles to maintain distributional stability and prevent gradient oscillations.
Prefix Tree Merging: This optimization restructures batch training into a tree computation. Conversations sharing identical initial prefixes are processed concurrently in the forward pass before branching. This eliminates redundant calculations, potentially achieving up to a 40x speedup in training with no approximation error.

This reinforcement learning infrastructure was instrumental in the development of the M2.7 checkpoint, propelling the series towards “self-evolution.” Operating within an automated agent harness, M2.7 functions as an independent machine learning engineer. It monitors its own training runs, identifies anomalies, analyzes logs, and automatically modifies its codebase and configurations. MiniMax reports that M2.7 successfully managed between 30% and 50% of its own development workflow.

On OpenAI’s rigorous MLE Bench Lite suite, which assesses autonomous machine learning research capabilities, M2.7 achieved a 66.6% medal rate across independent 24-hour trials, effectively matching Google’s closed-weight Gemini 3.1 Pro.

The continuous iteration from M2 to M2.5—which notably completed 30% of internal tasks and 80% of newly committed code at MiniMax headquarters—underscores a broader strategic vision. The MiniMax team commented during that deployment phase, “we believe that M2.5 provides virtually limitless possibilities for the development and operation of agents in the economy.” With the recent technical report codifying the M2 generation’s achievements and the upcoming MSA tech blog, MiniMax is signaling its intent to push the boundaries of AI, focusing on translating minimal activation footprints into maximum real-world intelligence.

Business Style Takeaway: MiniMax’s advancements in efficient AI architectures, particularly their sparse attention mechanism for extended contexts, directly address critical bottlenecks in AI deployment. This innovation promises to make complex, long-form AI applications economically viable, opening new avenues for enterprise automation and intelligent agent development.

Learn more at : venturebeat.com

No votes yet.

Please wait...

The Attention Dilemma in Large Language Models

Addressing the Trade-offs of Sub-Quadratic Scaling

Introducing MiniMax Sparse Attention (MSA) and Sub-Quadratic Scaling

The Evolution of the MiniMax M Series and the “Forge” System

Leave a ReplyCancel Reply