The performance ceiling for artificial intelligence models operating directly on consumer devices, such as smartphones, has historically been constrained by the limited capacity of DRAM (Dynamic Random-Access Memory). This limitation has typically restricted the practical number of parameters—the core components of AI models that store learned information—to levels significantly lower than those achievable in server-side deployments. Consequently, enterprise architects assessing agent-based AI workloads have faced a trade-off: opting for highly capable but cloud-dependent models or settling for less powerful on-device solutions. Apple’s latest generation of foundation models, unveiled at WWDC26, aims to overcome this barrier by completely decoupling the model’s weights from DRAM.
Advancements in Apple’s Foundation Models
The new AFM 3 family, developed in collaboration with Google, comprises five distinct models. Two are designed for on-device operation, while three are server-based. All models operate within Apple’s secure Private Cloud Compute framework. The server-side models, including the powerful AFM 3 Cloud Pro, which is optimized for agentic tool use and complex reasoning tasks, leverage Nvidia GPUs hosted on Google Cloud infrastructure. The architecture for the on-device models, however, is entirely proprietary to Apple. Notably, AFM 3 Core Advanced, a 20-billion-parameter model, stores its weights in NAND flash memory instead of relying on DRAM.
Researchers from Apple explained that this architectural shift allows the “full model to be stored in flash memory” rather than being confined to DRAM. They further elaborated that since the bandwidth between NAND flash and DRAM is insufficient for the rapid, token-by-token weight swapping required by conventional Mixture of Experts (MoE) models, AFM 3 Core Advanced implements a strategy of making routing decisions on a per-prompt basis.
Decoding the Novel Architecture
The memory constraint that Apple is addressing is a well-known challenge for developers working with local AI processing. As Awni Hannun, a researcher at Anthropic and former Apple research scientist, noted on X, “You can’t put 20B parameters in RAM at any reasonable precision.” He further commented that Apple’s solution employs an “exotic architecture by today’s standards,” where a smaller, auxiliary model predicts which specific “experts” (sub-modules of the larger AI model) need to be loaded from NAND flash into RAM for a given query.
This innovative prediction-and-load mechanism is built upon three fundamental components, each tailored to the hardware realities of consumer-grade silicon:
-
Full 20B Weight Set Resides in Flash, Not DRAM: AFM 3 Core Advanced retains its entire parameter set within NAND flash storage, bypassing the need to load everything into active memory. Unlike typical on-device AI deployments where the entire model must fit within DRAM—hence the parameter count limitation—Apple’s approach designates flash as the model’s permanent repository. DRAM serves as a dynamic buffer, holding only the specific experts required for the current prompt. This strategy, termed Instruction-Following Pruning (IFP) by Apple researchers, treats flash as the primary storage and DRAM as a temporary workspace.
-
Expert Routing Occurs Once Per Prompt, Not Per Token: In traditional MoE architectures, a routing mechanism selects different experts for each generated token. This constant selection necessitates continuous data transfer between flash and DRAM, a process that exceeds the capabilities of NAND-to-DRAM bandwidth at inference speeds. AFM 3 Core Advanced circumvents this by performing routing only once at the beginning of a prompt. A fixed set of experts is then selected and loaded into DRAM alongside perpetually active shared experts, allowing all subsequent tokens to be generated using this stable configuration. As Hannun observed, “The key distinction from a typical MoE is that you do this once per query and then generate all the tokens with the same experts.”

-
Active Parameter Count Dynamically Scales from 1B to 4B: Instead of operating with a fixed model size for every task, AFM 3 Core Advanced dynamically adjusts the number of active parameters based on the complexity of the request. For simpler operations, it utilizes 1 billion parameters, scaling up to 4 billion for more demanding tasks, all drawn from the extensive 20-billion-parameter pool stored in flash memory.
Undisclosed Details and Enterprise Implications
While Apple’s architectural paper provides substantial detail on the memory design and sparse activation mechanisms, it offers less clarity on practical deployment constraints crucial for enterprise adoption.
Marco Abis, who is developing Ziraph, a performance profiler for AI on Apple silicon, highlighted on X that Apple’s profiling tools lack critical metrics for production viability, such as energy consumption, memory bandwidth usage, and thermal performance. He noted, “A notable gap, given those decide most of on-device performance.”
Furthermore, Abis observed that Apple’s documentation does not specify when an on-device request might transparently offload to the cloud, nor whether this routing behavior is discernible to developers or end-users. For organizations in regulated industries that must meticulously document where data inference occurs, this lack of transparency presents a significant compliance challenge.
Apple has indicated that a comprehensive technical report, including detailed benchmarks, is slated for release later this summer, which is expected to address these outstanding questions.
Strategic Considerations for Enterprise Architects
The advancements in on-device AI models present regulated industries evaluating agentic AI deployments with a pivotal architectural decision point:
-
The DRAM Barrier for On-Device Agents Has Been Raised: Businesses considering AI agents that require continuous operation without reliance on cloud connectivity now have a viable 20-billion-parameter local option. The primary constraint shifts from the model’s inherent capability to the hardware specifications of the end-user device.
-
The Private vs. Cloud Boundary Becomes an Architectural Choice: The AFM 3 family introduces a nuanced approach where simpler requests are processed locally on the device, while complex, agentic tasks are seamlessly routed to the AFM 3 Cloud Pro model operating within the Private Cloud Compute environment. The critical unknown for enterprises is Apple’s methodology for determining when a request is offloaded, and whether this process is visible, which directly impacts compliance and data governance policies.
-
Server-Tier Agentic Capabilities Depend on Google Cloud: The server-side operations for AFM 3 Cloud Pro are powered by Nvidia GPUs hosted on Google Cloud. While Apple’s Private Cloud Compute framework ensures data privacy, the underlying infrastructure for server-side inference remains dependent on Google’s platform.
The AFM 3 Core Advanced model represents a significant leap, offering enterprises an on-device AI capability with 20 billion parameters that was previously unavailable. The ultimate scalability and practical deployability of this technology will hinge on the detailed performance metrics and operational transparency that Apple is expected to provide in its forthcoming technical report.
Business Style Takeaway: Apple’s new AFM 3 foundation models redefine the capabilities of on-device AI by moving model weights to flash memory, significantly expanding parameter counts without direct DRAM dependence. This innovation offers enterprises a compelling alternative for agentic workloads demanding both performance and local processing, potentially impacting the design of secure, privacy-focused AI applications and shifting the competitive landscape for edge AI solutions.
Information compiled from materials : venturebeat.com
