AI Agents Enter Rebuild Era Tackling Enterprise Reliability

AI Agents Enter Rebuild Era Tackling Enterprise Reliability 2

As enterprise-grade AI agents transition from development to production, organizations are encountering a critical reliability challenge. The success of these agents in real-world applications hinges not just on the performance of large language models (LLMs), but also on their ability to handle complex, long-running workflows. This necessitates robust mechanisms for surviving system crashes, preserving operational state, recovering from failures, managing inference costs, and orchestrating interactions across various APIs, tools, and enterprise systems.

Preeti Somal, Senior Vice President of Engineering at Temporal Technologies, highlighted during a recent AI Impact Series event in New York that after an initial phase focused on rapid deployment, many companies are now compelled to re-evaluate and re-architect their early agent implementations. This redesign prioritizes workflow orchestration, system observability, governance, and resilient recovery protocols.

“We see a significant number of customers approaching us for what they term version 2.0 of an agent they’ve already built,” Somal stated. “Their initial imperative was speed, often at the expense of foundational infrastructure. When these systems inevitably falter, they find themselves needing to rebuild on a more dependable base.”

Temporal, a workflow orchestration company whose infrastructure predates the current wave of agentic AI, observes that this trend reflects a broader industry realization: production AI systems demand durable execution, sophisticated state management, transparent workflow visibility, and effective recovery strategies for when models or dependent systems experience disruptions.

Agentic AI Intensifies Existing Engineering Challenges

“These are not entirely new engineering patterns,” Somal remarked. “Artificial intelligence simply amplifies their complexity.”

Agentic systems introduce layers of intricacy due to their propensity for executing long-duration, multi-stage processes that interface with numerous services, models, APIs, and external tools. A single agent workflow might involve multiple LLM calls, queries to retrieval-augmented generation (RAG) systems, triggers for external applications, and state tracking that can span hours or even days. Somal noted that the fundamental engineering questions surrounding reliability often surface only after deployment.

“Developers create agents without fully considering the implications of a system crash,” she explained. “They haven’t planned for whether the entire agent process will need to be re-executed from scratch.”

For enterprises operating under stringent budget controls, the distinction is significant. Rerunning workflows after failures can dramatically increase inference expenditures, extend latency, and ultimately degrade the customer experience.

Somal drew a parallel to an earlier period in enterprise cloud adoption. At that time, many organizations rushed to migrate existing workloads to the cloud without fundamentally redesigning their architectures, leading to unforeseen cost overruns and suboptimal performance. She compared this to the current AI landscape.

“This haste to implement AI without modernizing the underlying application architecture is reminiscent of the initial ‘lift-and-shift’ approach to cloud migration,” she observed. “Many companies discovered they were incurring higher cloud costs without realizing the expected value.”

Why Extended Agent Lifecycles Necessitate New Architectures

Enterprise workflows are increasingly characterized by agents operating over extended periods, sometimes many hours, while actively interacting with various tools and systems. The challenges to reliability escalate when workflows persist over time, impacting both state management and contextual memory—concepts often conflated in AI discussions.

State management pertains to the progress of a workflow: its current stage, which actions have been successfully completed, and the precise point from which recovery should resume following an interruption. Memory, or context, refers to the information an agent retains and utilizes across successive interactions or tasks.

“The state of an agent is defined by its current step and completed actions, dictating where recovery should initiate after a failure. Context and memory, conversely, are about the information carried forward,” Somal elaborated.

This distinction becomes critically important as enterprises move beyond simple conversational interfaces toward more complex, long-running business processes. Somal cited the example of Abridge, a healthcare technology company. Abridge utilizes complex workflows to process physician visits through multiple stages, including audio transcription, summarization, LLM inference, and the generation of post-visit summaries.

“A single workflow involves many components,” Somal stated. “It includes processing audio and video segments, generating summaries, making calls to LLMs, and producing the final after-visit report. All of these steps require sophisticated orchestration.”

The implication for enterprises is clear: the successful deployment of advanced agents increasingly relies on systems capable of withstanding interruptions, coordinating across distributed services, and maintaining operational continuity over extended durations.

The Emergence of the Deterministic Spine

Somal advocates for the adoption of a “deterministic spine” framework for enterprise AI design, which aligns with Temporal’s operational philosophy. This approach defines a predictable execution path.

“It establishes the intended sequence of operations,” she explained. “It initiates a call to the AI model—the ‘brain’—and if the response is delayed or fails, it will retry. Even if the initial response is received but the subsequent step encounters an error, the system can resume execution from that specific failure point.”

Within this paradigm, the LLM functions as a probabilistic component, capable of producing variable outputs, while the orchestration software ensures the reliability and determinism of the overall execution flow. This concept is vital because enterprise systems require unwavering consistency, even when the underlying AI models remain inherently non-deterministic. Critical business processes, such as procurement, healthcare reporting, customer support escalations, or compliance procedures, cannot afford to fail silently due to a model timeout or an external system dependency issue.

“The paramount concern is ensuring recoverability and avoiding unnecessary token expenditure when errors occur,” Somal emphasized.

Reliability, Visibility, and the Economics of Token Consumption

As business leaders assess the return on investment for AI initiatives, the transparency of associated costs has become a significant concern. Long-running agents, often involved in intricate workflows with multiple model calls, can lead to opaque spending patterns. Somal highlighted that a key operational benefit of robust orchestration is enhanced visibility into cost accumulation points. By making workflows observable at each step, development teams can precisely track token consumption across the entire agent process.

“You gain a comprehensive view of the entire workflow through a single interface,” she stated. “This allows you to pinpoint exactly where tokens are being consumed within an agent that interacts with multiple systems and executes numerous steps.”

Furthermore, workflow recovery mechanisms directly influence cost efficiency. Without durable orchestration, a failure late in a complex process can force organizations to restart the entire sequence from the beginning, incurring redundant model call costs. Systems designed with recovery capabilities, Somal explained, can resume execution precisely from the point of interruption.

“Execution resumes exactly where the failure occurred,” she noted. “This saves significant costs that would otherwise be incurred by rerunning the agent from the initial step.”

Enterprises Need to Establish Standardized Frameworks and Leverage Partner Expertise

Governance is emerging as another critical consideration as agentic AI gains traction. Somal suggests that rather than adopting fully managed, off-the-shelf agent systems, enterprises are increasingly inclined to develop standardized internal frameworks. These frameworks provide necessary guardrails while maintaining operational flexibility, incorporating essential features such as governance controls, model selection policies, identity management, cost oversight, and comprehensive observability.

“Organizations are focused on building these ‘paved paths’,” Somal commented. “A generic, ready-made solution may not suffice due to the diverse and specific requirements of their operations.”

As companies revisit their initial AI agent deployments, challenges of this nature are increasingly viewed not as purely model-centric issues, but rather as systemic engineering problems. Temporal is well-positioned to assist enterprises in this evolutionary phase, partly because its platform was already integrated into many organizations’ broader modernization programs prior to AI becoming a primary strategic focus.

“Temporal is already an established presence within the enterprise,” Somal concluded. “Extending its capabilities to support AI and agent platforms feels like a very natural progression.”

Business Style Takeaway: The shift from rapid AI deployment to production-level reliability underscores a crucial strategic pivot for businesses. Companies must prioritize robust orchestration and recovery mechanisms over simply optimizing LLM performance to manage costs, ensure uptime, and achieve sustainable ROI from agentic AI systems.

Based on materials from : venturebeat.com

No votes yet.
Please wait...

Leave a Reply

Your email address will not be published. Required fields are marked *