
Our system was designed with a singular, highly effective function: translating natural language queries into executable API calls.
The intended users were business analysts, account managers, and operations leads. These professionals understood the data they required but faced the cumbersome task of manually aggregating it from disparate sources—four dashboards, two business intelligence tools, and a Salesforce report builder. Our solution simplified this process by allowing users to articulate their requests in plain English. For instance, a query like, “Compile a report on sales volume for January through March 2026 for the Northeast region, broken down by city,” was transformed into a structured API call, enabling the system to process the request efficiently.
{ "description": "User requested sales volume for the given date range, here is the API call to get the response", "api_call": "/api/sales_volume", "post_body": { "start_date": "2026-01-01", "end_date": "2026-03-31", "region": "northeast" } }
The remainder of the operational pipeline employed conventional engineering practices. The system routed the API call to the appropriate backend service, integrating with internal reporting portals, Salesforce, and proprietary systems. It then applied a large language model (LLM)-generated JSON query to refine and shape the response, delivering the final output via email, as a document in cloud storage, or rendered dynamically as a chart within a web browser.
By mid-2025, this system was instrumental in generating several hundred reports monthly. These reports were vital for leadership and analysts and were frequently shared with external stakeholders, establishing the system as the de facto standard for ad-hoc data retrieval across most teams.
The interface between the LLM and the rest of the system was defined by a structured JSON object, as illustrated in the example above.
{ "description": "User requested sales volume for the given date range, here is the API call to get the response", "api_call": "/api/sales_volume", "post_body": { "start_date": "2026-01-01", "end_date": "2026-03-31", "region": "northeast" } }
We initially built this solution using Claude Sonnet 3.5 in early 2025. Subsequent upgrades to versions 3.7 and 4.0 proceeded without any issues. By the time Sonnet 4.5 was released, we had developed a considerable degree of confidence in the stability and predictability of LLMs for what we perceived as a straightforward problem. Model updates had become a routine part of our development cycle, akin to minor version increments of a stable software library.
However, upon deploying Sonnet 4.5, we observed a significant percentage of requests being processed incorrectly. The model began incorporating the content of the `post_body` field into the `description` field, leading to two critical failure modes:
Failure Mode 1: Incomplete API Payloads
Consequently, the essential filter parameters—such as date ranges and region—never reached the intended API. Our system relied on the `post_body` as the definitive source for the request payload. When this field was returned empty, the API call proceeded without the necessary filters. Depending on the specific API endpoint, the backend either returned aggregate data for all time or all regions, or it failed entirely, resulting in a 500 internal server error.
Failure Mode 2: Unanticipated Model Responses
A more novel issue emerged as the model started posing clarifying questions within its responses. Earlier versions consistently attempted to fulfill ambiguous requests with a best-effort structured output. Sonnet 4.5, exhibiting a more cautious behavior, sometimes responded with a question instead. Our system lacked any mechanism to handle such conversational responses. It was architected under the strict assumption that every model invocation would yield a definitive API call. There was no provision for human intervention (a human-in-the-loop component) nor any state management to handle partially completed requests. This fundamental incompatibility caused cascading failures across downstream systems.
We initiated a rollback to version 4.0, a process that proved more complex than anticipated. Between the deployment of 4.0 and the subsequent 4.5 release, our team had integrated new APIs, all of which had been qualified and tested against the behavior of 4.5. Reverting the LLM version necessitated re-qualifying every one of these new integrations against 4.0 under significant time pressure.
Why Traditional Engineering Discipline Falls Short in LLM Systems
Core tenets of software engineering rely on the ability to predictably bound the impact of any given change. When updating a driver or a library, engineers typically consult release notes for potential breaking changes and rely on unit tests to define the scope of what might be affected. This methodology is predicated on the assumption that the system undergoing modification is sufficiently deterministic, allowing for predictable behavior or at least dense sampling to build confidence. The “blast radius”—the potential scope of negative impact—is inherently contained by design.
LLM-integrated systems fundamentally disrupt this assumption. The component responsible for generating the system’s output is no longer under direct developer control. A version update from LLM 4.0 to 4.5 cannot be assessed via a simple code diff; it represents a wholesale replacement of the core functionality upon which the system depends.
This leads to what we term an infinite blast radius: a scenario where the downstream effects of a change are impossible to enumerate prospectively due to the inherently unbounded nature of both the input space (natural language) and the potential failure modes (any unexpected behavior exhibited by the model).
Anatomy of the Failure Incident
A post-mortem analysis revealed that our initial prompt engineering was critically under-specified. We had instructed the model to produce a JSON object with three distinct fields and described the purpose of each. However, we failed to explicitly mandate that the `description` field must be a natural-language string and must not contain serialized representations of other data fields.
Earlier iterations of the model had inferred this constraint implicitly from the surrounding context. Sonnet 4.5, likely exhibiting enhanced capabilities in “helpfulness” through its formatting choices, interpreted ambiguous instructions by embedding the request body within the description or by posing clarifying questions—actions it deemed useful. From the model’s perspective, this was a reasonable interpretation of an ambiguous instruction. However, these behaviors directly violated the foundational assumptions upon which our system was architected.
The root cause was not an inherent flaw in the model itself but rather our flawed assumption that the model would continue to bridge the specification gaps as it had in previous versions. Three successful upgrades had fostered a sense of complacency, leading us to believe these implicit assumptions were secure.
While structured output modes and tool-use APIs could have potentially flagged this specific failure at the schema level—though we did not utilize them for reasons beyond the scope of this discussion—schemas primarily constrain syntax, not semantics. A schema cannot inherently dictate that a clarifying question should not appear in a system incapable of handling it, nor that a date range should never default to encompassing all time without explicit instruction. Schemas address the more straightforward aspect of the problem.
The Evals-First Architecture as a Solution
The engineering discipline required to bridge this gap involves treating the evaluation suite, rather than the prompt, as the definitive specification of the system. The prompt becomes merely an implementation detail, and the model serves as the interpreter. The evaluation suite represents the true specification, and any change to the model or prompt is deemed valid only if it successfully passes all relevant evaluations.
In practice, an evaluation consists of a three-part structure: an input, a property that the output must satisfy, and a scoring function. For our system, an evaluation designed to catch the regression observed with version 4.5 would resemble the following Python code snippet:
def test_description_contains_no_serialized_payload(response): desc = response["description"].lower() forbidden = ["curl", "post_body", "{", "http://", "https://"] assert not any(token in desc for token in forbidden), f"description leaked structured content: {response['description']}"
A comprehensive suite comprising several hundred such properties—some meticulously crafted to address known critical invariants, others automatically generated as regression tests from production traffic, and still others scored by an LLM-as-judge for more nuanced qualities like tone—establishes a robust quality gate. Model upgrades and prompt modifications should be managed as pull requests that must achieve a “green” status across the entire evaluation suite before being merged.
Building and maintaining such evaluation suites is resource-intensive. They require ongoing updates as the product evolves, and the use of LLM-as-judge scoring introduces its own inherent variance. Furthermore, an evaluation suite can only identify failure modes that have been explicitly anticipated and specified; it cannot guarantee safety against unforeseen categories of failure. This was a critical lesson learned: no one on our team had ever formulated an assertion like “the description field must not contain a curl command” because the possibility of the model embedding one had never been considered.
Evaluations are not a panacea. However, they provide the essential capability to bound the blast radius of changes in systems where the underlying processing logic is a black box. This is achieved by densely sampling the input-output responses that are critical to the system’s function and by implementing deployment gates that prevent changes when this behavior deviates.
The Path Forward
The engineering community is still in the nascent stages of developing comprehensive methodologies for constructing effective evaluation suites. Widely accepted standards for defining “coverage” within natural language input spaces are lacking. Current CI/CD systems were not designed to manage the gating of probabilistic test outcomes. As AI agents assume increasingly autonomous responsibilities—writing code, executing financial transactions, orchestrating infrastructure changes—the disparity between passing basic smoke tests and possessing genuine confidence in a system’s production behavior will become the paramount engineering challenge of the coming years.
Organizations that successfully navigate this challenge will be those that transition from treating evaluations as a post-deployment quality assurance step to recognizing them as the fundamental specification defining their systems’ true operational parameters.
Vijay Sagar Gullapalli is Founding AI Engineer at Adopt AI and a USPTO-patented inventor.
Sarat Mahavratayajula is a Senior Software Engineer at Sherwin-Williams.
Business Style Takeaway: The integration of LLMs into business processes introduces an “infinite blast radius” due to the unpredictable nature of model outputs, challenging traditional software engineering paradigms. Businesses must shift from viewing prompts as specifications to treating comprehensive evaluation suites as the authoritative definition of system behavior, ensuring robustness and predictability in AI-driven operations.
Details can be found on the website : venturebeat.com
