<?xml version="1.0" encoding="UTF-8"?><rss version="2.0" xmlns:content="http://purl.org/rss/1.0/modules/content/"><channel><title>Dwar-AI Blog | Blog</title><description/><link>https://blog.dwar-ai.org/</link><language>en</language><item><title>Building Reliable Software on Top of Unreliable LLMs: A Deep Dive into Non-Determinism</title><link>https://blog.dwar-ai.org/b/blog-llm-nondeterminism/</link><guid isPermaLink="true">https://blog.dwar-ai.org/b/blog-llm-nondeterminism/</guid><pubDate>Sun, 24 May 2026 00:00:00 GMT</pubDate><content:encoded>&lt;blockquote&gt;
&lt;p&gt;&lt;em&gt;This post draws from real engineering experience building a production CI pipeline that uses LLMs as a core processing component inside an automated, incremental analysis system. The pipeline runs on every commit, makes structural decisions using LLMs, synthesizes structured output, and publishes incremental diffs to object storage. Everything in this post comes from actual incidents, debugging sessions, and architectural decisions made while fighting non-determinism at production scale.&lt;/em&gt;&lt;/p&gt;
&lt;/blockquote&gt;
&lt;hr&gt;
&lt;div&gt;&lt;h2 id=&quot;table-of-contents&quot;&gt;Table of Contents&lt;/h2&gt;&lt;/div&gt;
&lt;ol&gt;
&lt;li&gt;&lt;a href=&quot;#1-what-is-llm-non-determinism&quot;&gt;What Is LLM Non-Determinism?&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;#2-why-this-matters-more-than-you-think&quot;&gt;Why This Matters More Than You Think&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;#3-types-of-software-that-are-affected&quot;&gt;Types of Software That Are Affected&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;#4-types-of-software-that-are-not-affected&quot;&gt;Types of Software That Are NOT Affected&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;#5-a-real-world-case-study-an-llm-powered-ci-analysis-pipeline&quot;&gt;A Real-World Case Study: An LLM-Powered CI Analysis Pipeline&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;#6-the-taxonomy-of-non-determinism-in-llm-pipelines&quot;&gt;The Taxonomy of Non-Determinism in LLM Pipelines&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;#7-how-non-determinism-cascades&quot;&gt;How Non-Determinism Cascades&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;#8-real-incidents-and-their-root-causes&quot;&gt;Real Incidents and Their Root Causes&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;#9-the-core-design-philosophy-code-owns-identity&quot;&gt;The Core Design Philosophy: Code Owns Identity&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;#10-mitigation-strategies-in-depth&quot;&gt;Mitigation Strategies in Depth&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;#11-operational-patterns-for-llm-reliability&quot;&gt;Operational Patterns for LLM Reliability&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;#12-architecture-patterns-for-deterministic-llm-pipelines&quot;&gt;Architecture Patterns for Deterministic LLM Pipelines&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;#13-what-cannot-be-fixed&quot;&gt;What Cannot Be Fixed&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;#14-summary-and-checklist&quot;&gt;Summary and Checklist&lt;/a&gt;&lt;/li&gt;
&lt;/ol&gt;
&lt;hr&gt;
&lt;div&gt;&lt;h2 id=&quot;1-what-is-llm-non-determinism&quot;&gt;1. What Is LLM Non-Determinism?&lt;/h2&gt;&lt;/div&gt;
&lt;p&gt;A deterministic system, given the same inputs, always produces the same outputs. A Postgres query, a sorting algorithm, a hash function — these are deterministic. You can test them, cache them, replay them. You can build on them.&lt;/p&gt;
&lt;p&gt;LLMs are not deterministic. Even with &lt;code dir=&quot;auto&quot;&gt;temperature=0&lt;/code&gt; and a fixed model version, an LLM may produce subtly different outputs across calls with identical prompts. This is partly by design (sampling, parallelism, hardware differences), partly a side effect of model serving infrastructure, and partly an intrinsic property of the probabilistic nature of the underlying architecture.&lt;/p&gt;
&lt;p&gt;This is well-known in the context of user-facing chatbots. What is less well-understood is what it means when you embed an LLM as a component inside a production software pipeline — one that is expected to behave like any other service: stable, cacheable, testable, incrementally correct.&lt;/p&gt;
&lt;p&gt;The challenge is not just that the LLM might say something different. The challenge is that &lt;strong&gt;different prose produces different downstream state&lt;/strong&gt;, and that downstream state is what your pipeline acts on.&lt;/p&gt;
&lt;div&gt;&lt;h3 id=&quot;the-deceptive-simplicity-of-just-call-the-llm&quot;&gt;The Deceptive Simplicity of “Just Call the LLM”&lt;/h3&gt;&lt;/div&gt;
&lt;p&gt;When you first embed an LLM call into a pipeline, it feels simple:&lt;/p&gt;
&lt;div&gt;&lt;figure&gt;&lt;figcaption&gt;&lt;/figcaption&gt;&lt;pre&gt;&lt;code&gt;&lt;div&gt;&lt;div&gt;&lt;span&gt;flowchart LR&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;span&gt;&lt;span&gt;    &lt;/span&gt;&lt;/span&gt;&lt;span&gt;Input --&gt; LLM[&quot;[LLM]&quot;]&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;span&gt;&lt;span&gt;    &lt;/span&gt;&lt;/span&gt;&lt;span&gt;LLM --&gt; Output&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/figure&gt;&lt;/div&gt;
&lt;p&gt;But production pipelines look more like this:&lt;/p&gt;
&lt;div&gt;&lt;figure&gt;&lt;figcaption&gt;&lt;/figcaption&gt;&lt;pre&gt;&lt;code&gt;&lt;div&gt;&lt;div&gt;&lt;span&gt;flowchart TD&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;span&gt;&lt;span&gt;    &lt;/span&gt;&lt;/span&gt;&lt;span&gt;Input --&gt; A[&quot;LLM decision A&quot;]&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;span&gt;&lt;span&gt;    &lt;/span&gt;&lt;/span&gt;&lt;span&gt;A --&gt; IS[&quot;intermediate state&quot;]&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;span&gt;&lt;span&gt;    &lt;/span&gt;&lt;/span&gt;&lt;span&gt;IS --&gt; CL[&quot;Cache lookup using state&quot;]&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;span&gt;&lt;span&gt;    &lt;/span&gt;&lt;/span&gt;&lt;span&gt;CL --&gt; B[&quot;LLM decision B\n(conditioned on A)&quot;]&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;span&gt;&lt;span&gt;    &lt;/span&gt;&lt;/span&gt;&lt;span&gt;B --&gt; ID[&quot;Identity derived from B&apos;s output&quot;]&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;span&gt;&lt;span&gt;    &lt;/span&gt;&lt;/span&gt;&lt;span&gt;ID --&gt; PA[&quot;Persisted artifact keyed by identity&quot;]&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;span&gt;&lt;span&gt;    &lt;/span&gt;&lt;/span&gt;&lt;span&gt;PA --&gt; IDD[&quot;Incremental diff against previous artifact&quot;]&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;span&gt;&lt;span&gt;    &lt;/span&gt;&lt;/span&gt;&lt;span&gt;IDD --&gt; C[&quot;LLM decision C\n(conditioned on diff)&quot;]&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;span&gt;&lt;span&gt;    &lt;/span&gt;&lt;/span&gt;&lt;span&gt;C --&gt; FO[&quot;Final output&quot;]&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/figure&gt;&lt;/div&gt;
&lt;p&gt;Each LLM call in this chain is a source of variance. And variance at step A propagates through every downstream step. By the time you reach the final output, you may have no idea whether the result changed because the real world changed, or because the LLM phrased something differently.&lt;/p&gt;
&lt;hr&gt;
&lt;div&gt;&lt;h2 id=&quot;2-why-this-matters-more-than-you-think&quot;&gt;2. Why This Matters More Than You Think&lt;/h2&gt;&lt;/div&gt;
&lt;div&gt;&lt;h3 id=&quot;non-determinism-breaks-incremental-systems&quot;&gt;Non-Determinism Breaks Incremental Systems&lt;/h3&gt;&lt;/div&gt;
&lt;p&gt;Most production pipelines are not full-recompute systems. They are incremental: they track what changed, what was processed, what is stale. Git diffs, database change streams, event queues — incremental processing is how large systems stay affordable.&lt;/p&gt;
&lt;p&gt;LLM non-determinism attacks the most critical assumption of any incremental system: &lt;strong&gt;that unchanged inputs produce unchanged outputs&lt;/strong&gt;. When this assumption breaks, you get:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;False positives: “nothing changed but the system thinks something did”&lt;/li&gt;
&lt;li&gt;False negatives: “something changed but the system did not notice”&lt;/li&gt;
&lt;li&gt;Cascading re-computation that defeats the whole purpose of incrementalism&lt;/li&gt;
&lt;li&gt;Runaway costs in pay-per-call LLM APIs&lt;/li&gt;
&lt;li&gt;Continuous deployment loops triggered by phantom changes&lt;/li&gt;
&lt;/ul&gt;
&lt;div&gt;&lt;h3 id=&quot;non-determinism-is-silent&quot;&gt;Non-Determinism Is Silent&lt;/h3&gt;&lt;/div&gt;
&lt;p&gt;Most bugs make something crash or produce obviously wrong output. Non-determinism usually produces output that looks correct. The analysis looks fine. The test passes. The CI is green. But on the next run, with identical inputs, you get slightly different output — and that tiny difference causes downstream components to do unnecessary work.&lt;/p&gt;
&lt;p&gt;This kind of bug is extremely hard to find in code review. There is no test that reliably catches it because the tests themselves are non-deterministic.&lt;/p&gt;
&lt;div&gt;&lt;h3 id=&quot;non-determinism-at-scale-is-expensive&quot;&gt;Non-Determinism at Scale Is Expensive&lt;/h3&gt;&lt;/div&gt;
&lt;p&gt;When a non-deterministic component sits inside a pipeline that runs on every commit, the cost compounds fast. Thirty unnecessary LLM calls per CI run, at $0.003-$0.006 per call, is $0.09-$0.18 per run. Across hundreds of commits per month, across multiple repositories, this is real money — before you even count the engineering time spent debugging “why did this change?“&lt;/p&gt;
&lt;hr&gt;
&lt;div&gt;&lt;h2 id=&quot;3-types-of-software-that-are-affected&quot;&gt;3. Types of Software That Are Affected&lt;/h2&gt;&lt;/div&gt;
&lt;p&gt;Non-determinism is a concern whenever an LLM is used as more than a one-shot question-answerer. Specifically, it bites hardest in:&lt;/p&gt;
&lt;div&gt;&lt;h3 id=&quot;31-automated-analysis-and-report-generation&quot;&gt;3.1 Automated Analysis and Report Generation&lt;/h3&gt;&lt;/div&gt;
&lt;p&gt;Systems that automatically generate and maintain structured reports from source code or other structured inputs are particularly vulnerable. Reports have identity: a report named &lt;code dir=&quot;auto&quot;&gt;variant-a3f8.md&lt;/code&gt; is expected to remain &lt;code dir=&quot;auto&quot;&gt;variant-a3f8.md&lt;/code&gt; across runs unless something actually changed. If the LLM renames an internal concept from “error path” to “failure branch,” and the report ID is derived from that prose, you now have a stale old report and a spurious new one.&lt;/p&gt;
&lt;div&gt;&lt;h3 id=&quot;32-cicd-pipelines-with-llm-gates&quot;&gt;3.2 CI/CD Pipelines With LLM Gates&lt;/h3&gt;&lt;/div&gt;
&lt;p&gt;Any CI/CD pipeline that uses an LLM to make decisions (code review, test generation, diff analysis, deployment approvals) is exposed. A non-deterministic approval generates a non-deterministic pipeline state, which can cause flaky builds, unnecessary rollbacks, or missed deployments.&lt;/p&gt;
&lt;div&gt;&lt;h3 id=&quot;33-code-analysis-and-intelligence-tools&quot;&gt;3.3 Code Analysis and Intelligence Tools&lt;/h3&gt;&lt;/div&gt;
&lt;p&gt;Tools that statically analyze code and use LLMs to resolve ambiguity (e.g., which concrete implementation is called through an interface) must cache their decisions. If the same ambiguity is re-resolved on every run and the answer varies slightly, downstream graph structures change, and the entire analysis becomes unstable.&lt;/p&gt;
&lt;div&gt;&lt;h3 id=&quot;34-data-labeling-and-classification-pipelines&quot;&gt;3.4 Data Labeling and Classification Pipelines&lt;/h3&gt;&lt;/div&gt;
&lt;p&gt;Pipelines that use LLMs to classify or label data incrementally face the problem that re-classification of unchanged data can silently flip labels. If downstream models are trained or evaluated on these labels, model performance becomes non-reproducible.&lt;/p&gt;
&lt;div&gt;&lt;h3 id=&quot;35-agentic-and-multi-step-llm-workflows&quot;&gt;3.5 Agentic and Multi-Step LLM Workflows&lt;/h3&gt;&lt;/div&gt;
&lt;p&gt;Agents that take actions across multiple steps — web browsing, tool use, code execution — amplify non-determinism geometrically. A different choice in step 2 produces a different context for step 3, which produces different tool calls in step 4. The final state of a multi-step agent workflow is highly sensitive to early variation.&lt;/p&gt;
&lt;div&gt;&lt;h3 id=&quot;36-rag-retrieval-augmented-generation-systems&quot;&gt;3.6 RAG (Retrieval-Augmented Generation) Systems&lt;/h3&gt;&lt;/div&gt;
&lt;p&gt;RAG systems retrieve documents and synthesize answers. The retrieved documents depend on an embedding match, and the synthesis depends on which documents were retrieved. If the retrieval order or content changes slightly, the synthesized answer changes. If the system stores those answers as canonical truths, you have silent drift.&lt;/p&gt;
&lt;hr&gt;
&lt;div&gt;&lt;h2 id=&quot;4-types-of-software-that-are-not-affected&quot;&gt;4. Types of Software That Are NOT Affected&lt;/h2&gt;&lt;/div&gt;
&lt;p&gt;Non-determinism is a real and costly problem — but it is not a universal problem. Many categories of software use LLMs heavily and are genuinely unaffected by output variance. Understanding why they are immune helps clarify what actually makes the affected cases hard.&lt;/p&gt;
&lt;p&gt;The underlying pattern: &lt;strong&gt;non-determinism only matters when the LLM’s output is used as input to a system that has memory, identity, or state&lt;/strong&gt;. If the output is ephemeral and consumed immediately without being stored, compared, or built upon, variance is harmless.&lt;/p&gt;
&lt;div&gt;&lt;h3 id=&quot;41-one-shot-question-answering&quot;&gt;4.1 One-Shot Question Answering&lt;/h3&gt;&lt;/div&gt;
&lt;p&gt;A user asks a question, the LLM answers, the answer is shown to the user, and the conversation ends. There is no downstream state that persists. The next time the same question is asked, no system is comparing the new answer to the old one. The user might notice the answer is different, but no pipeline breaks.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Examples:&lt;/strong&gt; customer support chatbots, developer assistants, knowledge base Q&amp;#x26;A, search-augmented chat.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Why it is safe:&lt;/strong&gt; The answer is the final product. It is not a key, an identifier, a cache lookup, or the input to another automated system. Variance in prose is the expected behavior of a conversational interface.&lt;/p&gt;
&lt;div&gt;&lt;h3 id=&quot;42-content-generation-for-human-review&quot;&gt;4.2 Content Generation for Human Review&lt;/h3&gt;&lt;/div&gt;
&lt;p&gt;Systems that use LLMs to draft content that a human then reviews, edits, and approves before publication are effectively insulated from non-determinism. The human is the reconciliation layer. They see the draft, decide if it is good enough, and take responsibility for the output.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Examples:&lt;/strong&gt; marketing copy generators, email drafting tools, blog post drafters, PR description generators, changelog summarizers.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Why it is safe:&lt;/strong&gt; The human review step converts probabilistic output into a deliberate, deterministic human decision. The system never has to compare two LLM-generated drafts and decide whether they represent the same intent.&lt;/p&gt;
&lt;div&gt;&lt;h3 id=&quot;43-creative-and-generative-applications&quot;&gt;4.3 Creative and Generative Applications&lt;/h3&gt;&lt;/div&gt;
&lt;p&gt;Applications where the explicit design goal is variety — where users want different outputs each time — are not just unaffected by non-determinism, they depend on it. Controlled randomness is a feature, not a bug.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Examples:&lt;/strong&gt; story generators, image prompt expanders, brainstorming tools, game dialogue systems, name generators, creative writing assistants.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Why it is safe:&lt;/strong&gt; There is no baseline to drift from. Each generation is independent and its quality is judged on its own merits, not compared to previous generations.&lt;/p&gt;
&lt;div&gt;&lt;h3 id=&quot;44-search-and-ranking-result-presentation-only&quot;&gt;4.4 Search and Ranking (Result Presentation Only)&lt;/h3&gt;&lt;/div&gt;
&lt;p&gt;Systems that use an LLM to summarize or rerank search results for a user, without storing those summaries or feeding them into further automated processing, are effectively stateless. Each search is a fresh query. The user gets an answer and moves on.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Examples:&lt;/strong&gt; AI-augmented search interfaces, document summarization for search results, relevance explanation.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Why it is safe:&lt;/strong&gt; The LLM output is presentation only. It is never written to a database, compared against a previous version, or used as a key in any system.&lt;/p&gt;
&lt;div&gt;&lt;h3 id=&quot;45-translation-and-transcription-display-only&quot;&gt;4.5 Translation and Transcription (Display Only)&lt;/h3&gt;&lt;/div&gt;
&lt;p&gt;Translation and transcription systems that display output to a user but do not store it in a structured system are low-risk. Small variations in phrasing across translations are acceptable — they are expected in natural language.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Examples:&lt;/strong&gt; real-time captioning, document translation for reading, meeting transcription.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Why it is safe:&lt;/strong&gt; Translation variance is a known property of natural language that users and downstream humans tolerate. The output is not parsed, hashed, or used as an identifier.&lt;/p&gt;
&lt;div&gt;&lt;h3 id=&quot;46-classification-with-human-in-the-loop-correction&quot;&gt;4.6 Classification With Human-in-the-Loop Correction&lt;/h3&gt;&lt;/div&gt;
&lt;p&gt;Pipelines where LLM classification feeds into a human review queue — where every output is reviewed before being acted upon — have the same insulation as content generation for human review. The human catches inconsistencies.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Examples:&lt;/strong&gt; content moderation with human review, medical coding with clinician review, legal document classification with attorney review.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Why it is safe:&lt;/strong&gt; The human step is the system’s source of truth. The LLM is a tool to reduce human workload, not an authoritative decision maker.&lt;/p&gt;
&lt;div&gt;&lt;h3 id=&quot;47-summarization-of-streaming-data-no-state-persistence&quot;&gt;4.7 Summarization of Streaming Data (No State Persistence)&lt;/h3&gt;&lt;/div&gt;
&lt;p&gt;Real-time summarization of logs, events, or telemetry that is displayed in a dashboard and discarded does not accumulate state. Each window of data produces a fresh summary. There is no previous summary to compare against.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Examples:&lt;/strong&gt; log anomaly narration, real-time event stream summaries, live dashboard insights.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Why it is safe:&lt;/strong&gt; The summary is a transient view of current state, not a persisted record. No downstream system depends on the summary being identical across repeated calls.&lt;/p&gt;
&lt;hr&gt;
&lt;div&gt;&lt;h3 id=&quot;the-dividing-line-does-output-become-state&quot;&gt;The Dividing Line: Does Output Become State?&lt;/h3&gt;&lt;/div&gt;
&lt;p&gt;The clearest way to assess risk is to ask a single question:&lt;/p&gt;
&lt;div&gt;&lt;figure&gt;&lt;figcaption&gt;&lt;/figcaption&gt;&lt;pre&gt;&lt;code&gt;&lt;div&gt;&lt;div&gt;&lt;span&gt;flowchart TD&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;span&gt;&lt;span&gt;    &lt;/span&gt;&lt;/span&gt;&lt;span&gt;Q[&quot;Does the LLM&apos;s output become state that another system\nwill read, compare, or build upon in a future operation?&quot;]&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;span&gt;&lt;span&gt;    &lt;/span&gt;&lt;/span&gt;&lt;span&gt;Q --&gt;|YES| Y[&quot;Non-determinism is a design concern.&quot;]&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;span&gt;&lt;span&gt;    &lt;/span&gt;&lt;/span&gt;&lt;span&gt;Q --&gt;|NO| N[&quot;Non-determinism is probably fine.&quot;]&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/figure&gt;&lt;/div&gt;
&lt;p&gt;“State” here includes:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Files written to disk or object storage&lt;/li&gt;
&lt;li&gt;Database records&lt;/li&gt;
&lt;li&gt;Cache entries used as keys&lt;/li&gt;
&lt;li&gt;Identifiers or IDs derived from LLM output&lt;/li&gt;
&lt;li&gt;Flags or decisions that trigger downstream automation&lt;/li&gt;
&lt;li&gt;Embeddings stored for later retrieval and comparison&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;If the LLM output is consumed immediately by a human or displayed ephemerally, you are in the safe zone. If it is persisted anywhere and later compared to new LLM output to determine “what changed,” you are exposed.&lt;/p&gt;
&lt;p&gt;A useful heuristic: &lt;strong&gt;if you could run the same LLM call twice and the system would behave differently on the second call based on what the first call produced, non-determinism is a risk you need to manage.&lt;/strong&gt;&lt;/p&gt;
&lt;hr&gt;
&lt;div&gt;&lt;h2 id=&quot;5-a-real-world-case-study-an-llm-powered-ci-analysis-pipeline&quot;&gt;5. A Real-World Case Study: An LLM-Powered CI Analysis Pipeline&lt;/h2&gt;&lt;/div&gt;
&lt;p&gt;The system that motivates this post is a production CI pipeline for a large Go codebase. Its job is to automatically generate and maintain structured analysis reports for every entry point in the codebase — incrementally, on every commit. The pipeline combines static analysis with LLM-driven inference and persists its output to object storage, diffing against the previous run to determine what needs to be regenerated.&lt;/p&gt;
&lt;div&gt;&lt;h3 id=&quot;51-what-the-pipeline-does&quot;&gt;5.1 What the Pipeline Does&lt;/h3&gt;&lt;/div&gt;
&lt;div&gt;&lt;figure&gt;&lt;figcaption&gt;&lt;/figcaption&gt;&lt;pre&gt;&lt;code&gt;&lt;div&gt;&lt;div&gt;&lt;span&gt;flowchart TD&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;span&gt;&lt;span&gt;    &lt;/span&gt;&lt;/span&gt;&lt;span&gt;GR[&quot;Git Repo\n(Go source)&quot;]&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;span&gt;&lt;span&gt;    &lt;/span&gt;&lt;/span&gt;&lt;span&gt;CGB[&quot;Call Graph Builder&quot;]&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;span&gt;&lt;span&gt;    &lt;/span&gt;&lt;/span&gt;&lt;span&gt;DR[&quot;Dispatch Resolver\n(CHA + LLM)&quot;]&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;span&gt;&lt;span&gt;    &lt;/span&gt;&lt;/span&gt;&lt;span&gt;ATB[&quot;Analysis Tree Builder&quot;]&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;span&gt;&lt;span&gt;    &lt;/span&gt;&lt;/span&gt;&lt;span&gt;STORE[&quot;Artifact Store\nPrevious Snapshot&quot;]&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;span&gt;&lt;span&gt;    &lt;/span&gt;&lt;/span&gt;&lt;span&gt;DIFF[&quot;Diff&quot;]&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;span&gt;&lt;span&gt;    &lt;/span&gt;&lt;/span&gt;&lt;span&gt;PATCH[&quot;Patcher\nNew / Modified / Deleted&quot;]&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;span&gt;&lt;span&gt;    &lt;/span&gt;&lt;/span&gt;&lt;span&gt;LLM1[&quot;IdentifyVariants\n(Pass 1)&quot;]&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;span&gt;&lt;span&gt;    &lt;/span&gt;&lt;/span&gt;&lt;span&gt;LLM2[&quot;SynthesizeOverview\n(Pass 2)&quot;]&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;span&gt;&lt;span&gt;    &lt;/span&gt;&lt;/span&gt;&lt;span&gt;LLM3[&quot;SynthesizeReport\n(Pass 3)&quot;]&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;span&gt;&lt;span&gt;    &lt;/span&gt;&lt;/span&gt;&lt;span&gt;PUB[&quot;Artifact Store\n(Publish)&quot;]&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;
&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;span&gt;&lt;span&gt;    &lt;/span&gt;&lt;/span&gt;&lt;span&gt;GR --&gt; CGB --&gt; DR --&gt; ATB&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;span&gt;&lt;span&gt;    &lt;/span&gt;&lt;/span&gt;&lt;span&gt;ATB --&gt; DIFF&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;span&gt;&lt;span&gt;    &lt;/span&gt;&lt;/span&gt;&lt;span&gt;STORE --&gt; DIFF&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;span&gt;&lt;span&gt;    &lt;/span&gt;&lt;/span&gt;&lt;span&gt;DIFF --&gt; PATCH --&gt; LLM1 --&gt; LLM2 --&gt; LLM3 --&gt; PUB&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/figure&gt;&lt;/div&gt;
&lt;div&gt;&lt;h3 id=&quot;52-where-llm-calls-happen&quot;&gt;5.2 Where LLM Calls Happen&lt;/h3&gt;&lt;/div&gt;
&lt;p&gt;There are three distinct places where LLM calls are made:&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;1. Dispatch Resolution (Pass 1 + Pass 2)&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;Go uses interfaces extensively. When a method is called through an interface, there may be multiple concrete implementations. The pipeline uses Class Hierarchy Analysis (CHA) to enumerate candidates, then asks an LLM to resolve which concrete type is actually invoked in a given context.&lt;/p&gt;
&lt;div&gt;&lt;figure&gt;&lt;figcaption&gt;&lt;/figcaption&gt;&lt;pre&gt;&lt;code&gt;&lt;div&gt;&lt;div&gt;&lt;span&gt;Interface call: (BatchMessageProcessor).ProcessBatch&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;span&gt;Candidates:&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;span&gt;&lt;span&gt;  &lt;/span&gt;&lt;/span&gt;&lt;span&gt;- (*BatchHandler).ProcessBatch&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;span&gt;&lt;span&gt;  &lt;/span&gt;&lt;/span&gt;&lt;span&gt;- recoveryDecorator.ProcessBatch&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;span&gt;&lt;span&gt;  &lt;/span&gt;&lt;/span&gt;&lt;span&gt;- metricsDecorator.ProcessBatch&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;span&gt;&lt;span&gt;  &lt;/span&gt;&lt;/span&gt;&lt;span&gt;- loggingDecorator.ProcessBatch&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;
&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;span&gt;LLM Decision → (*BatchHandler).ProcessBatch [confidence: high]&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/figure&gt;&lt;/div&gt;
&lt;p&gt;&lt;strong&gt;2. Variant Identification (Pass 1)&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;For each entry point, the LLM is asked to identify distinct behavioral variants — “what are the different execution paths through this code?” It returns a list of variants with titles, descriptions, and the set of participating code nodes.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;3. Report Synthesis (Pass 2 + 3)&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;For each variant, the LLM synthesizes a human-readable overview and a detailed structured report.&lt;/p&gt;
&lt;hr&gt;
&lt;div&gt;&lt;h2 id=&quot;6-the-taxonomy-of-non-determinism-in-llm-pipelines&quot;&gt;6. The Taxonomy of Non-Determinism in LLM Pipelines&lt;/h2&gt;&lt;/div&gt;
&lt;p&gt;After building and operating this pipeline, the non-determinism we encountered falls into six distinct categories. Understanding the category is the first step to fighting it.&lt;/p&gt;
&lt;div&gt;&lt;h3 id=&quot;61-prose-drift&quot;&gt;6.1 Prose Drift&lt;/h3&gt;&lt;/div&gt;
&lt;p&gt;&lt;strong&gt;What it is:&lt;/strong&gt; The LLM uses different wording to describe the same concept across runs.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Example from our pipeline:&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Run 1: &lt;code dir=&quot;auto&quot;&gt;variant_condition: &quot;when the message fails to unmarshal&quot;&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;Run 2: &lt;code dir=&quot;auto&quot;&gt;variant_condition: &quot;on unmarshal error&quot;&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;Run 3: &lt;code dir=&quot;auto&quot;&gt;variant_condition: &quot;if JSON deserialization fails&quot;&lt;/code&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;All three describe the same code branch. But if the artifact identity is derived from &lt;code dir=&quot;auto&quot;&gt;variant_condition&lt;/code&gt;, each run produces a different artifact ID. The old report becomes stale (but not deleted), and a near-duplicate new report is created.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Blast radius:&lt;/strong&gt; Medium — cosmetically annoying, but also causes real cost from variant re-synthesis.&lt;/p&gt;
&lt;div&gt;&lt;h3 id=&quot;62-structural-drift&quot;&gt;6.2 Structural Drift&lt;/h3&gt;&lt;/div&gt;
&lt;p&gt;&lt;strong&gt;What it is:&lt;/strong&gt; The LLM returns structurally different output across runs — different number of variants, different groupings, different field presence.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Example:&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Run 1 returns 4 variants for an entry point&lt;/li&gt;
&lt;li&gt;Run 2 returns 3 variants (the LLM merged two)&lt;/li&gt;
&lt;li&gt;Run 3 returns 5 variants (the LLM split one further)&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;Blast radius:&lt;/strong&gt; High — variant count changes trigger full re-synthesis for the affected entry point.&lt;/p&gt;
&lt;div&gt;&lt;h3 id=&quot;63-ordering-drift&quot;&gt;6.3 Ordering Drift&lt;/h3&gt;&lt;/div&gt;
&lt;p&gt;&lt;strong&gt;What it is:&lt;/strong&gt; The LLM returns the same information but in a different order. If the pipeline derives any kind of index from position, this causes false positives.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Example:&lt;/strong&gt;&lt;/p&gt;
&lt;div&gt;&lt;figure&gt;&lt;figcaption&gt;&lt;/figcaption&gt;&lt;pre&gt;&lt;code&gt;&lt;div&gt;&lt;div&gt;&lt;span&gt;// Run 1:&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;span&gt;{&lt;/span&gt;&lt;span&gt;&quot;participating_nodes&quot;&lt;/span&gt;&lt;span&gt;: [&lt;/span&gt;&lt;span&gt;&quot;service.A&quot;&lt;/span&gt;&lt;span&gt;, &lt;/span&gt;&lt;span&gt;&quot;service.B&quot;&lt;/span&gt;&lt;span&gt;, &lt;/span&gt;&lt;span&gt;&quot;service.C&quot;&lt;/span&gt;&lt;span&gt;]}&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;
&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;span&gt;// Run 2:&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;span&gt;{&lt;/span&gt;&lt;span&gt;&quot;participating_nodes&quot;&lt;/span&gt;&lt;span&gt;: [&lt;/span&gt;&lt;span&gt;&quot;service.C&quot;&lt;/span&gt;&lt;span&gt;, &lt;/span&gt;&lt;span&gt;&quot;service.A&quot;&lt;/span&gt;&lt;span&gt;, &lt;/span&gt;&lt;span&gt;&quot;service.B&quot;&lt;/span&gt;&lt;span&gt;]}&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/figure&gt;&lt;/div&gt;
&lt;p&gt;If the artifact ID is computed as &lt;code dir=&quot;auto&quot;&gt;hash(participating_nodes)&lt;/code&gt; without sorting, these hash to different values.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Blast radius:&lt;/strong&gt; Low if the pipeline normalizes order; high if it doesn’t.&lt;/p&gt;
&lt;div&gt;&lt;h3 id=&quot;64-candidate-resolution-drift&quot;&gt;6.4 Candidate Resolution Drift&lt;/h3&gt;&lt;/div&gt;
&lt;p&gt;&lt;strong&gt;What it is:&lt;/strong&gt; For the same interface call with the same candidate set, the LLM selects a different concrete implementation on different runs.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Example:&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Run 1: resolves &lt;code dir=&quot;auto&quot;&gt;(BatchMessageProcessor).ProcessBatch&lt;/code&gt; → &lt;code dir=&quot;auto&quot;&gt;(*payments/handler.Handler).ProcessBatch&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;Run 2: resolves the same call → &lt;code dir=&quot;auto&quot;&gt;(*inventory/handler.BatchHandler).ProcessBatch&lt;/code&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Even if the LLM is “usually right,” inconsistency here changes the call graph structure, which changes the analysis tree hash, which marks entry points as modified even when no code changed.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Blast radius:&lt;/strong&gt; Very high — affects the entire downstream pipeline.&lt;/p&gt;
&lt;div&gt;&lt;h3 id=&quot;65-format-drift&quot;&gt;6.5 Format Drift&lt;/h3&gt;&lt;/div&gt;
&lt;p&gt;&lt;strong&gt;What it is:&lt;/strong&gt; The LLM returns output in a subtly different format that the parser handles differently.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Example from our pipeline:&lt;/strong&gt; For a dispatch call involving &lt;code dir=&quot;auto&quot;&gt;(hash.Hash).Sum&lt;/code&gt;, the LLM consistently returned &lt;code dir=&quot;auto&quot;&gt;crypto/sha256.(*digest).Sum&lt;/code&gt; — the pre-Go 1.24 internal type name. After a Go toolchain upgrade, the actual FQN changed to &lt;code dir=&quot;auto&quot;&gt;(*crypto/internal/fips140/sha256.Digest).Sum&lt;/code&gt;. The LLM didn’t update its knowledge, so its output failed validation every time — but the failure was silent, and the site was re-attempted on every CI run, never successfully cached.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Blast radius:&lt;/strong&gt; Medium — causes persistent cache misses for specific sites, cumulative cost.&lt;/p&gt;
&lt;div&gt;&lt;h3 id=&quot;66-external-non-determinism-call-graph-layer&quot;&gt;6.6 External Non-Determinism (Call Graph Layer)&lt;/h3&gt;&lt;/div&gt;
&lt;p&gt;&lt;strong&gt;What it is:&lt;/strong&gt; Non-determinism that appears to come from the LLM but actually originates in upstream infrastructure.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Example:&lt;/strong&gt; The static call graph builder uses the Go type system to enumerate candidates. Some FQNs (particularly for stdlib internals, platform-specific types, and generic instantiations) can be generated differently depending on the Go toolchain version, build tags, or even iteration order over maps in the compiler. If the candidate list changes between runs, the dispatch cache key changes, the persisted entry is not reused, and the LLM is re-invoked.&lt;/p&gt;
&lt;p&gt;This is a case where the non-determinism appears in the LLM layer but is actually caused by the calling layer.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Blast radius:&lt;/strong&gt; High and deceptive — it looks like the LLM is behaving differently, but the LLM is actually consistent. The problem is the key that addresses its cache.&lt;/p&gt;
&lt;hr&gt;
&lt;div&gt;&lt;h2 id=&quot;7-how-non-determinism-cascades&quot;&gt;7. How Non-Determinism Cascades&lt;/h2&gt;&lt;/div&gt;
&lt;p&gt;The most important thing to understand about LLM non-determinism is that it doesn’t stay local. It cascades.&lt;/p&gt;
&lt;p&gt;Here is a concrete cascade from our pipeline:&lt;/p&gt;
&lt;div&gt;&lt;figure&gt;&lt;figcaption&gt;&lt;/figcaption&gt;&lt;pre&gt;&lt;code&gt;&lt;div&gt;&lt;div&gt;&lt;span&gt;flowchart TD&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;span&gt;&lt;span&gt;    &lt;/span&gt;&lt;/span&gt;&lt;span&gt;S1[&quot;Step 1 — Dispatch resolution\nSame LLM target on both runs, but phrased differently on run 2\n→ cache key doesn&apos;t match → cache miss&quot;]&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;span&gt;&lt;span&gt;    &lt;/span&gt;&lt;/span&gt;&lt;span&gt;S2[&quot;Step 2 — Call graph construction\nTarget A included in both runs (same result)\nbut map iteration order differs\n→ computeAnalysisTreeHash produces different hash&quot;]&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;span&gt;&lt;span&gt;    &lt;/span&gt;&lt;/span&gt;&lt;span&gt;S3[&quot;Step 3 — Diff against previous analysis tree\nHash differs → entry point marked Modified\n→ IdentifyVariants called (1 LLM call)&quot;]&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;span&gt;&lt;span&gt;    &lt;/span&gt;&lt;/span&gt;&lt;span&gt;S4[&quot;Step 4 — Variant identification\nSame 4 variants on both runs\nbut different variant_condition text\n→ 4 old IDs gone, 4 new IDs appear&quot;]&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;span&gt;&lt;span&gt;    &lt;/span&gt;&lt;/span&gt;&lt;span&gt;S5[&quot;Step 5 — Report synthesis\n4 new variant reports synthesized (4 LLM calls)\n4 old reports marked stale&quot;]&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;span&gt;&lt;span&gt;    &lt;/span&gt;&lt;/span&gt;&lt;span&gt;S6[&quot;Step 6 — Publish\n4 new files uploaded, 4 old deprecated\nManifest updated&quot;]&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;span&gt;&lt;span&gt;    &lt;/span&gt;&lt;/span&gt;&lt;span&gt;S7[&quot;Step 7 — PR creation\nBot opens PR: 14 entry points modified\nPR merged → next CI run → same drift → infinite loop&quot;]&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;
&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;span&gt;&lt;span&gt;    &lt;/span&gt;&lt;/span&gt;&lt;span&gt;S1 --&gt; S2 --&gt; S3 --&gt; S4 --&gt; S5 --&gt; S6 --&gt; S7&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/figure&gt;&lt;/div&gt;
&lt;p&gt;This is not a hypothetical. It happened in production.&lt;/p&gt;
&lt;div&gt;&lt;figure&gt;&lt;figcaption&gt;&lt;/figcaption&gt;&lt;pre&gt;&lt;code&gt;&lt;div&gt;&lt;div&gt;&lt;span&gt;flowchart TD&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;span&gt;&lt;span&gt;    &lt;/span&gt;&lt;/span&gt;&lt;span&gt;A[&quot;LLM Prose Drift&quot;]&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;span&gt;&lt;span&gt;    &lt;/span&gt;&lt;/span&gt;&lt;span&gt;B[&quot;Different variant_condition text&quot;]&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;span&gt;&lt;span&gt;    &lt;/span&gt;&lt;/span&gt;&lt;span&gt;C[&quot;Different variant ID\n(hash of text)&quot;]&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;span&gt;&lt;span&gt;    &lt;/span&gt;&lt;/span&gt;&lt;span&gt;D[&quot;Old variant treated as deleted&quot;]&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;span&gt;&lt;span&gt;    &lt;/span&gt;&lt;/span&gt;&lt;span&gt;E[&quot;New variant treated as new&quot;]&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;span&gt;&lt;span&gt;    &lt;/span&gt;&lt;/span&gt;&lt;span&gt;F[&quot;Re-synthesis cost&quot;]&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;span&gt;&lt;span&gt;    &lt;/span&gt;&lt;/span&gt;&lt;span&gt;G[&quot;Bot PR opened&quot;]&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;span&gt;&lt;span&gt;    &lt;/span&gt;&lt;/span&gt;&lt;span&gt;H[&quot;Merge bot PR&quot;]&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;span&gt;&lt;span&gt;    &lt;/span&gt;&lt;/span&gt;&lt;span&gt;I[&quot;Trigger next CI run&quot;]&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;span&gt;&lt;span&gt;    &lt;/span&gt;&lt;/span&gt;&lt;span&gt;J[&quot;Same drift on next run&quot;]&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;
&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;span&gt;&lt;span&gt;    &lt;/span&gt;&lt;/span&gt;&lt;span&gt;A --&gt; B --&gt; C --&gt; D --&gt; E --&gt; F --&gt; G --&gt; H --&gt; I --&gt; J --&gt; A&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/figure&gt;&lt;/div&gt;
&lt;hr&gt;
&lt;div&gt;&lt;h2 id=&quot;8-real-incidents-and-their-root-causes&quot;&gt;8. Real Incidents and Their Root Causes&lt;/h2&gt;&lt;/div&gt;
&lt;div&gt;&lt;h3 id=&quot;81-incident-14-entry-points-flagged-modified-with-no-code-change&quot;&gt;8.1 Incident: 14 Entry Points Flagged Modified With No Code Change&lt;/h3&gt;&lt;/div&gt;
&lt;p&gt;&lt;strong&gt;Observed:&lt;/strong&gt; A CI run reported &lt;code dir=&quot;auto&quot;&gt;modified: 14, unchanged: 24&lt;/code&gt; even though the target repository had no new source file changes between this run and the previous one.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Investigation:&lt;/strong&gt; The pipeline does not use &lt;code dir=&quot;auto&quot;&gt;git diff&lt;/code&gt; to decide &lt;code dir=&quot;auto&quot;&gt;Modified&lt;/code&gt; vs &lt;code dir=&quot;auto&quot;&gt;Unchanged&lt;/code&gt;. Instead, it:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;Loads the previous analysis snapshot from the artifact store&lt;/li&gt;
&lt;li&gt;Rebuilds fresh analysis trees from the current source&lt;/li&gt;
&lt;li&gt;Compares analysis tree hashes&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;When the dispatch cache was not fully reused across runs (due to validation failures we’ll discuss in a moment), some interface calls were re-resolved. Even though the LLM chose the same targets, the rebuilt call graphs had slightly different internal structure, causing analysis tree hash mismatches.&lt;/p&gt;
&lt;p&gt;The critical insight: &lt;strong&gt;the pipeline concluded “modified” from analysis output drift, not from source-code drift.&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Root cause:&lt;/strong&gt; Dispatch cache reuse failures caused partial call graph non-determinism, which caused analysis tree hash drift, which caused false-positive &lt;code dir=&quot;auto&quot;&gt;Modified&lt;/code&gt; classifications.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Cost:&lt;/strong&gt; ~14 entry points × (~3 variants × 2 synthesis calls + 1 overview call) ≈ &lt;strong&gt;84 unnecessary LLM calls per run&lt;/strong&gt;.&lt;/p&gt;
&lt;hr&gt;
&lt;div&gt;&lt;h3 id=&quot;82-incident-15-false-positive-re-syntheses-from-one-line-change&quot;&gt;8.2 Incident: 15 False-Positive Re-Syntheses From One-Line Change&lt;/h3&gt;&lt;/div&gt;
&lt;p&gt;&lt;strong&gt;Observed:&lt;/strong&gt; A one-character string literal change in a single handler function (&lt;code dir=&quot;auto&quot;&gt;&quot;unmarshaling message&quot;&lt;/code&gt; → &lt;code dir=&quot;auto&quot;&gt;&quot;unmarshaling inventory v2 message&quot;&lt;/code&gt;) triggered re-synthesis for 15 entry points. Only 1 was genuinely affected.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;The system is supposed to be incremental:&lt;/strong&gt; it has a change detection layer that computes a &lt;code dir=&quot;auto&quot;&gt;CanonicalSourceHash&lt;/code&gt; for each function (AST text, comments stripped) and only regenerates reports for variants whose &lt;code dir=&quot;auto&quot;&gt;participating_nodes&lt;/code&gt; intersect the changed nodes.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Root cause: The dispatch cache key is context-free.&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;The dispatch key was:&lt;/p&gt;
&lt;div&gt;&lt;figure&gt;&lt;figcaption&gt;&lt;/figcaption&gt;&lt;pre&gt;&lt;code&gt;&lt;div&gt;&lt;div&gt;&lt;span&gt;func&lt;/span&gt;&lt;span&gt; &lt;/span&gt;&lt;span&gt;dispatchKey&lt;/span&gt;&lt;span&gt;(&lt;/span&gt;&lt;span&gt;invokedMethodFQN&lt;/span&gt;&lt;span&gt; &lt;/span&gt;&lt;span&gt;string&lt;/span&gt;&lt;span&gt;, &lt;/span&gt;&lt;span&gt;candidates&lt;/span&gt;&lt;span&gt; []&lt;/span&gt;&lt;span&gt;string&lt;/span&gt;&lt;span&gt;) &lt;/span&gt;&lt;span&gt;string&lt;/span&gt;&lt;span&gt; {&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;span&gt;&lt;span&gt;    &lt;/span&gt;&lt;/span&gt;&lt;span&gt;sorted &lt;/span&gt;&lt;span&gt;:=&lt;/span&gt;&lt;span&gt; &lt;/span&gt;&lt;span&gt;append&lt;/span&gt;&lt;span&gt;([]&lt;/span&gt;&lt;span&gt;string&lt;/span&gt;&lt;span&gt;(&lt;/span&gt;&lt;span&gt;nil&lt;/span&gt;&lt;span&gt;), candidates&lt;/span&gt;&lt;span&gt;...&lt;/span&gt;&lt;span&gt;)&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;span&gt;&lt;span&gt;    &lt;/span&gt;&lt;/span&gt;&lt;span&gt;sort.&lt;/span&gt;&lt;span&gt;Strings&lt;/span&gt;&lt;span&gt;(sorted)&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;span&gt;    &lt;/span&gt;&lt;span&gt;return&lt;/span&gt;&lt;span&gt; invokedMethodFQN &lt;/span&gt;&lt;span&gt;+&lt;/span&gt;&lt;span&gt; &lt;/span&gt;&lt;span&gt;&quot;::&quot;&lt;/span&gt;&lt;span&gt; &lt;/span&gt;&lt;span&gt;+&lt;/span&gt;&lt;span&gt; strings.&lt;/span&gt;&lt;span&gt;Join&lt;/span&gt;&lt;span&gt;(sorted, &lt;/span&gt;&lt;span&gt;&quot;|&quot;&lt;/span&gt;&lt;span&gt;)&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;span&gt;}&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/figure&gt;&lt;/div&gt;
&lt;p&gt;This key contains the interface method and the sorted candidate set, but &lt;strong&gt;not the caller&lt;/strong&gt;. This means every call site in the entire codebase that invokes the same interface method against the same candidates shares one cache entry.&lt;/p&gt;
&lt;p&gt;The shared infrastructure pattern made this catastrophic:&lt;/p&gt;
&lt;div&gt;&lt;figure&gt;&lt;figcaption&gt;&lt;/figcaption&gt;&lt;pre&gt;&lt;code&gt;&lt;div&gt;&lt;div&gt;&lt;span&gt;internal/consumer/v2/consumer_pull.go&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;span&gt;&lt;span&gt;    &lt;/span&gt;&lt;/span&gt;&lt;span&gt;│&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;span&gt;&lt;span&gt;    &lt;/span&gt;&lt;/span&gt;&lt;span&gt;│  c.processor.ProcessBatch(ctx, batch)   ← one call site&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;span&gt;&lt;span&gt;    &lt;/span&gt;&lt;/span&gt;&lt;span&gt;│&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;span&gt;&lt;span&gt;    &lt;/span&gt;&lt;/span&gt;&lt;span&gt;├── inventory/consumer/start_batch.go     ← wires BatchHandler&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;span&gt;&lt;span&gt;    &lt;/span&gt;&lt;/span&gt;&lt;span&gt;├── inventory/consumer/start_v2.go        ← wires HandlerV2&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;span&gt;&lt;span&gt;    &lt;/span&gt;&lt;/span&gt;&lt;span&gt;├── accounts/consumer/start_main.go       ← wires AccountsHandler&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;span&gt;&lt;span&gt;    &lt;/span&gt;&lt;/span&gt;&lt;span&gt;├── accounts/consumer/start_secondary.go&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;span&gt;&lt;span&gt;    &lt;/span&gt;&lt;/span&gt;&lt;span&gt;├── payments/consumer/start_primary.go&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;span&gt;&lt;span&gt;    &lt;/span&gt;&lt;/span&gt;&lt;span&gt;└── payments/consumer/start_events.go&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/figure&gt;&lt;/div&gt;
&lt;p&gt;Six entry points, six different concrete types, one shared call site, one shared dispatch resolution. When &lt;code dir=&quot;auto&quot;&gt;BatchHandler.ProcessBatch&lt;/code&gt; changed, its &lt;code dir=&quot;auto&quot;&gt;CanonicalSourceHash&lt;/code&gt; changed, and the changed node appeared in every entry point that included the shared call site — all six entry points and their variants.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Expected result:&lt;/strong&gt; &lt;code dir=&quot;auto&quot;&gt;new:0, modified:1, deleted:0, unchanged:37&lt;/code&gt;&lt;br&gt;
&lt;strong&gt;Actual result:&lt;/strong&gt; &lt;code dir=&quot;auto&quot;&gt;new:0, modified:15, deleted:0, unchanged:23&lt;/code&gt;&lt;br&gt;
&lt;strong&gt;Extra LLM calls:&lt;/strong&gt; ~84 unnecessary calls&lt;/p&gt;
&lt;p&gt;The fix: include the caller FQN in the dispatch key, so each call site gets its own resolution scoped to its context.&lt;/p&gt;
&lt;hr&gt;
&lt;div&gt;&lt;h3 id=&quot;83-incident-30-dispatch-sites-missing-cache-on-every-run&quot;&gt;8.3 Incident: 30 Dispatch Sites Missing Cache on Every Run&lt;/h3&gt;&lt;/div&gt;
&lt;p&gt;&lt;strong&gt;Observed:&lt;/strong&gt; Two consecutive CI runs on &lt;code dir=&quot;auto&quot;&gt;main&lt;/code&gt; (no code changes between them) showed identical cache miss patterns:&lt;/p&gt;
&lt;div&gt;&lt;figure&gt;&lt;figcaption&gt;&lt;/figcaption&gt;&lt;pre&gt;&lt;code&gt;&lt;div&gt;&lt;div&gt;&lt;span&gt;dispatch resolver: collected ambiguities  sites=174  unique_keys=174&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;span&gt;dispatch resolver: reused persisted cache  persisted_reused=144&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;span&gt;dispatch resolver: pass 1 complete         pass1_calls=30&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/figure&gt;&lt;/div&gt;
&lt;p&gt;30 sites miss the cache on every single run, resulting in 30 unnecessary LLM calls per commit. This creates a permanent cost floor that no amount of caching can reduce.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Root cause split into two categories:&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Category 1 (~2 sites):&lt;/strong&gt; Parse failure — result never written to cache.&lt;/p&gt;
&lt;p&gt;The LLM for &lt;code dir=&quot;auto&quot;&gt;(hash.Hash).Sum&lt;/code&gt; consistently returned &lt;code dir=&quot;auto&quot;&gt;crypto/sha256.(*digest).Sum&lt;/code&gt; — the pre-Go 1.24 internal type name. After a Go toolchain upgrade, the actual FQN changed to &lt;code dir=&quot;auto&quot;&gt;(*crypto/internal/fips140/sha256.Digest).Sum&lt;/code&gt;. The LLM didn’t know about this change. Its response failed validation, was never written to cache, and the same call was made on every subsequent run with the same result.&lt;/p&gt;
&lt;p&gt;Permanent loop:&lt;/p&gt;
&lt;div&gt;&lt;figure&gt;&lt;figcaption&gt;&lt;/figcaption&gt;&lt;pre&gt;&lt;code&gt;&lt;div&gt;&lt;div&gt;&lt;span&gt;Run N:   LLM call → parse failure → not written to cache&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;span&gt;Run N+1: LLM call → parse failure → not written to cache&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;span&gt;Run N+2: ...&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/figure&gt;&lt;/div&gt;
&lt;p&gt;&lt;strong&gt;Category 2 (~28 sites):&lt;/strong&gt; Written to cache, but fails validation on next read.&lt;/p&gt;
&lt;p&gt;Sites would parse successfully and be written to the persisted cache. But the following run would still miss them. Investigation pointed to two sub-causes:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Key mismatch:&lt;/strong&gt; The &lt;code dir=&quot;auto&quot;&gt;constructorHash&lt;/code&gt; or &lt;code dir=&quot;auto&quot;&gt;candidatesHash&lt;/code&gt; component of the cache key was computed differently on read vs write (due to non-deterministic candidate enumeration for certain types).&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Graph lookup failure:&lt;/strong&gt; A selected FQN in the persisted entry was absent from the current call graph’s &lt;code dir=&quot;auto&quot;&gt;byFQN&lt;/code&gt; map — meaning the FQN format changed subtly between toolchain invocations.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;Blast radius beyond cost:&lt;/strong&gt; Because ~30 sites resolved differently on each run, 3 entry points were classified as &lt;code dir=&quot;auto&quot;&gt;Modified&lt;/code&gt; on every CI run, triggering a bot PR on every commit merge — including merges of the bot PR itself. &lt;strong&gt;Infinite bot PR loop.&lt;/strong&gt;&lt;/p&gt;
&lt;hr&gt;
&lt;div&gt;&lt;h2 id=&quot;9-the-core-design-philosophy-code-owns-identity&quot;&gt;9. The Core Design Philosophy: Code Owns Identity&lt;/h2&gt;&lt;/div&gt;
&lt;p&gt;Every incident above has a common thread: &lt;strong&gt;the system let LLM-generated prose determine the identity of artifacts&lt;/strong&gt;. The fix, in every case, is the same:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;LLM output may propose candidates, but code-derived anchors must own identity.&lt;/strong&gt;&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;This is the single most important architectural principle for building reliable software on top of LLMs.&lt;/p&gt;
&lt;div&gt;&lt;h3 id=&quot;91-the-identity-principle&quot;&gt;9.1 The Identity Principle&lt;/h3&gt;&lt;/div&gt;
&lt;div&gt;&lt;figure&gt;&lt;figcaption&gt;&lt;/figcaption&gt;&lt;pre&gt;&lt;code&gt;&lt;div&gt;&lt;div&gt;&lt;span&gt;flowchart LR&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;span&gt;&lt;span&gt;    &lt;/span&gt;&lt;/span&gt;&lt;span&gt;subgraph LLM_OWNS[&quot;LLM OWNS&quot;]&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;span&gt;&lt;span&gt;        &lt;/span&gt;&lt;/span&gt;&lt;span&gt;direction TB&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;span&gt;&lt;span&gt;        &lt;/span&gt;&lt;/span&gt;&lt;span&gt;L1[&quot;Prose quality&quot;]&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;span&gt;&lt;span&gt;        &lt;/span&gt;&lt;/span&gt;&lt;span&gt;L2[&quot;Title text&quot;]&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;span&gt;&lt;span&gt;        &lt;/span&gt;&lt;/span&gt;&lt;span&gt;L3[&quot;Description wording&quot;]&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;span&gt;&lt;span&gt;        &lt;/span&gt;&lt;/span&gt;&lt;span&gt;L4[&quot;Condition phrasing&quot;]&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;span&gt;&lt;span&gt;        &lt;/span&gt;&lt;/span&gt;&lt;span&gt;L5[&quot;Grouping hints&quot;]&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;span&gt;&lt;span&gt;        &lt;/span&gt;&lt;/span&gt;&lt;span&gt;L6[&quot;Explanation of concepts&quot;]&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;span&gt;&lt;span&gt;    &lt;/span&gt;&lt;/span&gt;&lt;span&gt;end&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;span&gt;&lt;span&gt;    &lt;/span&gt;&lt;/span&gt;&lt;span&gt;subgraph CODE_OWNS[&quot;CODE OWNS&quot;]&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;span&gt;&lt;span&gt;        &lt;/span&gt;&lt;/span&gt;&lt;span&gt;direction TB&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;span&gt;&lt;span&gt;        &lt;/span&gt;&lt;/span&gt;&lt;span&gt;C1[&quot;Artifact identity&quot;]&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;span&gt;&lt;span&gt;        &lt;/span&gt;&lt;/span&gt;&lt;span&gt;C2[&quot;Cache keys&quot;]&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;span&gt;&lt;span&gt;        &lt;/span&gt;&lt;/span&gt;&lt;span&gt;C3[&quot;Analysis tree hashes&quot;]&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;span&gt;&lt;span&gt;        &lt;/span&gt;&lt;/span&gt;&lt;span&gt;C4[&quot;Variant IDs&quot;]&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;span&gt;&lt;span&gt;        &lt;/span&gt;&lt;/span&gt;&lt;span&gt;C5[&quot;Dispatch cache keys&quot;]&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;span&gt;&lt;span&gt;        &lt;/span&gt;&lt;/span&gt;&lt;span&gt;C6[&quot;Stale/fresh classification&quot;]&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;span&gt;&lt;span&gt;        &lt;/span&gt;&lt;/span&gt;&lt;span&gt;C7[&quot;Deduplication logic&quot;]&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;span&gt;&lt;span&gt;    &lt;/span&gt;&lt;/span&gt;&lt;span&gt;end&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/figure&gt;&lt;/div&gt;
&lt;div&gt;&lt;h3 id=&quot;92-stable-identity-anchors&quot;&gt;9.2 Stable Identity Anchors&lt;/h3&gt;&lt;/div&gt;
&lt;p&gt;In our pipeline, we identified the following code-derived anchors that remain stable across LLM runs:&lt;/p&gt;






























&lt;table&gt;&lt;thead&gt;&lt;tr&gt;&lt;th&gt;Artifact&lt;/th&gt;&lt;th&gt;LLM-Derived (unstable)&lt;/th&gt;&lt;th&gt;Code-Derived (stable)&lt;/th&gt;&lt;/tr&gt;&lt;/thead&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td&gt;Entry point ID&lt;/td&gt;&lt;td&gt;&lt;code dir=&quot;auto&quot;&gt;module_name&lt;/code&gt; (LLM-named)&lt;/td&gt;&lt;td&gt;&lt;code dir=&quot;auto&quot;&gt;hash(entry_point_fqn + entry_kind + trigger)&lt;/code&gt;&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Variant ID&lt;/td&gt;&lt;td&gt;&lt;code dir=&quot;auto&quot;&gt;hash(variant_condition_text)&lt;/code&gt;&lt;/td&gt;&lt;td&gt;&lt;code dir=&quot;auto&quot;&gt;hash(entry_id + owner_fqn + branch_fqn + sorted_participating_nodes)&lt;/code&gt;&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Dispatch cache key&lt;/td&gt;&lt;td&gt;&lt;code dir=&quot;auto&quot;&gt;(method, candidates)&lt;/code&gt;&lt;/td&gt;&lt;td&gt;&lt;code dir=&quot;auto&quot;&gt;(caller_fqn + method + sorted_candidates + wiring_hash + resolver_version)&lt;/code&gt;&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Analysis tree hash&lt;/td&gt;&lt;td&gt;Not applicable&lt;/td&gt;&lt;td&gt;&lt;code dir=&quot;auto&quot;&gt;sha256(canonical JSON of tree structure and node source hashes)&lt;/code&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;
&lt;div&gt;&lt;h3 id=&quot;93-the-reconciliation-pattern&quot;&gt;9.3 The Reconciliation Pattern&lt;/h3&gt;&lt;/div&gt;
&lt;p&gt;The key insight for variant identity: you don’t need the LLM to produce consistent IDs. You need to reconcile LLM output against code-derived anchors before any identity-sensitive operation.&lt;/p&gt;
&lt;div&gt;&lt;figure&gt;&lt;figcaption&gt;&lt;/figcaption&gt;&lt;pre&gt;&lt;code&gt;&lt;div&gt;&lt;div&gt;&lt;span&gt;flowchart TD&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;span&gt;&lt;span&gt;    &lt;/span&gt;&lt;/span&gt;&lt;span&gt;IS[&quot;IdentifyVariants\n(LLM)&quot;]&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;span&gt;&lt;span&gt;    &lt;/span&gt;&lt;/span&gt;&lt;span&gt;NPN[&quot;Normalize participating_nodes\n(sort, deduplicate)&quot;]&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;span&gt;&lt;span&gt;    &lt;/span&gt;&lt;/span&gt;&lt;span&gt;IBF[&quot;Infer branch_fqn from code structure&quot;]&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;span&gt;&lt;span&gt;    &lt;/span&gt;&lt;/span&gt;&lt;span&gt;CIK[&quot;Compute identity key from code anchors&quot;]&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;span&gt;&lt;span&gt;    &lt;/span&gt;&lt;/span&gt;&lt;span&gt;MAT[&quot;Match against previous variant index&quot;]&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;span&gt;&lt;span&gt;    &lt;/span&gt;&lt;/span&gt;&lt;span&gt;SM[&quot;Strong match → reuse previous ID&quot;]&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;span&gt;&lt;span&gt;    &lt;/span&gt;&lt;/span&gt;&lt;span&gt;MM[&quot;Medium match → reuse previous ID\n(if unambiguous)&quot;]&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;span&gt;&lt;span&gt;    &lt;/span&gt;&lt;/span&gt;&lt;span&gt;NM[&quot;No match → generate new stable ID\nfrom anchors&quot;]&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;span&gt;&lt;span&gt;    &lt;/span&gt;&lt;/span&gt;&lt;span&gt;DED[&quot;Deduplicate by identity key\n(keep canonical, drop duplicates)&quot;]&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;span&gt;&lt;span&gt;    &lt;/span&gt;&lt;/span&gt;&lt;span&gt;SYN[&quot;Synthesize via LLM\nusing reconciled IDs&quot;]&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;
&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;span&gt;&lt;span&gt;    &lt;/span&gt;&lt;/span&gt;&lt;span&gt;IS --&gt; NPN --&gt; IBF --&gt; CIK --&gt; MAT&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;span&gt;&lt;span&gt;    &lt;/span&gt;&lt;/span&gt;&lt;span&gt;MAT --&gt; SM --&gt; DED&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;span&gt;&lt;span&gt;    &lt;/span&gt;&lt;/span&gt;&lt;span&gt;MAT --&gt; MM --&gt; DED&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;span&gt;&lt;span&gt;    &lt;/span&gt;&lt;/span&gt;&lt;span&gt;MAT --&gt; NM --&gt; DED&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;span&gt;&lt;span&gt;    &lt;/span&gt;&lt;/span&gt;&lt;span&gt;DED --&gt; SYN&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/figure&gt;&lt;/div&gt;
&lt;p&gt;This way, even if the LLM describes the same variant with different words every run, the variant ID is stable because it’s derived from the code structure, not the prose.&lt;/p&gt;
&lt;hr&gt;
&lt;div&gt;&lt;h2 id=&quot;10-mitigation-strategies-in-depth&quot;&gt;10. Mitigation Strategies in Depth&lt;/h2&gt;&lt;/div&gt;
&lt;div&gt;&lt;h3 id=&quot;101-structural-caching-the-dispatch-cache&quot;&gt;10.1 Structural Caching: The Dispatch Cache&lt;/h3&gt;&lt;/div&gt;
&lt;p&gt;The most impactful mitigation we implemented was persisting dispatch resolution decisions to object storage.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Before:&lt;/strong&gt; Every CI run asked the LLM to resolve all interface call ambiguities from scratch.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;After:&lt;/strong&gt; Dispatch decisions are persisted with a rich cache key:&lt;/p&gt;
&lt;div&gt;&lt;figure&gt;&lt;figcaption&gt;&lt;/figcaption&gt;&lt;pre&gt;&lt;code&gt;&lt;div&gt;&lt;div&gt;&lt;span&gt;type&lt;/span&gt;&lt;span&gt; &lt;/span&gt;&lt;span&gt;PersistedDispatchEntry&lt;/span&gt;&lt;span&gt; &lt;/span&gt;&lt;span&gt;struct&lt;/span&gt;&lt;span&gt; {&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;span&gt;&lt;span&gt;    &lt;/span&gt;&lt;/span&gt;&lt;span&gt;Key              &lt;/span&gt;&lt;span&gt;string&lt;/span&gt;&lt;span&gt;         &lt;/span&gt;&lt;span&gt;`json:&quot;key&quot;`&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;span&gt;&lt;span&gt;    &lt;/span&gt;&lt;/span&gt;&lt;span&gt;InvokedMethodFQN &lt;/span&gt;&lt;span&gt;string&lt;/span&gt;&lt;span&gt;         &lt;/span&gt;&lt;span&gt;`json:&quot;invoked_method_fqn&quot;`&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;span&gt;&lt;span&gt;    &lt;/span&gt;&lt;/span&gt;&lt;span&gt;Candidates       []&lt;/span&gt;&lt;span&gt;string&lt;/span&gt;&lt;span&gt;       &lt;/span&gt;&lt;span&gt;`json:&quot;candidates&quot;`&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;span&gt;&lt;span&gt;    &lt;/span&gt;&lt;/span&gt;&lt;span&gt;CandidatesHash   &lt;/span&gt;&lt;span&gt;string&lt;/span&gt;&lt;span&gt;         &lt;/span&gt;&lt;span&gt;`json:&quot;candidates_hash&quot;`&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;span&gt;&lt;span&gt;    &lt;/span&gt;&lt;/span&gt;&lt;span&gt;WiringHash       &lt;/span&gt;&lt;span&gt;string&lt;/span&gt;&lt;span&gt;         &lt;/span&gt;&lt;span&gt;`json:&quot;wiring_hash&quot;`&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;span&gt;&lt;span&gt;    &lt;/span&gt;&lt;/span&gt;&lt;span&gt;CallsiteHash     &lt;/span&gt;&lt;span&gt;string&lt;/span&gt;&lt;span&gt;         &lt;/span&gt;&lt;span&gt;`json:&quot;callsite_hash,omitempty&quot;`&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;span&gt;&lt;span&gt;    &lt;/span&gt;&lt;/span&gt;&lt;span&gt;Result           &lt;/span&gt;&lt;span&gt;DispatchResult&lt;/span&gt;&lt;span&gt; &lt;/span&gt;&lt;span&gt;`json:&quot;result&quot;`&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;span&gt;}&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/figure&gt;&lt;/div&gt;
&lt;p&gt;The cache key includes:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;The invoked method FQN&lt;/li&gt;
&lt;li&gt;The sorted candidate set&lt;/li&gt;
&lt;li&gt;A hash of the dependency injection wiring (the context the LLM uses to make its decision)&lt;/li&gt;
&lt;li&gt;A resolver version (for explicit invalidation when the LLM or prompt changes)&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;A persisted entry is only reused if all of the following are true:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;code dir=&quot;auto&quot;&gt;schema_version&lt;/code&gt; matches&lt;/li&gt;
&lt;li&gt;&lt;code dir=&quot;auto&quot;&gt;resolver_version&lt;/code&gt; matches&lt;/li&gt;
&lt;li&gt;&lt;code dir=&quot;auto&quot;&gt;wiring_hash&lt;/code&gt; matches (DI context hasn’t changed)&lt;/li&gt;
&lt;li&gt;&lt;code dir=&quot;auto&quot;&gt;candidates_hash&lt;/code&gt; matches (the candidate set hasn’t changed)&lt;/li&gt;
&lt;li&gt;All selected targets still exist in the current call graph&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;This brought dispatch cache hit rate from 0% (first run) to 83% (144/174 sites on subsequent runs), with the remaining 30 misses explained by the incidents above.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;The key design decision:&lt;/strong&gt; tie the cache key to the context that informs the LLM’s decision. If the DI wiring changes, the decision may change too — so the cache must be invalidated. This is similar to how a database query cache invalidates when its input tables change.&lt;/p&gt;
&lt;div&gt;&lt;h3 id=&quot;102-hash-based-change-detection&quot;&gt;10.2 Hash-Based Change Detection&lt;/h3&gt;&lt;/div&gt;
&lt;p&gt;Instead of comparing LLM-generated prose to detect changes, use code-derived hashes.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Analysis tree hash:&lt;/strong&gt;&lt;/p&gt;
&lt;div&gt;&lt;figure&gt;&lt;figcaption&gt;&lt;/figcaption&gt;&lt;pre&gt;&lt;code&gt;&lt;div&gt;&lt;div&gt;&lt;span&gt;func&lt;/span&gt;&lt;span&gt; &lt;/span&gt;&lt;span&gt;computeAnalysisTreeHash&lt;/span&gt;&lt;span&gt;(&lt;/span&gt;&lt;span&gt;tree&lt;/span&gt;&lt;span&gt; &lt;/span&gt;&lt;span&gt;*&lt;/span&gt;&lt;span&gt;AnalysisTree&lt;/span&gt;&lt;span&gt;) &lt;/span&gt;&lt;span&gt;string&lt;/span&gt;&lt;span&gt; {&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;span&gt;    &lt;/span&gt;&lt;span&gt;// Include: entry kind, entry trigger, entry FQN,&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;span&gt;    &lt;/span&gt;&lt;span&gt;//          node FQNs, node source hashes,&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;span&gt;    &lt;/span&gt;&lt;span&gt;//          boundary/recursion/truncated/dispatch_uncertain flags,&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;span&gt;    &lt;/span&gt;&lt;span&gt;//          ordered child relationships&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;span&gt;    &lt;/span&gt;&lt;span&gt;// Exclude: generation timestamp, LLM prose, absolute paths&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;span&gt;    &lt;/span&gt;&lt;span&gt;return&lt;/span&gt;&lt;span&gt; &lt;/span&gt;&lt;span&gt;sha256&lt;/span&gt;&lt;span&gt;(&lt;/span&gt;&lt;span&gt;canonicalJSON&lt;/span&gt;&lt;span&gt;(treeStructure))&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;span&gt;}&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/figure&gt;&lt;/div&gt;
&lt;p&gt;An entry point is &lt;code dir=&quot;auto&quot;&gt;Unchanged&lt;/code&gt; if and only if its analysis tree hash matches the previous hash. This is deterministic, fast, and immune to prose drift.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Node source hash:&lt;/strong&gt;&lt;/p&gt;
&lt;div&gt;&lt;figure&gt;&lt;figcaption&gt;&lt;/figcaption&gt;&lt;pre&gt;&lt;code&gt;&lt;div&gt;&lt;div&gt;&lt;span&gt;func&lt;/span&gt;&lt;span&gt; &lt;/span&gt;&lt;span&gt;computeCanonicalSourceHash&lt;/span&gt;&lt;span&gt;(&lt;/span&gt;&lt;span&gt;funcBody&lt;/span&gt;&lt;span&gt; &lt;/span&gt;&lt;span&gt;string&lt;/span&gt;&lt;span&gt;) &lt;/span&gt;&lt;span&gt;string&lt;/span&gt;&lt;span&gt; {&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;span&gt;    &lt;/span&gt;&lt;span&gt;// Strip comments, normalize whitespace&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;span&gt;    &lt;/span&gt;&lt;span&gt;return&lt;/span&gt;&lt;span&gt; &lt;/span&gt;&lt;span&gt;sha256&lt;/span&gt;&lt;span&gt;(&lt;/span&gt;&lt;span&gt;normalizeAST&lt;/span&gt;&lt;span&gt;(funcBody))&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;span&gt;}&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/figure&gt;&lt;/div&gt;
&lt;p&gt;This gives us per-function change detection without relying on line numbers or file modification times, both of which are brittle in CI environments.&lt;/p&gt;
&lt;div&gt;&lt;h3 id=&quot;103-the-git-precheck-deterministic-early-gate&quot;&gt;10.3 The Git Precheck: Deterministic Early Gate&lt;/h3&gt;&lt;/div&gt;
&lt;p&gt;Before any LLM work starts, a Git-based precheck can determine whether the commit range contains any code changes that could plausibly affect the generated output.&lt;/p&gt;
&lt;div&gt;&lt;figure&gt;&lt;figcaption&gt;&lt;/figcaption&gt;&lt;pre&gt;&lt;code&gt;&lt;div&gt;&lt;div&gt;&lt;span&gt;flowchart TD&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;span&gt;&lt;span&gt;    &lt;/span&gt;&lt;/span&gt;&lt;span&gt;CP[&quot;Load checkpoint\nanalysis/pipeline_checkpoint.json\nlast_processed_sha: abc123&quot;]&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;span&gt;&lt;span&gt;    &lt;/span&gt;&lt;/span&gt;&lt;span&gt;GD[&quot;git diff --name-only abc123..HEAD&quot;]&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;
&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;span&gt;&lt;span&gt;    &lt;/span&gt;&lt;/span&gt;&lt;span&gt;subgraph CLASS[&quot;Changed files classification&quot;]&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;span&gt;&lt;span&gt;        &lt;/span&gt;&lt;/span&gt;&lt;span&gt;subgraph SAFE[&quot;Safe to skip&quot;]&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;span&gt;&lt;span&gt;            &lt;/span&gt;&lt;/span&gt;&lt;span&gt;S1[&quot;**/*_test.go&quot;]&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;span&gt;&lt;span&gt;            &lt;/span&gt;&lt;/span&gt;&lt;span&gt;S2[&quot;docs/**&quot;]&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;span&gt;&lt;span&gt;            &lt;/span&gt;&lt;/span&gt;&lt;span&gt;S3[&quot;**/*.md&quot;]&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;span&gt;&lt;span&gt;        &lt;/span&gt;&lt;/span&gt;&lt;span&gt;end&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;span&gt;&lt;span&gt;        &lt;/span&gt;&lt;/span&gt;&lt;span&gt;subgraph MUST[&quot;Must run&quot;]&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;span&gt;&lt;span&gt;            &lt;/span&gt;&lt;/span&gt;&lt;span&gt;R1[&quot;cmd/**&quot;]&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;span&gt;&lt;span&gt;            &lt;/span&gt;&lt;/span&gt;&lt;span&gt;R2[&quot;internal/**&quot;]&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;span&gt;&lt;span&gt;            &lt;/span&gt;&lt;/span&gt;&lt;span&gt;R3[&quot;go.mod / go.sum&quot;]&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;span&gt;&lt;span&gt;            &lt;/span&gt;&lt;/span&gt;&lt;span&gt;R4[&quot;ci-pipeline.yml&quot;]&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;span&gt;&lt;span&gt;        &lt;/span&gt;&lt;/span&gt;&lt;span&gt;end&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;span&gt;&lt;span&gt;    &lt;/span&gt;&lt;/span&gt;&lt;span&gt;end&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;
&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;span&gt;&lt;span&gt;    &lt;/span&gt;&lt;/span&gt;&lt;span&gt;Q{&quot;All changes\nsafe to skip?&quot;}&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;span&gt;&lt;span&gt;    &lt;/span&gt;&lt;/span&gt;&lt;span&gt;SKIP[&quot;Skip all LLM work\nadvance checkpoint&quot;]&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;span&gt;&lt;span&gt;    &lt;/span&gt;&lt;/span&gt;&lt;span&gt;CONT[&quot;Continue with full pipeline&quot;]&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;
&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;span&gt;&lt;span&gt;    &lt;/span&gt;&lt;/span&gt;&lt;span&gt;CP --&gt; GD --&gt; CLASS --&gt; Q&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;span&gt;&lt;span&gt;    &lt;/span&gt;&lt;/span&gt;&lt;span&gt;Q --&gt;|YES| SKIP&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;span&gt;&lt;span&gt;    &lt;/span&gt;&lt;/span&gt;&lt;span&gt;Q --&gt;|NO| CONT&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/figure&gt;&lt;/div&gt;
&lt;p&gt;The precheck is designed to &lt;strong&gt;fail open&lt;/strong&gt;: if any uncertainty exists (no checkpoint, unreachable SHA, inconclusive classification), it proceeds to the full pipeline. The optimization is only applied when confidence is high.&lt;/p&gt;
&lt;p&gt;Critically, the precheck uses the &lt;strong&gt;full range&lt;/strong&gt; from last processed SHA to HEAD — not just the current commit. This ensures that skipped intermediate commits are still considered when evaluating whether to run.&lt;/p&gt;
&lt;div&gt;&lt;h3 id=&quot;104-variant-deduplication-by-code-anchors&quot;&gt;10.4 Variant Deduplication by Code Anchors&lt;/h3&gt;&lt;/div&gt;
&lt;p&gt;When the LLM is asked to identify variants for an entry point, it may return semantically identical variants worded differently. Code-anchor deduplication catches these before they reach synthesis:&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Deduplication key:&lt;/strong&gt;&lt;/p&gt;
&lt;div&gt;&lt;figure&gt;&lt;figcaption&gt;&lt;/figcaption&gt;&lt;pre&gt;&lt;code&gt;&lt;div&gt;&lt;div&gt;&lt;span&gt;entry_id + owner_fqn + branch_fqn + ordered_participating_nodes&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/figure&gt;&lt;/div&gt;
&lt;p&gt;When two variants share this key, keep one canonically:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;Prefer the variant that reused a previous variant ID&lt;/li&gt;
&lt;li&gt;Prefer the variant with non-empty &lt;code dir=&quot;auto&quot;&gt;branch_fqn&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;Prefer the variant with a more complete participating node set&lt;/li&gt;
&lt;li&gt;Prefer first occurrence as tie-breaker&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;This prevents the LLM from generating 2x or 3x the expected number of reports just because it phrased the same variant differently in each.&lt;/p&gt;
&lt;div&gt;&lt;h3 id=&quot;105-parse-failure-persistence&quot;&gt;10.5 Parse-Failure Persistence&lt;/h3&gt;&lt;/div&gt;
&lt;p&gt;When an LLM response fails validation (e.g., it returns an unrecognized FQN), don’t silently drop it. Persist a low-confidence fallback:&lt;/p&gt;
&lt;div&gt;&lt;figure&gt;&lt;figcaption&gt;&lt;/figcaption&gt;&lt;pre&gt;&lt;code&gt;&lt;div&gt;&lt;div&gt;&lt;span&gt;result, perr &lt;/span&gt;&lt;span&gt;:=&lt;/span&gt;&lt;span&gt; &lt;/span&gt;&lt;span&gt;parseDispatchResponse&lt;/span&gt;&lt;span&gt;(respText, ambig.candidates, ambig.invokedMethodFQN)&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;span&gt;if&lt;/span&gt;&lt;span&gt; perr &lt;/span&gt;&lt;span&gt;!=&lt;/span&gt;&lt;span&gt; &lt;/span&gt;&lt;span&gt;nil&lt;/span&gt;&lt;span&gt; {&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;span&gt;&lt;span&gt;    &lt;/span&gt;&lt;/span&gt;&lt;span&gt;log.&lt;/span&gt;&lt;span&gt;Warn&lt;/span&gt;&lt;span&gt;(&lt;/span&gt;&lt;span&gt;&quot;dispatch resolve: parse failed; persisting low-conf fallback&quot;&lt;/span&gt;&lt;span&gt;,&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;span&gt;        &lt;/span&gt;&lt;span&gt;&quot;method&quot;&lt;/span&gt;&lt;span&gt;, ambig.invokedMethodFQN, &lt;/span&gt;&lt;span&gt;&quot;error&quot;&lt;/span&gt;&lt;span&gt;, perr)&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;span&gt;    &lt;/span&gt;&lt;span&gt;// Low-confidence → all candidates used (CHA pass-through)&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;span&gt;    &lt;/span&gt;&lt;span&gt;// But the result IS written to cache, breaking the retry loop&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;span&gt;    &lt;/span&gt;&lt;span&gt;return&lt;/span&gt;&lt;span&gt; &lt;/span&gt;&lt;span&gt;DispatchResult&lt;/span&gt;&lt;span&gt;{Confidence: &lt;/span&gt;&lt;span&gt;&quot;low&quot;&lt;/span&gt;&lt;span&gt;}, &lt;/span&gt;&lt;span&gt;true&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;span&gt;}&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/figure&gt;&lt;/div&gt;
&lt;p&gt;A low-confidence result has the same runtime behavior as a parse failure (all CHA candidates are used), but it stops the infinite retry loop. The LLM is not re-invoked for this site on the next run.&lt;/p&gt;
&lt;hr&gt;
&lt;div&gt;&lt;h2 id=&quot;11-operational-patterns-for-llm-reliability&quot;&gt;11. Operational Patterns for LLM Reliability&lt;/h2&gt;&lt;/div&gt;
&lt;p&gt;Beyond the pipeline-specific mitigations, there are operational patterns that apply to any production LLM system.&lt;/p&gt;
&lt;div&gt;&lt;h3 id=&quot;111-per-attempt-timeout&quot;&gt;11.1 Per-Attempt Timeout&lt;/h3&gt;&lt;/div&gt;
&lt;p&gt;LLM providers can take arbitrarily long for individual responses, especially during degraded service. Without a per-attempt timeout, one slow call can block your entire pipeline.&lt;/p&gt;
&lt;div&gt;&lt;figure&gt;&lt;figcaption&gt;&lt;/figcaption&gt;&lt;pre&gt;&lt;code&gt;&lt;div&gt;&lt;div&gt;&lt;span&gt;// Instead of:&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;span&gt;resp, err &lt;/span&gt;&lt;span&gt;:=&lt;/span&gt;&lt;span&gt; c.client.&lt;/span&gt;&lt;span&gt;GenerateContent&lt;/span&gt;&lt;span&gt;(ctx, &lt;/span&gt;&lt;span&gt;...&lt;/span&gt;&lt;span&gt;)&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;
&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;span&gt;// Do:&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;span&gt;attemptCtx, cancel &lt;/span&gt;&lt;span&gt;:=&lt;/span&gt;&lt;span&gt; context.&lt;/span&gt;&lt;span&gt;WithTimeout&lt;/span&gt;&lt;span&gt;(ctx, c.attemptTimeout)&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;span&gt;defer&lt;/span&gt;&lt;span&gt; &lt;/span&gt;&lt;span&gt;cancel&lt;/span&gt;&lt;span&gt;()&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;span&gt;resp, err &lt;/span&gt;&lt;span&gt;:=&lt;/span&gt;&lt;span&gt; c.client.&lt;/span&gt;&lt;span&gt;GenerateContent&lt;/span&gt;&lt;span&gt;(attemptCtx, &lt;/span&gt;&lt;span&gt;...&lt;/span&gt;&lt;span&gt;)&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/figure&gt;&lt;/div&gt;
&lt;p&gt;Typical value: 60-120 seconds per attempt. With 3 retries and exponential backoff, worst-case per-call time is bounded to ~8-16 minutes rather than the provider’s connection timeout (often 10+ minutes per attempt).&lt;/p&gt;
&lt;div&gt;&lt;h3 id=&quot;112-global-concurrency-control&quot;&gt;11.2 Global Concurrency Control&lt;/h3&gt;&lt;/div&gt;
&lt;p&gt;A common mistake is using nested concurrency limiters. If you have an outer limit of 4 (for clusters of work) and an inner limit of 4 (for items within each cluster), you can have 4×4=16 concurrent LLM calls when you intended to have 4.&lt;/p&gt;
&lt;p&gt;Use a &lt;strong&gt;single shared semaphore&lt;/strong&gt; for all LLM-backed operations:&lt;/p&gt;
&lt;div&gt;&lt;figure&gt;&lt;figcaption&gt;&lt;/figcaption&gt;&lt;pre&gt;&lt;code&gt;&lt;div&gt;&lt;div&gt;&lt;span&gt;type&lt;/span&gt;&lt;span&gt; &lt;/span&gt;&lt;span&gt;synthesisLimiter&lt;/span&gt;&lt;span&gt; &lt;/span&gt;&lt;span&gt;struct&lt;/span&gt;&lt;span&gt; {&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;span&gt;&lt;span&gt;    &lt;/span&gt;&lt;/span&gt;&lt;span&gt;tokens &lt;/span&gt;&lt;span&gt;chan&lt;/span&gt;&lt;span&gt; &lt;/span&gt;&lt;span&gt;struct&lt;/span&gt;&lt;span&gt;{}&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;span&gt;}&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;
&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;span&gt;func&lt;/span&gt;&lt;span&gt; (&lt;/span&gt;&lt;span&gt;l &lt;/span&gt;&lt;span&gt;*&lt;/span&gt;&lt;span&gt;synthesisLimiter&lt;/span&gt;&lt;span&gt;) &lt;/span&gt;&lt;span&gt;run&lt;/span&gt;&lt;span&gt;(&lt;/span&gt;&lt;span&gt;ctx&lt;/span&gt;&lt;span&gt; &lt;/span&gt;&lt;span&gt;context&lt;/span&gt;&lt;span&gt;.&lt;/span&gt;&lt;span&gt;Context&lt;/span&gt;&lt;span&gt;, &lt;/span&gt;&lt;span&gt;fn&lt;/span&gt;&lt;span&gt; &lt;/span&gt;&lt;span&gt;func&lt;/span&gt;&lt;span&gt;() &lt;/span&gt;&lt;span&gt;error&lt;/span&gt;&lt;span&gt;) &lt;/span&gt;&lt;span&gt;error&lt;/span&gt;&lt;span&gt; {&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;span&gt;    &lt;/span&gt;&lt;span&gt;select&lt;/span&gt;&lt;span&gt; {&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;span&gt;    &lt;/span&gt;&lt;span&gt;case&lt;/span&gt;&lt;span&gt; &lt;/span&gt;&lt;span&gt;&amp;#x3C;-&lt;/span&gt;&lt;span&gt;ctx.&lt;/span&gt;&lt;span&gt;Done&lt;/span&gt;&lt;span&gt;():&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;span&gt;        &lt;/span&gt;&lt;span&gt;return&lt;/span&gt;&lt;span&gt; ctx.&lt;/span&gt;&lt;span&gt;Err&lt;/span&gt;&lt;span&gt;()&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;span&gt;    &lt;/span&gt;&lt;span&gt;case&lt;/span&gt;&lt;span&gt; l.tokens &lt;/span&gt;&lt;span&gt;&amp;#x3C;-&lt;/span&gt;&lt;span&gt; &lt;/span&gt;&lt;span&gt;struct&lt;/span&gt;&lt;span&gt;{}{}:&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;span&gt;&lt;span&gt;    &lt;/span&gt;&lt;/span&gt;&lt;span&gt;}&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;span&gt;    &lt;/span&gt;&lt;span&gt;defer&lt;/span&gt;&lt;span&gt; &lt;/span&gt;&lt;span&gt;func&lt;/span&gt;&lt;span&gt;() { &lt;/span&gt;&lt;span&gt;&amp;#x3C;-&lt;/span&gt;&lt;span&gt;l.tokens }()&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;span&gt;    &lt;/span&gt;&lt;span&gt;return&lt;/span&gt;&lt;span&gt; &lt;/span&gt;&lt;span&gt;fn&lt;/span&gt;&lt;span&gt;()&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;span&gt;}&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/figure&gt;&lt;/div&gt;
&lt;p&gt;Pass this limiter through your entire call stack. The configured concurrency limit should mean exactly: the maximum number of concurrent LLM calls across the entire pipeline, not per level.&lt;/p&gt;
&lt;div&gt;&lt;h3 id=&quot;113-circuit-breaker--failure-budget&quot;&gt;11.3 Circuit Breaker / Failure Budget&lt;/h3&gt;&lt;/div&gt;
&lt;p&gt;When the LLM service is degraded, you’ll get a flood of 503 or deadline errors. Without a circuit breaker, your pipeline will exhaust its retry budget on every call, waste 10+ minutes, and eventually fail.&lt;/p&gt;
&lt;p&gt;A failure budget is simpler than a full circuit breaker and sufficient for most batch pipelines:&lt;/p&gt;
&lt;div&gt;&lt;figure&gt;&lt;figcaption&gt;&lt;/figcaption&gt;&lt;pre&gt;&lt;code&gt;&lt;div&gt;&lt;div&gt;&lt;span&gt;type&lt;/span&gt;&lt;span&gt; &lt;/span&gt;&lt;span&gt;llmFailureBudget&lt;/span&gt;&lt;span&gt; &lt;/span&gt;&lt;span&gt;struct&lt;/span&gt;&lt;span&gt; {&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;span&gt;&lt;span&gt;    &lt;/span&gt;&lt;/span&gt;&lt;span&gt;maxFailures &lt;/span&gt;&lt;span&gt;int&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;span&gt;&lt;span&gt;    &lt;/span&gt;&lt;/span&gt;&lt;span&gt;count       &lt;/span&gt;&lt;span&gt;atomic&lt;/span&gt;&lt;span&gt;.&lt;/span&gt;&lt;span&gt;Int64&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;span&gt;}&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;
&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;span&gt;func&lt;/span&gt;&lt;span&gt; (&lt;/span&gt;&lt;span&gt;b &lt;/span&gt;&lt;span&gt;*&lt;/span&gt;&lt;span&gt;llmFailureBudget&lt;/span&gt;&lt;span&gt;) &lt;/span&gt;&lt;span&gt;record&lt;/span&gt;&lt;span&gt;(&lt;/span&gt;&lt;span&gt;err&lt;/span&gt;&lt;span&gt; &lt;/span&gt;&lt;span&gt;error&lt;/span&gt;&lt;span&gt;) &lt;/span&gt;&lt;span&gt;error&lt;/span&gt;&lt;span&gt; {&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;span&gt;    &lt;/span&gt;&lt;span&gt;if&lt;/span&gt;&lt;span&gt; &lt;/span&gt;&lt;span&gt;!&lt;/span&gt;&lt;span&gt;isLLMAvailabilityFailure&lt;/span&gt;&lt;span&gt;(err) {&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;span&gt;        &lt;/span&gt;&lt;span&gt;return&lt;/span&gt;&lt;span&gt; err  &lt;/span&gt;&lt;span&gt;// Don&apos;t count parse errors, validation errors&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;span&gt;&lt;span&gt;    &lt;/span&gt;&lt;/span&gt;&lt;span&gt;}&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;span&gt;    &lt;/span&gt;&lt;span&gt;if&lt;/span&gt;&lt;span&gt; b.count.&lt;/span&gt;&lt;span&gt;Add&lt;/span&gt;&lt;span&gt;(&lt;/span&gt;&lt;span&gt;1&lt;/span&gt;&lt;span&gt;) &lt;/span&gt;&lt;span&gt;&gt;=&lt;/span&gt;&lt;span&gt; &lt;/span&gt;&lt;span&gt;int64&lt;/span&gt;&lt;span&gt;(b.maxFailures) {&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;span&gt;        &lt;/span&gt;&lt;span&gt;return&lt;/span&gt;&lt;span&gt; fmt.&lt;/span&gt;&lt;span&gt;Errorf&lt;/span&gt;&lt;span&gt;(&lt;/span&gt;&lt;span&gt;&quot;llm failure budget exhausted: &lt;/span&gt;&lt;span&gt;%w&lt;/span&gt;&lt;span&gt;&quot;&lt;/span&gt;&lt;span&gt;, err)&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;span&gt;&lt;span&gt;    &lt;/span&gt;&lt;/span&gt;&lt;span&gt;}&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;span&gt;    &lt;/span&gt;&lt;span&gt;return&lt;/span&gt;&lt;span&gt; err&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;span&gt;}&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/figure&gt;&lt;/div&gt;
&lt;p&gt;Count: HTTP 503, &lt;code dir=&quot;auto&quot;&gt;Unavailable&lt;/code&gt;, &lt;code dir=&quot;auto&quot;&gt;DeadlineExceeded&lt;/code&gt;, &lt;code dir=&quot;auto&quot;&gt;i/o timeout&lt;/code&gt;.&lt;br&gt;
Don’t count: JSON parse failures, validation errors, caller cancellation.&lt;/p&gt;
&lt;p&gt;Budget exhaustion should fail the run cleanly and immediately — not after waiting for every in-flight call to timeout.&lt;/p&gt;
&lt;div&gt;&lt;h3 id=&quot;114-prompt-versioning&quot;&gt;11.4 Prompt Versioning&lt;/h3&gt;&lt;/div&gt;
&lt;p&gt;The LLM’s interpretation of a prompt changes when the prompt changes. This is obvious, but the consequence is less obvious: when you update a prompt, you should invalidate all cached decisions that were made with the old prompt.&lt;/p&gt;
&lt;p&gt;Embed a &lt;code dir=&quot;auto&quot;&gt;resolver_version&lt;/code&gt; in your cache keys and increment it whenever your prompt changes meaningfully. This ensures that the first run after a prompt change re-resolves everything from scratch (acceptable), rather than mixing old and new resolutions in the same artifact (not acceptable).&lt;/p&gt;
&lt;div&gt;&lt;h3 id=&quot;115-observability-what-to-measure&quot;&gt;11.5 Observability: What to Measure&lt;/h3&gt;&lt;/div&gt;
&lt;p&gt;For any LLM pipeline, instrument the following:&lt;/p&gt;





































&lt;table&gt;&lt;thead&gt;&lt;tr&gt;&lt;th&gt;Metric&lt;/th&gt;&lt;th&gt;Why It Matters&lt;/th&gt;&lt;/tr&gt;&lt;/thead&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td&gt;&lt;code dir=&quot;auto&quot;&gt;cache_hits&lt;/code&gt; / &lt;code dir=&quot;auto&quot;&gt;cache_misses&lt;/code&gt;&lt;/td&gt;&lt;td&gt;Catch cache key instability&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;&lt;code dir=&quot;auto&quot;&gt;llm_calls_per_run&lt;/code&gt;&lt;/td&gt;&lt;td&gt;Catch non-determinism-driven cost growth&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;&lt;code dir=&quot;auto&quot;&gt;parse_failures&lt;/code&gt;&lt;/td&gt;&lt;td&gt;Catch prompt/format regressions&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;&lt;code dir=&quot;auto&quot;&gt;false_positive_modified_count&lt;/code&gt;&lt;/td&gt;&lt;td&gt;The ultimate integration test&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;&lt;code dir=&quot;auto&quot;&gt;p50/p95/p99 latency per LLM call&lt;/code&gt;&lt;/td&gt;&lt;td&gt;Catch service degradation&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;&lt;code dir=&quot;auto&quot;&gt;variants_per_entry_point&lt;/code&gt; (variance)&lt;/td&gt;&lt;td&gt;Catch structural drift&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;&lt;code dir=&quot;auto&quot;&gt;artifact_churn_rate&lt;/code&gt; (new/deprecated per run)&lt;/td&gt;&lt;td&gt;Catch identity instability&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;
&lt;p&gt;A sudden spike in &lt;code dir=&quot;auto&quot;&gt;cache_misses&lt;/code&gt; or &lt;code dir=&quot;auto&quot;&gt;llm_calls_per_run&lt;/code&gt; with no code changes is almost always a sign of non-determinism. Set alerts on these.&lt;/p&gt;
&lt;hr&gt;
&lt;div&gt;&lt;h2 id=&quot;12-architecture-patterns-for-deterministic-llm-pipelines&quot;&gt;12. Architecture Patterns for Deterministic LLM Pipelines&lt;/h2&gt;&lt;/div&gt;
&lt;p&gt;Putting all of the above together, here are the architectural patterns that distinguish a stable LLM pipeline from a flaky one.&lt;/p&gt;
&lt;div&gt;&lt;h3 id=&quot;121-the-two-layer-architecture&quot;&gt;12.1 The Two-Layer Architecture&lt;/h3&gt;&lt;/div&gt;
&lt;div&gt;&lt;figure&gt;&lt;figcaption&gt;&lt;/figcaption&gt;&lt;pre&gt;&lt;code&gt;&lt;div&gt;&lt;div&gt;&lt;span&gt;flowchart TD&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;span&gt;&lt;span&gt;    &lt;/span&gt;&lt;/span&gt;&lt;span&gt;subgraph DET[&quot;DETERMINISTIC LAYER&quot;]&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;span&gt;&lt;span&gt;        &lt;/span&gt;&lt;/span&gt;&lt;span&gt;D1[&quot;Static analysis\n(CHA, AST)&quot;]&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;span&gt;&lt;span&gt;        &lt;/span&gt;&lt;/span&gt;&lt;span&gt;D2[&quot;Hash-based\nchange detection&quot;]&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;span&gt;&lt;span&gt;        &lt;/span&gt;&lt;/span&gt;&lt;span&gt;D3[&quot;Code-anchor\nidentity derivation&quot;]&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;span&gt;&lt;span&gt;        &lt;/span&gt;&lt;/span&gt;&lt;span&gt;D4[&quot;Stable cache keys\n(from code)&quot;]&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;span&gt;&lt;span&gt;    &lt;/span&gt;&lt;/span&gt;&lt;span&gt;end&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;
&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;span&gt;&lt;span&gt;    &lt;/span&gt;&lt;/span&gt;&lt;span&gt;subgraph PROB[&quot;PROBABILISTIC LAYER — LLM&quot;]&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;span&gt;&lt;span&gt;        &lt;/span&gt;&lt;/span&gt;&lt;span&gt;P1[&quot;Dispatch\nresolution&quot;]&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;span&gt;&lt;span&gt;        &lt;/span&gt;&lt;/span&gt;&lt;span&gt;P2[&quot;Variant\nidentification&quot;]&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;span&gt;&lt;span&gt;        &lt;/span&gt;&lt;/span&gt;&lt;span&gt;P3[&quot;Overview\nsynthesis&quot;]&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;span&gt;&lt;span&gt;        &lt;/span&gt;&lt;/span&gt;&lt;span&gt;P4[&quot;Report\nsynthesis&quot;]&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;span&gt;&lt;span&gt;    &lt;/span&gt;&lt;/span&gt;&lt;span&gt;end&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;
&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;span&gt;&lt;span&gt;    &lt;/span&gt;&lt;/span&gt;&lt;span&gt;subgraph REC[&quot;RECONCILIATION LAYER&quot;]&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;span&gt;&lt;span&gt;        &lt;/span&gt;&lt;/span&gt;&lt;span&gt;R1[&quot;Validate LLM output\nagainst code facts&quot;]&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;span&gt;&lt;span&gt;        &lt;/span&gt;&lt;/span&gt;&lt;span&gt;R2[&quot;Normalize field\norder and format&quot;]&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;span&gt;&lt;span&gt;        &lt;/span&gt;&lt;/span&gt;&lt;span&gt;R3[&quot;Reconcile IDs\nvs code anchors&quot;]&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;span&gt;&lt;span&gt;        &lt;/span&gt;&lt;/span&gt;&lt;span&gt;R4[&quot;Deduplicate\nby code anchors&quot;]&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;span&gt;&lt;span&gt;    &lt;/span&gt;&lt;/span&gt;&lt;span&gt;end&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;
&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;span&gt;&lt;span&gt;    &lt;/span&gt;&lt;/span&gt;&lt;span&gt;DET --&gt;|&quot;decisions, not prose&quot;| PROB&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;span&gt;&lt;span&gt;    &lt;/span&gt;&lt;/span&gt;&lt;span&gt;PROB --&gt;|&quot;prose, not identity&quot;| REC&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/figure&gt;&lt;/div&gt;
&lt;div&gt;&lt;h3 id=&quot;122-the-caching-strategy-pattern&quot;&gt;12.2 The Caching Strategy Pattern&lt;/h3&gt;&lt;/div&gt;
&lt;p&gt;Not all LLM calls should be cached the same way. Distinguish between:&lt;/p&gt;





















&lt;table&gt;&lt;thead&gt;&lt;tr&gt;&lt;th&gt;Call Type&lt;/th&gt;&lt;th&gt;Cache Strategy&lt;/th&gt;&lt;/tr&gt;&lt;/thead&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td&gt;Structural decisions (dispatch)&lt;/td&gt;&lt;td&gt;Persist to durable storage; invalidate on context change&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Content synthesis (reports)&lt;/td&gt;&lt;td&gt;Cache by code hash; regenerate when code changes&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;One-shot creative tasks&lt;/td&gt;&lt;td&gt;Don’t cache; accept variance&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;
&lt;p&gt;The key insight: cache &lt;strong&gt;decisions&lt;/strong&gt;, not &lt;strong&gt;prose&lt;/strong&gt;. A dispatch decision (“this interface resolves to type X”) is a structural fact that should remain stable. The prose explanation of why X was chosen is cosmetic and doesn’t need to be cached.&lt;/p&gt;
&lt;div&gt;&lt;h3 id=&quot;123-the-candidateselection-pattern&quot;&gt;12.3 The Candidate/Selection Pattern&lt;/h3&gt;&lt;/div&gt;
&lt;p&gt;When asking an LLM to select from a set of candidates, always:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;Enumerate candidates deterministically (sorted, deduped)&lt;/li&gt;
&lt;li&gt;Validate the LLM’s selection against the candidate set (reject selections not in candidates)&lt;/li&gt;
&lt;li&gt;Store the candidates hash in the cache key (so candidate-set changes invalidate the cache)&lt;/li&gt;
&lt;li&gt;Handle selection failures gracefully (low-confidence fallback, not re-retry forever)&lt;/li&gt;
&lt;/ol&gt;
&lt;div&gt;&lt;h3 id=&quot;124-separation-of-identity-and-content&quot;&gt;12.4 Separation of Identity and Content&lt;/h3&gt;&lt;/div&gt;
&lt;p&gt;&lt;strong&gt;Anti-pattern:&lt;/strong&gt;&lt;/p&gt;
&lt;div&gt;&lt;figure&gt;&lt;figcaption&gt;&lt;/figcaption&gt;&lt;pre&gt;&lt;code&gt;&lt;div&gt;&lt;div&gt;&lt;span&gt;Artifact ID = hash(LLM-generated title)&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/figure&gt;&lt;/div&gt;
&lt;p&gt;&lt;strong&gt;Correct pattern:&lt;/strong&gt;&lt;/p&gt;
&lt;div&gt;&lt;figure&gt;&lt;figcaption&gt;&lt;/figcaption&gt;&lt;pre&gt;&lt;code&gt;&lt;div&gt;&lt;div&gt;&lt;span&gt;Artifact ID = hash(code-derived anchors: entry_fqn + branch_fqn + participating_nodes)&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;span&gt;Artifact content = LLM-generated prose (can vary without affecting ID)&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/figure&gt;&lt;/div&gt;
&lt;p&gt;This separation means prose can be freely re-generated (for quality improvements) without causing identity churn. It also means the same artifact can be updated with new prose without creating a new file.&lt;/p&gt;
&lt;div&gt;&lt;h3 id=&quot;125-the-validation-first-pattern&quot;&gt;12.5 The Validation-First Pattern&lt;/h3&gt;&lt;/div&gt;
&lt;p&gt;Before using any LLM output:&lt;/p&gt;
&lt;div&gt;&lt;figure&gt;&lt;figcaption&gt;&lt;/figcaption&gt;&lt;pre&gt;&lt;code&gt;&lt;div&gt;&lt;div&gt;&lt;span&gt;flowchart TD&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;span&gt;&lt;span&gt;    &lt;/span&gt;&lt;/span&gt;&lt;span&gt;LO[&quot;LLM output&quot;]&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;span&gt;&lt;span&gt;    &lt;/span&gt;&lt;/span&gt;&lt;span&gt;PV[&quot;Parse and validate structure&quot;]&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;span&gt;&lt;span&gt;    &lt;/span&gt;&lt;/span&gt;&lt;span&gt;VCR[&quot;Validate all code references\nexist in current graph&quot;]&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;span&gt;&lt;span&gt;    &lt;/span&gt;&lt;/span&gt;&lt;span&gt;NCF[&quot;Normalize to canonical form&quot;]&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;span&gt;&lt;span&gt;    &lt;/span&gt;&lt;/span&gt;&lt;span&gt;RECON[&quot;Reconcile against previous state\nusing code anchors&quot;]&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;span&gt;&lt;span&gt;    &lt;/span&gt;&lt;/span&gt;&lt;span&gt;USE[&quot;Use validated, normalized,\nreconciled output&quot;]&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;
&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;span&gt;&lt;span&gt;    &lt;/span&gt;&lt;/span&gt;&lt;span&gt;LO --&gt; PV --&gt; VCR --&gt; NCF --&gt; RECON --&gt; USE&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/figure&gt;&lt;/div&gt;
&lt;p&gt;Never pass raw LLM output directly into identity-sensitive operations. Always validate first.&lt;/p&gt;
&lt;hr&gt;
&lt;div&gt;&lt;h2 id=&quot;13-what-cannot-be-fixed&quot;&gt;13. What Cannot Be Fixed&lt;/h2&gt;&lt;/div&gt;
&lt;p&gt;It is worth being honest about what these mitigations do not solve.&lt;/p&gt;
&lt;div&gt;&lt;h3 id=&quot;131-the-llms-knowledge-is-frozen&quot;&gt;13.1 The LLM’s Knowledge Is Frozen&lt;/h3&gt;&lt;/div&gt;
&lt;p&gt;The LLM’s knowledge of your codebase is frozen at training time. It does not know about:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;New packages added to your repository&lt;/li&gt;
&lt;li&gt;Renamed interfaces or restructured types&lt;/li&gt;
&lt;li&gt;Framework upgrades that change FQN conventions&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;This means the LLM can produce plausible-looking but incorrect output for code patterns it hasn’t seen, and the incorrectness may be subtle enough to pass validation. In our case, the &lt;code dir=&quot;auto&quot;&gt;(hash.Hash).Sum&lt;/code&gt; incident showed the LLM consistently returning a pre-upgrade FQN that looked valid but failed against the current call graph.&lt;/p&gt;
&lt;p&gt;The mitigation is: validate LLM output against current code facts, not against what the LLM should know.&lt;/p&gt;
&lt;div&gt;&lt;h3 id=&quot;132-fundamental-temperature-non-determinism&quot;&gt;13.2 Fundamental Temperature Non-Determinism&lt;/h3&gt;&lt;/div&gt;
&lt;p&gt;Even with &lt;code dir=&quot;auto&quot;&gt;temperature=0&lt;/code&gt;, some providers do not guarantee identical outputs across all invocations. Hardware differences, model serving infrastructure, and internal implementation details can produce different outputs for identical inputs. The only mitigation is caching — but caching only helps for calls you’ve made before.&lt;/p&gt;
&lt;div&gt;&lt;h3 id=&quot;133-context-sensitive-dispatch-requires-context-sensitive-analysis&quot;&gt;13.3 Context-Sensitive Dispatch Requires Context-Sensitive Analysis&lt;/h3&gt;&lt;/div&gt;
&lt;p&gt;The false-positive re-synthesis incident (15 entry points from 1 change) is only partially fixable with a context-scoped dispatch key. The deeper fix is context-sensitive pointer analysis (k-CFA instead of CHA). CHA is O(n) over the class hierarchy; k-CFA is polynomial and can be extremely slow on large codebases. For most production systems, CHA + LLM dispatch is the right tradeoff, but it cannot achieve the precision of a true context-sensitive analysis.&lt;/p&gt;
&lt;div&gt;&lt;h3 id=&quot;134-prompt-changes-are-costly&quot;&gt;13.4 Prompt Changes Are Costly&lt;/h3&gt;&lt;/div&gt;
&lt;p&gt;Every time you improve your prompt, you pay the cost of re-resolving all previously cached decisions. This is unavoidable — a better prompt may produce genuinely different and better results, so old cached results are legitimately stale. The mitigation is &lt;code dir=&quot;auto&quot;&gt;resolver_version&lt;/code&gt; in cache keys, which makes invalidation explicit and controllable.&lt;/p&gt;
&lt;hr&gt;
&lt;div&gt;&lt;h2 id=&quot;14-summary-and-checklist&quot;&gt;14. Summary and Checklist&lt;/h2&gt;&lt;/div&gt;
&lt;p&gt;Building software on top of LLMs that is reliable enough for production CI/CD requires treating the LLM as a &lt;strong&gt;probabilistic black box that produces proposals, not facts&lt;/strong&gt;.&lt;/p&gt;
&lt;div&gt;&lt;h3 id=&quot;the-core-principles&quot;&gt;The Core Principles&lt;/h3&gt;&lt;/div&gt;
&lt;ol&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Code owns identity.&lt;/strong&gt; Never derive artifact IDs, cache keys, or artifact names from LLM-generated prose. Use code-derived anchors (FQNs, hashes, structured types).&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Validate, normalize, reconcile before using.&lt;/strong&gt; Every LLM response should pass through a validation layer that checks code-derived facts, a normalization layer that canonicalizes format, and a reconciliation layer that matches against previous state.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Cache decisions, not prose.&lt;/strong&gt; Persist structural decisions (dispatch resolutions, classification results) to durable storage with rich invalidation keys. Don’t cache prose — it should be freely regenerable.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Measure non-determinism directly.&lt;/strong&gt; Instrument for cache hit rate, LLM calls per run, parse failures, and artifact churn rate. Non-determinism is invisible without these metrics.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Fail open safely.&lt;/strong&gt; When caching or validation fails, fall back to a deterministic default (e.g., CHA pass-through, all candidates used). Never re-retry forever — persist the failure result.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Separate early gates from semantic gates.&lt;/strong&gt; Git-based prechecks can save LLM cost for obviously irrelevant changes, but they do not replace semantic diffing. Use both, in sequence.&lt;/p&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;div&gt;&lt;h3 id=&quot;production-readiness-checklist&quot;&gt;Production Readiness Checklist&lt;/h3&gt;&lt;/div&gt;
&lt;ul&gt;
&lt;li&gt;&lt;input type=&quot;checkbox&quot; disabled&gt; All cache keys include code-derived components, not LLM prose&lt;/li&gt;
&lt;li&gt;&lt;input type=&quot;checkbox&quot; disabled&gt; Cache keys include a &lt;code dir=&quot;auto&quot;&gt;resolver_version&lt;/code&gt; for prompt-change invalidation&lt;/li&gt;
&lt;li&gt;&lt;input type=&quot;checkbox&quot; disabled&gt; All LLM responses are validated against current code facts before use&lt;/li&gt;
&lt;li&gt;&lt;input type=&quot;checkbox&quot; disabled&gt; Parse failures are persisted as low-confidence results, not dropped&lt;/li&gt;
&lt;li&gt;&lt;input type=&quot;checkbox&quot; disabled&gt; Artifact and report IDs are derived from code anchors, not LLM text&lt;/li&gt;
&lt;li&gt;&lt;input type=&quot;checkbox&quot; disabled&gt; Per-attempt LLM timeout is configured (not relying on provider’s connection timeout)&lt;/li&gt;
&lt;li&gt;&lt;input type=&quot;checkbox&quot; disabled&gt; Global concurrency limit is a single shared semaphore, not nested limits&lt;/li&gt;
&lt;li&gt;&lt;input type=&quot;checkbox&quot; disabled&gt; Circuit breaker or failure budget stops runaway retries on service degradation&lt;/li&gt;
&lt;li&gt;&lt;input type=&quot;checkbox&quot; disabled&gt; Metrics are in place for cache hit rate, LLM calls/run, and artifact churn rate&lt;/li&gt;
&lt;li&gt;&lt;input type=&quot;checkbox&quot; disabled&gt; False-positive modified artifact count is monitored with alerting&lt;/li&gt;
&lt;li&gt;&lt;input type=&quot;checkbox&quot; disabled&gt; Git precheck or equivalent avoids LLM calls for clearly irrelevant changes&lt;/li&gt;
&lt;li&gt;&lt;input type=&quot;checkbox&quot; disabled&gt; Integration tests verify that two consecutive identical runs produce identical output&lt;/li&gt;
&lt;/ul&gt;
&lt;hr&gt;
&lt;div&gt;&lt;h2 id=&quot;further-reading&quot;&gt;Further Reading&lt;/h2&gt;&lt;/div&gt;
&lt;p&gt;The patterns described in this post draw on established ideas from several fields:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Content-addressable storage&lt;/strong&gt; (CAS): the idea of keying artifacts by hash of their content, used in Git, Nix, and build systems like Bazel&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Incremental computation&lt;/strong&gt;: systems like Salsa (Rust compiler) and Build Systems à la Carte provide formal frameworks for thinking about what must be recomputed when inputs change&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Optimistic concurrency control&lt;/strong&gt;: validating cached results against current state before use, rather than assuming they’re valid&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Circuit breakers&lt;/strong&gt;: from Michael Nygard’s “Release It!” — stopping cascading failures by detecting and short-circuiting failing dependencies&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Context-sensitive program analysis&lt;/strong&gt;: Andersen’s algorithm, k-CFA — the theoretical foundation for understanding why CHA + LLM is an approximation&lt;/li&gt;
&lt;/ul&gt;
&lt;hr&gt;
&lt;p&gt;&lt;em&gt;The incidents described in this post were observed in a production CI pipeline running on a large Go codebase with hundreds of entry points and thousands of LLM calls per week. All code examples are simplified for clarity. If you’re building a similar system, the most important investment is not in prompt engineering — it is in the deterministic infrastructure around your LLM calls.&lt;/em&gt;&lt;/p&gt;</content:encoded></item></channel></rss>