Agentic Modernization and the Compilation Conundrum

Written by DeeDee Walsh | Jun 17, 2026 10:37:43 PM

I watched the full .NET Day Agentic Modernization, and it was fantastic. You should definitely watch it. The GitHub Copilot app modernization agent did legit work: AST-aware project-system migration (packages.config → PackageReference, SDK-style csproj), dependency graph resolution, API-incompatibility detection, task-decomposed execution with a reviewable plan. Aspire's CLI tokenizing the docs into an LLMs.txt the agent can retrieve against is a great idea. Data API Builder collapsing a hand-rolled CRUD tier is the right call most of the time. All of that is super legit but I did see one problem and I'm gonna write about it - cuz that's what I do....

Apologies, this AI art is almost coherent.

The problem is the definition of "validation" that the entire day operated under. Watch the validation steps in any of the runs and they reduce to a small, specific set:

✓ dotnet build      → 0 errors (some warnings)
✓ vulnerable package scan → no flagged transitive refs
✓ sample app launches, loads data from DB
✓ "final validation" task → clean full build on .NET 10

That is a type-soundness and toolchain check, plus a smoke test. At one point the agent's final-validation step went looking for test projects and found none and the run was reported as successful anyway. So let's be precise about what was and wasn't established.

What a green build proves

dotnet build succeeding proves: the program is syntactically valid, the type graph is consistent, references resolve, and the IL emits. The C# type system is sound-ish, so a clean build rules out a large class of static errors. It says nothing, by construction, about runtime semantics. The compiler does not know what your application is supposed to do. It only knows the code is well-formed.

Green means go, amirite? Also, this AI art is much more representative of my brain.

Behavioral equivalence is a property of observable output over the input domain, not a property of the source. And the Framework → modern-.NET transition is dense with places where well-formed, compile-clean code produces different observable behavior. A few that bite real migrations, none of which a build catches:

Globalization moved from NLS to ICU. Since .NET 5, culture-sensitive string operations use ICU, not the OS NLS tables. Collation order, casing edge cases, and culture-sensitive comparisons can differ from .NET Framework.

 csharp
// Same code, different result depending on the globalization stack.
// Sort order feeding a paged grid, a dedupe key, or a "starts-with" filter
// can silently reorder/regroup.
string.Compare("cote", "coté", CultureInfo.GetCultureInfo("fr-FR"), CompareOptions.None);
// NLS (Framework) and ICU (modern .NET) do not guarantee the same ordering here.

If a comparison result ever became a key: a cache key, a dedupe bucket, a merge join, an "is this the same record" check, the divergence is now in your data, not your logs.

Default floating-point formatting changed. .NET Core 3.0 switched double/float ToString() to shortest round-trippable, IEEE-754-compliant output.

 csharp
(0.1 + 0.2).ToString();
// .NET Framework:  "0.3"
// .NET Core 3.0+:  "0.30000000000000004"

Every place a double crosses a string boundary including in a generated report, a serialized payload, a signature/hash input, a CSV export a downstream system parses can change. Compiles identically.

Legacy code pages aren't registered by default. A modern .NET process that reads a Windows-1252 file a legacy ETL still produces will throw at runtime unless you opt in:

 csharp
// Throws NotSupportedException on modern .NET without this line:
Encoding.RegisterProvider(CodePagesEncodingProvider.Instance);
var enc = Encoding.GetEncoding(1252);

The likely "fix" an agent reaches for, assume UTF-8, doesn't throw and doesn't error. It mojibakes the data and ships.

The ASP.NET synchronization context is gone. System.Web installed a SynchronizationContext; ASP.NET Core does not. Sync-over-async patterns that deadlocked (or were "load-bearing" in their ordering) on Framework now behave differently:

 csharp
var result = SomeAsync().Result;   // a classic Framework deadlock site
// On ASP.NET Core: no captured context → it may now complete,
// changing execution ordering and continuation behavior.

Code that "starts working" is not obviously a win when something elsewhere depended on the old timing.

Add the long tail: BinaryFormatter is obsolete/removed (any persisted blob or remoting payload that used it is now a runtime failure or a silent format change), ConfigurationManager/web.config semantics give way to IConfiguration, HttpContext.Current and Thread.CurrentPrincipal ambient access disappear, System.Web session/auth pipelines have no 1:1 port, and WebForms has no forward path at all (note that the day's web demo used MVC5 Music Store which sidesteps the hardest legacy ASP.NET surface entirely).

Every one of these is invisible to dotnet build. Several are invisible to a smoke test that loads one screen of data.

Why LLM translation specifically produces this failure shape

This isn't a knock on the model; it's a property of the objective. An LLM doing translation is sampling the most probable .NET 10 idiom conditioned on the input. It optimizes for code that looks like canonical modern .NET because that's what the distribution rewards; not for code that preserves your application's idiosyncratic observable contract. The model has no oracle for your behavior. It has no way to know that this rounding, this culture, this null-vs-empty, this status code was load-bearing for a downstream consumer it cannot see.

So the divergence concentrates exactly where it's hardest to catch: the output compiles (it's idiomatic), passes the green-build gate (it's well-formed), and differs from the original only on the inputs nobody demoed. That's the hallucination tax stated mechanically:

csharp
 expected cost ≈ P(silent divergence) × blast_radius × time_to_detection

LLM translation raises all three terms relative to a deterministic transform: higher P (sampling, not a proven mapping), wider blast radius (drift can land anywhere in the surface), and longer time-to-detection (it survives every gate the demo showed and surfaces in production).

The oracle problem (not the db...)

Validation requires an oracle: a source of truth for "correct." Greenfield projects have a spec. Migration has exactly one oracle: the observable behavior of the legacy system. That's the asymmetry that makes this hard. You're not asserting against requirements; you're asserting against a running binary whose behavior is the only surviving specification and which nobody on the team can fully enumerate.

Which is why the agent finding no test projects is the defining condition of the work. The validation hierarchy in play during the day topped out around here:

L0 — Compiles. Type soundness. (Shown.)
L1 — Builds + static/SCA checks. No analyzer errors, no flagged CVEs. (Shown.)
L2 — Existing tests pass. Requires tests to exist. (Legacy estates: ~none.)
L3 — Characterization / golden-master. Capture legacy I/O, assert new output matches.
L4 — Differential testing. Run old and new against the same inputs; diff observable outputs at the contract boundary.
L5 — Property-based / metamorphic. For input domains too large to enumerate, assert invariants that must hold across the migration.

The green-build gate lives at L1. The behavioral contract lives at L3-L5. The interesting fact is that L3 doesn't require pre-existing tests. You synthesize the oracle. Characterization testing means instrumenting the legacy system, capturing representative inputs and their observed outputs, and turning that capture into the assertion baseline the modernized system must satisfy. Differential (L4) is the gold standard for migration: shadow real traffic to both versions and diff the responses. The Scientist-style "run both, compare, report" pattern, or HTTP-level response diffing à la Diffy. None of that is exotic. It's just not what an assess→plan→translate→build agent does, and it's what determines whether the modernization is trustworthy.

Two contract-level traps the day's stack introduces

A green build hides these too, and they're worth calling out because they came from choices the demos endorsed.

Data API Builder changes the HTTP contract. Replacing a bespoke controller with DAB is often correct, but DAB has its own REST/GraphQL conventions: response envelope, paging tokens, $filter/$orderby semantics, status-code and error-payload shapes. "Same behavior on the backend" (a phrase used in the demo) is not the same as "same contract at the boundary."

GET /api/orders?page=2
# legacy controller
200 { "items": [...], "total": 412, "page": 2 }
# DAB
200 { "value": [...], "nextLink": "/api/orders?$after=..." }   # different envelope, paging, errors

Every client of that endpoint is now running against a changed contract that compiled perfectly on both sides. This is precisely an L4 differential check and nothing in the build catches it.

Aspire's default resilience can change runtime behavior. Bringing services under Aspire orchestration pulls in the standard resilience and telemetry defaults including transient-fault retries on outbound calls. For a non-idempotent operation that the legacy app issued exactly once, a default retry policy can turn one POST into two under transient conditions. The topology change is the point of Aspire; the behavioral side effect is the thing you validate for.

What validate has to mean and the architecture it implies

I was a PM on Visual Basic 1.0 and was in the room for the ArtinSoft deal that became the VB Upgrade Wizard. Technically, the Wizard was a deterministic, rule-based AST transform: provably repeatable, auditable mappings, same input, same output. Its hard lesson was the residual: the constructs with no sound 1:1 mapping, where the transform had to guess or punt, and where you needed a human plus a behavioral check. Agentic translation didn't eliminate that residual. It relocated it out of an inspectable rule table and into the model's probability mass, where it's harder to see and easier to ship.

That history is why our pipeline is built the way it is, and why I think the right architecture is explicitly hybrid:

Deterministic transforms where a sound mapping exists. Lower hallucination surface, auditable, repeatable. You don't sample what you can prove.
Agentic judgment where it doesn't. The residual, the places that genuinely require reasoning.
A validation layer that trusts neither blindly. In our pipeline that's the Quality agent, and its job is L3–L4: establish behavioral baselines from the legacy system and grade the modernized output against them. When we did a very large PowerBuilder → .NET conversion, the deliverable wasn't "it builds". It was an A-grade against the original's behavior, which only means anything because there's a differential baseline behind the letter. The stabilization pipeline is the soak/regression layer on top of that; behavioral validation under sustained real conditions, not a single launch.

It's also why we start with an Assessment that reports the percentage of the application we can stand behind, measured against the behavioral surface, before any translation runs and why we deliver a complete, validated application at a fixed price rather than handing back a green build and a residual backlog. The honest scope is a validation of the plan; the graded output is a validation of the result. All of it sits on Microsoft Foundry, which is the point: this is the layer that makes the agentic on-ramp safe to actually drive, not a competitor to it.

My final thought (for this blog, not forever)

dotnet build returning zero errors is a legit signal about the static structure of your code and a non-signal about whether it does what the system it replaced did. Agentic modernization has made the translation step fast and way cheaper which means the entire engineering value of a migration is collapsing onto the one step nobody demoed: establishing an oracle from the legacy system and proving behavioral equivalence against it. The estates that matter have no tests, no spec, and no tolerance for silent divergence. Closing the gap between "it compiles" and "it's provably equivalent" is the problem.

If you want to see a differential validation baseline built against your own legacy behavior and an honest number for how much of the app we can stand behind, that's what an Assessment is.

View full post