April 29, 20259 min readGaurav

AI Case Study: Standardizing Logging Across a Large C# Codebase

Not every problem needs an autonomous agent. Sometimes orchestration is all you need.

In large enterprise systems, inconsistent logging becomes an incident response problem long before it is recognized as an engineering quality problem.

That was the situation we were dealing with at Icertis. Production issues such as unhandled exceptions, deadlocks, race conditions, and edge-case failures were consuming engineering time, and root-cause analysis was slower than it should have been. A major reason was inconsistent logging. Some methods had no logs. Some had legacy instrumentation. Some had logs that were too vague or too technical to help support or operations teams during an incident.

Leadership asked teams to improve logging across a large and constantly changing C# codebase. The requirement was reasonable. The obvious implementation path was not.

Manually reviewing millions of lines of code to decide where logging should exist, what it should say, and how to add it safely would have turned into an unbounded cleanup effort. It would also have competed directly with feature delivery. We needed a system, not a campaign.

We built an orchestrated workflow using Roslyn, GPT-4o, SQL-backed state tracking, and build validation. It scanned the codebase, identified methods that needed better logging, generated updated method bodies, wrote the changes back into source files, validated the output, and packaged the results into small reviewable branches.

This worked because the problem was treated as a systems problem. The value did not come from asking a model to edit code. It came from building a controlled pipeline around a narrow semantic task.

1. The hard part was judgment at scale

The core problem was not inserting log statements. The core problem was deciding where logging was needed and what the logging should communicate.

In a large codebase, that judgment cannot be handled well by simple pattern matching. A useful log is not defined by syntax. It is defined by whether it improves visibility into a meaningful business operation or a failure path that matters during debugging.

That requires context. A method may already have instrumentation, but the instrumentation may be obsolete. A method may contain no logging, but may also do nothing operationally important. Another method may execute a critical business path and still expose very little runtime signal when something goes wrong.

This is where GPT-4o was useful. Not as an autonomous coding agent, and not as a developer copilot, but as a bounded semantic component inside a larger workflow. We used it to answer two narrow questions:

Does this method need better logging?
If yes, what should the logging say?

Those questions sound small. At scale, they are the entire problem.

2. Roslyn gave us the inventory and control surface

The first stage of the system was a scanner built on top of the Roslyn compiler SDK.

It walked the C# solutions and extracted method-level metadata, including file path, project name, method name, line boundaries, method size, and whether the method already used legacy instrumentation or the newer logging pattern. We stored that metadata in SQL.

This design choice was foundational.

Without a durable method inventory, the workflow would have had no stable unit of work. We would have been re-parsing files repeatedly, rediscovering the same methods, and guessing about progress based on the current state of source files. That does not hold up in a real codebase.

By materializing the codebase into method-level records, we created a control plane for the workflow. Each method could move through explicit states such as analyzed, queued, generated, skipped, failed, or completed. Failures became trackable. Retries became selective instead of global. Progress became measurable.

This is a general lesson for enterprise AI systems. If the state model is weak, the workflow becomes opaque. Once that happens, debugging the automation becomes harder than the original engineering task.

3. The model was constrained because unconstrained generation is operationally expensive

Once the scanner had mapped the codebase, we used GPT-4o to classify candidate methods and generate updated method bodies.

The prompts were deliberately restrictive. The model was instructed to preserve business logic, avoid structural rewrites, replace outdated instrumentation where relevant, and generate log messages that added business context rather than low-value technical noise.

That level of constraint was necessary.

Most failures in AI-assisted code transformation are not model failures in the abstract. They are control failures. The model is given too much room, changes too much, and produces output that is harder to validate, harder to merge, and harder to review.

In this workflow, we were not optimizing for elegance. We were optimizing for bounded change.

We also used the Batch API instead of synchronous generation. That decision mattered for both cost and throughput. This was not an interactive developer workflow. It was a large-scale transformation pipeline. Running the model one method at a time in a synchronous loop would have increased cost, increased latency, and made scheduling harder.

The model’s job was semantic interpretation. The system’s job was everything else.

4. Applying changes safely was the real engineering challenge

The most difficult part of the workflow was not generation. It was writing generated changes back into source files safely.

This is where many AI code transformation systems break down.

Generating a revised method body is straightforward. Applying that revised method body into a real file that contains multiple methods, existing formatting, shifting line numbers, and potentially multiple edits is not. Once one replacement is made, offsets move. If subsequent edits depend on stale spans or naive text replacement, files get corrupted quickly.

We had to reconstruct updated files carefully while preserving formatting, indentation, and encoding. This required treating edits as structured transformations rather than simple string substitutions.

This is a pattern worth calling out directly. Teams regularly overestimate the importance of generation quality and underestimate the complexity of deterministic application. In production systems, application logic usually matters more. If it is weak, the automation is not safe enough to use.

5. Validation gates kept the system credible

We did not assume generated output was safe because it looked reasonable.

After file updates were applied, we ran Roslyn and MSBuild-based compilation checks. If a file failed validation, it was excluded from the output.

This gate was basic, but essential.

Any workflow that writes changes into a production codebase needs hard boundaries around failure. Without those boundaries, the system pushes low-quality output downstream and forces human reviewers to act as the cleanup layer. Once that happens, trust in the workflow collapses.

Compilation checks do not guarantee semantic correctness. They do remove a large class of avoidable breakage. More importantly, they create a clear contract. The pipeline can propose changes, but only validated changes move forward.

6. Reviewability was part of the architecture, not an afterthought

A common mistake in large-scale automation is optimizing for total transformed output while ignoring how the output will actually be consumed.

We avoided that mistake by designing for review from the beginning.

Instead of creating one large branch with hundreds of changed files, we grouped updates into batches of 10 files. For each batch, we created a Git branch and an issue describing the affected files. Engineers could review a bounded unit, make adjustments if necessary, and submit a pull request without taking on an unreasonable review burden.

This mattered for adoption.

A workflow can be technically correct and still fail if it does not fit the operating model of the engineering organization. Teams review in small units. Ownership is distributed. Risk is easier to assess when the blast radius is narrow. Branch shape, batch size, and issue structure are not packaging details. They are part of the system design.

If the final output is not reviewable, the automation has not solved the actual problem.

7. Orchestration was a better fit than an agent

There is a tendency to frame every AI engineering problem as an agent problem. In this case, that would have been the wrong design.

The workflow was already known:

Scan the codebase.
Identify candidate methods.
Generate logging changes.
Apply changes safely.
Validate the output.
Package the results for review.

There was no real ambiguity in the sequence. The difficulty was operational: scale, cost control, checkpointing, failure recovery, and reviewability.

That made orchestration the correct abstraction.

A ReAct-style agent would have added overhead without adding useful capability. It would have increased tool calls, increased token usage, introduced more behavioral variability, and weakened auditability. It also would have made debugging harder, because execution would depend more heavily on runtime reasoning instead of explicit stage boundaries.

For bulk code transformation in an enterprise codebase, determinism is usually more valuable than autonomy.

The better design was to isolate the part that required semantic judgment and keep the rest of the workflow explicit. That allowed us to track progress in SQL, retry narrow failures, skip risky cases, control generation costs, and maintain clear observability into pipeline behavior.

This distinction matters beyond this project. Many enterprise AI use cases are not agent problems. They are workflow problems with one or two semantic decision points inside them.

8. The result was leverage, not just output

In about a week, we converted an open-ended logging initiative into a repeatable engineering workflow.

The obvious result was that more methods got better logging. The more important result was that the problem changed shape.

Before the workflow, the effort was hard to scope, hard to staff, and disruptive to normal delivery. After the workflow, it became measurable and tractable. We had explicit units of work, visible pipeline state, bounded review batches, and a controlled rollout path.

That is the real value of this kind of system.

The goal is not to use AI to produce code changes. The goal is to reduce the operational cost of improving a large codebase while keeping risk under control.

9. The same pattern extends to other code quality problems

Once the workflow was working for logging, the next set of applications was obvious.

Large codebases contain many classes of issues that are individually small, operationally relevant, and expensive to clean up manually. Examples include forgotten break statements, N+1 query patterns, redundant code, dead code, and other maintainability or reliability problems.

These use cases often follow the same structure:

Use static analysis to produce a structured inventory.
Use a model where deterministic rules are not enough.
Apply changes through a controlled transformation path.
Validate aggressively.
Keep humans in the approval loop.

That pattern is more durable than treating every problem as a separate AI initiative. Once the workflow architecture is sound, new use cases become extensions of the same system rather than fresh experiments.

Practical takeaways

Start with the workflow, not the model. Define the unit of work, the state model, and the validation gates first.
Use models for bounded semantic tasks. Do not ask them to own control flow that can be expressed deterministically.
Treat apply logic as a first-class engineering problem. Safe merge and reconstruction are often harder than generation.
Build explicit error surfaces. Failures should be attributable to a stage, a unit of work, and a concrete reason.
Design for reviewability early. Small review batches increase trust and reduce organizational friction.
Prefer orchestration over agents when the path is already known. It will usually be cheaper, easier to debug, and easier to audit.
Measure success in operational terms. The real metric is not how many edits were generated. It is whether the system reduced engineering effort without increasing downstream risk.

This project worked because the AI component was narrow and the engineering around it was disciplined. That is usually the right pattern for enterprise code transformation.

Have a similar problem to solve?

Use the contact page if you want help with agent architecture, evaluation, or implementation.

Get in touch