GitHub Copilot Did a Code Review on the Code It Helped Me Write: Here's What Happened

Written by incompletedeveloper | Published 2026/03/18
Tech Story Tags: ai-coding | code-review | github-copilot | claude-sonnet-4.5 | ai-code-review | ai-coding-tools | ai-coding-assistants | ai-assisted-coding

TLDRDefault AI code reviews are broken by design — LLMs are trained to be encouraging, so without guidance they will call a structurally bankrupt codebase "well-structured with clear domain modeling." The fix is a weighted scorecard: a markdown document that defines evaluation dimensions, assigns domain-specific weights, calibrates what good looks like at each score level, and forces the model to commit to numbers. Without it, you get flattery. With it, the same model scored a legacy .NET codebase 15/100 and recommended a full rewrite. The rewritten codebase scored substantially higher. Build the scorecard before you review, attach it as a file not a prompt, require numerical output, and run it more than once — AI reviews are non-deterministic and the variance is meaningful.via the TL;DR App

Why default AI code reviews produce confidently wrong results — and how a weighted scorecard fixes that.

I used GitHub Copilot — Claude Sonnet 4.5 inside Visual Studio 2026 — to do a full rewrite of a .NET application I originally wrote in 2013. Copilot did a significant share of the work. Once the rewrite was done, I asked it to review what it had helped build.

The review process turned out to be more instructive than the rewrite itself.

Watch Video

https://youtu.be/omDvFGu8Vtc?embedable=true


Background: The Application

Lottron2000 is a lottery simulator. Originally written in 2013, untouched for over a decade. When I reopened it, the codebase had every problem you would expect from a solo .NET project from that era:

  • Static classes throughout — nothing injectable, nothing testable in isolation
  • No separation of concerns — data access and domain logic in the same project, often the same class
  • Pass-through layers — classes whose only job was to call the next layer down
  • Zero tests

The rewrite produced a clean architecture solution: separated projects for domain, application services, infrastructure, and presentation, with dependency injection throughout. That rewrite is not what this article is about. The review is.


The Problem With Default AI Code Reviews

Before the rewrite, I ran a simple open-ended Copilot review of the original codebase. The prompt: Review this code and give me an assessment.

Copilot's conclusion:

"Lottron2000 is a well-structured lottery simulation system with clear domain modeling and separation of concerns. The core business logic is sound."

Every part of that was wrong. No domain modeling. No separation of concerns. No architecture to speak of. And yet the review read like feedback written to avoid upsetting someone.

This is not a Copilot-specific failure. It is a direct consequence of how LLMs are trained — optimized to be helpful and non-demoralising. In a code review context, this produces systematically inflated output. Three failure modes are consistently at work.

Failure Mode 1: The Politeness Bias

The model defaults to encouragement. It will frame a missing architecture as "a lean, focused structure." It will frame zero tests as "logic that appears straightforward to verify manually." The worse the codebase, the harder the model works to find something worth praising.

Failure Mode 2: No Domain Context

The model does not know what the application is supposed to be. A financial system has different quality priorities than a CRUD admin panel. A public API has different security requirements than an internal desktop tool. Without domain context, the model applies a generic rubric — and generic rubrics miss what actually matters.

Failure Mode 3: The Architecture Blindspot

Default reviews operate at the class level: method length, naming conventions, and null handling. They rarely examine how components relate to each other — whether projects are correctly separated, whether abstractions are leaking, whether the solution structure fits the problem. Architectural problems are the most expensive to fix and the most likely to be glossed over.


The Scorecard: What It Is and How to Build One

The fix is to give the model a structured evaluation framework before requesting the review. A scorecard is a markdown document that defines four things:

  • The dimensions being evaluated
  • The weight assigned to each dimension
  • What good looks like in each category
  • What the scoring scale means at each level

The model is then instructed to apply the scorecard and return a numerically scored assessment per dimension.

Choosing Your Dimensions

Dimensions should map to the real cost categories in your specific project. For most .NET applications, these six cover the critical ground:

Testing — unit tests, integration tests, coverage meaningful to the domain, and whether tests are structured to survive refactoring. Not just whether tests exist, but whether they provide genuine confidence.

Architecture — solution structure, project separation, dependency direction, adherence to chosen patterns (clean architecture, vertical slice, etc.), and whether the structure will hold up as the codebase grows.

Code Quality — readability at the class and method level: naming, method length, cyclomatic complexity, and documentation where it adds value. This is what most developers think of first. It matters, but it is not the top priority.

Maintainability — distinct from code quality. A codebase can be readable in isolation, but expensive to change safely. This covers coupling, cohesion, and whether a change in one area predictably breaks things in another.

Programming — correct use of language features, framework conventions, and established patterns. In C#, this means proper use of async/await, LINQ, exception handling strategy, and avoiding unnecessary allocations.

Security — authentication surface, input validation, data exposure, dependency vulnerability hygiene. Weight this heavily for anything with a login or external API surface. Weight it lightly for an isolated desktop tool.

Setting the Weights

The weights must reflect the actual priorities of the project, not an abstract ideal. For Lottron2000 — a CRUD-heavy lottery simulator with no public surface area — the weights I used were:

  • Testing — 20% (untested code is unverifiable code)
  • Architecture — 20% (structural problems compound over time)
  • Code Quality — 20% (readability is a maintenance cost)
  • Maintainability — 15% (long-term cost of change)
  • Programming — 15% (correct use of language features and patterns)
  • Security — 10% (lower weight for a non-exposed standalone application)

The number to pay attention to is Programming at 15%. Most developers implicitly treat writing code as the primary quality dimension. This scorecard caps it at 15% — because syntactically clean code inside a broken architecture is still a broken architecture.

One honest limitation: the weight on data access felt too low for a CRUD application. Lottron2000 reads and writes lottery draw history — data access correctness and performance are real concerns. A more accurate scorecard would break out data access as its own dimension. This is the first refinement for the next iteration.

Defining What Good Looks Like

This step is as important as the weights. Without it, the model scores against its own implicit standard, which drifts between runs. Write a concrete description of what a 90%, 70%, 50%, and 30% score looks like in each category.

For Testing, a 90% score might mean: comprehensive unit tests for all business logic, integration tests for data access, tests named to document intent, and coverage meaningful to the domain.

A 30% score might mean: some tests exist but cover only the happy path, no integration tests, tests tightly coupled to implementation rather than behavior.

Write these calibration descriptions for each dimension. They are the difference between a scorecard that produces consistent, honest output and one that drifts.

The Prompting Technique

Once the scorecard is ready, prompt structure matters:

  1. Attach the scorecard as a file in Copilot's context — not pasted into the prompt. The model treats attached files as persistent reference material. Instructions in the prompt body are easier to drift away from.
  2. Attach a solution structure template alongside it. This gives the model a concrete standard for evaluating project organisation rather than inferring one.
  3. Require explicit numerical scores per dimension in the output. Without this, the model will discuss dimensions qualitatively and avoid committing to numbers. A paragraph about "areas for improvement in testability" is easy to dismiss. A score of 11 out of 20 is not.
  4. Ask for refactoring recommendations with effort estimates. This converts the review from an assessment into an actionable backlog.

The Results

Old Codebase: 15 out of 100

Running the scorecard against the original 2013 codebase produced a score of 15 out of 100. The model's verdict when forced to commit to a number: technically bankrupt. Recommendations:

  • Do not attempt incremental remediation
  • Technical debt is structural, not cosmetic
  • Rewrite using modern architecture
  • Use the original as a reference only

Compare that to the open-ended review conclusion: "well-structured with clear domain modeling." Same model. Same codebase. Completely different output — because the scorecard gave the model a framework that made it impossible to default to encouragement.

New Codebase: A Strong Result

The rewritten codebase scored substantially higher across all six dimensions. The review produced scored assessments per dimension, specific refactoring recommendations, and effort estimates — total remaining work approximately two and a half weeks.


A Note on Determinism

The same scorecard prompt will not produce the same output twice. During this process, one run omitted numerical scores entirely and had to be re-prompted. Another run interpreted the architecture and code quality dimensions differently enough to shift the scores meaningfully.

This is not a bug. LLMs reason through problems rather than retrieving cached answers. Build your process around that reality:

  • Run the scorecard review more than once and synthesise the themes across runs
  • Do not depend on a consistent output format without explicit format instructions in the prompt
  • Treat the output as analytical input to a human decision, not as a definitive audit result

The Bottom Line

Default AI code reviews are not just unhelpful — they are confidently wrong in a specific direction. They are optimized to avoid demoralizing you, which means they will consistently understate architectural problems, ignore missing tests, and find something positive to say about a codebase that deserves none.

A scorecard does not fix the model. It fixes the context in which the model is operating. Define the dimensions, set domain-appropriate weights, calibrate what good looks like at each scoring level, and require numerical output. The model will do the rest — honestly, this time.


📺 Watch the Full Series

Episode 1: GitHub Copilot AI Code Review – Can AI Understand Legacy .NET Code? https://youtu.be/P26t5EVz70U

Episode 2: Creating .NET Projects and Solution Structure https://youtu.be/Vf0yULOHY3I

Episode 3: Legacy Code Rewrite – Random Number Generator https://youtu.be/6DuaW9VjQa8

Episode 4: Working Without Agent Skills in Visual Studio 2026 https://youtu.be/dznUGMNhqSU

Episode 5: Vibe Coding Razor Pages https://youtu.be/sQdByQML_w8

Episode 6: Code Review (this article) https://youtu.be/omDvFGu8Vtc


Written by incompletedeveloper | .NET C# developer Writing about AI-assisted software development. Focused on how modern tools change productivity
Published by HackerNoon on 2026/03/18