I Asked GitHub Copilot to Plan My Next Sprint: It Failed Spectacularly

Testing whether AI can replace your Scrum Master (spoiler: it can't)

Here's a question that's been bouncing around tech Twitter: Can AI do sprint planning?

Not "can it generate user stories" or "can it estimate story points" — I mean actual, realistic sprint planning that a human Scrum Master would do. The kind where you account for developer velocity, context switching, technical debt, and all the messy reality of building software.

I decided to find out by throwing GitHub Copilot into the deep end.

The experiment: Give Copilot a legacy codebase, ask it to plan a complete rewrite using Scrum methodology, and see if the estimates match reality.

The result: It was like watching someone who read a book about Agile try to plan sprints without ever actually working in one.

Let me show you exactly where it went wrong.

Watch Video

https://youtu.be/ErwuATHHXw4?embedable=true

The Setup: A Fair(ish) Test

I gave Copilot in Visual Studio 2026 a real-world scenario:

The codebase:

Legacy .NET application (older framework)
Needs a complete rewrite to .NET 10 with Clean Architecture
Domain entities, services, repositories, the whole stack

The constraints:

1 developer (me)
5 hours/day of actual coding time (realistic for senior devs with meetings)
2-week sprints where only 7 out of 10 days are development days
No previous sprint velocity data (this is important later)

The task: Review the code using templates, then generate a complete sprint plan with effort estimates.

I tested two models:

ChatGPT 5.1 Codex mini
ChatGPT 5.1 Codex (full version)

Round 1: ChatGPT 5.1 Codex Mini — The Waterfall Disaster

The mini model gave me what it proudly called a "detailed Sprint and iterations plan."

What I got:

Sprint 1: Foundation & Domain Entities
- Set up .NET 10 project structure
- Migrate domain entities
- Configure dependency injection

Sprint 2: Repositories & Data Access
- Implement repository pattern
- Set up Entity Framework Core
- Database migrations

Sprint 3: Service Layer & Testing
- Build service layer
- Write unit tests
- Integration tests

Sprint 4: Documentation & Final Sign-Off
- API documentation
- Code documentation
- Final review and deployment

The problem?

This isn't Agile. This is a textbook waterfall disguised as sprints.

Let me break down the anti-patterns:

Anti-Pattern #1: No Working Software Until Sprint 3

Sprint 1 builds entities. Sprint 2 builds repositories. Cool story — what does the user see?

Nothing. Zero. Nada.

In real Agile, every sprint should deliver something demonstrable. Even if it's just one feature end-to-end. This plan gives you infrastructure for two sprints before anything actually works.

Anti-Pattern #2: Testing is a Separate Phase

"Sprint 3: Write unit tests."

Excuse me, what?

In Agile, tests are written during development, not after. This is literally the waterfall "testing phase" approach that Agile was invented to replace.

Anti-Pattern #3: Documentation Sprint

Sprint 4 is just documentation and sign-off. No new features, no bug fixes, just... paperwork.

This is what happens when you give AI a list of tasks and ask it to group them into timeboxes without understanding why we timebox in the first place.

Verdict: ChatGPT 5.1 Codex mini gets an F in Agile 101.

It doesn't understand the philosophy. It's just playing Mad Libs with sprint terminology.

Round 2: ChatGPT 5.1 Codex (Full) — Better, But Still Wrong

Okay, maybe the mini model was too... mini. Let's try the full version.

This time, I split it into two steps:

Code review first
Sprint planning second

The Code Review

Cost: 3 premium requests (approximately $0.45 in API credits)

The review was actually decent. It identified:

Architectural patterns in use
Technical debt hotspots
Complexity metrics
Areas needing refactoring

No complaints here. This part worked.

The Sprint Plan

This is where things got interesting. The full model generated:

Definition of Done:

"Code reviewed, tested, integrated, and deployed to staging."

Okay, that's... generic. Every sprint has the same DoD? What about feature-specific acceptance criteria?

Definition of Ready:

"Requirements clear, dependencies identified, estimates agreed upon."

Cool. Standard Scrum terminology. Sounds professional. But also completely useless without specifics.

Sprint 1 Breakdown: The Devil in the Details

Sprint Goal: "Establish .NET 10 Clean Architecture foundation."

Tasks:

Set up project structure (4 hours)
Configure dependency injection (3 hours)
Implement base domain entities (8 hours)
Set up logging framework (2 hours)
Configure application settings (1 hour)
Initial database schema (4 hours)

Total: 22 hours across 7 working days

At first glance, this looks reasonable. But let me tell you why it's not.

The Reality Check

I actually did this rewrite. Here's what the AI missed:

What the AI got right (about 80%):

The basic tasks exist
Hour estimates aren't completely insane
Logical grouping of related work

What the AI completely missed:

Automation wasn't factored in properly
- I told Copilot some tasks would be automated
- Actual time for project setup + DI + entities: ~10 hours max (by day 3)
- AI estimated 22 hours (entire sprint)
No actual domain logic
- Sprint 1 and Sprint 2 were just "convert old entities to new format."
- This is repetitive labor, not building features
- Where's the business logic? Where's the value?
The dreaded end-of-sprint testing phase
- Milestone: "Service & Test Coverage" at the end of Sprint 3
- There it is again — testing as a separate phase
- This is Agile cosplay, not Agile

Sprint 2: More of the Same

Sprint 2 continued the entity conversion work. Still no working features. Still no demonstrable value.

The pattern: The AI treated this like a checklist migration, not a product rewrite.

The Fundamental Problem: AI Doesn't Understand Context

Here's what became painfully obvious:

AI is great at:

Listing tasks
Using correct Scrum terminology
Making estimates that sound plausible
Generating structured output

AI is terrible at:

Understanding why you're rewriting the code
Knowing the business domain
Judging the complexity of logic vs. boilerplate
Factoring in real-world developer constraints
Planning for value delivery instead of task completion

The Missing Pieces

Even a human developer wouldn't know all the details without poking around the codebase for a few days. But a human would:

Ask clarifying questions
- "What's the most critical feature to deliver first?"
- "Are there any risky technical assumptions?"
- "What's the user-facing priority?"
Plan for vertical slices
- Pick one feature end-to-end
- Build it through all layers
- Get feedback
Adjust based on sprint velocity
- Oh, wait, I didn't give the AI historical velocity data
- (This was intentional — most teams don't have clean velocity metrics anyway)
Account for unknowns
- Buffer time for unexpected complexity
- Technical debt discovered mid-sprint
- Dependency issues

Was This Test Fair?

Short answer: No.

Long answer: That's the point.

Real-world sprint planning is done by:

Developers who know the codebase
Teams with historical velocity data
Planning poker sessions where multiple people debate estimates
Scrum Masters who understand team capacity and context

I gave Copilot:

Code it had never seen before
No velocity data
No team context
No business domain knowledge

But guess what? That's actually a realistic scenario for:

New teams
New projects
Consultants were brought in mid-project
Startups without established processes

And in those scenarios, human developers still do better than AI because they:

Ask questions
Make assumptions explicit
Plan iteratively
Adjust based on feedback

The Fluff Factor: AI's Secret Weapon

Here's something I noticed across both models:

AI generates a LOT of impressive-sounding fluff.

"Definition of Done"
"Definition of Ready"
"Sprint Goals"
"Milestones"
"Acceptance Criteria"

It all looks professional. It sounds like someone who knows Agile wrote it.

But when you actually analyze the content:

Definitions are too vague to be useful
Goals don't align with value delivery
Milestones reveal waterfall thinking
Acceptance criteria are missing or generic

It's the software development equivalent of a student padding a paper to hit the word count.

What AI Actually Did Well

To be fair, there were useful outputs:

1. Task Decomposition

Breaking "rewrite application" into concrete steps is helpful, even if the sequencing was wrong.

2. Hour Estimates (Sort Of)

Individual task estimates weren't terrible — they were just missing automation factors and developer experience.

3. Structured Output

Having everything in a standardized format is better than nothing.

4. Starting Point for Discussion

If a human reviewed this plan, they could quickly identify and fix the issues.

So AI sprint planning isn't useless — it's just not autonomous.

The Real Use Case: AI as a Junior PM

Here's where I landed:

Don't ask AI to do sprint planning.

Ask AI to draft a sprint planning that a human will review and fix.

The workflow should be:

AI generates an initial plan
Human identifies anti-patterns
Human adjusts for reality
Team discusses and commits

This is basically how you'd work with a junior PM who:

Knows the theory
Doesn't have domain experience
Needs supervision

And that's fine! Junior PMs are valuable. They do the grunt work of structuring information. Then senior people refine it.

My Recommendations

If you're considering using AI for sprint planning:

DO:

Use it to generate task lists
Let it estimate individual tasks
Have it structure backlog of items
Generate templates and frameworks

DON'T:

Trust its understanding of Agile philosophy
Accept sprint sequencing without review
Assume estimates account for your context
Skip human validation

DEFINITELY DON'T:

Use it as a replacement for experienced PMs
Let it make the final decisions on priorities
Trust it with value-based planning

The Bigger Picture: What This Means

This experiment reveals something important about current AI limitations:

AI is great at pattern matching, terrible at pattern breaking.

Sprint planning requires:

Understanding trade-offs
Challenging assumptions
Adjusting based on context
Optimizing for human factors

These are all areas where AI struggles because they require:

Domain expertise
Emotional intelligence
Strategic thinking
Real-world experience

The models are getting better at generating correct syntax. They're not getting better at understanding semantics.

Final Verdict

Can AI do realistic sprint planning?

No.

Not yet. Maybe not ever.

Sprint planning isn't just about dividing work into timeboxes. It's about:

Understanding team capacity
Managing risk
Delivering value iteratively
Adapting to feedback

AI can help with the mechanical parts. But the strategic thinking that makes Agile actually work?

That still requires humans.

TL;DR:

Tested GitHub Copilot for sprint planning a .NET rewrite
ChatGPT 5.1 Codex mini produced a pure waterfall disguised as Agile
ChatGPT 5.1 Codex (full) was better but still missed critical context
AI generates impressive-sounding fluff without strategic thinking
Use AI as a junior PM who needs supervision, not as a replacement
Sprint planning requires human judgment AI doesn't have

Hot take: If your sprint planning can be automated by AI, your process is probably too rigid anyway.

Learn About Spec-Driven Development

https://www.youtube.com/watch?v=0atkW_janVg&list=PLphsQTGN5DbJnaiy-89QitCMkg-8toQac&embedable=true