Testing whether AI can replace your Scrum Master (spoiler: it can't)
Here's a question that's been bouncing around tech Twitter: Can AI do sprint planning?
Not "can it generate user stories" or "can it estimate story points" — I mean actual, realistic sprint planning that a human Scrum Master would do. The kind where you account for developer velocity, context switching, technical debt, and all the messy reality of building software.
I decided to find out by throwing GitHub Copilot into the deep end.
The experiment: Give Copilot a legacy codebase, ask it to plan a complete rewrite using Scrum methodology, and see if the estimates match reality.
The result: It was like watching someone who read a book about Agile try to plan sprints without ever actually working in one.
Let me show you exactly where it went wrong.
Watch Video
https://youtu.be/ErwuATHHXw4?embedable=true
The Setup: A Fair(ish) Test
I gave Copilot in Visual Studio 2026 a real-world scenario:
The codebase:
- Legacy .NET application (older framework)
- Needs a complete rewrite to .NET 10 with Clean Architecture
- Domain entities, services, repositories, the whole stack
The constraints:
- 1 developer (me)
- 5 hours/day of actual coding time (realistic for senior devs with meetings)
- 2-week sprints where only 7 out of 10 days are development days
- No previous sprint velocity data (this is important later)
The task: Review the code using templates, then generate a complete sprint plan with effort estimates.
I tested two models:
- ChatGPT 5.1 Codex mini
- ChatGPT 5.1 Codex (full version)
Round 1: ChatGPT 5.1 Codex Mini — The Waterfall Disaster
The mini model gave me what it proudly called a "detailed Sprint and iterations plan."
What I got:
Sprint 1: Foundation & Domain Entities
- Set up .NET 10 project structure
- Migrate domain entities
- Configure dependency injection
Sprint 2: Repositories & Data Access
- Implement repository pattern
- Set up Entity Framework Core
- Database migrations
Sprint 3: Service Layer & Testing
- Build service layer
- Write unit tests
- Integration tests
Sprint 4: Documentation & Final Sign-Off
- API documentation
- Code documentation
- Final review and deployment
The problem?
This isn't Agile. This is a textbook waterfall disguised as sprints.
Let me break down the anti-patterns:
Anti-Pattern #1: No Working Software Until Sprint 3
Sprint 1 builds entities. Sprint 2 builds repositories. Cool story — what does the user see?
Nothing. Zero. Nada.
In real Agile, every sprint should deliver something demonstrable. Even if it's just one feature end-to-end. This plan gives you infrastructure for two sprints before anything actually works.
Anti-Pattern #2: Testing is a Separate Phase
"Sprint 3: Write unit tests."
Excuse me, what?
In Agile, tests are written during development, not after. This is literally the waterfall "testing phase" approach that Agile was invented to replace.
Anti-Pattern #3: Documentation Sprint
Sprint 4 is just documentation and sign-off. No new features, no bug fixes, just... paperwork.
This is what happens when you give AI a list of tasks and ask it to group them into timeboxes without understanding why we timebox in the first place.
Verdict: ChatGPT 5.1 Codex mini gets an F in Agile 101.
It doesn't understand the philosophy. It's just playing Mad Libs with sprint terminology.
Round 2: ChatGPT 5.1 Codex (Full) — Better, But Still Wrong
Okay, maybe the mini model was too... mini. Let's try the full version.
This time, I split it into two steps:
- Code review first
- Sprint planning second
The Code Review
Cost: 3 premium requests (approximately $0.45 in API credits)
The review was actually decent. It identified:
- Architectural patterns in use
- Technical debt hotspots
- Complexity metrics
- Areas needing refactoring
No complaints here. This part worked.
The Sprint Plan
This is where things got interesting. The full model generated:
Definition of Done:
"Code reviewed, tested, integrated, and deployed to staging."
Okay, that's... generic. Every sprint has the same DoD? What about feature-specific acceptance criteria?
Definition of Ready:
"Requirements clear, dependencies identified, estimates agreed upon."
Cool. Standard Scrum terminology. Sounds professional. But also completely useless without specifics.
Sprint 1 Breakdown: The Devil in the Details
Sprint Goal: "Establish .NET 10 Clean Architecture foundation."
Tasks:
- Set up project structure (4 hours)
- Configure dependency injection (3 hours)
- Implement base domain entities (8 hours)
- Set up logging framework (2 hours)
- Configure application settings (1 hour)
- Initial database schema (4 hours)
Total: 22 hours across 7 working days
At first glance, this looks reasonable. But let me tell you why it's not.
The Reality Check
I actually did this rewrite. Here's what the AI missed:
What the AI got right (about 80%):
- The basic tasks exist
- Hour estimates aren't completely insane
- Logical grouping of related work
What the AI completely missed:
- Automation wasn't factored in properly
- I told Copilot some tasks would be automated
- Actual time for project setup + DI + entities: ~10 hours max (by day 3)
- AI estimated 22 hours (entire sprint)
- No actual domain logic
- Sprint 1 and Sprint 2 were just "convert old entities to new format."
- This is repetitive labor, not building features
- Where's the business logic? Where's the value?
- The dreaded end-of-sprint testing phase
- Milestone: "Service & Test Coverage" at the end of Sprint 3
- There it is again — testing as a separate phase
- This is Agile cosplay, not Agile
Sprint 2: More of the Same
Sprint 2 continued the entity conversion work. Still no working features. Still no demonstrable value.
The pattern: The AI treated this like a checklist migration, not a product rewrite.
The Fundamental Problem: AI Doesn't Understand Context
Here's what became painfully obvious:
AI is great at:
- Listing tasks
- Using correct Scrum terminology
- Making estimates that sound plausible
- Generating structured output
AI is terrible at:
- Understanding why you're rewriting the code
- Knowing the business domain
- Judging the complexity of logic vs. boilerplate
- Factoring in real-world developer constraints
- Planning for value delivery instead of task completion
The Missing Pieces
Even a human developer wouldn't know all the details without poking around the codebase for a few days. But a human would:
-
Ask clarifying questions
- "What's the most critical feature to deliver first?"
- "Are there any risky technical assumptions?"
- "What's the user-facing priority?"
-
Plan for vertical slices
- Pick one feature end-to-end
- Build it through all layers
- Get feedback
-
Adjust based on sprint velocity
- Oh, wait, I didn't give the AI historical velocity data
- (This was intentional — most teams don't have clean velocity metrics anyway)
-
Account for unknowns
- Buffer time for unexpected complexity
- Technical debt discovered mid-sprint
- Dependency issues
Was This Test Fair?
Short answer: No.
Long answer: That's the point.
Real-world sprint planning is done by:
- Developers who know the codebase
- Teams with historical velocity data
- Planning poker sessions where multiple people debate estimates
- Scrum Masters who understand team capacity and context
I gave Copilot:
- Code it had never seen before
- No velocity data
- No team context
- No business domain knowledge
But guess what? That's actually a realistic scenario for:
- New teams
- New projects
- Consultants were brought in mid-project
- Startups without established processes
And in those scenarios, human developers still do better than AI because they:
- Ask questions
- Make assumptions explicit
- Plan iteratively
- Adjust based on feedback
The Fluff Factor: AI's Secret Weapon
Here's something I noticed across both models:
AI generates a LOT of impressive-sounding fluff.
- "Definition of Done"
- "Definition of Ready"
- "Sprint Goals"
- "Milestones"
- "Acceptance Criteria"
It all looks professional. It sounds like someone who knows Agile wrote it.
But when you actually analyze the content:
- Definitions are too vague to be useful
- Goals don't align with value delivery
- Milestones reveal waterfall thinking
- Acceptance criteria are missing or generic
It's the software development equivalent of a student padding a paper to hit the word count.
What AI Actually Did Well
To be fair, there were useful outputs:
1. Task Decomposition
Breaking "rewrite application" into concrete steps is helpful, even if the sequencing was wrong.
2. Hour Estimates (Sort Of)
Individual task estimates weren't terrible — they were just missing automation factors and developer experience.
3. Structured Output
Having everything in a standardized format is better than nothing.
4. Starting Point for Discussion
If a human reviewed this plan, they could quickly identify and fix the issues.
So AI sprint planning isn't useless — it's just not autonomous.
The Real Use Case: AI as a Junior PM
Here's where I landed:
Don't ask AI to do sprint planning.
Ask AI to draft a sprint planning that a human will review and fix.
The workflow should be:
- AI generates an initial plan
- Human identifies anti-patterns
- Human adjusts for reality
- Team discusses and commits
This is basically how you'd work with a junior PM who:
- Knows the theory
- Doesn't have domain experience
- Needs supervision
And that's fine! Junior PMs are valuable. They do the grunt work of structuring information. Then senior people refine it.
My Recommendations
If you're considering using AI for sprint planning:
DO:
- Use it to generate task lists
- Let it estimate individual tasks
- Have it structure backlog of items
- Generate templates and frameworks
DON'T:
- Trust its understanding of Agile philosophy
- Accept sprint sequencing without review
- Assume estimates account for your context
- Skip human validation
DEFINITELY DON'T:
- Use it as a replacement for experienced PMs
- Let it make the final decisions on priorities
- Trust it with value-based planning
The Bigger Picture: What This Means
This experiment reveals something important about current AI limitations:
AI is great at pattern matching, terrible at pattern breaking.
Sprint planning requires:
- Understanding trade-offs
- Challenging assumptions
- Adjusting based on context
- Optimizing for human factors
These are all areas where AI struggles because they require:
- Domain expertise
- Emotional intelligence
- Strategic thinking
- Real-world experience
The models are getting better at generating correct syntax. They're not getting better at understanding semantics.
Final Verdict
Can AI do realistic sprint planning?
No.
Not yet. Maybe not ever.
Sprint planning isn't just about dividing work into timeboxes. It's about:
- Understanding team capacity
- Managing risk
- Delivering value iteratively
- Adapting to feedback
AI can help with the mechanical parts. But the strategic thinking that makes Agile actually work?
That still requires humans.
TL;DR:
- Tested GitHub Copilot for sprint planning a .NET rewrite
- ChatGPT 5.1 Codex mini produced a pure waterfall disguised as Agile
- ChatGPT 5.1 Codex (full) was better but still missed critical context
- AI generates impressive-sounding fluff without strategic thinking
- Use AI as a junior PM who needs supervision, not as a replacement
- Sprint planning requires human judgment AI doesn't have
Hot take: If your sprint planning can be automated by AI, your process is probably too rigid anyway.
Learn About Spec-Driven Development
https://www.youtube.com/watch?v=0atkW_janVg&list=PLphsQTGN5DbJnaiy-89QitCMkg-8toQac&embedable=true
