Can LLMs Generate Quality Code? A 40,000-Line Experiment

Executive Summary

I spent four weeks part-time (probably 80 hours total) building a complete reactive UI framework with 40+ components, a router, and supporting interactive website using only LLM-generated code, it is evident LLMs can produce quality code—but like human developers, they need the right guidance.

Key Findings

On Code Quality:

Well-specified tasks yield clean first-pass code
Poorly specified or unique requirements produce sloppy implementations
Code degrades over time without deliberate refactoring
LLMs defensively over-engineer when asked to improve reliability

On The Development Process:

It is hard to be “well specified” when a task is large
Extended reasoning ("thinking") produces better outcomes, though sometimes leads to circular or overly expansive logic
Multiple LLM perspectives (switching models) provides valuable architectural review and debug assistance
Structured framework use, e.g. Bau.js or Lightview, prevent slop better than unconstrained development
Formal metrics objectively identify and guide removal of code complexity

The bottom line: In many ways LLMs behave like the average of the humans who trained them—they make similar mistakes, but much faster and at greater scale. Hence, you can get six months of maintenance and enhancement “slop” 6 minutes after you generate an initial clean code base and then ask for changes.

The Challenge

Four weeks ago, I set out to answer a question that's been hotly debated in the development community: Can LLMs generate substantive, production-quality code?

Not a toy application. Not a simple CRUD app. A complete, modern reactive UI framework with lots of pre-built components, a router and a supporting website with:

Tens of thousands of lines of JavaScript, CSS, and HTML
Memory, performance, and security considerations
Professional UX and developer experience

I chose to build Lightview (lightview.dev)—a reactive UI framework combining the best features of Bau.js, HTMX, and Juris.js. The constraint: 100% LLM-generated code using Anthropic's Claude (Opus 4.5, Sonnet 4.5) and Google's Gemini 3 Pro (Flash was not released when I started).

Starting With Questions, Not Code

I began with Claude Opus:

"I want to create a reactive UI library that combines HTMX, Bau, and Juris. For hypermedia, I prefer no special attribute names—just enhanced behavior. I also want string literal processing in HTML, an SPA router, a UI component library with automatic custom element creation, SEO-enabled apps with no extra work, and a website to promote and educate users. Do you have any questions?"

Claude didn't initially dive into code. It asked dozens of clarifying questions:

TypeScript or vanilla JavaScript?
Which UI component library for styling? (It provided options with pros/cons)
Which HTMX features specifically?
Hosting preferences?
Routing strategy?

However, at times it started to write code before I thought it should be ready and I had to abort a response and redirect it.

Finding: LLMs have a strong bias toward premature code generation. Even when reminded not to code yet, they forget after a few interactions. This happens with all models—Claude, Gemini, GPT. They seem especially triggered to start the generation process when given example code, even when examples are provided alongside questions rather than as implementation requests.

Guidance: If an LLM starts generating code before you're ready, cancel the completion immediately and redirect: "Don't generate code yet. Do you have more questions?" You may need to repeat this multiple times as the LLM "forgets" and drifts back toward code generation. The Planning vs Fast mode toggle in Antigravity or similar modes in other IDE’s should help with this, but it's inconvenient to use repeatedly. A better solution would be: if the user asks a question, the LLM should assume they want discussion, not code. Only generate/modify code when explicitly requested or after asking permission.

After an hour of back-and-forth, Claude finally said: "No more questions. Would you like me to generate an implementation plan since there will be many steps?"

The resulting plan was comprehensive - a detailed Markdown file with checkboxes, design decisions, and considerations for:

Core reactive library
40+ UI components
Routing system
A website … though the website got less attention - a gap I'd later address

I did not make any substantive changes to this plan except for clarification on website items and the addition of one major feature, declarative event gating at the end of development.

The Build Begins

With the plan in place, I hit my token limit on Opus. No problem—I switched to Gemini 3 (High), which had full context from the conversation plus the plan file.

Within minutes, Gemini generated lightview.js—the core reactivity engine—along with two example files: a "Hello, World!" demo showing both Bau-like syntax and vDOM-like syntax.

Then I made a mistake.

"Build the website as an SPA," I said, without specifying to use Lightview itself. I left for lunch.

When I returned, there was a beautiful website running in my browser. I looked at the code and my heart sank: React with Tailwind CSS.

Finding: LLMs will use the most common/popular solution if you don't specify otherwise. React + Tailwind is an extremely common pattern for SPAs. Without explicit guidance to use Lightview—the very framework I'd just built—the LLM defaulted to what it had seen most often in training data.

Worse, when I asked to rebuild it with Lightview, I forgot to say "delete the existing site first." So it processed and modified all 50+ files one by one, burning through tokens at an alarming rate.

Guidance: When asking an LLM to redo work, be explicit about the approach:

Delete the existing site and rebuild from scratch using Lightview

vs Modify the existing site to use Lightview

The first is often more token-efficient for large changes. The second is better for targeted fixes. The LLM won't automatically choose the efficient path—you need to direct it.

The Tailwind Surprise

One issue caught me off-guard. After Claude generated the website using Lightview components, I noticed it was still full of Tailwind CSS classes. I asked Claude about this.

"Well," Claude effectively explained, "you chose DaisyUI for the UI components, and DaisyUI requires Tailwind as a dependency. I assumed you were okay with Tailwind being used throughout the site."

Fair point—but I wasn't okay with it. I prefer semantic CSS classes and wanted the site to use classic CSS approaches.

Finding: LLMs make reasonable but sometimes unwanted inferences. When you specify one technology that has dependencies, LLMs will extend that choice to related parts of the project. They're being logical, but they can't read your mind about preferences.

Guidance: Be explicit about what you don't want, not just what you do want. e.g. "I want DaisyUI components, but only use Tailwind for them not elsewhere." If you have strong preferences about architectural approaches, state them upfront.

I asked Claude to rewrite the site using classic CSS and semantic classes. I liked the design and did not want to delete the files, so once again I suffered through a refactor that consumed a lot of tokens since it touched so many files. I once again ran out of tokens and tired GPT-OSS bit hit syntax errors and had to switch to another IDE to keep working.

Guidance: When one LLM struggles with your codebase, switch back to one that was previously successful. Different models have different "understanding" of your project context. And, if you are using Antigravity when you run out of tokens, you can switch to MS Visual Code in the same directory and use a light GitHub Copilot account with Claude. Antigravity is based on Visual Code, so it works in a very similar manner.

The Iterative Dance

Over the next few weeks, I worked to build out the website and test/iterate on components, I worked across multiple LLMs as token limits reset. Claude, Gemini, back to Claude. Each brought different strengths and weaknesses:

Claude excelled at architectural questions and generated clean website code with Lightview components
Gemini Pro consistently tried to use local tools and shell helper scripts to support its own work—valuable for speed and token efficiency. However, it sometimes failed with catastrophic results, many files zeroed out or corrupt with no option but to roll-back.
Switching perspectives proved powerful: "You are a different LLM. What are your thoughts?" often yielded breakthrough insights or rapid fixes to bugs on which one LLM has been spinning.
I found the real winner to be Gemini Flash. It did an amazing job of refactoring code without introducing syntax errors and needed minimal guidance on what code to put where. Sometimes I was skeptical of a change and would say so. Sometimes, Flash would agree and adjust and other times it would make a rational justification of its choice. And, talk about fast … wow!

The Router Evolution

The router also needed work. Claude initially implemented a hash-based router (#/about, #/docs, etc.). This is entirely appropriate for an SPA—it's simple, reliable, and doesn't require server configuration.

But I had additional requirements I hadn't clearly stated: I wanted conventional paths (/about, /docs) for deep linking and SEO. Search engines can handle hash routes now, but path-based routing is still cleaner for indexing and sharing.

Finding: LLMs will sometimes default to the simplest valid solution. Hash-based routing is easier to implement and works without server-side configuration. Since I did not say I wanted path-based routing, the LLM will choose the simpler approach.

When I told Claude I needed conventional paths for SEO and deep linking, it very rapidly rewrote the router and came up with what I consider a clever solution—a hybrid approach that makes the SPA pages both deep-linkable and SEO-indexable without the complexity of server-side rendering. However, it did leave some of the original code in place which kind of obscured what was going on and was totally un-needed. I had to tell it to remove this code which supported the vestiges of hash-based routes. This code retention is the kind of thing that can lead to slop. I suppose many people would blame the LLM, but if I had been clear to start with and also said “completely re-write”, my guess is the vestiges would not have existed.

Guidance: For architectural patterns, be explicit about your requirements early. Don't assume the LLM knows you want the more complex but SEO-friendly approach. Specify: "I need path-based routing with History API for SEO" rather than just "I need routing."

Guidance: I also found that LLMs defensively try to ensure compatibility with previous versions, this can lead to overly complex code. If you are writing from scratch you need to remind them that backward compatibility is not required.

Confronting The Numbers

The Final Tally

Project Size:

60 JavaScript files, 78 HTML files, 5 CSS files
41,405 total lines of code (including comments and blanks)
Over 40 custom UI components
70+ website files

At this point, files seemed reasonable - not overly complex. But intuition and my biased feelings about code after more than 40 years of software development isn't enough. I decided to run formal metrics on the core files.

Core Libraries:

File	Lines	Minified Size
lightview.js	603	7.75K
lightview-x.js	1,251	20.2K
lightview-router.js	182	3K

The website component gallery scored well on Lighthouse for performance without having had super focused optimization.

But then came the complexity metrics.

The Slop Revealed

I asked Gemini Flash to evaluate the code using three formal metrics:

1. Maintainability Index (MI): A combined metric where 0 is unmaintainable and 100 is perfectly documented/clean code. The calculation considers:

Halstead Volume (measure of code size and complexity)
Cyclomatic Complexity
Lines of code
Comment density

Scores above 65 are considered healthy for library code. This metric gives you a single number to track code health over time.

2. Cyclomatic Complexity: An older but still valuable metric that measures the number of linearly independent paths through code. High cyclomatic complexity means:

More potential bugs
Harder to test thoroughly (the metric can actually tell you how many you might need to write)
More cognitive load to understand

3. Cognitive Complexity: A modern metric that measures the mental effort a human needs to understand code. Unlike cyclomatic complexity (which treats all control flow equally), cognitive complexity penalizes:

Nested conditionals and loops (deeper nesting = higher penalty)
Boolean operator chains
Recursion
Breaks in linear flow

The thresholds:

0-15: Clean Code - easy to understand and maintain
16-25: High Friction - refactoring suggested to reduce technical debt
26+: Critical - immediate attention needed, maintenance nightmare

Finding: LLMs excel at creating analysis tools. Gemini Flash initially searched for an existing metrics library, couldn't find one, then wrote its own complete analyzer (metrics-analysis.js) using the Acorn JavaScript parser—without asking permission. This is both impressive and occasionally problematic. I cover the problem with this case later.

The Verdict

Overall health looked good:

File	Functions	Avg Maintainability	Avg Cognitive	Status
lightview.js	58	65.5	3.3	⚖️ Good
lightview-x.js	93	66.5	3.6	⚖️ Good
lightview-router.js	27	68.6	2.1	⚖️ Good

But drilling into individual functions told a different story. Two functions hit "Critical" status:

handleSrcAttribute (lightview-x.js):

Cognitive Complexity: 35 🛑
Cyclomatic Complexity: 22 🛑
Maintainability Index: 33.9

Anonymous Template Processing (lightview-x.js):

Cognitive Complexity: 31 🛑
Cyclomatic Complexity: 13

This was slop. Technical debt waiting to become maintenance nightmares.

Can AI Fix Its Own Slop?

Here's where it gets interesting. The code was generated by Claude Opus, Claude Sonnet, and Gemini 3 Pro several weeks earlier. Could the newly released Gemini 3 Flash clean it up?

I asked Flash to refactor handleSrcAttribute to address its complexity. This seemed to take a little longer than necessary. So I aborted and spent some time reviewing its thinking process. There were obvious places it got side-tracked or even went in circles, but I told it to continue. After it completed, I manually inspected the code and thoroughly tested all website areas that use this feature. No bugs found.

Critical Discovery #2: Gemini Flash "thinks" extensively. While reviewing all its thought processes would be tedious, important insights flash by in the IDE. When an LLM seems stuck in a loop, aborts and review historical thoughts for possible sidetracks and tell to continue or redirect as needed.

After the fixes to handleSrcAttribute, I asked for revised statistics to see the improvement.

Flash's Disappearing Act

Unfortunately, Gemini Flash had deleted its metrics-analysis.js file! It had to recreate the entire analyzer.

Finding: Gemini Flash aggressively deletes temporary files. After Flash uses a script or analysis tool it creates, it often deletes the file assuming it is temporary. This happens even for files that take significant effort to create and that you might want to keep or reuse.

Guidance: Tell Gemini to put utility scripts in a specific directory (like /home/claude/tools/ or /home/claude/scripts/) and explicitly ask it to keep them. You can say: "Create this in /home/claude/tools/ and keep it for future use." Otherwise, you'll find yourself regenerating the same utilities multiple times.

The Dev Dependencies Problem

When I told Gemini to keep the metrics scripts permanently, another issue surfaced: it failed to officially install dev dependencies like acorn (the JavaScript parser).

Flash simply assumed that because it found packages in node_modules, it could safely use them. The only reason acorn was available was because I'd already installed a Markdown parser that depended on it.

Finding: LLMs don't always manage dependencies properly. They'll use whatever's available in node_modules without verifying it's officially declared in package.json. This creates fragile builds that break on fresh installs.

Guidance: After an LLM creates utility scripts that use external packages, explicitly ask: "Did you add all required dependencies to package.json? Please verify and install any that are missing." Better yet, review the script's imports and cross-check against your declared dependencies yourself.

The Refactoring Results

With the analyzer recreated, Flash showed how it had decomposed the monolithic function into focused helpers:

fetchContent (cognitive: 5)
parseElements (cognitive: 5)
updateTargetContent (cognitive: 7)
elementsFromSelector (cognitive: 2)
handleSrcAttribute orchestrator (cognitive: 10)

The Results

Metric	Before	After	Improvement
Cognitive Complexity	35 🛑	10 ✅	-71%
Cyclomatic Complexity	22	7	-68%
Status	Critical Slop	Clean Code	—

Manual inspection and thorough website testing revealed zero bugs. The cost? A 0.5K increase in file size - negligible.

Emboldened, I tackled the template processing logic. Since it spanned multiple functions, this required more extensive refactoring:

Extracted Functions:

collectNodesFromMutations - iteration logic
processAddedNode - scanning logic
transformTextNode - template interpolation for text
transformElementNode - attribute interpolation and recursion

Results:

Function Group	Previous Max	New Max	Status
MutationObserver Logic	31 🛑	6 ✅	Clean
domToElements Logic	12 ⚠️	6 ✅	Clean

Final Library Metrics

After refactoring, lightview-x.js improved significantly:

Functions: 93 → 103 (better decomposition)
Avg Maintainability: 66.5 → 66.8
Avg Cognitive: 3.6 → 3.2

All critical slop eliminated. The increased function count reflects healthier modularity - complex logic delegated to specialized, low-complexity helpers. In fact, it is as good or better than established frameworks from a metrics perspective:

File	Functions	Maintainability (min/avg/max)	Cognitive (min/avg/max)	Status
`lightview.js`	58	7.2 / 65.5 / 92.9	0 / 3.4 / 25	⚖️ Good
`lightview-x.js`	103	0.0 / 66.8 / 93.5	0 / 3.2 / 23	⚖️ Good
`lightview-router.js`	27	24.8 / 68.6 / 93.5	0 / 2.1 / 19	⚖️ Good
`react.development.js`	109	0.0 / 65.2 / 91.5	0 / 2.2 / 33	⚖️ Good
`bau.js`	79	11.2 / 71.3 / 92.9	0 / 1.5 / 20	⚖️ Good
`htmx.js`	335	0.0 / 65.3 / 92.9	0 / 3.4 / 116	⚖️ Good
`juris.js`	360	21.2 / 70.1 / 96.5	0 / 2.6 / 51	⚖️ Good

1. LLMs Mirror Human Behavior—For Better and Worse

LLMs exhibit the same tendencies as average developers:

Rush to code without full understanding
Don't admit defeat or ask for help soon enough
Generate defensive, over-engineered solutions when asked to improve reliability
Produce cleaner code with structure and frameworks

The difference? They do it faster and at greater volume. They can generate mountains of slop in hours that would take humans weeks.

2. Thinking Helps

Extended reasoning (visible in "thinking" modes) shows alternatives, self-corrections, and occasional "oh but" moments. The thinking is usually fruitful, sometimes chaotic. Don’t just leave or do something else when tasks you belive are comple or critical are being conducted. The LLMs rarely say "I give up" or "Please give me guidance" - I wish they would more often. Watch the thinking flow and abort the response request if necessary. Read the thinking and redirect or just say continue, you will learn a lot.

3. Multiple Perspectives Are Powerful

When I told a second LLM, "You are a different LLM reviewing this code. What are your thoughts?", magic happened.

Finding: LLMs are remarkably non-defensive. When faced with an implementation that's critiqued as too abstract, insufficiently abstract, or inefficient, leading LLMs (Claude, Gemini, GPT) won't argue. They'll do a rapid, thorough analysis and return with honest pros/cons of current versus alternative approaches.

This behavior is actually beyond what most humans provide:

How many human developers give rapid, detailed feedback without any defensive behavior?
How many companies have experienced architects available for questioning by any developer at any time?
How many code review conversations happen without ego getting involved?

Guidance: Before OR after making changes, switch LLMs deliberately:

Make progress with one LLM (e.g., Claude builds a feature)

Switch to another (e.g., Gemini) and say: "You are a different LLM reviewing this implementation. What are your thoughts on the architecture, potential issues, and alternative approaches?" Then switch back to the first and ask what it thinks now!

This is especially valuable before committing to major architectural decisions or after implementing complex algorithms. The second opinion costs just a few tokens but can save hours of refactoring later.

4. Structure Prevents Slop

Finding: Telling an LLM to use "vanilla JavaScript " without constraints invites slop. Vanilla JavaScript is a wonderful but inherently loose language through which a sometimes sloppy or inconsistent browser API is exposed. Without constraints, it's easy to create unmaintainable code—for both humans and LLMs. Specifying a framework (Bau.js, React, Vue, Svelte, etc.) provides guardrails that lead to cleaner, more maintainable code.

Guidance: When starting a project, based on what you want to accomplish ask for advice on:

The framework/library to use (React, Vue, Svelte, etc.)

The architectural pattern (MVC, MVVM, component-based, etc.)

Code organization preferences (feature-based vs. layer-based folders)

Naming conventions

Whether to use TypeScript or JSDoc for type safety

Other libraries to use … no prevent re-invention.

Don't say: "Build me a web app in JavaScript" Do say: "Build me a React application using functional components, hooks, TypeScript, and feature-based folder organization. Follow Airbnb style guide for naming."

The more structure you provide upfront, the less slop you'll get. This applies to all languages, not just JavaScript.

5. Metrics Provide Objective Truth

I love that formal software metrics can guide LLM development. They're often considered too dull, mechanical, difficult or costly to obtain for human development, but in an LLM-enhanced IDE with an LLM that can write code to do formal source analysis (no need for an IDE plugin subscription), they should get far more attention than they do.

Finding: Formal software metrics can guide development objectively. They're perfect for:

Identifying technical debt automatically

Tracking code health over time

Guiding refactoring priorities

Validating that "improvements" actually improve things

Metrics don't lie. They identified the slop my intuition missed.

Guidance: Integrate metrics into your LLM workflow:

After initial implementation: Run complexity metrics on all files. Identify functions with cognitive complexity > 15 or cyclomatic complexity > 10.

Prioritize refactoring: Address Critical (cognitive > 26) functions first, then "High Friction" (16-25) functions.

Request targeted refactoring: Don't just say ‘improve this’. Say ‘Refactor handleSrcAttribute to reduce cognitive complexity to target range’.

Verify improvements: After refactoring, re-run metrics. Ensure complexity actually decreased and maintainability increased. Sometimes ‘improvements’ just shuffle complexity around.

Set quality gates: Before marking code as done, try to have all functions with a cognitive complexity < 15 and maintainability index > 65.

The Verdict

After 40,000 lines of LLM-generated code, I'm cautiously optimistic.

Yes, LLMs can generate quality code. But like human developers, they need:

Clear, detailed specifications
Structural constraints (frameworks, patterns)
Regular refactoring guidance
Objective quality measurements
Multiple perspectives on architectural decisions

The criticism that LLMs generate slop isn't wrong—but it's incomplete. They generate slop for the same reasons humans do: unclear requirements, insufficient structure, and lack of quality enforcement.

The difference is iteration speed. What might take a human team months to build and refactor, LLMs can accomplish in hours. The cleanup work remains, but the initial generation accelerates dramatically.

Looking Forward

I'm skeptical that most humans will tolerate the time required to be clear and specific with LLMs - just as they don't today when product managers or developers push for detailed requirements from business staff. The desire to "vibe code" and iterate will persist.

But here's what's changed: We can now iterate and clean up faster when requirements evolve or prove insufficient. The feedback loop has compressed from weeks to hours.

As coding environments evolve to wrap LLMs in better structure - automated metrics, enforced patterns, multi-model reviews -the quality will improve. We're not there yet, but the foundation is promising.

The real question isn't whether LLMs can generate quality code. It's whether we can provide them - and ourselves - with the discipline to do so consistently.

And, I have a final concern … if LLMs are based on history and have a tendency to stick with what they know, then how are we going to evolve the definition and use of things like UI libraries? Are we forever stuck with React unless we ask for something different? Or, are libraries an anachronism? Will LLMs and image or video models soon just generate the required image of a user interface with no underlying code?

Given its late entry into the game and the anchoring LLMs already have, I don’t hold high hopes for the adoption of Lightview, but it was an interesting experiment. You can visit the project at: https://lightview.dev