Can LLMs Actually Clean Your Data? The Tradeoffs Nobody Wants to Admit

This is a Plain English Papers summary of a research paper called Can LLMs Clean Up Your Mess? A Survey of Application-Ready Data Preparation with LLMs. If you like these kinds of analysis, join AIModels.fyi or follow us on Twitter.

When raw data is messy, LLMs offer a new kind of solution

Data is messy. Always has been. The moment you try to combine information from different sources or time periods or human beings, inconsistency emerges. A customer's name appears in three different formats. A product price shows up in different currencies. An address is complete in one record and abbreviated in another. This isn't random corruption. It's the natural artifact of real-world data collection, where systems evolve, people make different choices, and standards clash.

For decades, data scientists have spent 80% of their time on this unglamorous work: cleaning, integrating, and enriching raw datasets into something actually usable. The old approach was rigid and rule-based. You'd identify a problem, write code to fix it, and hope the next dataset didn't have a problem your rules didn't anticipate. It always did.

The emerging shift is more subtle than "add an LLM to your pipeline." Rather, it's a fundamental rethinking of how we approach data preparation. Instead of writing rules that specify exactly how to transform messy data, we write prompts that describe what correct data should look like, then let the model reason from context. This is more flexible. It also trades predictability for capability, and that tradeoff creates new problems the field is only beginning to understand.

The paradigm shift: from rules to reasoning

Think about how you'd teach someone to fix a customer address field. The rule-based approach gives specific instructions: "If it lacks a state, look it up from the ZIP code. If the ZIP code is missing, flag it." This is predictable and auditable, but brittle. The moment you encounter an address format you didn't anticipate, the system fails.

The LLM-based approach is different. You describe the desired outcome: "Customer addresses should include street number, street name, city, and state where available. If you can infer missing components from context, do so. If you can't, mark the record as uncertain." The model reads this description and reasons about each specific case, using its understanding of how addresses work in the real world. A street address written as "123 Main St, Springfield, IL 62701" and another written as "Main St in Springfield" can both be understood and standardized meaningfully.

Overview of Application-Ready Data Preparation through LLM-Enhanced Methods

This shift represents something genuine: moving from application-specific, deterministic pipelines to prompt-driven, context-aware, and agentic workflows. The infrastructure supporting this has matured too. Tools like Databricks Unity Catalog now make it practical to construct flexible data preparation agents, not just static scripts.

But flexibility comes with costs. LLMs can hallucinate, confidently inventing information that isn't there. Their behavior can vary across runs. They're expensive at scale. And their advantages aren't universal across all data problems. This is where the nuance matters. LLMs aren't universally better at data preparation. They're better at some things and worse at others, and understanding the boundary is critical.

Three distinct challenges, one integrated solution

Data preparation breaks down into three tasks that sound simple but each present different challenges:

Data cleaning addresses quality issues within individual records. A customer's email is formatted inconsistently. A price is in dollars in one row and euros in another. A date is ambiguous: does "12/03" mean December 3rd or March 12th? The goal is making each record internally consistent and conform to expected formats.

Data integration solves structural mismatches across different datasets. You acquire another company and inherit their customer database. Their customer ID doesn't map to yours. Your product catalog uses SKU identifiers, but your warehouse system uses part numbers. Two tables both have an "address" field, but they mean different things. The goal is identifying which records refer to the same entity and merging information correctly.

Data enrichment addresses missing context. You have a list of companies but no industry classification. You have transaction data with no customer demographics attached. You have product descriptions written as narratives but no structured metadata. The goal is inferring or acquiring information that enables better downstream analysis.

Application-Ready Data Preparation: Three core tasks address key sources of data inefficiency: quality issues, integration barriers, and semantic gaps

These tasks are distinct but interdependent. You can't reliably integrate data you haven't cleaned. You can't enrich data with inconsistent structure. But they're separate enough that different techniques excel at each one. Understanding where LLMs shine and where they struggle requires looking at each task individually.

Data cleaning: semantic understanding meets hallucination risk

Cleaning is foundational. A database where addresses are formatted inconsistently won't break downstream analysis immediately, but it will silently produce wrong answers. The cleaning task divides into three sub-problems: standardization, error processing, and imputation.

Standardization requires making inconsistent data consistent. Customer names might appear as "John Smith", "Smith, John", "J. Smith", "john smith". A human immediately recognizes these as variations on the same format family. A rule-based system requires you to enumerate the variations or write sophisticated pattern-matching code. An LLM understands the semantic equivalence naturally.

Example of LLM-Enhanced Data Standardization

But this advantage comes with a real liability. An LLM standardizing names might confidently transform "Smith" into "Michael Smith" if the model is confident about what the full name should be, even though it's actually inventing information. This is the hallucination problem in its most practical form: the model seems helpful but introduces false data.

Error processing is about deciding what to do with clearly bad data. Raw datasets always have errors: impossible values, values that violate constraints, missing data. A phone number column with an entry like "12345678901234567" is obviously wrong. A customer age of 247 is impossible. But deciding whether to fix, flag, or discard requires understanding context.

Example of LLM-Enhanced Data Error Processing

This is where LLMs' contextual reasoning shines. You can ask a model to identify impossible values, infer what the likely intended value was, and decide whether to attempt a correction or mark the record as suspicious. A customer age of "24 7" might be a typo for "24" or "27". An address with "Apt 256B" in a building that only goes to 10 stories might be "Apt 2B". But without perfect understanding of context, the model might confidently guess wrong.

Imputation fills missing values. Missing data is ubiquitous. A machine learning model needs all features to have values. An analyst needs to know what percentage of data is missing. But inventing data you don't know is risky.

Example of LLM-Enhanced Data Imputation

LLMs can use multi-column context in sophisticated ways. If you know a customer's address, employer, and phone number, the model might infer additional information. But this is exactly where hallucinations become most dangerous. The model can confidently invent information that seems plausible but is actually wrong. The difference between "inferred from limited context" and "hallucinated" is often invisible to downstream systems.

Research on LLM-enhanced data preprocessing has explored how models handle these tasks. The findings are mixed: LLMs excel at understanding semantic constraints and using cross-column context, but struggle with reliable confidence estimates and consistent behavior across runs.

Data integration: connecting mismatched sources

Cleaning fixes individual records. Integration solves the harder problem: you have data from multiple sources that don't speak the same language. Entity matching and schema matching are classic database problems. LLMs bring semantic understanding to both, at new costs.

Entity matching answers the question: are these two records about the same thing? You have 100,000 customer records in your CRM and 150,000 in your data warehouse. Some are definitely the same person. Some are definitely different. Many are ambiguous. Matching errors compound: every record you match incorrectly pollutes downstream analysis with wrong information.

Example of LLM-Enhanced Entity Matching

Rule-based entity matching looks for exact matches or simple fuzzy matches: do the names match? Do they match after removing extra spaces? Do they match phonetically? LLMs can reason about semantic similarity in a much richer way. "John J. Smith" and "Jonathan Smith" probably refer to the same person, especially if their addresses are in the same city and their phone numbers are similar. LLMs handle this kind of nuanced reasoning naturally.

But they also confidently match records that shouldn't match. When matches are ambiguous, the model can't see the context that would make the decision clear. A 60% match might be correct or wrong. Without proper confidence calibration, the model can assign high confidence to a wrong match, and downstream systems will act on false data.

Schema matching is structurally simpler but practically important. Two product catalogs need to be merged. One uses "SKU", the other uses "sku_id". One has "list_price", the other has "msrp". One has "units_in_stock", the other has "inventory_qty". These fields mean similar things but aren't labeled the same.

Example of LLM-Enhanced Schema Matching

This is one of LLMs' cleaner wins. Semantic understanding of field names and their purpose is exactly what language models do well. An LLM understands that "units_in_stock" and "inventory_qty" likely refer to the same concept without being told. The failure modes tend to be false negatives (missed some mappings) rather than false positives (confidently wrong mappings), which is more recoverable.

Work on LLM agents for cleaning tabular data has shown that agentic approaches, where models can iteratively refine their understanding by asking clarifying questions or requesting feedback, perform better than single-pass approaches. But this comes at higher computational cost.

Data enrichment: when inference becomes hallucination

You've cleaned your data and integrated it. Now comes the ambitious task: adding information that wasn't originally there. Data annotation and profiling transform a spreadsheet of raw facts into a spreadsheet of interpreted facts.

Data annotation assigns labels or classifications to records. You have a list of transaction descriptions written as narratives. You need to classify each transaction as "office supplies", "travel", "entertainment", or similar categories. You have product descriptions. You need to extract structured attributes: color, size, material, brand. You have customer support tickets. You need to assign priority and category.

Example of LLM-Enhanced Data Annotation

LLMs are powerful here because they understand language semantically. They can read a product description and extract relevant attributes without being shown examples of every possible format. They can classify a transaction description correctly even if it's written in unusual language. This is where LLMs' core strength—language understanding—directly applies to a data problem.

But hallucination is a real risk. An LLM reading a vague transaction description might confidently assign a category that seems reasonable but is wrong. An LLM extracting attributes from a sparse product description might invent details that aren't there. The model appears helpful and authoritative even when it's confabulating.

Data profiling extracts statistical and structural summaries of datasets. What are the distributions of values in each column? Which columns have missing data and how much? What are the relationships between columns? Which records are outliers? What constraints do the values satisfy?

Example of LLM-Enhanced Data Profiling

This is a hybrid task where LLMs can be useful but aren't universally better. Computing basic statistics (mean, median, distribution) is something traditional methods do perfectly well. But interpreting what those statistics mean, identifying interesting patterns, and generating human-readable summaries of data quality is where LLMs add value. An LLM can read a profile of a customer database and generate a natural language summary: "Most customers are concentrated in three metropolitan areas. New customer acquisition has grown 15% year-over-year, but retention has declined in the Northeast region."

Research on how LLMs serve as data preprocessors suggests they excel at narrative and semantic tasks but struggle with reliable statistical reasoning. They can make probabilistic inferences but sometimes with inflated confidence.

The real-world tradeoffs

The shift from rule-based to LLM-driven data preparation isn't a pure upgrade. It's an exchange. You gain flexibility, semantic understanding, and the ability to handle novel problems. You lose predictability, auditability, and cost efficiency.

What LLMs gain: Flexibility is the headline. A single model can handle diverse data problems without retraining. Semantic understanding means the model can reason about the meaning of data, not just its format. Context awareness allows decisions based on interactions between columns. When data violates the assumptions that rule-based systems rely on, LLMs often adapt gracefully.

What LLMs lose: Predictability. A rule deterministically produces the same output every time. An LLM can vary between runs due to sampling, or produce different results with slightly different prompts. Auditability is harder. Why did the model standardize this address this way? The answer involves statistical patterns learned from training data, not explicit rules you can inspect. Cost scaling is real. Calling an LLM API millions of times to clean a large dataset is expensive in ways that local rule execution isn't. Hallucinations are persistent even in advanced models. Confidence calibration remains unsolved. The model can be very confident when it's actually wrong.

There's also a subtler tradeoff: evaluation becomes harder. With rule-based systems, correctness is often objective. Did the script produce the expected output? With LLM-based systems, correctness is sometimes subjective. Is this standardization of an address "correct" even though it differs from the rule-based approach? Both interpretations might be defensible.

Why evaluation is broken

This is where the survey's careful analysis becomes critical. The field has moved faster than evaluation has. Most papers use different benchmarks, different metrics, different datasets. Comparing results across papers is nearly impossible. Many evaluations are small-scale: hundreds of records tested on one domain. Real data preparation scales to millions of records across diverse domains.

The core problem: data preparation evaluation should measure whether downstream tasks actually perform better after preparation. But most papers measure preparation quality in isolation. They count how many records are cleaned correctly, how many matches are accurate, how many annotations match human labels. But a 90% accurate cleaning might produce better downstream predictions than 95% accurate cleaning if the remaining errors are in less critical columns. The field measures the proximal task while ignoring whether it actually solves the distal problem.

There's also the annotation problem. What counts as a "correct" standardization of an address? Different annotators might disagree, especially for edge cases. The survey found that most datasets lack rigorous inter-annotator agreement studies. This means papers might be optimizing for one evaluator's preferences, not for actual correctness.

The issue compounds when you consider that real data preparation is heterogeneous. A company's data pipeline might need to clean customer records, integrate them with transaction data, then enrich them with external datasets. Evaluating any single task in isolation misses the interactions. An imputation that seems reasonable for one downstream task might be disastrous for another.

The path forward

The survey identifies several critical open challenges. Scalability is first. LLMs work well on small datasets and prototypes. Scaling to terabytes of data while keeping costs reasonable remains unsolved. Some approaches use LLMs to generate rules that then execute locally. Others use LLMs selectively on ambiguous records. But there's no principled framework yet.

Reliable agentic workflows is another. The idea is appealing: an agent that can iteratively clean data, ask for human feedback when uncertain, and refine its approach. But current agent architectures are brittle. They get stuck in loops, make mistakes in planning, and don't gracefully handle feedback. Designing agentic systems that are robust and actually improve with human input remains open.

Robust evaluation is foundational. The field needs standardized benchmarks that test real-world scenarios, not just isolated tasks. Evaluations should measure downstream impact, not just preparation accuracy. And they should account for the heterogeneity of real data problems.

There's also a deeper question about when LLMs should be used at all. The survey suggests they're particularly valuable when (1) data heterogeneity is high, (2) semantic understanding is required, (3) rules would be brittle, and (4) perfect accuracy isn't critical. They're less valuable when (1) data is highly structured, (2) rules can capture the problem exhaustively, (3) errors compound into catastrophic failures downstream, and (4) cost is critical.

The future probably isn't "all LLMs" or "no LLMs" for data preparation. It's hybrid systems that use LLMs where they excel (semantic reasoning, handling novel patterns) and use traditional methods where they're better (deterministic execution, reliable cost, guaranteed consistency).

The transition from rule-based to LLM-enhanced data preparation is real and accelerating. But it's not a simple replacement. It's a fundamental shift in how we think about preparing data, paired with new challenges that the field is only beginning to address. The most successful practitioners will be those who understand both the power and the limits, and who know when to apply each approach.