The Missing Piece in Your Text-to-SQL Pipeline: Automated Data Catalog Refinement

Chat with your Database is the current Holy Grail of enterprise AI.

We have all seen the demos: A manager asks, "Show me the population growth in the downtown district for Q3," and the AI magically generates the perfect SQL query.

But when you deploy this in production, the magic often breaks. The LLM hallucinates table names, misunderstands obscure column headers like pop_val_09, or joins unrelated tables.

Why? Because RAG is only as good as its metadata.

If your database schema is poorly documented (and let’s be honest, most are), the LLM is guessing. This article breaks down an architectural pattern to solve this: An Automated Data Catalog Refinement System.

We will explore how to use query logs and schema similarity to detect bad metadata and automatically suggest improvements, creating a self-healing RAG pipeline for your data warehouse.

The Problem: Garbage In, Hallucination Out

In a standard Text-to-SQL RAG (Retrieval-Augmented Generation) setup, the workflow looks like this:

User asks a question.
System searches the Data Catalog (Vector Store) for relevant table schemas.
System feeds those schemas + the user question to the LLM.
LLM generates SQL.

If your Data Catalog describes a column simply as num_val instead of "Monthly aggregate of inbound traffic," the LLM fails. The bottleneck isn't the model (GPT-4 or Claude 3); the bottleneck is the empty description fields in your DDL.

The Solution Architecture: The "Self-Healing" Catalog

We need a mechanism that doesn't just read the catalog but improves it. The proposed solution uses a feedback loop based on User Behavior (SQL Logs) and Schema Similarity.

Here is the high-level architecture:

Phase 1: Detecting "Bad" Catalogs via SQL Logs

How do we know a table description is bad without reading thousands of entries? We look at how users struggle.

If a Data Analyst (or an AI agent) is struggling to query a dataset, their behavior leaves a digital footprint in the SQL Execution Logs.

**The "Wandering" Pattern: \ If the metadata is clear, a query is usually successful on the first or second try.
If the metadata is vague, we see a pattern of Wandering:

Query Table A -> Fail.
Query Table B -> Fail.
Join A and C -> Fail.
Finally, query Table D -> Success.

This Search - Fail - Search - Success pattern is a strong signal that the metadata for Table A, B, and C is misleading or insufficient.

Phase 2: The Similarity Engine

Once we identify a Bad catalog entry, we need to fix it. We can't expect users to write documentation from scratch. Instead, we use Schema Similarity to find a good example that already exists in your system.

If you have a poorly documented table sales_2024_raw, but a pristine, well-documented table sales_2023_final, the system should detect this relationship.

**The Algorithm: Jaccard & MinHash \ To find similar schemas efficiently (even across millions of tables), we can compare column profiles.

Column Type Matching: Do both tables have {String, Int, Date, Int}?
Semantic Name Matching: Is cust_id similar to customer_identifier?

Here is a Python prototype of how you might calculate Schema Similarity to find a Documentation Donor:

def calculate_jaccard_similarity(schema_a, schema_b):
    """
    Calculates similarity between two table schemas based on column definitions.
    schema = set of (column_name, data_type) tuples
    """
    intersection = len(schema_a.intersection(schema_b))
    union = len(schema_a.union(schema_b))
    
    if union == 0:
        return 0.0
        
    return intersection / union

# Example Schemas
bad_catalog = {('id', 'int'), ('val', 'float'), ('dt', 'date')}
good_catalog = {('user_id', 'int'), ('revenue', 'float'), ('transaction_date', 'date')}

# In a real implementation, you would use Semantic Embeddings 
# instead of exact string matching for column names.
similarity_score = calculate_jaccard_similarity(bad_catalog, good_catalog)

if similarity_score > 0.4:
    print("Optimization Opportunity: Use metadata from 'Good Catalog' to annotate 'Bad Catalog'")

Note: For large-scale enterprise databases, iterating every pair is too slow. You would use MinHash or Locality Sensitive Hashing (LSH) to filter candidates quickly.

Phase 3: The "Golden Record" Feedback Loop

The final step is the User Interface. When a data owner registers a new dataset (or when the system detects the Wandering pattern), the system intervenes.

The Workflow:

Trigger: User uploads city_population_2024.csv with no descriptions.
Scan: System scans the vector database for similar schemas.
Match: System finds city_population_2023 which has rich descriptions: "Column A represents the annual census count."
Prompt: "This dataset looks 95% similar to the 2023 data. Would you like to copy the descriptions and tags?"

By reducing the friction of documentation, we ensure the Vector Store used by the RAG system is populated with high-quality, semantically rich context.

Why This Matters for GenAI

LLMs are probabilistic engines. They rely on semantic clarity.

Without this system: The LLM sees col_a and hallucinates that it means "Age."
With this system: The LLM sees col_a (inherited description: "The total inbound traffic in GB") and generates the correct SQL SUM(col_a).

Impact on RAG Accuracy

Implementing a "Self-Healing" Data Catalog significantly improves the Hit Rate of the retriever. When users ask natural language questions, the embedding model can verify the semantic similarity between the question and the well-described table schema, rather than guessing based on cryptic column names.

Summary

Don't just build a RAG chatbot and hope for the best. Engineer your metadata pipeline.

Monitor SQL Logs to identify "confusing" data.
Use Schema Similarity to propagate documentation from "Gold" tables to new tables.
Automate the updates to your Vector Store.

This turns your database from a black box into a transparent, AI-ready engine.