In the age of Agentic AI - where autonomous systems launch campaigns, personalise content, and trigger actions without human approval - data quality is everything. data quality is everything These AI systems act on what's in your CRM, ad platform, or analytics table. If that data is outdated, duplicated, or inconsistent, they don't just make small mistakes - they make them at scale. And yes, the results from these autonomous systems aren’t always reliable - hallucinations happen, decisions occasionally miss the mark. But the pace of progress is fast. What feels experimental today could become a default workflow tomorrow. That’s why it’s time to get serious about your data. If your team hasn’t made it a priority yet, this is a good place to start. Clean, consistent, reliable data isn’t a nice-to-have anymore - it’s the baseline for modern, AI - powered marketing. baseline for modern, AI - powered marketing According to Gartner, poor data quality costs organisations an average of $12.9 million annually, cutting into efficiency and clouding decision - making. And with AI now driving more of the day-to-day execution - from content to targeting - those costs are likely to climb. Bad data doesn't just stay in the background any more; it gets amplified. Gartner $12.9 million annually amplified I wrote this article as a practical SQL query cookbook for marketers using SQL Server Management Studio (SSMS) from Microsoft. Whether you’re technical or just data - curious, these examples will help you solve common issues fast. And with ChatGPT - or any large language model (LLM) - writing and explaining the queries, you don’t need deep SQL knowledge to get started. With clean data and the ability to work with it directly, you’ll move faster - and your AI will, too. You Don’t Need to Be a Developer Anymore You Don’t Need to Be a Developer Anymore Historically, data clean-up in SQL required technical support. That’s no longer true. Thanks to ChatGPT, marketers can now: Describe a data issue in plain English Get a SQL query instantly Run or adapt it in SQL Server Management Studio Ask ChatGPT to explain or rewrite the query for a different structure Describe a data issue in plain English Get a SQL query instantly Run or adapt it in SQL Server Management Studio Ask ChatGPT to explain or rewrite the query for a different structure Even if you only understand basic SQL - how SELECT, WHERE, and JOIN work - you can start writing powerful queries and learning as you go. Basic SQL + ChatGPT = a fast track to becoming a professional, SQL - literate marketer. Basic SQL + ChatGPT = a fast track to becoming a professional, SQL - literate marketer. How to Use This Cookbook How to Use This Cookbook Open SQL Server Management Studio (SSMS) Copy a query below Adjust table or column names to match your schema Run a SELECT to preview changes before applying DELETE or UPDATE Use ChatGPT to adapt or explain any query Open SQL Server Management Studio (SSMS) Copy a query below Adjust table or column names to match your schema Run a SELECT to preview changes before applying DELETE or UPDATE Use ChatGPT to adapt or explain any query Note: All queries use SELECT so you can preview the changes first. Always confirm before converting to UPDATE or DELETE. 🍳 The Marketing Data Cleaning Query Cookbook 🍳 The Marketing Data Cleaning Query Cookbook Name Formatting 1. Fix Name Capitalization (Proper Case) SELECT first_name, last_name, UPPER(LEFT(first_name, 1)) + LOWER(SUBSTRING(first_name, 2, LEN(first_name))) AS fixed_first_name, UPPER(LEFT(last_name, 1)) + LOWER(SUBSTRING(last_name, 2, LEN(last_name))) AS fixed_last_name FROM contacts WHERE first_name IS NOT NULL AND last_name IS NOT NULL; SELECT first_name, last_name, UPPER(LEFT(first_name, 1)) + LOWER(SUBSTRING(first_name, 2, LEN(first_name))) AS fixed_first_name, UPPER(LEFT(last_name, 1)) + LOWER(SUBSTRING(last_name, 2, LEN(last_name))) AS fixed_last_name FROM contacts WHERE first_name IS NOT NULL AND last_name IS NOT NULL; This turns: JOHN → John sMiTh → Smith anna → Anna JOHN → John sMiTh → Smith anna → Anna 2. Flag Suspicious or Swapped Names Flags names that look unusual in length (as a clue for errors like swapped fields or invalid entries). SELECT * FROM contacts WHERE LEN(first_name) > 15 OR LEN(last_name) < 2; SELECT * FROM contacts WHERE LEN(first_name) > 15 OR LEN(last_name) < 2; This flags: first_name = JOHNSONANDJOHNSON, last_name = A first_name = ADMINACCOUNT, last_name = x first_name = JOHNSONANDJOHNSON, last_name = A first_name = ADMINACCOUNT, last_name = x 3. Build Full Name from First and Last Name SELECT first_name, last_name, UPPER(LEFT(first_name, 1)) + LOWER(SUBSTRING(first_name, 2, LEN(first_name))) + ' ' + UPPER(LEFT(last_name, 1)) + LOWER(SUBSTRING(last_name, 2, LEN(last_name))) AS full_name FROM contacts; SELECT first_name, last_name, UPPER(LEFT(first_name, 1)) + LOWER(SUBSTRING(first_name, 2, LEN(first_name))) + ' ' + UPPER(LEFT(last_name, 1)) + LOWER(SUBSTRING(last_name, 2, LEN(last_name))) AS full_name FROM contacts; This turns: first_name = JOHN, last_name = SMITH → full_name = John Smith first_name = aNNa, last_name = bROWN → full_name = Anna Brown first_name = JOHN, last_name = SMITH → full_name = John Smith first_name = aNNa, last_name = bROWN → full_name = Anna Brown 4. Trim Extra Spaces Remove leading/trailing spaces. SELECT LTRIM(RTRIM(first_name)) AS trimmed_first_name, LTRIM(RTRIM(last_name)) AS trimmed_last_name FROM contacts; SELECT LTRIM(RTRIM(first_name)) AS trimmed_first_name, LTRIM(RTRIM(last_name)) AS trimmed_last_name FROM contacts; This turns: ' John ' → 'John' ' smith ' → 'smith' ' John ' → 'John' ' smith ' → 'smith' Fuzzy matching and sound-based duplicates Find similar-sounding duplicates using SOUNDEX + DIFFERENCE Compares phonetically similar names. Example: Jon Smith and John Smyth Jon Smith John Smyth WITH PossibleDupes AS ( SELECT a.contact_id AS id1, b.contact_id AS id2, a.first_name AS name1, b.first_name AS name2, a.email AS email1, b.email AS email2, ROW_NUMBER() OVER (PARTITION BY a.first_name ORDER BY a.last_updated DESC) AS rn FROM contacts a JOIN contacts b ON a.contact_id < b.contact_id AND SOUNDEX(a.first_name) = SOUNDEX(b.first_name) AND DIFFERENCE(a.last_name, b.last_name) >= 3 ) SELECT * FROM contacts WHERE contact_id IN ( SELECT id2 FROM PossibleDupes WHERE rn > 1 ); WITH PossibleDupes AS ( SELECT a.contact_id AS id1, b.contact_id AS id2, a.first_name AS name1, b.first_name AS name2, a.email AS email1, b.email AS email2, ROW_NUMBER() OVER (PARTITION BY a.first_name ORDER BY a.last_updated DESC) AS rn FROM contacts a JOIN contacts b ON a.contact_id < b.contact_id AND SOUNDEX(a.first_name) = SOUNDEX(b.first_name) AND DIFFERENCE(a.last_name, b.last_name) >= 3 ) SELECT * FROM contacts WHERE contact_id IN ( SELECT id2 FROM PossibleDupes WHERE rn > 1 ); This flags potential duplicates like: Jon Smith and John Smyth → same SOUNDEX and high DIFFERENCE score Sara Adams and Sarah Adamz → close pronunciation, potential duplicate Keeps only the most recent entry by last_updated and flags others for review Jon Smith and John Smyth → same SOUNDEX and high DIFFERENCE score Sara Adams and Sarah Adamz → close pronunciation, potential duplicate Keeps only the most recent entry by last_updated and flags others for review Caution: Fuzzy logic can create false positives. Always inspect the results before deciding on deduplication. Caution: Fuzzy logic can create false positives. Always inspect the results before deciding on deduplication. Caution: String unification 1. Add leading zeros to numbers Ensures that numeric fields are always the same length (e.g., 5-digit ZIPs, 8-digit IDs) by padding with leading zeros. SELECT zip_code, REPLICATE('0', 5 - LEN(zip_code)) + zip_code AS padded_zip FROM leads WHERE LEN(zip_code) < 5; SELECT zip_code, REPLICATE('0', 5 - LEN(zip_code)) + zip_code AS padded_zip FROM leads WHERE LEN(zip_code) < 5; This turns: 123 → '00123' 7890 → '07890' 9 → '00009' 123 → '00123' 7890 → '07890' 9 → '00009' 2. Standardize text using REPLICATE + UPPER Standardizes values with consistent formatting — great for codes, campaign labels, or any inconsistent text input. SELECT campaign_code, UPPER(REPLICATE('0', 5 - LEN(campaign_code)) + campaign_code) AS formatted_code FROM campaigns WHERE LEN(campaign_code) < 5 OR campaign_code COLLATE Latin1_General_CS_AS != UPPER(campaign_code); SELECT campaign_code, UPPER(REPLICATE('0', 5 - LEN(campaign_code)) + campaign_code) AS formatted_code FROM campaigns WHERE LEN(campaign_code) < 5 OR campaign_code COLLATE Latin1_General_CS_AS != UPPER(campaign_code); This turns: abc → 00ABC x9 → 000X9 abc → 00ABC x9 → 000X9 3. Standardize country names Maps common country variants to a standard form. SELECT country, CASE WHEN country IN ('United States', 'US', 'U.S.') THEN 'USA' ELSE country END AS standardized_country FROM contacts; SELECT country, CASE WHEN country IN ('United States', 'US', 'U.S.') THEN 'USA' ELSE country END AS standardized_country FROM contacts; This turns: 'United States' → 'USA' 'US' → 'USA' 'U.S.' → 'USA' 'United States' → 'USA' 'US' → 'USA' 'U.S.' → 'USA' 4. Standardize Phone Numbers (e.g. US) Formats 10-digit numbers for readability. SELECT phone, '(' + SUBSTRING(phone, 1, 3) + ') ' + SUBSTRING(phone, 4, 3) + '-' + SUBSTRING(phone, 7, 4) AS formatted_phone FROM contacts WHERE LEN(phone) = 10 AND phone NOT LIKE '(%'; SELECT phone, '(' + SUBSTRING(phone, 1, 3) + ') ' + SUBSTRING(phone, 4, 3) + '-' + SUBSTRING(phone, 7, 4) AS formatted_phone FROM contacts WHERE LEN(phone) = 10 AND phone NOT LIKE '(%'; This turns: '2125551234' → '(212) 555-1234' '2125551234' → '(212) 555-1234' 5. Classify free vs business email providers SELECT email, CASE WHEN email LIKE '%gmail.com%' OR email LIKE '%yahoo.com%' OR email LIKE '%hotmail.com%' THEN 'free' ELSE 'business' END AS email_type FROM leads; SELECT email, CASE WHEN email LIKE '%gmail.com%' OR email LIKE '%yahoo.com%' OR email LIKE '%hotmail.com%' THEN 'free' ELSE 'business' END AS email_type FROM leads; This flags: gmail.com, yahoo.com, hotmail.com emails as 'free' gmail.com, yahoo.com, hotmail.com emails as 'free' Note: This is a preview-only query—it classifies but does not modify your data 6. Normalize job titles SELECT job_title, CASE WHEN LOWER(job_title) LIKE '%manager%' OR LOWER(job_title) LIKE '%lead%' OR LOWER(job_title) LIKE '%head%' THEN 'Manager' ELSE 'Other' END AS job_level FROM contacts; SELECT job_title, CASE WHEN LOWER(job_title) LIKE '%manager%' OR LOWER(job_title) LIKE '%lead%' OR LOWER(job_title) LIKE '%head%' THEN 'Manager' ELSE 'Other' END AS job_level FROM contacts; This classifies: 'Product Manager', 'Team Lead', 'Head of Growth' → 'Manager' level 'Product Manager', 'Team Lead', 'Head of Growth' → 'Manager' level Dealing with Missing Data 1. Find records with missing critical fields SELECT * FROM contacts WHERE email IS NULL OR first_name IS NULL OR last_name IS NULL; SELECT * FROM contacts WHERE email IS NULL OR first_name IS NULL OR last_name IS NULL; This finds: Contacts missing any of the key identity fields: email, first name, or last name Contacts missing any of the key identity fields: email, first name, or last name 2. Fill missing values with defaults using ISNULL Fill Missing Industry with 'Unknown' SELECT industry, ISNULL(industry, 'Unknown') AS cleaned_industry FROM leads; SELECT industry, ISNULL(industry, 'Unknown') AS cleaned_industry FROM leads; Fill Missing UTM Medium with 'email' SELECT utm_medium, ISNULL(utm_medium, 'email') AS cleaned_medium FROM sessions; SELECT utm_medium, ISNULL(utm_medium, 'email') AS cleaned_medium FROM sessions; Fill NULL Lead Scores with 0 SELECT lead_score, ISNULL(lead_score, 0) AS final_score FROM leads; SELECT lead_score, ISNULL(lead_score, 0) AS final_score FROM leads; This turns: NULL → 'Unknown' NULL → 'email' NULL → 0 (helps avoid failed scoring logic) NULL → 'Unknown' NULL → 'email' NULL → 0 (helps avoid failed scoring logic) 3. Use COALESCE to create fallback values Combine full name from available fields. SELECT full_name, first_name, last_name, COALESCE(full_name, first_name + ' ' + last_name, 'Anonymous') AS display_name FROM contacts; SELECT full_name, first_name, last_name, COALESCE(full_name, first_name + ' ' + last_name, 'Anonymous') AS display_name FROM contacts; This turns: full_name = NULL, first_name = 'John', last_name = 'Smith' → 'John Smith' All fields NULL → 'Anonymous' full_name = NULL, first_name = 'John', last_name = 'Smith' → 'John Smith' All fields NULL → 'Anonymous' Fill missing company from backup field SELECT company, employer, COALESCE(company, employer, 'Unknown') AS known_company FROM contacts; SELECT company, employer, COALESCE(company, employer, 'Unknown') AS known_company FROM contacts; Avoiding Duplicate Data 1. Select duplicate contacts by email WITH RankedContacts AS ( SELECT *, ROW_NUMBER() OVER (PARTITION BY email ORDER BY Create_Date DESC) AS rn FROM contacts ) SELECT * FROM RankedContacts WHERE rn > 1; WITH RankedContacts AS ( SELECT *, ROW_NUMBER() OVER (PARTITION BY email ORDER BY Create_Date DESC) AS rn FROM contacts ) SELECT * FROM RankedContacts WHERE rn > 1; This finds: Multiple records with the same email: john@example.com (3 entries). Multiple records with the same email: john@example.com (3 entries). john@example.com 2. Detect Duplicate Full Names WITH RankedNames AS ( SELECT *, ROW_NUMBER() OVER (PARTITION BY first_name, last_name ORDER BY created_at DESC) AS rn FROM contacts ) SELECT * FROM RankedNames WHERE rn > 1; WITH RankedNames AS ( SELECT *, ROW_NUMBER() OVER (PARTITION BY first_name, last_name ORDER BY created_at DESC) AS rn FROM contacts ) SELECT * FROM RankedNames WHERE rn > 1; This flags: People with the same name entered multiple times. People with the same name entered multiple times. 3. Detect duplicate email + name combo WITH ComboDupes AS ( SELECT *, ROW_NUMBER() OVER ( PARTITION BY email, first_name, last_name ORDER BY contact_id ) AS rn FROM contacts ) SELECT * FROM ComboDupes WHERE rn > 1; WITH ComboDupes AS ( SELECT *, ROW_NUMBER() OVER ( PARTITION BY email, first_name, last_name ORDER BY contact_id ) AS rn FROM contacts ) SELECT * FROM ComboDupes WHERE rn > 1; This identifies: Repeat entries for the same person based on name and email. Repeat entries for the same person based on name and email. Out-of-range values and inaccurate data 1. Find out-of-range values Flags unrealistic values (age, lead score, conversion rate).Example:conversion_rate > 1 or age < 18 conversion_rate > 1 age < 18 SELECT * FROM leads WHERE age NOT BETWEEN 18 AND 100 OR lead_score NOT BETWEEN 0 AND 100 OR conversion_rate > 1; SELECT * FROM leads WHERE age NOT BETWEEN 18 AND 100 OR lead_score NOT BETWEEN 0 AND 100 OR conversion_rate > 1; This flags: age = 7 → flagged (too young) lead_score = 130 → flagged (should be between 0 and 100) conversion_rate = 1.2 → flagged (over 100% is not valid) age = 7 → flagged (too young) lead_score = 130 → flagged (should be between 0 and 100) conversion_rate = 1.2 → flagged (over 100% is not valid) 2. Detect contradictory values Example: Email opened after unsubscribe UK phone number in US record Email opened after unsubscribe UK phone number in US record Lead score = 0 but multiple conversions SELECT * FROM contacts WHERE (email_status = 'Unsubscribed' AND email_opened = 1) OR (country = 'USA' AND phone LIKE '+44%') OR (lead_score = 0 AND conversion_count > 0); SELECT * FROM contacts WHERE (email_status = 'Unsubscribed' AND email_opened = 1) OR (country = 'USA' AND phone LIKE '+44%') OR (lead_score = 0 AND conversion_count > 0); This flags: email_status = 'Unsubscribed' and email_opened = 1 → person opened email after opting out country = 'USA' and phone = '+447911123456' → UK number in US segment lead_score = 0 and conversion_count = 3 → score doesn't reflect activity email_status = 'Unsubscribed' and email_opened = 1 → person opened email after opting out country = 'USA' and phone = '+447911123456' → UK number in US segment lead_score = 0 and conversion_count = 3 → score doesn't reflect activity 3. Detect invalid emails Flags values without @ or domain. @ SELECT * FROM leads WHERE email NOT LIKE '%@%.%' OR email IS NULL OR email = ''; SELECT * FROM leads WHERE email NOT LIKE '%@%.%' OR email IS NULL OR email = ''; This finds: ‘johnexample.com’ NULL ““ (empty string) ‘johnexample.com’ johnexample.com NULL ““ (empty string) 4. Find internal or test contacts SELECT * FROM contacts WHERE email LIKE '%test%' OR email LIKE '%internal%' OR name LIKE '%demo%'; SELECT * FROM contacts WHERE email LIKE '%test%' OR email LIKE '%internal%' OR name LIKE '%demo%'; This flags: internal accounts used for QA/testing test@yourcompany.com internal accounts used for QA/testing test@yourcompany.com test@yourcompany.com 5. Detect invalid dates SELECT * FROM events WHERE event_date > GETDATE() OR event_date IS NULL; SELECT * FROM events WHERE event_date > GETDATE() OR event_date IS NULL; This finds: Future dates Empty or null event dates Future dates Empty or null event dates Splitting data of one column into more columns 1. Split Full Name into First and Last SELECT full_name, LEFT(full_name, CHARINDEX(' ', full_name) - 1) AS first_name, SUBSTRING(full_name, CHARINDEX(' ', full_name) + 1, LEN(full_name)) AS last_name FROM contacts WHERE full_name LIKE '% %'; SELECT full_name, LEFT(full_name, CHARINDEX(' ', full_name) - 1) AS first_name, SUBSTRING(full_name, CHARINDEX(' ', full_name) + 1, LEN(full_name)) AS last_name FROM contacts WHERE full_name LIKE '% %'; This turns: 'John Smith' → first_name = 'John', last_name = 'Smith' 'John Smith' → first_name = 'John', last_name = 'Smith' 2. Split Location Field into City and Country Assuming format: ”City, Country” SELECT location, LEFT(location, CHARINDEX(',', location) - 1) AS city, LTRIM(SUBSTRING(location, CHARINDEX(',', location) + 1, LEN(location))) AS country FROM leads WHERE location LIKE '%,%'; SELECT location, LEFT(location, CHARINDEX(',', location) - 1) AS city, LTRIM(SUBSTRING(location, CHARINDEX(',', location) + 1, LEN(location))) AS country FROM leads WHERE location LIKE '%,%'; This turns: Paris, France → city = ‘Paris’, country =’France’ Paris, France → city = ‘Paris’, country =’France’ Bonus: Train GPT on Your Schema Bonus: Train GPT on Your Schema For even more accurate and relevant SQL, create a custom GPT assistant trained on: A list of your CRM’s table and column names Common cleaning queries from your team Notes on formatting rules or naming patterns A list of your CRM’s table and column names Common cleaning queries from your team Notes on formatting rules or naming patterns Now ChatGPT doesn’t just write generic SQL. It writes your SQL. Final Thoughts Final Thoughts Agentic AI is changing how marketers work. It’s fast, autonomous, and powerful—but it depends entirely on the quality of your data. Thanks to SQL and ChatGPT, marketers can now clean that data themselves—without bottlenecks or technical debt. You don’t need to become an engineer. You just need: The right questions The right queries And the right assistant The right questions The right queries And the right assistant Use this cookbook. Prompt ChatGPT. Learn as you go. Clean data isn’t just an optimization anymore. It’s the key to making AI work the way it’s supposed to.