Hallucination by Design: How Embedding Models Misunderstand Language

The Silent Flaws of Embeddings: Why Your AI Is Getting It Wrong The Silent Flaws of Embeddings: Why Your AI Is Getting It Wrong Unstructured text data has proliferated recently, ranging from technical documentation and scientific literature to social media posts and customer evaluations. Businesses from various sectors are realizing more and more that this text offers priceless insights, but obtaining, arranging, and utilizing this data is difficult. The semantic richness of human language is not captured by conventional keyword-based text processing techniques, which have shown themselves to be insufficient. For example, a search for "car issues" might not turn up pertinent papers concerning "automobile problems", and a product review that mentions "screen froze" might be missed when examining customer complaints about "system crashes." Text embeddings, a novel technique that converts words and sentences into numerical vectors that capture their meaning, were developed and used as a result of this basic constraint. Large tech firms have spent billions on creating ever-more-advanced embedding models; OpenAI's embeddings, Meta's RoBERTa, Google's BERT, and a host of open-source alternatives are now essential parts of contemporary NLP systems. However, despite their extensive use, we still don't fully grasp how these embedding models function in practical settings. This lack of understanding has resulted in: Expensive implementation errors that occur when systems don't satisfy business needs. Poor user experiences when relevant content is missed by search or recommendation systems. Performance differences among various user demographics, content categories, and languages. Ineffective use of resources by organizations that employ overly complicated models Expensive implementation errors that occur when systems don't satisfy business needs. Poor user experiences when relevant content is missed by search or recommendation systems. Performance differences among various user demographics, content categories, and languages. Ineffective use of resources by organizations that employ overly complicated models Relevance and Applicability to Industry Many different industries can directly benefit from this analysis: Retail and E-Commerce: Enhancing product search to accommodate multilingual enquiries, synonyms, and misspellings. Improving recommendation systems to identify comparable products with varying descriptions. Developing more sophisticated review analysis tools that are aware of subtleties in sentiment. Enhancing product search to accommodate multilingual enquiries, synonyms, and misspellings. Improving recommendation systems to identify comparable products with varying descriptions. Developing more sophisticated review analysis tools that are aware of subtleties in sentiment. Medical Care: Allowing medical terminology to be matched across notation variations in clinical document retrieval systems. Regardless of educational attainment, enhancing patient linguistic processing for symptom description. Improving the search for medical literature by better matching concepts across technological variables Allowing medical terminology to be matched across notation variations in clinical document retrieval systems. Regardless of educational attainment, enhancing patient linguistic processing for symptom description. Improving the search for medical literature by better matching concepts across technological variables Finance: Enhancing compliance monitoring for policy infractions, irrespective of wording. Enhancing fraud detection through the identification of questionable patterns across linguistic variances. Improving customer service by better comprehending the different types of consumer enquiries Enhancing compliance monitoring for policy infractions, irrespective of wording. Enhancing fraud detection through the identification of questionable patterns across linguistic variances. Improving customer service by better comprehending the different types of consumer enquiries Finding how your embedding model is returning similar results: This analysis helps organizations make better decisions about model selection, preprocessing needs, and possible enhancement strategies by offering a systematic way to understand embedding behavior. This approach ultimately results in language processing systems that are more efficient, reliable, and equitable. What Are Text Embeddings? For the uninitiated, text embeddings convert words or sentences into numerical vectors – essentially long lists of numbers, dense vectors with many dimensions. These vectors are positioned in a high-dimensional space where semantic similarity is represented by proximity. In simpler terms, things with similar meanings should be close together in this space. These embeddings power practically everything in modern NLP: When you search Google, it understands you meant "New York hotels" even though you typed "places to stay in NYC." When your email client suggests replies When a content recommendation system figures out articles you might like When you search Google, it understands you meant "New York hotels" even though you typed "places to stay in NYC." When you search Google, it understands you meant "New York hotels" even though you typed "places to stay in NYC." When your email client suggests replies When your email client suggests replies When a content recommendation system figures out articles you might like When a content recommendation system figures out articles you might like This capability powers numerous apps: Semantic search engines Systems for recommending content Identification of chatbot intent Organization and grouping of documents Systems for classifying texts and answering questions Semantic search engines Systems for recommending content Identification of chatbot intent Organization and grouping of documents Systems for classifying texts and answering questions The effectiveness of these applications is largely dependent on how well the embedding model manages different text variants and linguistic occurrences. While a model that is too sensitive to small formatting variations would have trouble with actual search queries, a model that treats "I love this product" and "I hate this product" as almost identical would be ineffective for sentiment analysis. Why I Started Investigating Embedding Models I'll never forget the day when I was researching embedding models for one of my customers with sample documents. I tried searching for "laptops without touch screens," only to get a response about touch screen laptops from the vector index. Our embedding model had completely missed the negation. That moment stuck with me. I mean, if the model couldn't understand a basic "without", what else was it missing? That episode led me down an unexpected path. I had to know how serious the problem was. Sure, the benchmarks for our embedding model looked impressive, but how did it behave with the messy, real-world language our users typed every day? What happened when words were misspelled, sentences were incomplete, or people used emojis? After two years of experimenting with various embedding models across different applications using RAG Experiment Accelerator (https://github.com/microsoft/rag-experiment-accelerator), I've developed a systematic way to evaluate how these models handle different types of sentence variations. What I found scared me. https://github.com/microsoft/rag-experiment-accelerator These weren't random glitches – they were deep, systematic blind spots baked into how these models see language. And nobody was talking about it. I'm sharing these findings because I've seen too many teams waste months implementing sophisticated embedding-based systems only to discover too late that they break in unexpected ways. This issue isn't just academic. The consequences of misunderstanding embedding behavior are serious: A colleague at a major healthcare company told me their medical information retrieval system was missing relevant clinical documents because their embedding model couldn't properly handle abbreviations and medical terminology variations. In healthcare, this kind of miss can affect treatment decisions. At a financial services firm, their compliance monitoring system missed policy violations because their embedding model couldn't recognize that passive voice sentences ("funds were transferred") meant the same thing as active voice ("someone transferred funds"). I've witnessed e-commerce companies lose millions of dollars due to their product search system's inability to accommodate the common typos and linguistic shortcuts used by shoppers. My Testing Approach I developed a testing framework that examines how embedding models respond to multiple different categories of text variations. Using cosine similarity as the metric (where 1.0 means identical meaning and 0.0 means completely unrelated), I ran hundreds of test cases. I'll walk through what I found, focusing on the multiple models, including MSMarco DistilBERT, OpenAI text embedding, and many more, and I've seen similar patterns across most transformer-based embeddings. MSMarco DistilBERT, OpenAI text embedding, and many more, Can't Tell Upper From Lowercase (At All) This one shocked me. Embedding models see "Apple announced new products" and "apple announced new products" as EXACTLY the same thing – perfect 1.0 similarity score. No difference whatsoever. Apple announced new products" and "apple announced new products" as EXACTLY the same thing – perfect 1.0 similarity score. I ran into this with a product catalogue system. The search couldn't tell the difference between the brand "Apple" and the fruit "apple." Customers searching for Apple products got recipes for apple pie. Do you think your customers will be happy looking at the results? I am sure I will not like it. Why does this matter? Think about all the cases where capitalization changes meaning – "Polish" vs "polish", "March" vs "march", "Bill" vs "bill". For legal or medical texts, these distinctions can be crucial. We're basically working with models that are partially blind to an entire dimension of written language. There is a way to fix these kinds of issues, and they will be covered later. Let’s understand more issues with embeddings. Note that this could be advantageous if these differences don't affect the case you're handling. Numbers Might as Well Be Made Up This one also floored me. Embedding models see "The investment returned 2% annually" and "The investment returned 20% annually" as the same thing, a crazy high 0.97 similarity score. There is no discernible difference between the two scenarios. Embedding models see "The investment returned 2% annually" and "The investment returned 20% annually" as the same thing, a crazy high 0.97 similarity score. I ran into this with a financial document search system. The algorithm couldn't tell the difference between "management fee: 0.2%" and "management fee: 2.0%." Investors searching for low-fee funds got recommended expensive options instead. Do you think your retirement account would be pleased with that mistake? I am sure I wouldn't like seeing my savings eaten by fees. Think about all the cases where numerical values are critical dosage instructions, engineering tolerances, financial returns, and contract deadlines. For investment or medical text, these distinctions can be life-altering. We're basically working with models that are numerically illiterate despite handling text filled with important quantities. The "Not" Problem Is Scary This one's actually dangerous. Adding "not" to a sentence – literally flipping its meaning – barely affects similarity scores. We routinely saw scores above 0.95 for complete opposites. "The treatment improved patient outcomes" vs. "The treatment did not improve patient outcomes" → 0.96 similarity When I showed this to a doctor who was using our medical search system, he was horrified. He was so horrified that he physically backed away from the computer. And he should be. We routinely saw scores above 0.95 for complete opposites. "The treatment improved patient outcomes" vs. "The treatment did not improve patient outcomes" → 0.96 similarity We were building a system that doctors would use to find treatment protocols. Get this wrong, and people could die. Negation isn't some edge case – it's fundamental to human language. When your search, recommendation, or analysis system can't tell "effective" from "ineffective" or "safe" from "unsafe", you're building dangerous hallucination machines. In healthcare, it could mean recommending harmful treatments. In legal documents, it could completely invert contractual obligations. In content moderation, you could miss the difference between "violence is acceptable" and "violence is never acceptable." healthcare legal content Spaces Don't Matter (Until They Really Do) Extra spaces, tabs, weird formatting – the models don't care. Similarity stays above 0.995. But remove all spaces? Similarity suddenly drops to 0.82. I encountered this issue while working with content that was scraped and had irregular spacing due to poor HTML. We built this beautiful search system for a digital library with thousands of scraped documents. Half the queries returned nothing useful because of inconsistent spacing. The librarians were ready to scrap the whole project. Similarity stays above 0.995. But remove all spaces? Similarity suddenly drops to 0.82. This quirk becomes devastating when dealing with user-generated content, OCR'd documents, or languages that don't use spaces the same way English does (like Thai or Chinese). It also means these models struggle with hashtags, URLs, and product codes – things real people search for every day. References are Challenging Embedded models see "The car is to the left of the tree" and "The car is to the right of the tree" as nearly identical—an insanely high 0.98 similarity score. Despite describing opposite perspectives, the embedded models perceive them as nearly identical. Do you think warehouse managers will be pleased when robots deliver packages to the wrong stations? I can assure you that I would be overwhelmed with frustration. "The car is to the left of the tree" and "The car is to the right of the tree" as nearly identical—an insanely high 0.98 similarity score. Think about all the cases where perspective and reference frames are essential – navigation directions, spatial relationships, relative positioning in medical procedures, and legal descriptions of accident scenes. These distinctions aren't trivial – they entirely change the meaning based on perspective. We're basically working with models that can't tell if you're describing something from the front or from behind. Counterfactuals Get Completely Reversed This one made me laugh, then cry. Embedding models see "If demand increases, prices will rise" and "If demand increases, prices will fall" as practically identical – a shocking 0.95 similarity score. Despite this, they depict entirely different economic scenarios! "If demand increases, prices will rise" and "If demand increases, prices will fall" as practically identical – a shocking 0.95 similarity score I hit this problem building an analysis system for economic research papers. The algorithm couldn't distinguish between opposing causal relationships. Economists searching for papers about price increases during demand surges got results about price drops during recessions instead. Do you think financial analysts making million-dollar investment decisions would appreciate getting exactly backwards information? I am sure I wouldn't want my retirement funds managed that way. Again, think about all the cases where counterfactual reasoning is critical: economic projections, medical cause-and-effect relationships, legal hypotheticals, and engineering failure analysis. When you can't tell "if X, then Y" from "if not-X, then not-Y", you fundamentally misunderstand the causal relationship. We're basically working with models that can't grasp basic conditional logic despite handling text full of hypotheticals and predictions. Ranges and Exact Values This one left me speechless. Embedding models see "The product costs between $50-$100" and "The product costs exactly $101" as NEARLY the same thing - a mind-boggling 0.98 similarity score. One’s in a range, one's outside that range, but the model barely notices! Embedding models see "The product costs between $50-$100" and "The product costs exactly $101" as NEARLY the same thing - a mind-boggling 0.98 similarity score I discovered this while building a price comparison system for an e-commerce client. The search couldn't distinguish between price ranges and exact prices, even when the exact price was outside the specified range. Shoppers with a strict $100 budget searching for products "under $100" kept getting shown items costing $120 or $150. Do you think customers with fixed budgets appreciate seeing products they explicitly can't afford? I am sure I wouldn't want to waste time browsing things I can't buy. Think about all the cases where ranges versus exact values are crucial - pricing decisions, medication dosing ranges, legal deadlines, safety tolerances, and performance benchmarks. When your model treats "at least 25 days" and "exactly 20 days" as basically identical, you've lost critical meaning. We're basically working with models that can't distinguish between flexibility and precision despite handling text; this distinction drives decisions. The Truth and the Results Here is the comparison between msmarco-distilbert-base-tas-b, all-mpnet-base-v2, and open-ai-text-embedding-3-large, and you will notice that there is no significant difference between the output of these models. msmarco-distilbert-base-tas-b, all-mpnet-base-v2, and open-ai-text-embedding-3-large, and you will notice that there is no significant difference between the output of these models. msmarco-distilbert-base-tas-b embedding score across different test cases msmarco-distilbert-base-tas-b embedding score across different test cases all-mpnet-base-v2 embedding score across different test cases all-mpnet-base-v2 embedding score across different test cases openai-text-embedding-3-large embedding score across different test cases openai-text-embedding-3-large embedding score across different test cases How to Work With Embeddings Look, embeddings are amazingly useful despite these problems. I'm not saying don't use them – I'm saying use them with your eyes open. Here's my battle-tested advice after dozens of projects and countless failures: Test your model on real user language patterns before deployment. Not academic benchmarks, not sanitized test cases – actual examples of how your users communicate. I built techniques that simulates common variations like negations, typos, and numerical differences. Every system we test fails in some areas – the question is whether those areas matter for your specific application. Build guardrails around critical blind spots. Different applications have different can't-fail requirements. For healthcare, it's typically negation and entity precision. For finance, it's numbers and temporal relationships. For legal, it's conditions and obligations. Identify what absolutely can't go wrong in your domain, and implement specialized safeguards. Layer different techniques instead of betting everything on embeddings. Our most successful systems combine embedding-based retrieval with keyword verification, explicit rule checks, and specialized classifiers for critical distinctions. This redundancy isn't inefficient; it's essential. Be transparent with users about what the system can and can't do reliably. We added confidence scores that explicitly flag when a result might involve negation, numerical comparison, or other potential weak points. Users appreciate the honesty, and it builds trust in the system overall. Test your model on real user language patterns before deployment. Not academic benchmarks, not sanitized test cases – actual examples of how your users communicate. I built techniques that simulates common variations like negations, typos, and numerical differences. Every system we test fails in some areas – the question is whether those areas matter for your specific application. Test your model on real user language patterns before deployment. Not academic benchmarks, not sanitized test cases – actual examples of how your users communicate. I built techniques that simulates common variations like negations, typos, and numerical differences. Every system we test fails in some areas – the question is whether those areas matter for your specific application. Build guardrails around critical blind spots. Different applications have different can't-fail requirements. For healthcare, it's typically negation and entity precision. For finance, it's numbers and temporal relationships. For legal, it's conditions and obligations. Identify what absolutely can't go wrong in your domain, and implement specialized safeguards. Build guardrails around critical blind spots. Different applications have different can't-fail requirements. For healthcare, it's typically negation and entity precision. For finance, it's numbers and temporal relationships. For legal, it's conditions and obligations. Identify what absolutely can't go wrong in your domain, and implement specialized safeguards. Layer different techniques instead of betting everything on embeddings. Our most successful systems combine embedding-based retrieval with keyword verification, explicit rule checks, and specialized classifiers for critical distinctions. This redundancy isn't inefficient; it's essential. Layer different techniques instead of betting everything on embeddings. Our most successful systems combine embedding-based retrieval with keyword verification, explicit rule checks, and specialized classifiers for critical distinctions. This redundancy isn't inefficient; it's essential. Be transparent with users about what the system can and can't do reliably. We added confidence scores that explicitly flag when a result might involve negation, numerical comparison, or other potential weak points. Users appreciate the honesty, and it builds trust in the system overall. Be transparent with users about what the system can and can't do reliably. We added confidence scores that explicitly flag when a result might involve negation, numerical comparison, or other potential weak points. Users appreciate the honesty, and it builds trust in the system overall. Here's the most important thing I've learnt: these models don't understand language the way humans do – they understand statistical patterns. When I stopped expecting human-like understanding and started treating them as sophisticated pattern-matching tools with specific blind spots, my systems got better. Much better. Here's the most important thing I've learnt: The blind spots I've described aren't going away anytime soon – they're baked into how these models work. But if you know they're there, you can design around them. And sometimes, acknowledging a limitation is the first step toward overcoming it. Note: I have many more such cases found through experiments, and I will cover them in my subsequent post along with code examples. Note: I have many more such cases found through experiments, and I will cover them in my subsequent post along with code examples. The next continuation article will be coming out soon. Stay tuned!!