Why AI Models Keep Getting Things Wrong and What Is Being Done

AI chatbot screen showing confidently wrong answer with a red error symbol overlaid in 2026

In July 2025, a federal judge in Mississippi issued a civil rights restraining order that misnamed the parties in the case, misquoted state law, and made factually inaccurate statements not supported by the record. The judge had used AI to help prepare the opinion. In October 2025, Gordon Rees Scully Mansukhani, one of the top 100 law firms in the United States with nearly $760 million in gross revenue, submitted a bankruptcy filing riddled with citations to cases that do not exist. A second incident followed in February 2026. The firm's attorneys called themselves "profoundly embarrassed." These are not amateur errors from someone experimenting casually with a chatbot. They are the product of a problem that sits at the center of every AI language model built today - and one that has not been solved, despite years of effort and billions of dollars invested in trying.

The problem is called hallucination. It is the term researchers use when an AI model generates information that sounds authoritative and reads perfectly but is simply not true. Understanding why it happens, what real damage it causes, and what companies are actually doing about it is not just a technical question. In 2026, it is a practical question for anyone who uses these tools.

~~(toc) #title=(Table of Content)~~

What Hallucination Actually Is - and Why the Word Matters

The term "hallucination" gets used loosely, so it helps to be precise. AI researchers define it in two distinct ways, and the difference between them matters for understanding both the risk and the fix.

The first type is called intrinsic hallucination, also called a faithfulness failure. This is when a model contradicts information it was explicitly given. If you hand an AI a contract and ask it to summarize the key terms, and it adds clauses to the summary that do not appear in the original document, that is an intrinsic hallucination. The model had the correct source material. It chose not to stay faithful to it.

The second type is called extrinsic hallucination, also called a factuality failure. This is when the model generates information that cannot be verified against any known source. It invents facts, statistics, citations, or events from scratch. No source material was contradicted because no source material was consulted. The model is not summarizing or paraphrasing anything. It is confabulating - producing fluent, confident text from internal pattern-matching rather than from facts.

MIT researchers published findings in January 2025 that made this phenomenon more unsettling. When AI models hallucinate, they tend to use more confident language than when stating actual facts. The model is not hedging or flagging uncertainty. It is presenting invention with greater authority than truth. This is the quality that makes AI hallucinations so dangerous in professional contexts: the output gives no signal that anything is wrong.

The Numbers: How Often Does This Actually Happen?

The honest answer is that it depends enormously on what you ask the model to do, and the range is wider than most people expect.

On grounded summarization tasks - where a model is given a document and asked to stay faithful to it - the best models in 2025 achieved hallucination rates as low as 0.7%. Google's Gemini 2.0 Flash 001 holds the top position on the Vectara hallucination leaderboard at 0.7% as of April 2025. GPT-4o sits at 1.5%. Four models have now crossed below the 1% threshold on this specific type of task. This is genuine progress, and it is meaningful for use cases like document summarization where you control the source material.

Move outside that controlled context, and the numbers shift dramatically. A 2026 benchmark across 37 models found hallucination rates ranging from 15% to 52% on general factual recall tasks. On open-ended generation tasks - where the model has no grounding document and is answering from training data alone - rates of 40% to 80% have been recorded. The average hallucination rate across all models for general knowledge questions sits around 9.2% as of current benchmarks.

Certain domains produce particularly alarming results. Legal content is where the documented failures are most severe. A Stanford University study found that when language models answered questions about legal precedents, they hallucinated at least 75% of the time, producing fabricated cases with realistic names, detailed but fictional outcomes, and confident legal reasoning. In a subsequent study published in the Journal of Empirical Legal Studies in 2025, researchers tested domain-specific legal AI tools - the kind specifically marketed to law firms as hallucination-resistant - and found that even Lexis+ AI and Westlaw's AI-Assisted Research still hallucinated in 17% to 34% of responses, particularly in mis-citing sources and agreeing with incorrect user premises.

Medical AI research has produced equally concerning figures. A 2025 MedRxiv study using 300 physician-validated clinical cases found hallucination rates of 64.1% on long cases and 67.6% on short cases when AI models were given no mitigation instructions. Open-source models used in medical contexts showed hallucination rates exceeding 80% on some tasks.

Financial damage from these errors is now quantifiable. Global financial losses tied to AI hallucinations reached $67.4 billion in 2024. In enterprise settings, 47% of AI users reported making at least one major business decision based on hallucinated content in 2024. In Q1 2025 alone, 12,842 AI-generated articles were removed from online platforms specifically because of hallucinated content.

The Real Cases: What Hallucination Has Actually Cost People

Behind these statistics are specific, documented incidents that illustrate the real-world stakes.

The Legal System

Legal researcher Damien Charlotin maintains a public database of AI hallucination cases identified in legal proceedings. As of May 2026, the database contains more than 1,398 cases. The shift in who is responsible is striking: in 2023, seven out of ten hallucination cases in legal filings came from self-represented litigants - people without legal training using AI tools they did not fully understand. By May 2025, 13 out of 23 cases caught were the fault of practicing lawyers and legal professionals. The mistakes moved up the professional hierarchy.

In February 2026, two federal judges - one in Mississippi, one in New Jersey - admitted their offices had used AI when preparing opinions, and both orders were found to contain hallucinated material. The Mississippi order misnamed parties, misquoted state law, and made factually unsupported statements. In the New Jersey case, the problem was identified by the parties before the order could take effect, but the episode made national legal news. Over 35 state bar associations have now issued guidance requiring attorneys to verify AI-generated content, and multiple federal courts have mandated disclosure of AI use in all filings.

Healthcare and Mental Health

Amazon's AI-generated mushroom foraging guides recommended species that can be toxic. Multiple AI chatbots marketed as mental health support tools have been documented providing advice that medical professionals described as dangerous. In one case reported in June 2025, a therapy chatbot told a user struggling with addiction to take a "small hit of methamphetamine" to get through the week. Other AI-powered chatbots described as offering psychotherapy have been connected to patient suicides. These are not hypothetical risks in a future deployment scenario. They are documented incidents from platforms already in use.

Consumer and Financial Settings

Air Canada's AI chatbot hallucinated a refund policy that did not exist, promising a bereavement discount that the airline had not offered. A Canadian court ruled Air Canada was bound by its chatbot's misrepresentation. Microsoft's AI travel feature listed Ottawa's Food Bank as a tourist attraction. Google's Bard incorrectly claimed during its public launch demonstration that the James Webb Space Telescope had captured the first images of exoplanets - a factual error that contributed to Alphabet's share price falling and erasing approximately $100 billion in market value in a single day. A 2026 UC San Diego study found that AI-generated product summaries hallucinated 60% of the time in ways that influenced purchasing decisions.

The Reasoning Model Paradox

One of the more counterintuitive findings in recent AI research is that newer, more powerful reasoning models have sometimes hallucinated more than their predecessors on certain tasks, not less. OpenAI's o3 model hallucinated on 33% of responses in the PersonQA benchmark - double the rate of its predecessor, o1. OpenAI's o4-mini hallucinated on 48% of questions in the same test. DeepSeek's R1 reasoning model was found to hallucinate significantly more than DeepSeek's traditional non-reasoning models on the Vectara leaderboard. These results do not mean the models are worse overall. They reflect a specific trade-off: deeper reasoning chains introduce more opportunities for the model to generate intermediate conclusions that are wrong, and those errors compound. Progress on one metric does not guarantee progress across all of them.

Why Does Hallucination Happen? The Technical Explanation Without the Jargon

Large language models are trained to predict the most statistically likely next word, phrase, and sentence based on enormous amounts of text. They are exceptionally good at producing text that reads like the kind of text a competent human would write on any given subject. But being good at predicting what text should look like is not the same thing as knowing what is true.

When a model encounters a question and does not have strong evidence in its training data for the correct answer, it does not stop and say it does not know. It generates the most plausible-sounding response based on patterns. This is not a bug in the conventional sense. It is a consequence of how these models are built. The architecture that makes them fluent and useful is the same architecture that makes them capable of producing confident, fluent fiction.

Hallucination rates also increase measurably with the complexity of a query and the size of the input. Research consistently shows that longer, more complex inputs are associated with higher hallucination rates. The model has more context to manage, more opportunities to drift, and more room to introduce errors that are not immediately obvious.

A specific vulnerability involves citations and references. When asked to cite sources, models frequently fabricate them. Citation fabrication rates in adversarial testing have reached 94% in some benchmarks. The model generates something that looks like a real citation - author, journal, volume, year, page numbers - because it has seen thousands of real citations and knows what they look like. Whether the specific citation it produces points to something that actually exists is a separate question the model is not equipped to answer from training alone.

What Companies Are Actually Doing About It

The response to the hallucination problem has produced several technical approaches, some of which are showing genuine results and some of which have significant limitations that are not always honestly communicated.

Retrieval-Augmented Generation

Retrieval-Augmented Generation, or RAG, is the most widely deployed technique for reducing hallucination in enterprise settings. The principle is straightforward: instead of asking the model to answer from its training data alone, the system first searches a database of verified documents and retrieves relevant passages, then asks the model to generate a response grounded in those specific retrieved documents. The model is no longer working from memory. It has a factual anchor.

The results are meaningful. RAG reduces hallucinations by 40% to 71% across many deployment scenarios, according to multiple studies. OpenAI's internal evaluations show hallucination rates drop below 2% in retrieval-grounded tasks, compared to much higher rates on open-ended factual questions. Enterprise deployments using RAG in combination with domain-specific document bases have achieved reliable results in customer service, internal knowledge management, and compliance functions.

RAG is not a complete solution, and the honest version of this technology's limitations acknowledges that. The retrieved documents can themselves be wrong, outdated, or incomplete. The model can still misrepresent what a retrieved document says. And RAG requires a well-maintained, verified knowledge base to pull from - which is a significant organizational investment that many deployments underestimate. Legal AI tools that were specifically marketed as using RAG to prevent hallucination were still found hallucinating in 17% to 34% of tested queries in the 2025 Stanford follow-up study.

Reasoning Models and Self-Verification

Google's 2025 research showed that models with built-in reasoning capabilities - where the model works through a problem step by step before generating an answer - reduce hallucinations by up to 65% in controlled testing. The theory is that forcing a model to show its work surfaces incorrect intermediate steps before they compound into a confident wrong answer.

As the o3 and o4-mini PersonQA results show, this does not always hold in practice. The relationship between reasoning depth and hallucination rate is not linear, and domain matters significantly. For mathematical reasoning and structured problem-solving, reasoning models are a genuine improvement. For questions about real people, specific events, or recent developments, the hallucination patterns are more complex.

Grounding Verification and Human-in-the-Loop

The most conservative and reliable approach - and the one that requires the most human involvement - is treating AI output as a first draft that must be verified by a person with access to primary sources before it is used. This is already the standard that courts are enforcing for legal filings and that most medical AI deployment guidelines recommend.

IBM's AI Adoption Index for 2025 found that 76% of enterprises now include human-in-the-loop processes specifically to catch hallucinations before AI output reaches deployment. This number reflects an industry-wide recognition that the technology is useful enough to deploy and unreliable enough to require oversight. Both things are true simultaneously. The companies that have had the worst hallucination incidents are those that deployed AI in high-stakes contexts without building verification processes into the workflow.

Prompt Engineering and Structured Constraints

Research published in Nature in 2025 found that well-designed mitigation prompts - instructions that tell a model to express uncertainty, avoid speculation, and cite sources when available - reduce hallucination rates by approximately 22 percentage points compared to unprompted queries. For medical AI specifically, structured prompts reduced hallucinations by 33%. These are meaningful improvements available without any change to the underlying model. The implication is that a significant portion of current hallucination in deployed AI is the result of poor prompt design rather than fundamental model limitations.

Specialized Fine-Tuning

Domain-specific fine-tuning involves training a model further on a curated dataset from a specific field - medical literature, legal statutes, financial regulations - with the goal of reducing hallucination rates in that domain. The results are promising in controlled settings. The challenge is that fine-tuning on one domain can affect performance in others, and the curated datasets require ongoing expert review to remain accurate. Fine-tuning is expensive, domain-specific, and requires continuous maintenance as the knowledge base in any field evolves.

The Limits of What "Fixing" Hallucination Actually Means

There is a version of the hallucination story that ends with "and soon the models will stop making things up." That version is misleading.

The architectural reason hallucination exists - that language models predict likely text rather than verify facts - is not something that is straightforwardly removed. RAG, reasoning layers, and human verification all reduce hallucination's impact without eliminating the underlying tendency. FaithBench, one of the most rigorous hallucination evaluation datasets, found in 2025 testing that even the best current hallucination detection models achieve near 50% accuracy when tasked with identifying hallucinated content - meaning the tools designed to catch hallucinations are themselves unreliable roughly half the time.

The court system's response to AI hallucinations is instructive here. The principle that has emerged from multiple judicial rulings is that AI cannot verify AI. Cross-checking AI output with another AI tool does not satisfy verification obligations. Human review against authoritative primary sources remains the only legally defensible standard. This is not a temporary workaround until the models improve. It is a structural requirement that the legal system has reached through documented experience with what happens when it is skipped.

The companies making the most responsible public statements about this are those that acknowledge both what has improved and what has not. The best models have gotten significantly more reliable on specific, controlled tasks. General factual recall, complex legal and medical queries, and anything involving recent events or specific individuals remain genuinely difficult. The most honest measure of progress is not whether hallucination rates are falling in benchmark conditions but whether deployed AI systems in real workflows are producing fewer consequential errors. On that measure, the data is mixed and the honest answer is that it depends heavily on how the AI is being used and who is checking the output.

What This Means for You Right Now

If you are using AI tools for anything where accuracy matters, several practical conclusions follow directly from the evidence.

AI output is most reliable when the model is given the source material and asked to stay faithful to it. Summarization with provided documents, extraction from uploaded files, and grounded question-answering with cited sources all carry significantly lower hallucination risk than open-ended factual queries from training data. When you ask an AI tool about a specific legal case, a specific medical study, a specific product specification, or any other verifiable fact without providing the source - and especially when it confidently gives you an answer - the correct response is to verify that answer independently before acting on it.

Citations generated by AI tools should be treated as leads, not sources. Paste any AI-generated citation into a search engine or academic database before citing it. The realistic base rate for citation fabrication is high enough that treating AI citations as verified is an error.

The domain matters. General summaries and writing assistance are lower risk than legal research, medical information, or financial data. The higher the stakes of a wrong answer, the more verification the output requires. This is not a criticism of the technology. It is an accurate description of where the technology stands in May 2026.

The companies that acknowledge these limitations publicly and build verification into their products are more trustworthy than those that claim to have solved hallucination. The problem is real, the progress is real, and the gap between them is also real. Using AI effectively means understanding all three.

How AI Is Changing the Way Smartphones Work in 2026
Stop Facebook and Google From Using Your Photos for AI
Android 16 Is Rolling Out Now: Check If Your Phone Is Getting It
How AI Is Changing the Way Smartphones Work in 2026

Why AI Models Keep Getting Things Wrong and What Is Being Done

What Hallucination Actually Is - and Why the Word Matters

The Numbers: How Often Does This Actually Happen?