Scott Graffius, in his January 7, 2025 article, “Are AI Hallucinations Getting Better or Worse?” analyzed various data sources and concluded on comparable benchmarks, “hallucinations are declining year-over-year for non-complex cases” but “remain high in complex reasoning and open-domain factual recall, where rates can exceed 33%.” This well researched piece helps us understand the current state of AI hallucinations (a critical need-to-know aspect of AI generated research). It has inspired me to add a few key benchmark resources to the Artificial Intelligence category in the Supply Market Intelligence Index Collection I maintain. The following benchmarks join these other included indexes related to AI: the annual AI Index Report By HAI Stanford University (insights into AI’s technical progress, economic influence, and societal impact), the Anthropic Economic Index (AI adoption across the world), and IMD’s AI Maturity Index Ranking (evaluates an organization’s AI maturity).
Here are the benchmarks added:
AA-Omniscience – This knowledge and hallucination benchmark from Artificial Analysis (an AI benchmarking and analysis company) measures factual recall and hallucination across six main domains: 1) Business, 2) Health, 3) Humanities & Social Sciences, 4) Law, 5) Science, Engineering & Mathematic, and 6) Software Engineering (SWE). Overall coverage includes 42 relevant topics within these six domains. This benchmark rewards accuracy and punishes bad guesses (there is no penalty for refusing to answer ) to provide a complete picture of which models produce outputs that are “factually reliable.” Using authoritative academic and industry sources, 6,000 questions are generated to ensure factual accuracy.
Vectara’s Hallucination Leaderboard – This benchmark was released two years ago and tracks and compares the factuality of large language models. A new “more granular” hallucination leaderboard was announced in the end of 2025 which utilizes a dataset that is “larger, more robust, and significantly more challenging.” The new leaderboard has expanded its scale of articles from 1,000 to over 7,700 and utilizes longer articles to maintain “factual consistency over extended contexts.” A mix of “both low and high complexity text” has been added to the dataset in addition to several different domains such as technology, stocks, sports, science, politics, medicine, law, finance, education, and business. Vectara builds AI Agents and Assistants for enterprise applications.
AIMultiple – AIMultiple has benchmarked 37 different LLMs with 60 questions derived from CNN News articles to measure their hallucination rates. The questions asked for precise numerical values (percentages, dates), covered diverse topics such as oil prices and art history, included difficult to guess facts and temporal relationships, required exact retrieval from the provided text, and checked whether answers matched actual figures. (Note: CNN has a “Lean Left” media bias rating from AllSides, which might give some readers pause. Keep in mind, this benchmark is measuring accuracy of what is actually written in CNN news articles, and is not determining information veracity based on provider authority). AIMultiple is an independent AI research and analysis provider.
January 12, 2026

