Google has released new data that questions how reliable today’s AI chatbots really are, showing that even the most advanced models still struggle with factual accuracy.
Using its newly introduced FACTS Benchmark Suite, Google found that leading AI systems fail to surpass a 70% factual accuracy rate. Gemini 3 Pro ranked highest in the test with a 69% overall score. Other major models from OpenAI, Anthropic, and xAI performed worse.
The results suggest that current AI chatbots still provide incorrect information roughly one out of every three times, despite often delivering responses with high confidence.
Google said the benchmark addresses a gap in existing AI evaluations. Many current tests measure whether a model can complete a task, not whether the information it generates is factually correct.
This distinction is critical for sectors such as finance, healthcare, and law, where inaccurate information can have serious consequences. A response that sounds fluent but contains errors can mislead users who assume the chatbot’s output is reliable.
The FACTS Benchmark Suite was developed by Google’s FACTS team in collaboration with Kaggle. It evaluates factual accuracy across four real-world categories.
The first category tests parametric knowledge, which checks whether a model can answer factual questions using only what it learned during training. The second evaluates search performance, measuring how accurately models retrieve information using web tools. The third focuses on grounding, assessing whether a model stays faithful to a provided document without introducing false details. The fourth examines multimodal understanding, including the ability to correctly interpret charts, diagrams, and images.
The benchmark revealed clear performance gaps. Gemini 3 Pro led with a 69% FACTS score. Gemini 2.5 Pro and OpenAI’s ChatGPT-5 followed at close to 62%. Anthropic’s Claude 4.5 Opus scored around 51%, while xAI’s Grok 4 reached approximately 54%.
Multimodal tasks proved to be the weakest area across all models, with accuracy often dropping below 50%. Google noted that errors in this area are particularly risky, as chatbots can confidently misread charts or extract incorrect figures from documents, leading to mistakes that are difficult to detect.
Google said the findings do not mean AI chatbots lack value, but they highlight the risks of relying on them without verification. While AI accuracy continues to improve, the company stressed that human oversight, safeguards, and validation remain necessary before these systems can be treated as dependable sources of factual information.
Get the latest tech news, telecom insights, and product launches wherever you prefer.
Add ProPakistani to Preferred Sources and see more of our stories in Google Search and Top Stories.