Google Confirms AI Chatbots Are Still Getting 1 Out of 3 Answers Wrong

Google has released new data that questions how reliable today’s AI chatbots really are, showing that even the most advanced models still struggle with factual accuracy.

Using its newly introduced FACTS Benchmark Suite, Google found that leading AI systems fail to surpass a 70% factual accuracy rate. Gemini 3 Pro ranked highest in the test with a 69% overall score. Other major models from OpenAI, Anthropic, and xAI performed worse.

The results suggest that current AI chatbots still provide incorrect information roughly one out of every three times, despite often delivering responses with high confidence.

Ad Powered By Advergic
  Loading ad . . . 
 Ad - Continue scrolling to read

Why the FACTS Benchmark Matters

Google said the benchmark addresses a gap in existing AI evaluations. Many current tests measure whether a model can complete a task, not whether the information it generates is factually correct.

This distinction is critical for sectors such as finance, healthcare, and law, where inaccurate information can have serious consequences. A response that sounds fluent but contains errors can mislead users who assume the chatbot’s output is reliable.

How Google Tested Factual Accuracy

The FACTS Benchmark Suite was developed by Google’s FACTS team in collaboration with Kaggle. It evaluates factual accuracy across four real-world categories.

The first category tests parametric knowledge, which checks whether a model can answer factual questions using only what it learned during training. The second evaluates search performance, measuring how accurately models retrieve information using web tools. The third focuses on grounding, assessing whether a model stays faithful to a provided document without introducing false details. The fourth examines multimodal understanding, including the ability to correctly interpret charts, diagrams, and images.

Breakdown Per Chatbot

The benchmark revealed clear performance gaps. Gemini 3 Pro led with a 69% FACTS score. Gemini 2.5 Pro and OpenAI’s ChatGPT-5 followed at close to 62%. Anthropic’s Claude 4.5 Opus scored around 51%, while xAI’s Grok 4 reached approximately 54%.