New Study Says Google AI Overviews Tells Millions of Lies Per Hour

A new analysis by The New York Times found that Google’s AI Overviews now answer questions correctly about 90% of the time. While that suggests improvement, the remaining error rate means a significant number of incorrect answers may still be reaching the user.

Based on that error rate, the report said that if the results were projected across all Google searches, AI Overviews could be producing tens of millions of incorrect answers each day.

According to Ars Technica, the Times worked with startup Oumi to evaluate AI Overviews using SimpleQA, a benchmark designed to measure the factual accuracy of generative AI models.

Accuracy Improved After Gemini Update

SimpleQA, released by OpenAI in 2024, includes more than 4,000 questions with verifiable answers that can be used to test AI systems.

Oumi began running the benchmark last year when Gemini 2.5 was Google’s leading model. At that stage, AI Overviews achieved an accuracy rate of 85%.

After the release of Gemini 3, the benchmark was run again, and AI Overviews answered 91% of the questions correctly.

Examples Highlighted in the Report

The report included several cases where AI Overviews produced incorrect answers.

In one example, the system was asked when Bob Marley’s former home became a museum. AI Overviews cited three pages, but two of them did not mention the date. The third source, Wikipedia, listed two conflicting years, and the system selected the wrong one.

In another case, the benchmark asked for the date when Yo-Yo Ma was inducted into the Classical Music Hall of Fame. Although AI Overviews cited the organization’s website showing his induction, it still stated that no such institution exists.

Google Disputes the Findings

Google challenged the study’s conclusions. Company spokesperson Ned Adriance said the analysis had serious flaws and did not reflect the kinds of searches people actually make on Google.

Google said it prefers to use a benchmark called SimpleQA Verified, which relies on a smaller set of questions that have been more thoroughly reviewed.