- The Tow Center for Digital Journalism assessed how accurate generative artificial intelligence search is when fed excerpts from news stories.
- The bad news is that these tools get it wrong 60 percent of the time, even with access to content from news publishers.
- Grok and Gemini were the most prone to being incorrect in their responses to queries.
The tech of the moment is still generative artificial intelligence. Market insights from Statista estimate that by the end of this year the AI market will be worth a total $243.72 billion and rise to a value of $826.73 billion in the next five years.
To achieve that, developers of AI platforms are going to have to convince a lot more people that AI should be a part of their daily lives. That is already in progress with AI being crammed into every platform it can be crammed into. Google Search now prominently and annoyingly features AI generated results front and centre, Meta’s AI is constantly waiting for you to tap its hombre icon and Copilot is now a marquee feature of Windows 11.
The trouble is that the accuracy of these AI platforms is constantly and question and it doesn’t look like that problem is getting any better. The Tow Center for Digital Journalism has conducted tests on eight generative search tools with live search features to assess their abilities to accurately retrieve and cite news content and the results aren’t great. They’re not even good.
The center selected 10 articles from 20 publishers and selected excerpts from those articles and asked eight AI search tools to identify the corresponding article’s headline, original publisher, publication date, and URL.
The platforms used included Perplexity, ChatGPT Search, Perplexity Pro, DeepSeek Search, Copilot, Grok 2 Search, Grok 3 Search and Gemini.
“We deliberately chose excerpts that, if pasted into a traditional Google search, returned the original source within the first three results. We ran sixteen hundred queries (twenty publishers times ten articles times eight chatbots) in total. We manually evaluated the chatbot responses based on three attributes: the retrieval of (1) the correct article, (2) the correct publisher, and (3) the correct URL,” the researchers wrote.
Each response was then assigned one of the following labels:
- Correct: All three attributes were correct.
- Correct but Incomplete: Some attributes were correct, but the answer was missing information.
- Partially Incorrect: Some attributes were correct while others were incorrect.
- Completely Incorrect: All three attributes were incorrect and/or missing.
- Not Provided: No information was provided.
- Crawler Blocked: The publisher disallows the chatbot’s crawler in its robots.txt.
Collectively, AI search tools failed to provide correct information to 60 percent of queries. Individually things look a bit different with Perplexity giving the most correct responses. Gemini, is simply awful though with it generating more completely incorrect responses than accurate ones though the same can be said for Grok.

“Premium models, such as Perplexity Pro ($20/month) or Grok 3 ($40/month), might be assumed to be more trustworthy than their free counterparts, given their higher cost and purported computational advantages. However, our tests showed that while both answered more prompts correctly than their corresponding free equivalents, they paradoxically also demonstrated higher error rates. This contradiction stems primarily from their tendency to provide definitive, but wrong, answers rather than declining to answer the question directly,” write the researchers.
“The fundamental concern extends beyond the chatbots’ factual errors to their authoritative conversational tone, which can make it difficult for users to distinguish between accurate and inaccurate information. This unearned confidence presents users with a potentially dangerous illusion of reliability and accuracy.”
One of the problems that these AI solutions keeps running into is a reluctance to share information with their models. Publishers and creators are going to great lengths to prevent AI firms from scraping their websites for content that is used to train their models. While some developers have deals with news publishers, there is no guarantee that the information being scrapped will be surfaced to users, correctly. Blocking these scrapers also means the AI has less information to reference and as such may be prone to hallucinations.
Perhaps more concerning is how AI search is being pushed to replace non-AI search. Users are fed information that could very well be wrong but it’s believed because trusted sources may be referenced, even if they are referenced incorrectly.
This problem has been a longstanding issue with AI in that when it can’t answer a question it instead makes something up so that the user continues to trust it and doesn’t get frustrated with a useless response.
As Tow’s researchers have highlighted, even if AI has access to the correct information, it still gets it wrong a concerning amount of time. It may be easy to brush this off as the technology still being in its early stages but if that is the case, should it really be being marketed as a solution to our problems?
To put this into perspective, Google’s AI overviews was telling folks to put glue on pizza in May of last year and according to Tow’s study Gemini is still getting it wrong most of the time. The question then becomes, what exactly is Google doing with the billions it is ploughing into this market?
While Tow Center for Digital Journalism recognises that its research is limited, this does highlight that it may be better to do your own research without an AI chatbot feeding you answers.