Sustainability professionals increasingly turn to commercial Large Language Models (LLMs, or more colloquially – Artificial Intelligence) such as ChatGPT, CoPilot and Grok to sift through data, provide insights/trends and help develop solutions. However, the results of a new technical study about LLM performance is worrisome. Let’s start with TechSpot’s headline of their article about the study: “AI search engines fail accuracy test, study finds 60% error rate. Bump that up to 96 percent if it’s Grok-3“.
“The Tow Center for Digital Journalism recently studied eight AI search engines, including ChatGPT Search, Perplexity, Perplexity Pro, Gemini, DeepSeek Search, Grok-2 Search, Grok-3 Search, and Copilot. They tested each for accuracy and recorded how frequently the tools refused to answer…
Even when admitting it was wrong, ChatGPT would follow up that admission with more fabricated information. The LLM is seemingly programmed to answer every user input at all costs. The researcher’s data confirms this hypothesis, noting that ChatGPT Search was the only AI tool that answered all 200 article queries. However, it only achieved a 28-percent completely accurate rating and was completely inaccurate 57 percent of the time.
ChatGPT isn’t even the worst of the bunch. Both versions of X’s Grok AI performed poorly, with Grok-3 Search being 94 percent inaccurate. Microsoft’s Copilot was not that much better when you consider that it declined to answer 104 queries out of 200. Of the remaining 96, only 16 were ‘completely correct,’ 14 were ‘partially correct,’ and 66 were ‘completely incorrect,’ making it roughly 70 percent inaccurate.”
Tow’s study found that:
- “Chatbots were generally bad at declining to answer questions they couldn’t answer accurately, offering incorrect or speculative answers instead.
- Premium chatbots provided more confidently incorrect answers than their free counterparts.
- Multiple chatbots seemed to bypass Robot Exclusion Protocol preferences.
- Generative search tools fabricated links and cited syndicated and copied versions of articles.
- Content licensing deals with news sources provided no guarantee of accurate citation in chatbot responses.”
This is consistent with what I heard from a panel on AI at last year’s Texas Environmental Superconference (see this blog).
Enthusiastic supporters of AI/LLMs point out that the systems are continually improving and today’s accuracy/performance will be eclipsed soon. Fair enough, but that isn’t particularly helpful for those using LLMs today with the expectation (or hope) that the results are credible and reliable. For now, you need to independently check the results/output from commercial LLMs.
Our members can learn more about AI and ESG here.
If you aren’t already subscribed to our complimentary ESG blog, sign up here for daily updates delivered right to you.