Tests reveal that ChatGPT-5 hallucinates less than GPT-4o did – and Grok is still the king of making stuff up - InfoSunrise – Daily Insights Across Tech, Travel, Lifestyle & More

ChatGPT 5 scores a low 1.4% on the Hallucination Leaderboard
This puts it ahead of ChatGPT-4 which scores 1.8% and GPT-4o, which scores 1.49%
Grok 4 is much higher at 4.8% with Gemini-2.5 Pro is at 2.6%

When OpenAI launched ChatGPT-5 on Thursday last week one if the big selling points that CEO Sam Altman emphasised was that ChatGPT-5 was the most “powerful, smart, fastest, reliable and robust version of ChatGPT that we’ve ever shipped”, and in the presentation, OpenAI staff also emphasized that ChatGPT-5 would “mitigate hallucinations”.

When AI makes something up it’s called an hallucination, and while hallucination rates are dropping amongst all LLMs, it’s still surprisingly common, and one of the main reasons that we can’t trust AI to perform a task without human supervision.

Vectara, the RAG-as-a-Service and AI agent platform that operates the industry’s top hallucination leaderboard for foundation and reasoning models, has put OpenAI’s claims to the test and found that it does indeed rank lower for hallucinations than ChatGPT 4, but is only just a little bit lower than ChatGPT-4o (just 0.09% lower, in fact).

According to Vectara, ChatGPT-5 has a grounded hallucination rate of 1.4%, compared to 1.8% for GPT-4, and 1.69% for GPT-4 turbo and 4o mini, with 1.49% for GPT-4o.

Spicy Grok

Interestingly, the ChatGPT-5 hallucination rate came out slightly higher than the ChatGPT-4.5 Preview mode, which scored 1.2%, but it also scored a lot higher than OpenAI’s o3-mini High Reasoning model, which was the best-performing GPT model, with a grounded hallucination rate of 0.795%.

The results of the Vectra tests can be viewed on the Hughes Hallucination Evaluation Model (HHEM) Leaderboard hosted on Hugging Face, which states that, “For an LLM, its hallucination rate is defined as the ratio of summaries that hallucinate to the total number of summaries it generates”.

ChatGPT-5 still hallucinates a lot less than its competition, though, with Gemini-2.5-pro coming in at 2.6% and Grok-4 being much higher at 4.8%.

XAI, the makers of Grok recently received a lot of criticism for its new “Spicy” mode in Grok Imagine, an AI video generator that seems happy to create deepfake topless videos of celebrities like Taylor Swift, even if nudity had not been requested and the system is supposed to include filters and moderation to prevent actual nudity or anything sexual.

A close up shot of Taylor Swift on the 2024 Grammys red carpet — Grok Imagine is accused of deliberatley creating sexually explicit deepfakes of Taylor Swift. (Image credit: Neilson Barnard/Getty Images)

‘I lost my best friend’

OpenAI faced an almost immediate backlash when it removed ChatGPT 4, and all its variations like GPT-4o and 4o-mini, from its Plus accounts with the introduction of ChatGPT-5. Many users were incensed that OpenAI gave no warning that the older models were being removed, with some Reddit users saying they had “lost their only friend overnight”.

It now seems like ChatGPT-5 has replaced one of the most reliable versions of ChatGPT (version 4.5), from the hallucination perspective, as well.

Sam Altman quickly posted on X, “We for sure underestimated how much some of the things that people like in GPT-4o matter to them, even if GPT-5 performs better in most ways”, and promised to bring back ChatGPT-4o for Plus users for a limited time”, saying, “we will watch usage as we think about how long to offer legacy models for”.

Tests reveal that ChatGPT-5 hallucinates less than GPT-4o did – and Grok is still the king of making stuff up

Byadmin

Spicy Grok

‘I lost my best friend’

You might also like

Related

By admin

Related Post

Kevin Rose’s simple test for AI hardware — would you want to punch someone in the face who’s wearing it?

Alphabet is increasingly launching “moonshot” projects as independent companies — here’s why

Sequoia’s Roelof Botha warns founders about chasing sky-high valuations as the firm doubles down on its selective approach

You missed

Worse than expected: Thousands of delays, 5-hour backups plague airports this weekend

Marriott Bonvoy Bold vs. Hilton Honors American Express Card: Comparing hotel cards with no annual fee

From Alaska to New York City, here are our 10 favorite Christmas markets in the US

Best Alaska cruisetours to book right now