A close up shot of Taylor Swift on the 2024 Grammys red carpet<span class="caption-text">Grok Imagine is accused of deliberatley creating sexually explicit deepfakes of Taylor Swift.</span> <span class="credit">(Image credit: Neilson Barnard/Getty Images)</span>

  • ChatGPT 5 scores a low 1.4% on the Hallucination Leaderboard
  • This puts it ahead of ChatGPT-4 which scores 1.8% and GPT-4o, which scores 1.49%
  • Grok 4 is much higher at 4.8% with Gemini-2.5 Pro is at 2.6%

When OpenAI launched ChatGPT-5 on Thursday last week one if the big selling points that CEO Sam Altman emphasised was that ChatGPT-5 was the most “powerful, smart, fastest, reliable and robust version of ChatGPT that we’ve ever shipped”, and in the presentation, OpenAI staff also emphasized that ChatGPT-5 would “mitigate hallucinations”.

When AI makes something up it’s called an hallucination, and while hallucination rates are dropping amongst all LLMs, it’s still surprisingly common, and one of the main reasons that we can’t trust AI to perform a task without human supervision.

Vectara, the RAG-as-a-Service and AI agent platform that operates the industry’s top hallucination leaderboard for foundation and reasoning models, has put OpenAI’s claims to the test and found that it does indeed rank lower for hallucinations than ChatGPT 4, but is only just a little bit lower than ChatGPT-4o (just 0.09% lower, in fact).

According to Vectara, ChatGPT-5 has a grounded hallucination rate of 1.4%, compared to 1.8% for GPT-4, and 1.69% for GPT-4 turbo and 4o mini, with 1.49% for GPT-4o.

Spicy Grok

Interestingly, the ChatGPT-5 hallucination rate came out slightly higher than the ChatGPT-4.5 Preview mode, which scored 1.2%, but it also scored a lot higher than OpenAI’s o3-mini High Reasoning model, which was the best-performing GPT model, with a grounded hallucination rate of 0.795%.

The results of the Vectra tests can be viewed on the Hughes Hallucination Evaluation Model (HHEM) Leaderboard hosted on Hugging Face, which states that, “For an LLM, its hallucination rate is defined as the ratio of summaries that hallucinate to the total number of summaries it generates”.

ChatGPT-5 still hallucinates a lot less than its competition, though, with Gemini-2.5-pro coming in at 2.6% and Grok-4 being much higher at 4.8%.

XAI, the makers of Grok recently received a lot of criticism for its new “Spicy” mode in Grok Imagine, an AI video generator that seems happy to create deepfake topless videos of celebrities like Taylor Swift, even if nudity had not been requested and the system is supposed to include filters and moderation to prevent actual nudity or anything sexual.

A close up shot of Taylor Swift on the 2024 Grammys red carpet

Grok Imagine is accused of deliberatley creating sexually explicit deepfakes of Taylor Swift. (Image credit: Neilson Barnard/Getty Images)

‘I lost my best friend’

OpenAI faced an almost immediate backlash when it removed ChatGPT 4, and all its variations like GPT-4o and 4o-mini, from its Plus accounts with the introduction of ChatGPT-5. Many users were incensed that OpenAI gave no warning that the older models were being removed, with some Reddit users saying they had “lost their only friend overnight”.

It now seems like ChatGPT-5 has replaced one of the most reliable versions of ChatGPT (version 4.5), from the hallucination perspective, as well.

Sam Altman quickly posted on X, “We for sure underestimated how much some of the things that people like in GPT-4o matter to them, even if GPT-5 performs better in most ways”, and promised to bring back ChatGPT-4o for Plus users for a limited time”, saying, “we will watch usage as we think about how long to offer legacy models for”.

You might also like

By admin