×
Google DeepMind tackles LLM hallucinations with new benchmark
Written by
Published on
Join our daily newsletter for breaking news, product launches and deals, research breakdowns, and other industry-leading AI coverage
Join Now

Google DeepMind researchers have developed a new benchmark called FACTS Grounding to evaluate and improve the factual accuracy of large language models’ responses.

The core development: FACTS Grounding is designed to assess how well language models can generate accurate responses based on long-form documents, while ensuring the answers are sufficiently detailed and relevant.

  • The benchmark includes 1,719 examples split between public and private datasets
  • Each example contains a system prompt, a specific task or question, and a context document
  • Models must process documents up to 32,000 tokens in length and provide comprehensive responses that are fully supported by the source material

Current performance metrics: Gemini 2.0 Flash currently leads the FACTS leaderboard with an 83.6% factuality score, demonstrating the current state of LLM accuracy.

Evaluation methodology: The benchmark employs a two-phase judgment system to ensure thorough assessment of model responses.

  • Responses must first pass an eligibility check by satisfying the original user request
  • Qualified responses are then evaluated for factual accuracy and proper grounding in source documents
  • Three different LLMs (Gemini 1.5 Pro, GPT-4o, and Claude 3.5 Sonnet) serve as judges to reduce bias and ensure accuracy

Technical implementation: The benchmark addresses fundamental challenges in LLM development and evaluation.

  • Traditional pre-training methods focus on predicting next tokens rather than optimizing for factual accuracy
  • The dataset covers diverse topics including finance, technology, retail, medicine, and law
  • Researchers noted a 3.23% bias where models tend to favor responses from their own model family

Looking forward: While FACTS Grounding represents an important step in improving LLM accuracy, researchers acknowledge the rapid pace of AI advancement may quickly necessitate updates to the benchmark.

  • The team emphasizes that factuality and proper grounding are essential for LLM utility
  • The benchmark will need to evolve alongside continued progress in AI development
  • This initial release is positioned as a starting point rather than a definitive solution

Critical considerations: The use of LLMs to evaluate other LLMs raises questions about the reliability of the evaluation process, despite efforts to minimize bias through multiple judges. This approach, while practical, may need supplementation with human evaluation or other verification methods as the technology continues to mature.

Google DeepMind researchers introduce new benchmark to improve LLM factuality, reduce hallucinations

Recent News

Musk-backed DOGE project targets federal workforce with AI automation

DOGE recruitment effort targets 300 standardized roles affecting 70,000 federal employees, sparking debate over AI readiness for government work.

AI tools are changing workflows more than they are cutting jobs

Counterintuitively, the Danish study found that ChatGPT and similar AI tools created new job tasks for workers and saved only about three hours of labor monthly.

Disney abandons Slack after hacker steals terabytes of confidential data using fake AI tool

A Disney employee fell victim to malware disguised as an AI art tool, enabling the hacker to steal 1.1 terabytes of confidential data and forcing the company to abandon Slack entirely.