Artificial intelligence for the science of evidence synthesis: how good are AI-powered tools for automatic literature screening? – BMC Medical Research Methodology

Background Literature screening constitutes a critical component in evidence synthesis; however, it typically requires substantial time and human resources. Artificial intelligence (AI) has shown promise in this field, yet the accuracy and effectiveness of AI tools for literature screening remain uncertain. This study aims to evaluate the performance of several existing AI-powered automated tools for literature screening. Methods This diagnostic accuracy study employed a cohort to evaluate the performance of five AI tools—ChatGPT 4.0, Claude 3.5, Gemini 1.5, DeepSeek-V3, and RobotSearch—in literature screening. We selected a random sample of 1,000 publications from a well-established literature cohort, with 500 as randomized controlled trials (RCTs) group and 500 as others group. Diagnostic accuracy was measured using several metrics, including the false negative fraction (FNF), time used for screening, false positive fraction (FPF), and the redundancy number needed to screen. Results We reported the FNF for the RCTs group and the FPF for the others group. In the RCTs group, RobotSearch exhibited the lowest FNF at 6.4% (95% CI: 4.6% to 8.9%), whereas Gemini exhibited the highest at 13.0% (95% CI: 10.3% to 16.3%). In the others group, the FPF of the four large language models ranged from 2.8% (95% CI: 1.7% to 4.7%) to 3.8% (95% CI: 2.4% to 5.9%), both of which were significantly lower than RobotSearch’s rate of 22.2% (95% CI: 18.8% to 26.1%). In terms of screening efficiency, the mean time used for screening per article was 1.3 s for ChatGPT, 6.0 s for Claude, 1.2 s for Gemini, and 2.6 s for DeepSeek. Conclusions The AI tools assessed in this study demonstrated commendable performance in literature screening; however, they are not yet suitable as standalone solutions. These tools can serve as effective auxiliary aids, and a hybrid approach that integrates human expertise with AI may enhance both the efficiency and accuracy of the literature screening process. Graphical Abstract

Artificial intelligence for the science of evidence synthesis: how good are AI-powered tools for automatic literature screening? – BMC Medical Research Methodology

Recent Stories

Autonomous cars, drones cheerfully obey prompt injection by road sign

NVIDIA is still planning to make a ‘huge’ investment in OpenAI, CEO says

AI Agent Engineer at CollectWise