back
Get SIGNAL/NOISE in your inbox daily
Background Literature screening constitutes a critical component in evidence synthesis; however, it typically requires substantial time and human resources. Artificial intelligence (AI) has shown promise in this field, yet the accuracy and effectiveness of AI tools for literature screening remain uncertain. This study aims to evaluate the performance of several existing AI-powered automated tools for literature screening. Methods This diagnostic accuracy study employed a cohort to evaluate the performance of five AI tools—ChatGPT 4.0, Claude 3.5, Gemini 1.5, DeepSeek-V3, and RobotSearch—in literature screening. We selected a random sample of 1,000 publications from a well-established literature cohort, with 500 as randomized controlled trials (RCTs) group and 500 as others group. Diagnostic accuracy was measured using several metrics, including the false negative fraction (FNF), time used for screening, false positive fraction (FPF), and the redundancy number needed to screen. Results We reported the FNF for the RCTs group and the FPF for the others group. In the RCTs group, RobotSearch exhibited the lowest FNF at 6.4% (95% CI: 4.6% to 8.9%), whereas Gemini exhibited the highest at 13.0% (95% CI: 10.3% to 16.3%). In the others group, the FPF of the four large language models ranged from 2.8% (95% CI: 1.7% to 4.7%) to 3.8% (95% CI: 2.4% to 5.9%), both of which were significantly lower than RobotSearch’s rate of 22.2% (95% CI: 18.8% to 26.1%). In terms of screening efficiency, the mean time used for screening per article was 1.3 s for ChatGPT, 6.0 s for Claude, 1.2 s for Gemini, and 2.6 s for DeepSeek. Conclusions The AI tools assessed in this study demonstrated commendable performance in literature screening; however, they are not yet suitable as standalone solutions. These tools can serve as effective auxiliary aids, and a hybrid approach that integrates human expertise with AI may enhance both the efficiency and accuracy of the literature screening process. Graphical Abstract
Recent Stories
Jan 19, 2026
The Race to Build the DeepSeek of Europe Is On
As Europe’s longstanding alliance with the US falters, its push to become a self-sufficient AI superpower has become more urgent.
Jan 18, 2026Ed Zitron on big tech, backlash, boom and bust: ‘AI has taught us that people are excited to replace human beings’
His blunt, brash scepticism has made the podcaster and writer something of a cult figure. But as concern over large language models builds, he’s no longer the outsider he once was
Jan 18, 2026DigitalOcean And AMD Deliver Doubled Inference Performance For Character.ai
As enterprises seek alternatives to concentrated GPU markets, demonstrations of production-grade performance with diverse hardware reduce procurement risk.