P1028 - Artificial Intelligence in IBD: Comparing the Clinical Utility of Four Leading Large Language Models

Sunday, October 26, 2025

3:30 PM - 7:00 PM PDT

Location: Exhibit Hall

Presenting Author(s)

NK

Nazmus Khan, BSc(Hons), MBChB (he/him/his)

McMaster University
Hamilton, ON, Canada

Nazmus Khan, BSc(Hons), MBChB¹, Sama Anvari, MD¹, Abdulrahman Albassam, MD¹, Dorota Borovsky, MD, MSc, BSc², Gurjit Mander, MD¹, Yung Lee, MD¹, Ciaran Galts, MD¹
1McMaster University, Hamilton, ON, Canada; 2Michael G. DeGroote School of Medicine, McMaster University, Hamilton, ON, Canada

Introduction: Artificial intelligence is increasingly being integrated into clinical practice across medical specialties, including gastroenterology. Despite this growing adoption, questions remain about the accuracy and reliability of large language models (LLMs) when addressing complex subspecialty clinical scenarios, particularly in specialized fields like inflammatory bowel disease (IBD). This study aimed to evaluate the performance of four prominent large language models (LLMs) — OpenAI's ChatGPT-4, Anthropic's Claude Sonnet 3.7, Google's Gemini 2.5 Flash, and OpenEvidence — in responding to IBD-related clinical questions. Our study compared their accuracy and reference utilization across both multiple-choice and open-ended question formats.

Methods: Forty-six IBD questions based on current guidelines and clinical knowledge were input into all four LLMs between May 15 and May 22, 2025. Each LLM was assessed in both MCQ and open-ended formats. Responses were graded for correctness, with “unable to answer” classified as incorrect. Reference usage was also recorded. Chi-square tests were used to compare accuracy and citation behavior across models

Results: OpenEvidence achieved the highest accuracy in the MCQ format (87.0%), followed by ChatGPT-4 and Claude 3.7 (84.8% each), and Gemini (82.6%) (P < .001). In the open-ended format, Gemini had the highest accuracy (69.6%), followed by OpenEvidence (60.9%), Claude 3.7 (56.5%), and ChatGPT-4 (54.3%) (P < .001). Reference citation varied markedly: OpenEvidence provided references in 100% of cases, while other LLMs cited sources inconsistently or not at all (P < 0.0001). All LLMs attempted to answer 100% of the questions, but varied significantly in accuracy and citation transparency.

Discussion: LLMs demonstrate wide variability in their ability to answer IBD-related clinical questions. The MCQ format improves performance across all models. OpenEvidence consistently outperformed other models in both accuracy and citation use, likely due to its training on medical-specific content and its built-in ability to reference sources. These features may enhance the reliability and clinical usefulness of LLMs in gastroenterology. Further studies are needed to optimize their deployment in clinical and educational settings.

Disclosures:
Nazmus Khan indicated no relevant financial relationships.
Sama Anvari indicated no relevant financial relationships.
Abdulrahman Albassam indicated no relevant financial relationships.
Dorota Borovsky indicated no relevant financial relationships.
Gurjit Mander indicated no relevant financial relationships.
Yung Lee indicated no relevant financial relationships.
Ciaran Galts indicated no relevant financial relationships.

Nazmus Khan, BSc(Hons), MBChB¹, Sama Anvari, MD¹, Abdulrahman Albassam, MD¹, Dorota Borovsky, MD, MSc, BSc², Gurjit Mander, MD¹, Yung Lee, MD¹, Ciaran Galts, MD¹. P1028 - Artificial Intelligence in IBD: Comparing the Clinical Utility of Four Leading Large Language Models, ACG 2025 Annual Scientific Meeting Abstracts. Phoenix, AZ: American College of Gastroenterology.