Kevin P. Shah, MD1, Sanjay Prasad, MD2, Shravya Pothula, MD3, Manuel Garza, MD, MS4, Sri Komanduri, MD, MS1 1Northwestern Medicine, Chicago, IL; 2Baylor Scott & White Medical Center, Round Rock, TX; 3University of Colorado Anschutz Medical Campus, Aurora, CO; 4Baylor Scott & White, Georgetown, TX Introduction: Advancements in artificial intelligence (AI) have transformed the landscape of healthcare through innovative technologies that augment clinical decision making. Tools such as large language models (LLMs) have the potential to improve medical education, diagnosis, and treatment in gastroenterology (GI). This study aims to compare the effectiveness of three LLMs, including Chat Generative Pre-Trained Transformer (ChatGPT v4o), Google Gemini 2.5 Pro, and OpenEvidence in answering board-style GI questions. Methods: A total of 120 questions from the Digestive Diseases Self-Education Platform (DDSEP) question bank were randomly selected across topic areas including esophagus, stomach and duodenum, pancreas, biliary tract, liver, small intestine, colon, and GI cancers. Questions with tables or radiologic/endoscopic images were excluded. All LLMs were instructed to choose the best answer and rate confidence on a scale of 1-5 (1 being least confident, 5 being most confident). For each incorrect answer, the LLM was given a second opportunity to choose the next best answer. Quantitative and qualitative data were obtained. Statistical analysis for accuracy (binary outcome) was conducted using Cochran’s Q test and confidence scores were analyzed using the Friedman test. Results: Each LLM answered 120 questions across various content areas including Esophagus, Stomach & Duodenum, Pancreas, Biliary, Liver, Small Intestine, Colon, and GI Cancers. Figure 1 demonstrates accuracy of ChatGPT, Gemini, and OpenEvidence across various topic areas. Each LLM performed similarly in overall accuracy answering each question (ChatGPT 84.2% correct, Gemini 86.7% correct, and OpenEvidence 85% correct). Cochran’s Q test showed no statistically significant difference in overall accuracy among the three LLMs (Q=0.5, p=0.78). At times, all LLMs faced challenges with synthesis and clinical reasoning, especially when choosing between multiple answers that could be correct. OpenEvidence struggled to choose an answer in situations where it did not feel any answer was most correct. Discussion: This study highlights the potential of LLMs as educational tools and clinical resources for physicians. Contrary to prior published abstracts, LLMs have progressed to become relatively similar in clinical reasoning for board-style questions. As LLMs evolve, the ability to interpret tables, radiologic images, and endoscopic findings needs to be further explored.
Figure: Figure 1. Accuracy of Large Language Models in First Attempt of GI Board Review Questions
Figure: Figure 1. Accuracy of Large Language Models in First Attempt of GI Board Review Questions
Disclosures: Kevin Shah indicated no relevant financial relationships. Sanjay Prasad indicated no relevant financial relationships. Shravya Pothula indicated no relevant financial relationships. Manuel Garza indicated no relevant financial relationships. Sri Komanduri indicated no relevant financial relationships.
Kevin P. Shah, MD1, Sanjay Prasad, MD2, Shravya Pothula, MD3, Manuel Garza, MD, MS4, Sri Komanduri, MD, MS1. P6207 - Artificial Intelligence Showdown in GI: A Battle of Three Large Language Models (LLMs) in Board Style Review Questions, ACG 2025 Annual Scientific Meeting Abstracts. Phoenix, AZ: American College of Gastroenterology.