Icahn School of Medicine at Mount Sinai New York, NY
Priyanka Singh, MD, Lauren Grinspan, MD Icahn School of Medicine at Mount Sinai, New York, NY Introduction: Large language models including ChatGPT can augment medical education and clinical decision-making. There are various models of ChatGPT, with some having better capabilities for advanced reasoning. To date, ChatGPT’s ability to answer board-style questions in Gastroenterology (GI) and Hepatology categories has not been evaluated. This study aims to assess the accuracy of two different models of ChatGPT, 4o and o3, in answering board review questions from the American College of Physicians’ Medical Knowledge Self-Assessment Program (MKSAP), focusing specifically on GI and hepatology. Methods: All 155 questions in the GI and Hepatology section of ACP MKSAP were selected for this study. Each question with associated imaging and lab data were copied into ChatGPT 4o, the model designed for most tasks, and then ChatGPT o3, OpenAI’s reasoning model designed for complex tasks requiring deep reasoning. The options selected by each ChatGPT model were recorded and compared with the answers provided by MKSAP. Questions were divided into the following categories: general gastroenterology, inflammatory bowel disease (IBD), hepatobiliary, GI cancer, and esophageal. Chi square analyses were performed. Results: ChatGPT o3, the advanced deep reasoning model, significantly outperformed ChatGPT 4o, with 96% of questions answered correctly compared to 85% (p=0.009). By category, ChatGPT o3 answered 100% of IBD, 96% of hepatobiliary, 91% GI cancer, 86% esophageal, and 97% general GI questions correctly. ChatGPT 4o answered 100% IBD, 89% hepatobiliary, 82% GI cancer, 79% esophageal, and 81% general GI questions correctly. There was a significant difference between the two models for general GI questions (p=0.005). Discussion: Overall, ChatGPT had a high but not perfect accuracy rate in answering GI and hepatology internal medicine board review questions. ChatGPT o3 significantly outperformed ChatGPT 4o. This may be attributed to the o3 model’s reinforcement learning approach which enhances its capabilities in open-ended situations, particularly those involving visuals and multi-step workflows. More specifically, the o3 model had higher accuracy for general GI questions, but not for other GI subcategories, compared to the 4o model. Future directions for this study include comparing ChatGPT’s accuracy to that of internal medicine residents along with internal medicine attendings and GI attendings.
Disclosures: Priyanka Singh indicated no relevant financial relationships. Lauren Grinspan indicated no relevant financial relationships.
Priyanka Singh, MD, Lauren Grinspan, MD. P6184 - Evaluating the Accuracy of ChatGPT on MKSAP Gastroenterology and Hepatology Board Review Questions, ACG 2025 Annual Scientific Meeting Abstracts. Phoenix, AZ: American College of Gastroenterology.