Chakib Battioui, PhD1, Pavel Brodskiy, PhD2, Klaus Gottlieb, MD, PhD, JD1, Mohammad Haft-Javaherian, PhD2, William J. Eastman, 1, Julian Lehrer, 2, Evan Yu, PhD2, Derek Onken, PhD1, Darren Thomason, MBA2, Walter Reinisch, MD, PhD3, Daniel Colucci, 4, Shrujal Baxi, MD4 1Eli Lilly and Company, Indianapolis, IN; 2Iterative Health Inc, Cambridge, MA; 3Medical University of Vienna, Department of Internal Medicine III, Division of Gastroenterology and Hepatology, Spitalgasse, Wien, Austria; 4Iterative Health Inc, New York, NY Introduction: Regulatory guidance recommends the endoscopy subscore as the index to assess the endoscopic component of the primary endpoint in ulcerative colitis (UC) trials. Inter-reader variability in assessments may impact the reliability of trial results. Currently, there is no metric in place to assess the certainty by which a reader is assigning an endoscopy subscore. Machine learning (ML) provides an opportunity to assess the endoscopy subscore and provide a measurement of its certainty in a standardized manner. Artificial Intelligence Assessment of Endoscopic Severity (AI-ES) accurately assesses the endoscopy subscore. The objective of this study is to evaluate the calibration of AI-ES - how well its predicted probabilities reflect true likelihoods - to assess the reliability of its measurement of certainty in endoscopy subscore assessments in UC trials. Methods: AI-ES is a deep learning algorithm that assesses the endoscopy subscore in UC endoscopic videos. AI-ES measures probability for the four ordinal endoscopy subscore classes. The endoscopy subscore with the highest probability is assigned as the final score by AI-ES. We assessed calibration on a holdout test set of 639 videos (~25%) from the Phase 3 induction trial for mirikizumab in UC (NCT03518086). Videos had a 2+1 centrally read endoscopy subscore, randomly selected from week 0 and 12 with a distribution of endoscopic severity similar to the overall study population. Calibration plots were generated across endoscopy subscore classes with probabilities grouped into septiles (~100 videos per group) for primary analysis and deciles for confirmation. Brier scores, ranging from 0 (perfect calibration) to 1 (worst calibration), were calculated, with values < 0.25 considered informative. Results: AI-ES demonstrated strong calibration, with Brier scores below < 0.25 for each endoscopy subscore (0: 0.037, 1: 0.082, 2: 0.162, 3: 0.112). The Brier score for evaluation of endoscopic improvement (0,1 vs 2,3) also showed excellent calibration (0.066). Findings were consistent when assessing probabilities by deciles. Discussion: Whereas data on the certainty of human readers in endoscopy subscore assessments are elusive, AI-ES is calibrated across all endoscopy subscore classes, providing reliable data on score probabilities. This novel measurement of certainty by AI-ES added to the score assessment may enable novel AI-based multi-reader or consensus workflows in trials, potentially improving the reliability of UC endpoint assessments.
Figure: Figure 1. Calibration plots measuring the reliability of the model’s probability of endoscopy subscore class predictions for endoscopy subscores 0 or 1 (A) and 2 or 3 (B). Data is based on septiles of predicted probabilities.
Disclosures: Chakib Battioui: Eli Lilly – Employee. Pavel Brodskiy: Iterative Health Inc – Employee. Klaus Gottlieb: Eli Lilly – Employee. Mohammad Haft-Javaherian: Iterative Health Inc – Employee. William Eastman: Eli Lilly and Company – Employee, Stock Options. Julian Lehrer: Iterative Health Inc – Employee. Evan Yu: Iterative Health Inc – Employee. Derek Onken: Eli Lilly – Employee. Darren Thomason: Iterative Health – Employee. Walter Reinisch: AbbVie – Advisory Committee/Board Member, Consultant, Grant/Research Support. Actelion – Advisory Committee/Board Member, Consultant. Alpha Wasserman – Advisory Committee/Board Member, Consultant. AstraZeneca – Advisory Committee/Board Member, Consultant. Cellerix – Advisory Committee/Board Member, Consultant. Cosmo Pharmaceuticals – Advisory Committee/Board Member, Consultant. Ferring Pharmaceuticals – Advisory Committee/Board Member, Consultant. Genentech – Advisory Committee/Board Member, Consultant. Grunenthal – Advisory Committee/Board Member, Consultant. Johnson & Johnson – Advisory Committee/Board Member, Consultant. Merck – Advisory Committee/Board Member, Consultant. Millennium – Advisory Committee/Board Member, Consultant. Novo Nordisk – Advisory Committee/Board Member, Consultant. Nycomed – Advisory Committee/Board Member, Consultant. Pfizer – Advisory Committee/Board Member, Consultant. Pharmacosmos – Advisory Committee/Board Member, Consultant. Salix Pharmaceuticals – Advisory Committee/Board Member, Consultant. Schering-Plough – Advisory Committee/Board Member, Consultant. Takeda – Advisory Committee/Board Member, Consultant. UCB Pharma – Advisory Committee/Board Member, Consultant. Vifor Pharma – Advisory Committee/Board Member, Consultant. Daniel Colucci: Iterative Health Inc – Employee. Shrujal Baxi: Iterative Health Inc – Employee.
Chakib Battioui, PhD1, Pavel Brodskiy, PhD2, Klaus Gottlieb, MD, PhD, JD1, Mohammad Haft-Javaherian, PhD2, William J. Eastman, 1, Julian Lehrer, 2, Evan Yu, PhD2, Derek Onken, PhD1, Darren Thomason, MBA2, Walter Reinisch, MD, PhD3, Daniel Colucci, 4, Shrujal Baxi, MD4. P3295 - Validating Calibration of an Artificial Intelligence Assessment of Endoscopic Severity in Ulcerative Colitis, ACG 2025 Annual Scientific Meeting Abstracts. Phoenix, AZ: American College of Gastroenterology.