Virginia Commonwealth University Richmond, Virginia
Sri Harsha Boppana, MBBS, MD1, Manaswitha Thota, MD2, Gautam Maddineni, MD3, Sachin Sravan Kumar Komati, 4, Sarath Chandra Ponnada, 5, Sai Lakshmi Prasanna Komati, MBBS6, C. David Mintz, MD, PhD7 1Nassau University Medical Center, East Meadow, NY; 2Virginia Commonwealth University, Richmond, VA; 3Florida State University, Cape Coral, FL; 4Florida International University, Florida, FL; 5Great Eastern Medical School and Hospital, Srikakulam, Srikakulam, Andhra Pradesh, India; 6Government Medical College, Ongole, Ongole, Andhra Pradesh, India; 7Johns Hopkins University School of Medicine, Baltimore, MD Introduction: Accurate, real-time classification of gastrointestinal (GI) findings during endoscopy remains a clinical challenge due to variability in lesion appearance, motion artifacts, and class imbalance. Artificial intelligence offers promising support, but most existing models rely on single-modality inputs. We aimed to develop a multimodal AI framework that integrates spatial, structural, and temporal data to improve diagnostic performance using the HyperKvasir dataset. Methods: We developed a multimodal deep learning pipeline to classify 23 gastrointestinal findings using the HyperKvasir dataset, which includes 10,662 labeled images, 1,000 segmentation masks, and 373 annotated videos ( >1 million frames). Vision Transformers were used to extract 768-dimensional features from each modality. Video frames were sampled at 1 frame per second, and segmentation masks were aligned with labeled images. Features from all available modalities were concatenated per sample without enforcing label intersection, preserving data diversity. A Multilayer Perceptron classifier was trained on these fused embeddings and evaluated using standard metrics. Results: The model achieved 91.86% accuracy and an F1-score of 0.92. AUC values exceeded 0.99 for most classes. The confusion matrix demonstrated strong performance across common and rare findings, while t-SNE visualization confirmed clear inter-class separability. The framework effectively integrates spatial, structural, and temporal features, supporting accurate, scalable classification in diverse clinical settings. Discussion: This multimodal AI framework demonstrates high diagnostic accuracy by leveraging diverse data types in endoscopic imaging. Its scalable, label-efficient design supports real-world application in clinical decision support, with potential for integration into endoscopy suites for real-time classification and quality assurance.
Disclosures: Sri Harsha Boppana indicated no relevant financial relationships. Manaswitha Thota indicated no relevant financial relationships. Gautam Maddineni indicated no relevant financial relationships. Sachin Sravan Kumar Komati indicated no relevant financial relationships. Sarath Chandra Ponnada indicated no relevant financial relationships. Sai Lakshmi Prasanna Komati indicated no relevant financial relationships. C. David Mintz indicated no relevant financial relationships.
Sri Harsha Boppana, MBBS, MD1, Manaswitha Thota, MD2, Gautam Maddineni, MD3, Sachin Sravan Kumar Komati, 4, Sarath Chandra Ponnada, 5, Sai Lakshmi Prasanna Komati, MBBS6, C. David Mintz, MD, PhD7. P5129 - Multimodal Deep Learning for Automated Gastrointestinal Diagnosis Using the HyperKvasir Dataset, ACG 2025 Annual Scientific Meeting Abstracts. Phoenix, AZ: American College of Gastroenterology.