Sri Harsha Boppana, MBBS, MD1, Manaswitha Thota, MD2, Gautam Maddineni, MD3, Sachin Sravan Kumar Komati, 4, Sarath Chandra Ponnada, 5, Sai Lakshmi Prasanna Komati, MBBS6, C. David Mintz, MD, PhD7 1Nassau University Medical Center, East Meadow, NY; 2Virginia Commonwealth University, Richmond, VA; 3Florida State University, Cape Coral, FL; 4Florida International University, Florida, FL; 5Great Eastern Medical School and Hospital, Srikakulam, Srikakulam, Andhra Pradesh, India; 6Government Medical College, Ongole, Ongole, Andhra Pradesh, India; 7Johns Hopkins University School of Medicine, Baltimore, MD Introduction: Accurate classification of gastrointestinal (GI) conditions via endoscopic imaging remains a diagnostic challenge, particularly in cases with subtle or overlapping features. Advances in artificial intelligence, particularly transfer learning with vision transformers (ViTs), offer promising solutions for automated, high-fidelity image interpretation across diverse GI pathologies. Methods: We developed a multimodal pipeline to classify gastrointestinal (GI) conditions using the HyperKvasir dataset, which includes 10,662 labeled images (23 classes), 1,000 segmented polyp images with pixel-level masks, and 373 video recordings (over one million frames). All images were preprocessed and normalized. Visual features were extracted using a Vision Transformer (ViT) pretrained on ImageNet. For segmented images, features were derived from both the originals and their masks. For videos, frame-wise features were aggregated. These vectors were input into a custom multi-layer perceptron (MLP) classifier for multiclass prediction. The model was trained using the Adam optimizer with cross-entropy loss and dropout regularization. Performance was evaluated on a test set of 2,133 samples using accuracy, macro F1-score, Top-K accuracy, and class-wise area under the ROC curve (AUC). Results: The model achieved 88.61% accuracy and an 88.34% macro F1-score, showing strong performance across both common and rare classes. Top-3 and Top-5 accuracies reached 97.19% and 98.59%, respectively. Most classes had ROC-AUC >0.94, with AUC = 1.00 for conditions such as esophagitis, Barrett’s esophagus, and ulcerative colitis. A confusion matrix and multiclass ROC plots demonstrated consistent diagnostic performance and highlighted the model’s ability to distinguish subtle visual features across GI pathologies. The system’s reliability across varied modalities supports its potential integration into clinical endoscopic workflows. Discussion: This ViT-based classification system demonstrates high accuracy and robustness across static, segmented, and video-based GI imaging. Its performance and generalizability suggest strong potential for clinical application as a decision-support tool in endoscopic diagnostics
Disclosures: Sri Harsha Boppana indicated no relevant financial relationships. Manaswitha Thota indicated no relevant financial relationships. Gautam Maddineni indicated no relevant financial relationships. Sachin Sravan Kumar Komati indicated no relevant financial relationships. Sarath Chandra Ponnada indicated no relevant financial relationships. Sai Lakshmi Prasanna Komati indicated no relevant financial relationships. C. David Mintz indicated no relevant financial relationships.
Sri Harsha Boppana, MBBS, MD1, Manaswitha Thota, MD2, Gautam Maddineni, MD3, Sachin Sravan Kumar Komati, 4, Sarath Chandra Ponnada, 5, Sai Lakshmi Prasanna Komati, MBBS6, C. David Mintz, MD, PhD7. P5127 - Multimodal Classification of Gastrointestinal Conditions Using Vision Transformer-Based Deep Feature Extraction, ACG 2025 Annual Scientific Meeting Abstracts. Phoenix, AZ: American College of Gastroenterology.