Sri Harsha Boppana, MBBS, MD1, Manaswitha Thota, MD2, Gautam Maddineni, MD3, Sachin Sravan Kumar Komati, 4, Sarath Chandra Ponnada, 5, Sai Lakshmi Prasanna Komati, MBBS6, C. David Mintz, MD, PhD7 1Nassau University Medical Center, East Meadow, NY; 2Virginia Commonwealth University, Richmond, VA; 3Florida State University, Cape Coral, FL; 4Florida International University, Florida, FL; 5Great Eastern Medical School and Hospital, Srikakulam, Srikakulam, Andhra Pradesh, India; 6Government Medical College, Ongole, Ongole, Andhra Pradesh, India; 7Johns Hopkins University School of Medicine, Baltimore, MD Introduction: Clinicians performing gastrointestinal endoscopy face considerable challenges when evaluating a wide spectrum of mucosal findings and subtle lesions in real time. Inter-observer variability and the volume of procedures can delay accurate diagnosis and affect patient outcomes. Automated image classification using advanced deep-learning architectures offers a promising route to standardize lesion detection and streamline endoscopic workflow. Methods: We processed 374 anonymized clinical endoscopy recordings and extracted 15 representative frames per video by applying K-means clustering to feature embeddings generated by a pretrained Vision Transformer. Expert endoscopists assigned each of the resulting 5,296 frames to one of 29 gastrointestinal diagnostic categories, covering pathological findings (for example, polyps and ulcers), anatomical landmarks, and mucosal quality assessments. We fine-tuned the ViT backbone for 10 epochs using class-balanced cross-entropy loss and the Adam optimizer (learning rate 3 × 10⁻⁵, batch size 32). Data were split into 80% for training, 10% for validation, and 10% for testing, with early stopping triggered by lack of improvement in validation loss to prevent overfitting. Results: Fine-tuning achieved 90.0% overall accuracy on the held-out test set. Top-3 and Top-5 accuracies reached 98.2% and 99.0%, respectively. The weighted F1-score of 0.89 reflected balanced performance across all diagnostic categories. Receiver operating characteristic analysis yielded area under the curve values above 0.98 for 25 of the 29 classes, demonstrating strong discrimination. The confusion matrix indicated minimal misclassification, largely confined to visually similar lesions and adjacent anatomical landmarks. Discussion: Fine-tuned Vision Transformer models achieved high overall accuracy (90.0%), exceptional top-5 performance (99.0%), and strong discrimination across 29 diagnostic categories. Embedding this approach into endoscopic systems can support rapid, reliable lesion identification, reduce diagnostic variability, and enhance clinical efficiency. Further validation in prospective clinical trials will clarify its impact on patient care.
Disclosures: Sri Harsha Boppana indicated no relevant financial relationships. Manaswitha Thota indicated no relevant financial relationships. Gautam Maddineni indicated no relevant financial relationships. Sachin Sravan Kumar Komati indicated no relevant financial relationships. Sarath Chandra Ponnada indicated no relevant financial relationships. Sai Lakshmi Prasanna Komati indicated no relevant financial relationships. C. David Mintz indicated no relevant financial relationships.
Sri Harsha Boppana, MBBS, MD1, Manaswitha Thota, MD2, Gautam Maddineni, MD3, Sachin Sravan Kumar Komati, 4, Sarath Chandra Ponnada, 5, Sai Lakshmi Prasanna Komati, MBBS6, C. David Mintz, MD, PhD7. P5124 - Automated Classification of Gastrointestinal Endoscopic Findings Using a Fine-Tuned Vision Transformer, ACG 2025 Annual Scientific Meeting Abstracts. Phoenix, AZ: American College of Gastroenterology.