Sri Harsha Boppana, MBBS, MD1, Manaswitha Thota, MD2, Gautam Maddineni, MD3, Sachin Sravan Kumar Komati, 4, Sai Lakshmi Prasanna Komati, MBBS5, Aditya Chandrashekar, MBBS6, C. David Mintz, MD, PhD7 1Nassau University Medical Center, East Meadow, NY; 2Virginia Commonwealth University, Richmond, VA; 3Florida State University, Cape Coral, FL; 4Florida International University, Florida, FL; 5Government Medical College, Ongole, Ongole, Andhra Pradesh, India; 6The Johns Hopkins Hospital, Baltimore, MD; 7Johns Hopkins University School of Medicine, Baltimore, MD Introduction: Timely identification of gastrointestinal lesions improves treatment planning and patient outcomes. Variability in image quality, lesion morphology, and illumination challenges traditional methods. We propose a dual‐model framework that combines a Vision Transformer (ViT) for classification with a U-Net for precise lesion segmentation. This system targets real-time deployment in augmented reality (AR)-guided endoscopy. Methods: We curated 18,432 labeled images and 23,156 lesion instances from the KYUCapsule dataset. For classification, we fine-tuned a ViT-Base model pretrained on ImageNet. We resized input frames to 224×224 pixels and applied brightness shifts, contrast normalization, and small rotations for data augmentation. We replaced the original classification head with a dense layer optimized for three lesion categories (inflammatory, vascular, neoplastic). Training employed AdamW optimization (initial learning rate 1×10⁻⁴, weight decay 1×10⁻⁵) with a cosine annealing schedule. We implemented early stopping to prevent overfitting. For segmentation, we built a U-Net with a ResNet-34 encoder, training it on expert-annotated binary masks. Random horizontal and vertical flips enhanced generalizability. Both models trained on a stratified 70%:15%:15% split, ensuring equal representation of lesion types. We tuned batch size (32–64), dropout (0.2–0.5), and learning rate decay schedules on the validation set. Experiments ran on NVIDIA Tesla V100 GPUs with mixed-precision training. Results: On the held-out test set (2,764 classification images; 3,474 segmentation masks), ViT achieved an overall accuracy of 92.4%, precision of 91.8%, recall of 92.1%, and F1-score of 92.0%. Subgroup performance remained consistent: inflammatory 93.1%, vascular 91.5%, neoplastic 92.2%. Attention heatmaps demonstrated focus on relevant morphological features. The U-Net segmentation model reached a Dice coefficient of 91.3% and an IoU of 90.5%, accurately delineating lesion boundaries even in low-contrast regions. Combined inference time averaged 45 milliseconds per frame, meeting AR real-time requirements. Calibration analysis yielded a Brier score of 0.083, indicating minimal overconfidence. Discussion: Our transformer-based approach accurately classifies and segments gastrointestinal lesions in real time. High performance, rapid inference, and interpretability support integration into AR-guided endoscopy workflows. This system promises to enhance diagnostic precision and improve patient care.
Disclosures: Sri Harsha Boppana indicated no relevant financial relationships. Manaswitha Thota indicated no relevant financial relationships. Gautam Maddineni indicated no relevant financial relationships. Sachin Sravan Kumar Komati indicated no relevant financial relationships. Sai Lakshmi Prasanna Komati indicated no relevant financial relationships. Aditya Chandrashekar indicated no relevant financial relationships. C. David Mintz indicated no relevant financial relationships.
Sri Harsha Boppana, MBBS, MD1, Manaswitha Thota, MD2, Gautam Maddineni, MD3, Sachin Sravan Kumar Komati, 4, Sai Lakshmi Prasanna Komati, MBBS5, Aditya Chandrashekar, MBBS6, C. David Mintz, MD, PhD7. P5125 - AI-Powered Lesion Detection and Localization for AR-Enabled Capsule Endoscopy: A Step Toward Real-Time Clinical Decision Support, ACG 2025 Annual Scientific Meeting Abstracts. Phoenix, AZ: American College of Gastroenterology.