P4079 - Automated Detection and Classification of Small Bowel Lesions in Capsule Endoscopy Using Vision Transformer (ViT) Architecture

Monday, October 27, 2025

10:30 AM - 4:00 PM PDT

Location: Exhibit Hall

Presenting Author(s)

Sri Harsha Boppana, MBBS, MD

Nassau University Medical Center
Hicksville, NY

Sri Harsha Boppana, MBBS, MD¹, Manaswitha Thota, MD², Gautam Maddineni, MD³, Sachin Sravan Kumar Komati, ⁴, C. David Mintz, MD, PhD⁵
1Nassau University Medical Center, East Meadow, NY; 2Virginia Commonwealth University, Richmond, VA; 3Florida State University, Cape Coral, FL; 4Florida International University, Florida, FL; 5Johns Hopkins University School of Medicine, Baltimore, MD

Introduction: Capsule endoscopy (CE) visualizes the small bowel to detect lesions that traditional methods often miss. However, the vast number of images it produces can overwhelm physicians and lead to missed findings. To address this, we developed an AI model based on the Vision Transformer (ViT) architecture to automatically detect and classify small-bowel lesions. Using the SEE-AI and Kvasir-Capsule datasets of annotated CE images, our system aims to help clinicians identify abnormalities more quickly and accurately, enhancing CE’s clinical utility.

Methods: We developed a Vision Transformer–based model to automate small-bowel lesion detection in capsule endoscopy images. It was trained, validated, and tested on the SEE-AI (40 587 train, 8 696 validation, 8 696 test images) and Kvasir-Capsule (annotations from 117 videos across 21 lesion types) datasets. For comparison, we also evaluated DenseNet121 and ResNet50. Model performance was measured by accuracy, precision, recall, F1-score, and area under the receiver operating characteristic curve.

Results: The ViT model demonstrated outstanding performance, achieving 96.68% training accuracy, 92.03% validation accuracy, and 92.20% test accuracy. AUC values for most lesion categories were close to or equal to 1, indicating robust classification performance. The overall precision, recall, and F1-scores for ViT were all 0.92 (weighted averages), reflecting consistent, high-quality results across datasets. In contrast, DenseNet121 showed moderate performance, with training accuracy of 93.56%, validation accuracy of 75.18%, and test accuracy of 74%, indicating limitations in recognizing certain lesion types. ResNet50 performed poorly, with training accuracy of 65.52%, validation accuracy of 37.03%, and test accuracy of 38%, demonstrating significant challenges in handling the complexities of CE images. These results emphasize the superior performance of the ViT model, which achieved greater diagnostic precision and showed considerable potential for clinical use in automated CE image analysis.

Discussion: The Vision Transformer (ViT) excelled at automating small-bowel lesion detection and classification in capsule endoscopy.These findings suggest AI can ease physicians’ workloads without sacrificing diagnostic accuracy. Future work will validate the system in real-world clinical settings and integrate it into video-based CE workflows.

Disclosures:
Sri Harsha Boppana indicated no relevant financial relationships.
Manaswitha Thota indicated no relevant financial relationships.
Gautam Maddineni indicated no relevant financial relationships.
Sachin Sravan Kumar Komati indicated no relevant financial relationships.
C. David Mintz indicated no relevant financial relationships.

Sri Harsha Boppana, MBBS, MD¹, Manaswitha Thota, MD², Gautam Maddineni, MD³, Sachin Sravan Kumar Komati, ⁴, C. David Mintz, MD, PhD⁵. P4079 - Automated Detection and Classification of Small Bowel Lesions in Capsule Endoscopy Using Vision Transformer (ViT) Architecture, ACG 2025 Annual Scientific Meeting Abstracts. Phoenix, AZ: American College of Gastroenterology.