Sri Harsha Boppana, MBBS, MD1, Manaswitha Thota, MD2, Gautam Maddineni, MD3, Sachin Sravan Kumar Komati, 4, Sai Lakshmi Prasanna Komati, MBBS5, Aditya Chandrashekar, MBBS6, C. David Mintz, MD, PhD7 1Nassau University Medical Center, East Meadow, NY; 2Virginia Commonwealth University, Richmond, VA; 3Florida State University, Cape Coral, FL; 4Florida International University, Florida, FL; 5Government Medical College, Ongole, Ongole, Andhra Pradesh, India; 6The Johns Hopkins Hospital, Baltimore, MD; 7Johns Hopkins University School of Medicine, Baltimore, MD Introduction: Capsule endoscopy generates thousands of frames per study, making manual review time-consuming and prone to missed lesions. Existing algorithms often rely solely on spatial features, limiting detection of lesions with subtle visual cues. Integrating temporal context can reveal evolving patterns across contiguous frames. Methods: We used the KYUCapsule dataset, which contains over 18,000 capsule endoscopy images and annotations for 23,000 lesions across four categories. We normalized all frames to 512×512 pixels and balanced classes via targeted augmentation (rotation, scaling, intensity variation). A Vision Transformer (ViT) backbone, initialized with ImageNet weights, extracted spatial embeddings from individual frames. In parallel, we formed sequences of five consecutive frames and processed them with a 3D‐Convolutional Neural Network (3D‐CNN) to capture temporal dynamics. We fine‐tuned both networks using categorical cross‐entropy loss, optimized with AdamW, and employed a cosine annealing schedule over 50 epochs. We concatenated spatial and temporal embeddings and fed the fused vector into a fully connected head that produced probability scores for each lesion type. We calibrated outputs with isotonic regression and evaluated performance on a held‐out test set. Results: The hybrid model achieved 92.3% overall accuracy on unseen data. Its macro‐averaged precision reached 91.7%, recall 90.8 %, and F1‐score 91.2% across four lesion classes. For ulcers—often indistinct against surrounding mucosa—the model attained 90.2% accuracy. Bleeding detection achieved 93.4% precision, demonstrating high reliability in critical scenarios. Incorporating temporal context reduced false positives by 15% compared to a spatial‐only baseline, confirming the value of frame‐to‐frame information. Calibration analysis yielded a Brier score of 0.037, indicating that predicted probabilities closely matched observed outcomes. Discussion: The proposed hybrid spatial–temporal architecture significantly improves detection and classification of diverse capsule endoscopy lesions. Key strengths include accurate identification of visually subtle lesions and reliable confidence estimates. By reducing false positives and enhancing lesion delineation, this approach can streamline clinical workflows and support gastroenterologists in identifying pathology more efficiently.
Disclosures: Sri Harsha Boppana indicated no relevant financial relationships. Manaswitha Thota indicated no relevant financial relationships. Gautam Maddineni indicated no relevant financial relationships. Sachin Sravan Kumar Komati indicated no relevant financial relationships. Sai Lakshmi Prasanna Komati indicated no relevant financial relationships. Aditya Chandrashekar indicated no relevant financial relationships. C. David Mintz indicated no relevant financial relationships.
Sri Harsha Boppana, MBBS, MD1, Manaswitha Thota, MD2, Gautam Maddineni, MD3, Sachin Sravan Kumar Komati, 4, Sai Lakshmi Prasanna Komati, MBBS5, Aditya Chandrashekar, MBBS6, C. David Mintz, MD, PhD7. P5128 - Hybrid Spatial - Temporal Deep Learning Accurately Classifies Capsule Endoscopy Lesions, ACG 2025 Annual Scientific Meeting Abstracts. Phoenix, AZ: American College of Gastroenterology.