Virginia Commonwealth University Richmond, Virginia
Sri Harsha Boppana, MBBS, MD1, Manaswitha Thota, MD2, Gautam Maddineni, MD3, Sachin Sravan Kumar Komati, 4, Sarath Chandra Ponnada, 5, Sai Lakshmi Prasanna Komati, MBBS6, C. David Mintz, MD, PhD7 1Nassau University Medical Center, East Meadow, NY; 2Virginia Commonwealth University, Richmond, VA; 3Florida State University, Cape Coral, FL; 4Florida International University, Florida, FL; 5Great Eastern Medical School and Hospital, Srikakulam, Srikakulam, Andhra Pradesh, India; 6Government Medical College, Ongole, Ongole, Andhra Pradesh, India; 7Johns Hopkins University School of Medicine, Baltimore, MD Introduction: Accurate classification of gastrointestinal lesions during endoscopic examinations is essential for early diagnosis and appropriate treatment. Traditional computer-aided detection systems focus on single-frame analysis and overlook the continuity inherent in video streams. We implemented a deep-learning framework that combines spatial embeddings from a pretrained Vision Transformer with three-dimensional convolutional processing to capture both morphological and temporal cues in endoscopic video sequences. Methods: We obtained over one million annotated video frames from the HyperKvasir repository. We segmented each endoscopic examination into overlapping sequences of 16 frames (stride 4), yielding more than 7,500 temporal samples. We extracted 768-dimensional features per frame using a pre-trained Vision Transformer and reshaped these into volumetric tensors of size 16 × 32 × 24 to preserve spatial and temporal context. We implemented a three-dimensional convolutional neural network enhanced with batch normalization, ReLU activations, and dropout, and optimized it using the Adam algorithm with a stepwise learning rate scheduler across 500 epochs. Data were partitioned into training (80 %), validation (10 %), and independent test (10 %) sets, with stratified sampling to mitigate class imbalance. Results: On the held-out test set, the network achieved 97.2 % classification accuracy and a weighted F1-score of 0.97. The weighted-average area under the receiver operating characteristic curve reached 1.00, reflecting excellent discrimination among lesion types. Confusion matrix analysis confirmed high precision and recall across all major categories despite imbalanced class distributions. Dimensionality reduction via t-distributed stochastic neighbor embedding revealed well-defined clusters corresponding to distinct gastrointestinal lesion classes. Discussion: The network delivered 97.2 % accuracy, a weighted F1-score of 0.97, and an AUC-ROC of 1.00 on an independent test dataset. Robust results across lesion categories and distinct clustering in t-SNE embeddings underscore its reliability. By integrating spatial-temporal features, this approach holds promise for real-time decision support and may reduce diagnostic variability in gastrointestinal endoscopy.
Disclosures: Sri Harsha Boppana indicated no relevant financial relationships. Manaswitha Thota indicated no relevant financial relationships. Gautam Maddineni indicated no relevant financial relationships. Sachin Sravan Kumar Komati indicated no relevant financial relationships. Sarath Chandra Ponnada indicated no relevant financial relationships. Sai Lakshmi Prasanna Komati indicated no relevant financial relationships. C. David Mintz indicated no relevant financial relationships.
Sri Harsha Boppana, MBBS, MD1, Manaswitha Thota, MD2, Gautam Maddineni, MD3, Sachin Sravan Kumar Komati, 4, Sarath Chandra Ponnada, 5, Sai Lakshmi Prasanna Komati, MBBS6, C. David Mintz, MD, PhD7. P5123 - Video-Based Gastrointestinal Lesion Detection Using Deep Temporal Modeling with 3D CNN, ACG 2025 Annual Scientific Meeting Abstracts. Phoenix, AZ: American College of Gastroenterology.