Sri Harsha Boppana, MBBS, MD1, Manaswitha Thota, MD2, Gautam Maddineni, MD3, Sachin Sravan Kumar Komati, 4, C. David Mintz, MD, PhD5 1Nassau University Medical Center, East Meadow, NY; 2Virginia Commonwealth University, Richmond, VA; 3Florida State University, Cape Coral, FL; 4Florida International University, Florida, FL; 5Johns Hopkins University School of Medicine, Baltimore, MD Introduction: Accurate prediction of gastrointestinal lesion recurrence after treatment remains a critical unmet need. Integrating visual features from endoscopic imaging with patient clinical data may enhance risk stratification. We developed and evaluated a multimodal deep‐learning pipeline combining Vision Transformer embeddings of capsule endoscopy images with structured metadata to predict lesion recurrence. Methods: We retrospectively assembled a cohort of 1,200 adult patients from the KYUCapsule dataset who underwent capsule endoscopy and follow-up for at least 12 months. We annotated 3,600 high-resolution lesion frames, categorizing each by lesion type (inflammatory, vascular, or neoplastic) and extracting corresponding metadata: age, sex, treatment regimen, lesion classification, history of prior recurrence, and follow-up duration. Image frames passed through a pre trained Vision Transformer (ViT), with final layers fine-tuned on our dataset to produce 512-dimension feature vectors capturing morphology and texture. Simultaneously, we encoded structured metadata via a fully connected neural network (128 and 64 neurons, ReLU activation). We concatenated image and metadata embeddings and trained a multilayer perceptron classifier with two hidden layers (256 and 128 neurons) and dropout (rate = 0.3). We split data into training (70%), validation (15%), and test (15%) sets, ensuring equal representation of recurrence vs. non-recurrence cases through oversampling. Hyperparameters (learning rate, batch size, L2 regularization λ = 0.001) were optimized via grid search on the validation set. We used cross-entropy loss, Adam optimization, and early stopping. Results: On the test set (n = 180), the multimodal model achieved accuracy = 89.2%, precision = 88.7%, recall = 88.5%, and F1-score = 88.9%. The area under the ROC curve reached 0.902. Subgroup analysis showed consistent accuracy across lesion types: inflammatory = 90.5%, vascular = 88.0%, and neoplastic = 89.8%. Calibration was strong (Brier score = 0.087). SHAP analysis identified lesion type, prior recurrence history, and image-derived severity metrics as the top three predictors. Discussion: Our multimodal transformer-based approach reliably integrates endoscopic imaging and clinical metadata to predict gastrointestinal lesion recurrence. Given its high discriminative performance and interpretability, this model could aid clinicians in tailoring surveillance and management strategies, potentially improving patient outcomes.
Disclosures: Sri Harsha Boppana indicated no relevant financial relationships. Manaswitha Thota indicated no relevant financial relationships. Gautam Maddineni indicated no relevant financial relationships. Sachin Sravan Kumar Komati indicated no relevant financial relationships. C. David Mintz indicated no relevant financial relationships.
Sri Harsha Boppana, MBBS, MD1, Manaswitha Thota, MD2, Gautam Maddineni, MD3, Sachin Sravan Kumar Komati, 4, C. David Mintz, MD, PhD5. P4078 - Multimodal Transformer Model Predicts Post-Treatment Lesion Recurrence in Gastrointestinal Patients, ACG 2025 Annual Scientific Meeting Abstracts. Phoenix, AZ: American College of Gastroenterology.