Hybrid Deep-ML for Urothelial Carcinoma Staging

Plain-English Explanations

Overview

Pages 1-3

Why Bladder Cancer Staging Is Difficult and What This Study Proposes

The clinical problem: Bladder cancer is a heterogeneous disease spanning seven distinct stages, from low-grade non-invasive papillary carcinoma (Ta) and carcinoma in situ (Tis), through superficially invasive T1, to muscle-invasive stages T2, T3, and T4. Accurate staging is essential because the treatment pathway diverges sharply: non-muscle-invasive bladder cancer (NMIBC, comprising Ta, Tis, and T1) can typically be managed with transurethral resection of bladder tumor (TURBT) and intravesical therapy, while muscle-invasive bladder cancer (MIBC, comprising T2 through T4) requires neoadjuvant chemotherapy followed by radical cystectomy. However, CT-based staging has been reported with accuracy as low as 49%, and visual inspection of CT scans struggles to detect subtle findings such as perivesical fat stranding, focal wall thickening, and small pelvic lymph nodes that may harbor metastasis.

Prior radiomics approaches: Previous studies used texture-based features such as gray-level co-occurrence matrix (GLCM), gray-level run length matrix (GLRLM), local binary patterns (LBP), and Gabor wavelets to classify bladder cancer from MRI and CT scans. For example, one study achieved 95.24% accuracy differentiating T1 from T2 on MRI using 25 GLCM and 16 GLRLM features from 42 patients. Another used LBP and GLCM features with an SVM classifier on 65 patients, reporting an AUC of 80.60%. While these statistical texture approaches are fast and interpretable, they have generally achieved modest performance, and no clear consensus has emerged on which imaging signatures are most reliable.

The hybrid proposal: This study by Sarkar et al., a collaboration between Friedrich-Alexander-Universitat Erlangen-Nurnberg, Mayo Clinic, Dana Farber Cancer Institute, and Arizona State University, proposes a hybrid deep-machine learning framework. Rather than training a deep neural network end-to-end (which risks overfitting on small datasets), the authors use five pre-trained deep models as feature extractors and then feed the resulting feature vectors into five classical machine learning classifiers. This two-stage approach is designed to combine the powerful feature extraction of deep learning with the robustness and interpretability of traditional ML methods on a small, imbalanced dataset of only 100 CT scans from 100 patients.

Three classification tasks: The model is evaluated on three clinically relevant binary classification problems: (1) normal tissue vs. bladder cancer tissue, (2) NMIBC vs. MIBC, and (3) post-treatment changes (PTC) vs. MIBC. The third task is particularly important because distinguishing residual or recurrent cancer from post-treatment scarring on imaging is notoriously difficult and currently requires invasive repeat biopsy.

TL;DR: This Mayo Clinic/ASU study proposes a hybrid deep-ML framework for bladder cancer staging from CT scans. Pre-trained deep networks extract features, which are then classified by traditional ML algorithms. The approach addresses overfitting on a small dataset of 100 patients across 7 cancer stages, tackling three tasks: cancer detection, NMIBC vs. MIBC staging, and PTC vs. MIBC differentiation.

Deep Learning Literature

Pages 3-5

How CNNs and Transfer Learning Have Been Applied to Bladder Cancer

CNNs outperform texture features: The paper reviews the growing use of convolutional neural networks (CNNs) in bladder cancer imaging and notes that neural network-based classifiers are generally more effective than texture analysis alone. A landmark study designed nine CNN-based models for MIBC vs. NMIBC classification from contrast-enhanced CT (CECT) images, using 1200 CT scans from 369 patients. The VGG16 architecture, pre-trained on ImageNet, achieved the best results with an AUC of 99.70%, accuracy of 93.90%, sensitivity of 88.90%, and specificity of 98.90%. In contrast, a Haralick feature-based SVM approach on 118 MRI volumes from 68 patients achieved only an AUC of 86.10%.

Other CNN applications: A CNN-based model for low vs. high-stage bladder cancer classification from 84 CT urography images achieved 91.00% accuracy, outperforming texture-based SVM classification on the same dataset (88.00%). Another study using radiomics features (GLCM and histogram) from diffusion-weighted MRI in 61 patients achieved 82.90% accuracy for grading. These results consistently demonstrate CNNs' advantage, but the authors note that the end-to-end models reviewed were typically trained and tested on the same dataset, which raises concerns about generalizability, especially with small sample sizes.

The case for transfer learning: The authors highlight a critical finding: one study on Ta vs. T1 staging in 1177 bladder scans found that CNN classifiers achieved only 84.00% accuracy, performing worse than supervised ML classifiers trained on manually extracted features. This suggests that domain-knowledge-guided feature extraction can sometimes outperform end-to-end deep learning, particularly when data is limited. Because their own dataset contained only 200 ROIs from 100 CT scans, the authors adopted transfer learning, where deep networks are first pre-trained on the large ImageNet dataset (millions of natural images) and then fine-tuned on the target bladder cancer data. This strategy leverages the rich feature representations learned from ImageNet while adapting them to the medical imaging domain.

Rationale for the hybrid approach: Rather than using the fine-tuned deep networks directly for classification (which still risks overfitting on small medical datasets), the authors extract the learned feature representations from the last pooling layer of each network and feed them into classical ML classifiers. This "hybrid" strategy decouples the feature learning stage from the classification stage, allowing each component to be optimized independently and reducing the number of trainable parameters at the classification stage.

TL;DR: CNNs generally outperform texture features for bladder cancer classification (VGG16 achieved AUC 99.70% on 1200 CT scans), but end-to-end deep models risk overfitting on small datasets. Transfer learning from ImageNet, combined with classical ML classifiers, provides a practical middle ground for the study's limited dataset of 200 ROIs from 100 patients.

Feature Extraction

Pages 4-6

Five Pre-Trained Deep Models Used as Feature Extractors

The five architectures: The study extracts features using five widely adopted CNN architectures, all pre-trained on the ImageNet dataset. AlexNet (25 layers total) uses max pooling at layer 16 and produces a feature vector of length 9,216. GoogleNet (144 layers) uses global average pooling at layer 140 and yields a 1,024-dimensional feature vector. InceptionV3 (315 layers) applies global average pooling at layer 312, producing a 2,048-dimensional vector. ResNet-50 (177 layers) uses global average pooling at layer 174, also producing a 2,048-dimensional vector. Finally, XceptionNet (170 layers) uses global average pooling at layer 167 and outputs a 2,048-dimensional vector.

How features are extracted: Each of the five models was first fine-tuned on the bladder CT scan dataset using transfer learning. After training, the authors extracted the activation values from the last pooling layer of each model. These activations serve as high-level feature descriptors that encode complex visual patterns such as tissue texture, boundary characteristics, and intensity distributions that the deep model has learned to associate with different tissue types. The key insight is that these features capture richer, more abstract representations than hand-crafted texture features like GLCM or LBP, while the subsequent ML classification step avoids the overfitting risks of fully end-to-end deep learning.

Feature vector dimensionality: The resulting feature vectors are high-dimensional, particularly AlexNet's 9,216-dimensional output. High-dimensional feature spaces are prone to the "curse of dimensionality," where the number of features exceeds the number of samples (here, only 200 ROIs). This makes the subsequent feature selection step critical for removing redundant, uninformative, or correlated features before classification. The choice of five different architectures also allows the authors to compare which network learns the most discriminative representations for bladder cancer CT images.

TL;DR: Five ImageNet-pretrained CNNs (AlexNet, GoogleNet, InceptionV3, ResNet-50, XceptionNet) serve as feature extractors. Feature vectors range from 1,024 to 9,216 dimensions, extracted from each model's last pooling layer after fine-tuning on the bladder CT data. These deep features encode richer visual patterns than hand-crafted texture descriptors.

Feature Selection

Pages 5-6

Ensemble Feature Selection: Filtering Thousands of Features Down to the Most Informative

Five-step pipeline: The authors designed a multi-stage feature selection algorithm that combines unsupervised statistical filtering with supervised importance scoring. Given the high dimensionality of the extracted features (up to 9,216 from AlexNet) and only 200 ROIs available, aggressive feature reduction was essential to prevent overfitting. The pipeline proceeds through five sequential steps, each progressively narrowing the feature set.

Steps 1 and 2 - Sparsity filter and data imputer: The first step applies a "sparsity filter" that removes features whose values were not meaningfully updated during the deep model's fine-tuning process. These features have near-zero variance and carry negligible discriminative information. The second step uses a "data imputer" to replace any unchanged (constant) values in remaining features with the column mean, ensuring that the subsequent statistical analyses operate on complete, well-conditioned data. These two steps are purely unsupervised, requiring no label information.

Steps 3 and 4 - Coefficient of variation filter and correlation matrix: The third step removes features with a coefficient of variation (CV) below 0.1. CV is the standard deviation normalized by the mean, so a low CV indicates features with minimal spread relative to their magnitude, meaning they contribute little distinguishing power. The fourth step constructs a correlation matrix and, for each pair of features with cross-correlation exceeding 95%, drops the one with the weaker correlation to the output label. This removes highly redundant features that would add noise without adding information.

Step 5 - Random forest importance: The final step trains a random forest classifier (100 trees, Gini index criterion) on the remaining features and their labels to generate feature importance scores. This is the only supervised step in the pipeline, and it ranks features by their contribution to classification accuracy. The combination of four unsupervised filters followed by one supervised ranking creates a robust ensemble approach that balances statistical rigor with task-specific relevance.

TL;DR: A five-step ensemble feature selection pipeline reduces thousands of deep features to a manageable set. Four unsupervised steps (sparsity filtering, imputation, low-CV removal, 95% correlation pruning) eliminate noise and redundancy, followed by supervised random forest importance scoring with 100 trees and Gini index. This prevents overfitting on the small 200-ROI dataset.

Dataset and Methods

Pages 6-8

The Mayo Clinic Dataset: 100 Patients, Seven Stages, and Severe Class Imbalance

Dataset composition: The dataset was provided by Mayo Clinic, Arizona, and consisted of de-identified grayscale contrast-enhanced CT scans from 100 patients who were imaged before radical cystectomy and pelvic lymph node dissection. Each scan had two manually annotated masks (normal bladder wall and bladder cancer tissue) created by two radiologists, yielding 200 regions of interest (ROIs) total: 100 normal tissue and 100 abnormal tissue. The patient distribution across seven stages was highly imbalanced: Ta = 6, Tis = 9, T0 = 35, T1 = 9, T2 = 13, T3 = 24, and T4 = 4.

Class groupings for the three tasks: For normal vs. cancer classification, 100 normal ROIs were paired against 65 cancer ROIs (the 35 T0 patients were excluded because T0 represents post-treatment tissue with no evidence of malignancy). For NMIBC vs. MIBC classification, 24 NMIBC ROIs (Ta + Tis + T1) were compared against 41 MIBC ROIs (T2 + T3 + T4). For PTC vs. MIBC classification, 35 PTC ROIs (T0) were compared against 41 MIBC ROIs. Each task thus had a different sample size and degree of class imbalance.

Why SMOTE was not used: The authors explicitly note that synthetic minority oversampling technique (SMOTE) could not resolve the class imbalance because certain stages had extremely few samples (only 6 for Ta and 4 for T4). Synthetically generated samples from such tiny classes would be nearly identical to the originals, reducing training set variance and exacerbating overfitting rather than alleviating it. This is a critical practical consideration that many studies overlook when applying SMOTE to very small medical imaging datasets.

Evaluation methodology: All classification experiments used 10-fold cross-validation. The dataset was randomly shuffled and split into 10 equal partitions; in each iteration, 9 partitions served as training data and 1 as the test set. Performance metrics (accuracy, sensitivity, specificity, precision, and F1-score) were averaged across all 10 folds. The F1-score was chosen as the primary metric for ranking classifiers because, unlike accuracy, it accounts for class imbalance by computing the harmonic mean of precision and recall.

TL;DR: The Mayo Clinic dataset contained 100 CT scans from 100 patients distributed across 7 bladder cancer stages, with severe imbalance (as few as 4 patients in T4). Three binary classification tasks were evaluated using 10-fold cross-validation, with F1-score as the primary metric. SMOTE was rejected due to insufficient samples in minority classes.

Results: Cancer Detection

Pages 9-10

Task 1: Normal vs. Cancer Classification Results

Best model - XceptionNet + LDA: For the normal vs. cancer classification task (165 ROIs: 100 normal, 65 cancer), the linear discriminant analysis (LDA) classifier applied to XceptionNet-extracted features achieved the best overall performance with an accuracy of 86.07%, sensitivity of 96.75%, specificity of 69.65%, precision of 83.07%, and F1-score of 89.39%. The high sensitivity means the model correctly identifies cancer tissue in nearly 97 out of every 100 cases, which is clinically valuable for screening applications where missing a cancer is more harmful than a false alarm.

Comparison across architectures: XceptionNet-based features consistently outperformed the other four deep models across most classifiers. ResNet-50 + naive Bayes was the second-best combination with an F1-score of 86.54%, followed by AlexNet + LDA at 84.86%. GoogleNet-based features generally produced the lowest F1-scores, with the DT classifier on GoogleNet yielding 80.01%. The decision tree classifier consistently underperformed other classifiers across all feature extractors, suggesting that the complex, high-dimensional feature spaces from deep models are better suited to classifiers with smoother decision boundaries.

Sensitivity vs. specificity trade-off: A notable pattern across all models was high sensitivity paired with lower specificity. For instance, XceptionNet + LDA achieved 96.75% sensitivity but only 69.65% specificity, meaning the model is more likely to flag normal tissue as cancerous than to miss actual cancer. In a clinical context, this trade-off is generally acceptable for a screening tool: it is safer to flag a suspicious region for further investigation than to miss a cancer entirely. However, the relatively low specificity suggests room for improvement, possibly through larger training datasets or more targeted feature engineering.

Impact of feature selection: The ensemble feature selection pipeline proved critical to these results. Without feature reduction, the raw 9,216-dimensional AlexNet features and 2,048-dimensional features from the other models would have overwhelmed the classical ML classifiers on a dataset of only 165 samples. The sparsity filter, CV filter, correlation pruning, and random forest importance scoring collectively reduced the feature space to a manageable dimensionality while retaining the most discriminative features for the cancer vs. normal distinction.

TL;DR: For cancer vs. normal tissue classification, XceptionNet + LDA achieved the best F1-score of 89.39%, with 86.07% accuracy, 96.75% sensitivity, and 69.65% specificity on 165 ROIs. High sensitivity makes the model suitable as a screening aid, though specificity was limited by the small dataset. XceptionNet features consistently outperformed the other four architectures.

Results: Staging and PTC

Pages 10-12

Tasks 2 and 3: NMIBC vs. MIBC and Post-Treatment Changes vs. MIBC

NMIBC vs. MIBC - Best model: For the critical staging task of differentiating non-muscle-invasive from muscle-invasive bladder cancer (65 ROIs: 24 NMIBC, 41 MIBC), the LDA classifier on XceptionNet features again achieved the top performance with an accuracy of 79.72%, sensitivity of 66.62%, specificity of 87.39%, precision of 75.58%, and F1-score of 70.81%. The high specificity indicates the model is reliable at confirming MIBC when it makes that prediction, which is important given that MIBC designation triggers more aggressive treatment with chemotherapy and cystectomy.

NMIBC vs. MIBC - Architecture comparison: Performance varied substantially across feature extractors for this task. AlexNet + LDA was the second-best with an F1-score of 70.67%, while InceptionV3 + NB yielded 65.11% and ResNet-50 + SVM produced one of the lowest F1-scores at 45.84% due to very low sensitivity (31.67%). GoogleNet + DT achieved 69.25%. The wider spread in performance across models for this task compared to the cancer detection task reflects the inherently greater difficulty of distinguishing NMIBC from MIBC on CT imaging, where the differences in tissue appearance between stages are subtle and highly variable.

PTC vs. MIBC - Best model: For the task of distinguishing post-treatment changes from muscle-invasive cancer (76 ROIs: 35 PTC, 41 MIBC), LDA on XceptionNet features again led with an accuracy of 74.96%, sensitivity of 80.51%, specificity of 70.22%, precision of 69.78%, and F1-score of 74.73%. This task is clinically significant because post-treatment CT scans frequently show tissue changes that mimic residual or recurrent cancer, currently requiring invasive biopsy to resolve.

PTC vs. MIBC - Broader results: ResNet-50 + NB achieved a competitive F1-score of 70.72% for this task, driven by high sensitivity of 86.89% but low specificity of 49.80%. GoogleNet + LDA and InceptionV3 + LDA performed similarly at F1-scores of 67.43% and 67.42%, respectively. The decision tree classifier again consistently produced the weakest results across all feature extractors, with F1-scores ranging from 58.51% to 64.50%. Overall, the PTC vs. MIBC results are encouraging as a proof-of-concept for non-invasive post-treatment assessment, though the moderate performance underscores the need for larger training datasets.

TL;DR: XceptionNet + LDA was the best-performing combination across all three tasks. For NMIBC vs. MIBC staging, it achieved F1-score 70.81% with 87.39% specificity on 65 ROIs. For PTC vs. MIBC, it reached F1-score 74.73% with 80.51% sensitivity on 76 ROIs. These results demonstrate the feasibility of non-invasive CT-based staging and post-treatment monitoring.

Discussion and Limitations

Pages 12-15

Clinical Significance, Limitations, and the Path Forward

Consistency of XceptionNet + LDA: A key finding is that the LDA classifier on XceptionNet-extracted features performed best in terms of F1-score across all three classification tasks. The authors attribute this to XceptionNet's depthwise separable convolutions, which efficiently capture spatial hierarchies in medical images, combined with LDA's ability to find optimal linear boundaries in high-dimensional feature spaces without overfitting to noise. This consistency suggests that the XceptionNet + LDA combination may be a reliable default configuration for CT-based bladder cancer analysis when data is limited.

Addressing overfitting on small data: The hybrid approach directly addressed two major challenges: class imbalance and limited sample size. With only 100 patients across 7 stages (as few as 4 in T4), end-to-end deep learning would almost certainly overfit. The authors found that SMOTE could not help because synthetically generated samples from categories with only 4-6 patients were virtually identical to the originals, decreasing training variance. Their two-pronged solution of (1) using pre-trained networks as feature extractors rather than end-to-end classifiers, and (2) applying ensemble feature selection to remove uninformative features, successfully mitigated overfitting while maintaining competitive classification performance.

Comparison with prior work: The cancer detection accuracy of 86.07% is comparable to a previous deep learning study that reported 86.36% accuracy for bladder cancer detection from cystoscopic images, where no significant difference was found between AI and human surgical experts (p > 0.05). However, the authors' dataset is notably smaller (100 vs. 1350 patients) and uses CT rather than cystoscopy, making direct comparison difficult. For the staging task, the 79.72% accuracy for NMIBC vs. MIBC exceeds CT's historically reported accuracy of as low as 49% for bladder cancer evaluation, though this comparison should be interpreted cautiously given differences in datasets and evaluation methods.

Limitations: The study is retrospective and based on a single-center, small dataset, which may overestimate diagnostic performance. The class imbalance issue was partially but not fully resolved, and the specificity values for cancer detection (69.65%) and PTC vs. MIBC classification (70.22%) leave meaningful room for improvement. No external validation was performed, and the model has not been tested in a prospective clinical setting. The authors acknowledge these constraints and state that their next step is to extend the diagnostic model to prospective, multi-center datasets with external validation.

Clinical implications: Despite these limitations, the study demonstrates that radiomics-assisted interpretation of CT by radiologists could help more accurately diagnose and stage bladder cancer, enabling timely consultation with oncologists and urologists and ultimately improving patient outcomes. The PTC vs. MIBC classification capability is especially noteworthy, as it addresses an unmet clinical need: currently, patients require costly and invasive repeat cystoscopies and biopsies to assess treatment response and recurrence, with associated complications including bladder perforation.

TL;DR: XceptionNet + LDA was the consistent top performer across all tasks. The hybrid deep-ML approach successfully mitigated overfitting on a 100-patient dataset where SMOTE failed. Cancer detection reached 86.07% accuracy and NMIBC vs. MIBC staging reached 79.72%, both exceeding CT's historically reported 49% baseline. The main limitation is the single-center, small, retrospective design, with prospective multi-center validation planned as the next step.