ML Prediction of Bladder Cancer from Lab Data

Plain-English Explanations

Overview

Pages 1-3

Why This Study Was Needed and What It Set Out to Do

Bladder cancer is the 10th most common cancer worldwide, with GLOBOCAN reporting 573,278 new cases and 212,536 deaths globally. The disease is roughly four times more common in men than in women, with incidence rates of 9.5 per 100,000 in males versus about 2.4 per 100,000 in females. Smoking is the single largest modifiable risk factor. The gold standard for diagnosis remains cystoscopy, which achieves 88 to 100% sensitivity and 77 to 97% specificity but is invasive. The main non-invasive alternative, urine cytology, has high specificity but only about 38% sensitivity, meaning it misses more than half of all bladder cancers.

The authors from Taipei Medical University and National Tsing Hua University hypothesized that routine clinical laboratory data, including standard biochemistry panels and urinalysis results, contain hidden patterns that could discriminate bladder cancer from cystitis and other cancers. Unlike novel biomarkers such as NMP22 or BTA, routine lab tests are already collected during standard care, making them a fast and inexpensive data source for a screening model. No prior study had attempted to combine machine learning with routine clinical lab tests specifically for bladder cancer prediction.

To test this hypothesis, the team collected data from 1,336 patients at MacKay Memorial Hospital (January 2017 to February 2020): 591 bladder cancer patients, 144 cystitis patients, 200 kidney cancer patients, 201 prostate cancer patients, and 200 uterus cancer patients. All cancer diagnoses were confirmed by pathological report. The study used five machine learning models: decision tree (DT), random forest (RF), support vector machine (SVM), XGBoost, and LightGBM, each trained with 10-fold cross-validation using scikit-learn.

TL;DR: This 2022 study tested whether routine blood and urine lab tests could predict bladder cancer using five ML algorithms (decision tree, random forest, SVM, XGBoost, LightGBM) on 1,336 patients from a Taiwanese hospital, addressing the poor 38% sensitivity of urine cytology.

Data and Preprocessing

Pages 3-5

Clinical Laboratory Features and Handling Missing Data

The original dataset contained 56 laboratory test results per patient. Features with more than 50% missing data were removed, leaving 31 laboratory tests with missing rates ranging from 0% to 44.1%. The highest missing rates were for A/G ratio (44.1%), urine epithelium (43%), and calcium (42.8%). Missing continuous values were filled with the mean of each feature, while missing categorical values were filled with the median. Features that were entirely absent in certain comparison groups were excluded from those specific classification tasks.

The 31 retained features included continuous variables such as albumin, alkaline phosphatase (ALP), alanine aminotransferase (ALT), aspartate aminotransferase (AST), blood urea nitrogen (BUN), calcium, chloride, creatinine, estimated GFR, and uric acid. Categorical variables included urine occult blood, urine ketone, urine protein, urine bilirubin, nitrite, and strip WBC. Clinical characteristics such as hypertension, diabetes, smoking, and family history of cancer were also recorded.

A major challenge was class imbalance: the bladder cancer group (n=591) was roughly four times larger than the cystitis group (n=144). To address this, the team applied both oversampling and undersampling techniques from the Python imbalanced-learn package. Oversampling generates synthetic examples of the minority class, while undersampling reduces the majority class. Each approach has tradeoffs: oversampling risks overfitting by creating copies from existing data, while undersampling discards potentially informative samples.

TL;DR: From 56 original lab tests, 31 were retained after removing those with over 50% missing data. Missing values were imputed with mean/median. Class imbalance (591 bladder cancer vs. 144 cystitis) was handled with oversampling and undersampling from the imbalanced-learn library.

Feature Selection

Pages 4-5

Two-Step Feature Selection: InfoGain Ranking Plus Forward Selection

The authors employed a two-step feature selection strategy to reduce noise and identify the most discriminative lab tests. In the first step, they used the InfoGainAttributeEval method with a Ranker search in WEKA (version 3.8.3), which scores each feature based on its information gain, measuring how much each lab test reduces uncertainty about the disease classification. Features were ranked for every pairwise comparison group (e.g., cystitis vs. bladder cancer, kidney cancer vs. bladder cancer).

The top-ranked feature from each comparison group was assembled into a core set of six features: calcium (top for cystitis vs. kidney cancer), ALP (top for cystitis vs. prostate cancer and several other comparisons), albumin (top for cystitis vs. bladder cancer), urine ketone (top for cystitis vs. uterus cancer), urine occult blood (top for kidney cancer vs. bladder cancer), and creatinine (top for bladder cancer vs. uterus cancer). This core set was then used as a starting point for the second step.

In the second step, forward selection was applied within each model. Starting with the six WEKA-selected features, additional features were added one at a time, and each model was retrained and validated to see whether the added feature improved performance. Different models selected different additional features. For example, the LightGBM model added only ALT and diabetes status, while the XGBoost model added ten more features including ALT, AST, BUN, chloride, direct bilirubin, pH, potassium, sodium, total bilirubin, and total cholesterol.

TL;DR: A two-step approach used WEKA InfoGain ranking to identify six core features (calcium, ALP, albumin, urine ketone, urine occult blood, creatinine), then forward selection added model-specific features. LightGBM needed only two additional features (ALT, diabetes) while XGBoost added ten.

Clinical Characteristics

Pages 5-7

How Bladder Cancer Patients Differed from Other Groups

The study population showed several statistically significant differences between bladder cancer patients and cystitis controls. Bladder cancer patients were older on average (66.73 vs. 60.12 years, p < 0.0001) and predominantly male (65.3%). Smoking was significantly more common in bladder cancer patients (23.4%) compared to cystitis patients (12.5%). Hematuria (blood in urine) was far more prevalent among bladder cancer patients, with urine occult blood levels showing highly significant differences (p < 0.0001) across all cancer groups versus cystitis.

Among the biochemistry markers, ALP was significantly elevated in bladder cancer (median 69 vs. 71 IU/L, p < 0.0001 compared to cystitis), BUN was higher (median 16 vs. 14 mg/dL, p < 0.0001), calcium was significantly different (p < 0.0001), and creatinine was elevated (median 1.1 vs. 0.9 mg/dL, p < 0.0001). Estimated GFR was lower in bladder cancer patients (64.13 vs. 75.38, p < 0.0001), suggesting poorer kidney function. The specific gravity of urine was also significantly different (p < 0.05).

When comparing bladder cancer against other cancers, the patterns shifted. ALP was particularly useful for separating prostate cancer (where it serves as a known prognostic biomarker) from other groups. Urine occult blood was discriminative between kidney cancer and bladder cancer. Creatinine was the top feature for separating bladder cancer from uterus cancer, reflecting the different impacts of urologic versus gynecologic cancers on kidney function. These patterns informed the feature selection strategy.

TL;DR: Bladder cancer patients were older, more often male, and more likely to smoke. Key lab differences included elevated ALP, BUN, creatinine, and hematuria, with lower eGFR versus cystitis. ALP was most useful for separating prostate cancer, while urine occult blood distinguished kidney from bladder cancer.

Sampling Experiments

Pages 8-9

Effect of Oversampling and Undersampling on Model Performance

Before applying forward selection, the authors tested all five models with three data-balancing strategies: no sampling, oversampling, and undersampling. Without any sampling technique, the models achieved 77.2 to 78.8% accuracy for discriminating bladder cancer from cystitis, but specificity was extremely poor, ranging from just 5% to 55.4%. This meant the models were biased toward predicting bladder cancer (the larger class) and rarely identified cystitis patients correctly.

After applying oversampling, accuracy shifted to 73.4 to 78.8%, but critically, specificity improved to 51.3 to 59.3%. Sensitivity ranged from 78 to 84.3%, and the AUC was not dramatically different. With undersampling, accuracy was 76.3 to 78.3%, specificity improved to 42.9 to 57.4%, and sensitivity was 79.0 to 83.9%. The AUC ranged from 0.69 to 0.74. Neither sampling technique alone was sufficient, but both helped reduce the class-imbalance bias.

The authors noted that rather than pursuing the single highest performance metric, their goal was to achieve authentic, unbiased performance. They recognized that imbalanced data requires a multi-step solution: combining sampling techniques with feature selection and model optimization. This led to the combined approach of undersampling plus two-step feature selection plus forward selection that produced the final models reported in the paper.

TL;DR: Without sampling, specificity was as low as 5%, meaning models almost never correctly identified non-cancer patients. Oversampling improved specificity to 51-59% and undersampling to 43-57%. The final pipeline combined undersampling with two-step feature selection for balanced, authentic performance.

Model Comparison

Pages 9-10

Head-to-Head Comparison of Five ML Models for Bladder Cancer vs. Cystitis

After the full two-step feature selection and forward selection pipeline, five models were compared for discriminating bladder cancer from cystitis. The decision tree classifier achieved 76.2% accuracy, 73.2% sensitivity, 78.1% specificity, and AUC of 0.77. It used 16 features total (the six core plus ten from forward selection). The random forest classifier performed substantially better at 83.1% accuracy, 85.5% sensitivity, 79.4% specificity, and AUC of 0.88, using only one additional feature (ALT) beyond the six core features.

The SVM (with RBF kernel, C=1000, gamma=0.000001) was the weakest performer: 71.7% accuracy, only 55.7% sensitivity, but the highest specificity at 86.7%, with AUC of 0.73. This means SVM was conservative, rarely predicting bladder cancer but catching fewer actual cases. XGBoost (eta=0.2, depth=7) achieved 82.8% accuracy, 81.4% sensitivity, 83.3% specificity, and AUC of 0.87, performing close to random forest but with a more balanced sensitivity-specificity tradeoff.

The clear winner was LightGBM (100 leaves, depth=1), which achieved the best scores across nearly every metric: 87.6% accuracy, 86.3% precision, 87.7% F1 score, 89.5% sensitivity, 85.5% specificity, and AUC of 0.93. Notably, LightGBM achieved this with the fewest additional features: only ALT and diabetes status added to the six core features, for a total of just eight features. This parsimony is a practical advantage because fewer required lab tests make clinical deployment simpler.

TL;DR: LightGBM was the top performer with 87.6% accuracy, 89.5% sensitivity, 85.5% specificity, and AUC of 0.93 using only eight features. Random forest (AUC 0.88) and XGBoost (AUC 0.87) followed closely. SVM had the lowest sensitivity at 55.7% but highest specificity at 86.7%.

LightGBM Across Groups

Pages 10-11

LightGBM Performance Across All Cancer Comparison Groups

Because LightGBM was the best-performing model for bladder cancer vs. cystitis, the authors evaluated it across all ten pairwise comparison groups. For bladder cancer versus each other condition, the results were consistently strong. Against cystitis, LightGBM achieved 87.6% accuracy and AUC of 0.93. Against kidney cancer, it achieved 84.5% accuracy and AUC of 0.93. Against prostate cancer, it reached 84.8% accuracy and AUC of 0.88. Against uterus cancer, it achieved 86.9% accuracy and AUC of 0.92.

The model's sensitivity for detecting bladder cancer ranged from 84.4% to 89.5% across the four bladder-cancer comparisons, while specificity ranged from 82.9% to 86.7%. The narrowest performance gap was against kidney cancer (specificity 82.9%), which makes clinical sense because both are urologic cancers and share overlapping laboratory profiles, particularly in hematuria and kidney function markers. The widest margin was against uterus cancer (specificity 86.7%), reflecting greater biochemical differences between urologic and gynecologic cancers.

LightGBM also performed well for non-bladder-cancer comparisons. It separated prostate cancer from cystitis at 87.6% accuracy (AUC 0.94), kidney cancer from cystitis at 86.2% (AUC 0.90), and uterus cancer from cystitis at 83.8% (AUC 0.92). The ROC curves for all bladder cancer comparisons showed strong discrimination, with the cystitis comparison yielding the highest AUC (0.93) and the prostate cancer comparison the lowest (0.88).

TL;DR: LightGBM discriminated bladder cancer from cystitis (AUC 0.93), kidney cancer (AUC 0.93), prostate cancer (AUC 0.88), and uterus cancer (AUC 0.92). Sensitivity ranged from 84.4% to 89.5% and specificity from 82.9% to 86.7% across all bladder cancer comparisons.

Biological Rationale

Pages 12-13

Why These Eight Lab Features Make Biological Sense

The eight features selected by LightGBM each have documented biological relevance to cancer. Calcium is typically normal in bladder cancer but elevated in other malignancies through paraneoplastic hypercalcemia; elevated calcium in a bladder cancer patient may signal bone metastasis, making it useful for discrimination. ALP is a well-established prognostic biomarker in prostate cancer and is associated with bone and liver diseases. A prior study found that increased serum ALP did not improve bone scan accuracy in bladder cancer specifically, confirming that ALP distinguishes prostate cancer from bladder cancer without confounding the bladder cancer classification.

Albumin plays a central role in immunity and inflammation, and the albumin-to-globulin ratio has been proposed as a biomarker in gastric and lung cancers. A review of 1,096 patients with non-muscle-invasive bladder cancer found that this ratio independently predicted disease progression. The ratio of albumin to ALP has also been proposed as a prognostic biomarker in upper tract urothelial carcinoma. Urine ketone, a routine urinalysis component, is strongly correlated with diabetes, and a comprehensive systematic review confirmed that diabetes mellitus is associated with bladder cancer risk.

Urine occult blood (i.e., hematuria) is a classic screening indicator for bladder cancer. A study of 46,842 patients found that microhematuria strongly correlated with bladder cancer detection. However, hematuria also occurs in kidney cancer and benign conditions like interstitial cystitis, which is why it works best in a multi-feature model rather than as a standalone test. ALT, an enzyme from the liver and kidneys, contributes through the De Ritis ratio (AST/ALT), which has been identified as a prognostic indicator in bladder cancer after radical cystectomy. Creatinine reflects kidney function via the glomerular filtration rate and differentiates urologic from gynecologic cancers.

TL;DR: Each of the eight selected features has biological relevance: calcium signals metastasis, ALP separates prostate cancer, albumin tracks inflammation and prognosis, urine ketone links to diabetes risk, hematuria is a classic bladder cancer sign, ALT contributes via the De Ritis ratio, and creatinine reflects kidney function differences between cancer types.

Limitations and Context

Pages 13-17

Limitations, Comparison with Other Studies, and Future Directions

The authors compared their LightGBM results against other published approaches to bladder cancer detection. Wang et al. used ML to improve tumor-marker-based screening across multiple cancers and achieved only 81% sensitivity and 64% specificity. Shao et al. used metabolomics with a decision tree and reached 76.6% accuracy, 71.9% sensitivity, and 86.7% specificity. Wittmann et al. developed a random forest model with urinary metabolites and achieved AUC of only 0.78 to 0.81. Belugina et al. used a potentiometric multisensor urine system and achieved 76% accuracy. Kouznetsova et al. used a multilayer perceptron with logistic regression on metabolite profiles for 82.5% accuracy. The current study's LightGBM outperformed all of these in both sensitivity and specificity.

Several limitations should be noted. The study was single-center (MacKay Memorial Hospital only), which limits generalizability to other populations and healthcare settings. The missing data rates were high for some features (up to 44.1% for A/G ratio), and mean/median imputation is a relatively simple approach that may introduce bias. The authors acknowledged that multiple imputation did not produce usable results with their dataset, possibly due to insufficient sample size or weak feature correlations. All models were trained and validated on the same dataset using 10-fold cross-validation, with no external validation cohort from a separate hospital.

The study did not receive external funding, and the data has been made available on GitHub for reproducibility. The authors stated that their future work aims to collect data from different cohorts and build a model capable of differentiating bladder cancer from cystitis and all other cancers simultaneously (multi-class classification) rather than the pairwise comparisons used here. Integrating these routine lab-based predictions with established tools like cystoscopy and urine cytology could create a more effective and less invasive screening pipeline for bladder cancer.

TL;DR: The LightGBM model outperformed all published comparators for non-invasive bladder cancer detection. Key limitations include single-center design, no external validation, high missing data rates, and simple imputation methods. Future work aims for multi-cohort validation and multi-class classification integrating routine labs with cystoscopy and cytology.