Deep Convolutional GANs for Bladder Cancer Diagnosis

Applied Sciences 2020 AI 8 Explanations View Original
Original Paper (PDF)

Unable to display PDF. Download it here or view on PMC.

Plain-English Explanations
Pages 1-3
Why Bladder Cancer Diagnosis Needs Data Augmentation and What This Study Proposes

Bladder cancer prevalence: Urinary bladder cancer is one of the most common malignant diseases of the urinary tract, characterized by uncontrolled growth of mutated mucosa cells. It has notably high recurrence rates, reaching 61% within the first year and 78% within the first five years, making it one of the most recurrence-prone cancers. The standard diagnostic procedure uses cystoscopy, an endoscopic method where a cystoscope is inserted into the bladder via the ureter. Modern cystoscopes incorporate confocal laser endomicroscopes (CLE) that allow in-vivo evaluation of the mucosa without traditional biopsy, a technique known as optical biopsy.

The classification problem: CLE images are divided into four diagnostic classes: non-cancer tissue, high-grade carcinoma, low-grade carcinoma, and carcinoma in-situ (CIS). While optical biopsy is effective at detecting papillary lesions, it has struggled with CIS detection, where accuracy does not exceed 75%. CIS is particularly dangerous because it is a flat lesion that is difficult to distinguish from benign growths, yet carries high metastatic potential. Prior CNN-based research has achieved impressive results on cystoscopy images, including AUC values of 0.98 to 0.99 and F1 scores up to 99.52% with Xception architecture.

The data scarcity challenge: Despite these successes, a fundamental bottleneck exists: collecting sufficient training data in medicine is difficult. Patient populations for specific cancer types may be small, ethical restrictions limit data sharing, and class imbalance arises naturally because symptomatic patients (who test positive) outnumber healthy controls. Traditional data augmentation through geometric transformations such as mirroring, rotating, and scaling provides some improvement but is inherently limited because the augmented images remain tightly connected to the original data and do not introduce genuine new variance.

The proposed solution: This study by Lorencin et al. from the University of Rijeka proposes using Deep Convolutional Generative Adversarial Networks (DCGAN) to generate entirely new synthetic CLE images for data augmentation. The generated images are combined with the original training set to train two established CNN architectures, AlexNet and VGG-16, for four-class bladder cancer classification. The research addresses three key questions: whether GANs can generate usable bladder cancer images, whether classifier performance improves with augmented data, and how the proportion of generated images in the training set affects classifier performance.

TL;DR: Bladder cancer recurs in up to 78% of patients within five years, and CIS detection via optical biopsy is limited to 75% accuracy. This study proposes DCGAN-based data augmentation to generate synthetic CLE images for training AlexNet and VGG-16 classifiers on four bladder tissue classes, addressing the chronic data scarcity problem in medical AI.
Pages 4-8
How the DCGAN Generator and Discriminator Work Together

Adversarial training principle: The DCGAN consists of two competing deep convolutional neural networks: a generator and a discriminator. The generator takes a uniformly distributed random vector of 100 elements (values between 0.0 and 1.0) and uses two-dimensional transposed convolution to progressively upscale this noise into a realistic image. The discriminator is trained as a binary classifier to distinguish between real CLE images from the collected dataset and fake images produced by the generator. These two networks are "adversarial" because the generator continually adjusts its parameters to fool the discriminator, while the discriminator continually improves at detecting fakes.

Generator architecture: The random input vector of shape (1, 100) is fed into a densely connected layer that reshapes it to (wout/4, hout/4, 256). For a 28x28 pixel output image, this means the dense layer produces a (7, 7, 256) tensor. Each subsequent layer applies batch normalization (which normalizes outputs by subtracting the batch mean and dividing by the standard deviation) followed by LeakyReLU activation. LeakyReLU differs from standard ReLU by multiplying negative inputs by a factor of 0.01 rather than zeroing them out. Two transposed convolution layers then progressively expand the image through shapes (14, 14, 64) to the final output of (28, 28, 1).

Discriminator architecture: The discriminator uses standard two-dimensional convolution layers that compress the input. Starting from a (28, 28, 1) image, the first convolutional layer produces (14, 14, 64), and the second produces (7, 7, 128). Each convolution is followed by LeakyReLU activation and a dropout layer with a rate of 0.3 (30%), which randomly sets 30% of inputs to zero to prevent overfitting. The output is flattened into a vector of 6,272 elements (7 x 7 x 128) and passed through a dense layer to produce a single output neuron that classifies the image as real (1) or fake (0).

Loss function and training: The system uses cross-entropy loss to measure the difference between predicted and actual distributions. For the generator, its cross-entropy is calculated by comparing the discriminator's output on generated images against an array of ones (since a "perfect" generator would trick the discriminator into classifying all fakes as real). For the discriminator, the total loss is the sum of real cross-entropy (comparing its output on real images to ones) and fake cross-entropy (comparing its output on generated images to zeros). Training continues until the generator produces images that the discriminator cannot reliably distinguish from real data.

TL;DR: The DCGAN pairs a generator (which upscales a 100-element random vector into 28x28 images through transposed convolution, batch normalization, and LeakyReLU) with a discriminator (which compresses images through convolution with 30% dropout to classify real vs. fake). Cross-entropy loss drives adversarial training until generated images are indistinguishable from real CLE images.
Pages 9-10
Effect of GAN Training Epochs on Generated Image Quality

Epoch variations tested: The number of training epochs is a critical hyperparameter for GAN performance. The study tested four different epoch settings: 100, 250, 500, and 1000 epochs. Too few epochs result in an undertrained generator producing images with insufficient detail, while too many epochs can increase training time without proportional quality gains and may even cause overfitting. The visual differences between generated images at different epoch counts are apparent to the naked eye, with images from 100 epochs containing noticeably less detail compared to those from 500 or 1000 epochs.

Visual quality vs. classification utility: An important insight highlighted by the authors is that visual quality as perceived by humans does not necessarily correlate with classification utility. Images that appear visually weaker to a human observer may still contain sufficient discriminative features to improve classifier performance. Conversely, images that look more realistic could be the result of GAN overfitting and may fail to improve or even harm classification accuracy. This means the true quality of generated images can only be determined through actual training and evaluation with predefined CNN architectures.

Checkerboard artifacts: The generated images, particularly those from 500 and 1000 epochs, exhibit "checkerboard" artifacts. These are a well-known consequence of the transposed convolution process used in the generator. Despite their visual impact, these artifacts do not necessarily diminish the images' value for training classifiers, as the important discriminative features of the tissue patterns may still be preserved in the generated data.

TL;DR: The DCGAN was trained at 100, 250, 500, and 1000 epochs. Higher epoch counts produced more detailed images but with expected "checkerboard" artifacts from transposed convolution. Visual quality does not necessarily predict classification utility, so the actual impact must be measured through CNN training and evaluation rather than human inspection.
Pages 10-11
AlexNet and VGG-16: The Two Classifier Architectures Under Test

AlexNet: AlexNet is a CNN architecture that won the 2012 ImageNet Large Scale Visual Recognition Challenge and pioneered the "go deeper" trend in computer vision. It consists of a 9-layer architecture adapted for grayscale images: the input is 227x227x1 pixels, processed through five convolutional layers with kernel sizes ranging from 11x11 (first layer) down to 3x3, interspersed with three max-pooling layers. The convolutional layers use ReLU activation and produce feature maps of 96, 256, 384, 384, and 256 channels respectively. After flattening to 9,216 neurons, three fully connected layers reduce to 4,096, then 4,096, and finally 4 output neurons (one per class) with softmax activation for multi-class probability output.

VGG-16: VGG-16 is a deeper CNN architecture proposed at the 2013 ImageNet challenge, offering significant improvement over AlexNet. Its key design principle replaces larger convolutional kernels with multiple stacked 3x3 convolutional layers, resulting in a 16-layer configuration. The input is 224x224x1 pixels (grayscale-adapted), and the network uses blocks of 2 or 3 convolutional layers followed by max-pooling, progressively increasing the feature map depth from 64 to 512 channels. After flattening to 25,088 neurons, three fully connected layers reduce to 4,096, 4,096, and then 4 outputs with softmax.

Hyperparameter grid search: Both architectures were trained using a grid search over multiple hyperparameters. The varied parameters included the number of training epochs, batch size (4, 8, 16, 32), and optimizer (solver) including Adam, AdaMax, AdaDelta, AdaGrad, and RMSprop. This comprehensive search allowed the authors to not only find the best-performing configuration but also to measure average and median performance across all tested configurations, providing insight into how sensitive each architecture is to hyperparameter choices.

Why these architectures: The authors deliberately selected well-established, widely-used CNN architectures rather than novel ones. The research goal was to isolate the effect of DCGAN-based data augmentation on classification performance, not to propose a new architecture. By using AlexNet and VGG-16, the results become directly comparable to prior bladder cancer classification research that used the same architectures, and the measured improvements can be attributed to the augmentation strategy rather than architectural novelty.

TL;DR: AlexNet (9 layers, 227x227 input) and VGG-16 (16 layers, 224x224 input) were trained with grid search over batch sizes (4-32), optimizers (Adam, AdaMax, AdaDelta, AdaGrad, RMSprop), and epoch counts. Using established architectures isolates the effect of DCGAN augmentation from architectural improvements.
Pages 11-13
Dataset Construction and Cross-Validation Strategy

Original dataset composition: The collected dataset consists of CLE images divided into four classes: 900 non-cancer tissue images, 600 high-grade carcinoma images, 680 low-grade carcinoma images, and 345 CIS images, for a total of 2,525 images. This class distribution is inherently imbalanced, with non-cancer tissue having nearly three times the number of images as the CIS class, reflecting the real-world difficulty of collecting equal samples across cancer subtypes.

Cross-validation protocol: The dataset was divided using five-fold cross-validation. In each fold, four equal parts (80%) were used for training and GAN-based image generation, while the remaining one part (20%) was reserved exclusively for testing. This was repeated five times, ensuring each data partition served as the test set exactly once. Critically, generated images were never included in the test set, meaning the evaluation always measured performance against real, unseen patient data. This design ensures that if the GAN produces low-quality images, the CNN trained on them will perform poorly on the real test data, effectively evaluating both GAN and CNN quality simultaneously.

Four dataset variations: The training set was constructed in four configurations to measure the effect of different augmentation levels. Case 1 used only the 2,020 original training images (no augmentation). Case 2 added 2,020 generated images to the 2,020 originals (total 4,040, a 1:1 ratio). Case 3 added 10,100 generated images (total 12,120, a 5:1 ratio). Case 4 added 18,180 generated images (total 20,200, a 9:1 ratio). This systematic variation allowed the authors to determine the optimal proportion of synthetic data in the training pipeline.

Evaluation metrics: Classification performance was evaluated using multi-class ROC analysis with two key metrics: micro-average AUC (AUCmicro) and macro-average AUC (AUCmacro). AUCmicro calculates true positive and false positive rates across all classes combined using the trace and sum of the confusion matrix. AUCmacro averages the individual AUC values computed for each class separately. Both metrics were reported with their standard deviations across the five cross-validation folds, providing a measure of generalization stability.

TL;DR: The dataset contained 2,525 CLE images across four classes (900 non-cancer, 600 high-grade, 680 low-grade, 345 CIS). Five-fold cross-validation was used with four augmentation levels: 0, 2,020, 10,100, or 18,180 generated images added to 2,020 originals. Generated images were never in the test set, and performance was measured via AUCmicro and AUCmacro with standard deviations.
Pages 14-19
DCGAN Augmentation Dramatically Improves AlexNet Classification Performance

Baseline without augmentation: Without data augmentation (Case 1), the best AlexNet configuration achieved AUCmicro = 0.96 and AUCmacro = 0.96, using the RMSprop optimizer with batch size 16 and 10 training epochs. However, the standard deviations were relatively high at 0.04 for AUCmicro and 0.05 for AUCmacro, indicating substantial variability across cross-validation folds and suggesting poor generalization stability.

Best augmented results: With DCGAN augmentation, AlexNet's top configurations reached AUCmicro = 0.99 and AUCmacro = 0.99 across multiple GAN epoch settings (250, 500, and 1000 epochs) and dataset sizes (Cases 2, 3, and 4). The standard deviations dropped dramatically, with the best configurations achieving sigma(AUCmicro) as low as 0.001 compared to the baseline's 0.04. This 40-fold reduction in variability represents a striking improvement in generalization stability. The most effective augmentation used 250 or 500 GAN epochs with the AdaMax optimizer at batch size 32.

Population-level improvements: Beyond the best individual configurations, the authors analyzed the entire grid-search population of AlexNet models. When trained with 10,100 generated images from 500-epoch GANs, the average AUCmicro increased to approximately 0.95, and the median AUCmicro rose significantly compared to the non-augmented case. The standard deviation of AUCmicro across all grid-search configurations also decreased substantially, meaning AlexNet became far less sensitive to changes in batch size, number of training epochs, and optimizer choice. This "stabilization effect" is arguably more important than the top-line AUC improvement, because it means that augmented AlexNet is more robust and easier to tune.

Optimal augmentation ratio: Across all four GAN epoch settings (100, 250, 500, 1000), the best population-level results for AlexNet were consistently achieved with Case 3 (10,100 generated images, a 5:1 ratio). Increasing to 18,180 generated images (Case 4, 9:1 ratio) did not provide further benefit and sometimes slightly degraded average performance. The overall best GAN configuration for AlexNet was 500 GAN training epochs with 10,100 generated images, yielding the highest average and median AUCmicro with the lowest standard deviation.

TL;DR: DCGAN augmentation raised AlexNet's best AUCmicro from 0.96 to 0.99 and reduced the standard deviation from 0.04 to as low as 0.001. The optimal configuration was 500 GAN epochs with 10,100 generated images (5:1 ratio). Average and median performance across all hyperparameter settings also improved substantially, making AlexNet far less sensitive to tuning choices.
Pages 19-23
VGG-16 Shows Limited Improvement from GAN Augmentation

Baseline already strong: Without augmentation, VGG-16 achieved AUCmicro = 0.97 and AUCmacro = 0.97 using the Adam optimizer with batch size 16 and 7 epochs, with much lower standard deviations of 0.01 for both metrics. This baseline was already superior to AlexNet's augmented best in terms of consistency. VGG-16's deeper architecture with its stacked 3x3 convolutional layers evidently handles the dataset's diversity more effectively even without synthetic data.

Top-configuration improvements: With augmentation, VGG-16's best configurations consistently reached AUCmicro = 0.99 and AUCmacro = 0.99 across all GAN epoch settings and dataset sizes. The standard deviations were impressively low, with the best reaching sigma(AUCmicro) = 0.0004 when using 1000 GAN epochs in Case 2 (2,020 generated images). The preferred optimizer for VGG-16 was consistently AdaGrad, in contrast to AlexNet's preference for AdaMax. This shows that the optimal augmentation strategy is architecture-dependent.

Population-level analysis reveals a different story: Unlike AlexNet, VGG-16 did not show significant improvement in average or median AUCmicro across the grid-search population when augmented data was used. In some configurations (particularly with 100-epoch GAN images), the standard deviation of AUCmicro actually increased, indicating less stable behavior. For 250, 500, and 1000 GAN epochs, there were slight increases in median AUCmicro, but the average values remained largely unchanged, and the standard deviation often increased. This suggests that GAN augmentation introduces more noise into VGG-16's training than it resolves.

Generalization trade-off: When examining sigma(AUCmicro) as a measure of generalization, VGG-16 trained with augmented datasets generally showed higher average values of sigma(AUCmicro) compared to the original dataset. Only in specific cases (such as 1000 GAN epochs with Cases 3 and 4) did median sigma(AUCmicro) decrease. The authors conclude that VGG-16's deeper architecture already provides sufficient feature extraction capability, and the additional variance introduced by GAN-generated images does not improve and sometimes harms its overall stability.

TL;DR: VGG-16's baseline was already strong at AUC 0.97 with sigma of just 0.01. While top configurations reached AUC 0.99 with sigma as low as 0.0004, the population-level analysis showed no significant improvement in average or median performance. In some cases, augmentation actually increased variability, suggesting VGG-16's deeper architecture already captures sufficient features without synthetic data.
Pages 23-25
Key Findings: Architecture-Dependent Augmentation Benefits and Clinical Implications

Architecture-dependent effects: The central finding of this study is that the benefit of DCGAN augmentation depends critically on the CNN architecture being trained. For AlexNet, augmentation provided substantial improvements at both the top-configuration and population levels, raising the best AUCmicro from 0.96 to 0.99 and dramatically reducing sensitivity to hyperparameter choices. For VGG-16, the improvement was limited to the best individual configurations, with no significant benefit to the broader population of models tested in the grid search. The deeper VGG-16 architecture, with its 16 layers versus AlexNet's 9, already extracted sufficient features from the original dataset to achieve strong and consistent performance.

Three confirmed hypotheses: The study confirmed all three research questions posed at the outset. First, it is indeed possible to use GANs for augmentation of urinary bladder cancer CLE image datasets, as the generated images were realistic enough to improve classifier training. Second, classifier performance was generally higher when augmented datasets were used, particularly for AlexNet. Third, there is an optimal proportion of generated images: performance increased up to a certain ratio (approximately 5:1 for AlexNet) and then declined, indicating that flooding the training set with too much synthetic data can be counterproductive.

Practical significance for medical AI: The most important practical implication is the "stabilization effect" observed with AlexNet. In a clinical setting, robustness to hyperparameter choices is highly valuable because it means the system is more likely to perform consistently across different deployment conditions. The fact that GAN-augmented AlexNet maintained strong performance across a wide range of batch sizes, optimizers, and training epochs makes it a more reliable candidate for integration into CLE-based diagnostic systems. This stability reduces the need for extensive hyperparameter tuning when deploying the system in new clinical environments.

Limitations and future directions: The dataset of 2,525 total images, while adequate for this study, remains modest by deep learning standards. The data was collected from a single institution (Clinical Hospital Center Rijeka), and external validation at other sites was not performed. The generated images exhibit checkerboard artifacts inherent to transposed convolution, and more advanced GAN architectures (such as progressive GANs or StyleGAN) could potentially produce higher-quality synthetic images. Additionally, the study focused on a four-class problem; extending this to finer-grained histological subtypes would require larger datasets and possibly more sophisticated augmentation strategies. The data was not publicly available due to ethical restrictions, limiting reproducibility.

TL;DR: DCGAN augmentation significantly improved AlexNet (AUC 0.96 to 0.99, 40x reduction in variability) but provided limited benefit for the already-strong VGG-16. The optimal synthetic-to-real ratio was approximately 5:1. The key practical finding is that GAN augmentation stabilizes classifier behavior across hyperparameter settings, which is critically important for clinical deployment of CLE-based bladder cancer diagnostic systems.
Citation: Lorencin I, Baressi Šegota S, Anđelić N, et al.. Open Access, 2021. Available at: PMC7996800. DOI: 10.3390/biology10030175. License: cc by.