Renal cell carcinoma (RCC) accounts for roughly 3% of all cancers worldwide, with an estimated 431,288 new cases diagnosed globally in 2020. The primary diagnostic tool is contrast-enhanced computed tomography (CT), which requires precise and efficient interpretation by radiologists. However, diagnostic radiology is under strain: imaging workloads are climbing while the number of practicing radiologists is shrinking. This combination creates pressure on turnaround times and raises the risk of diagnostic errors, particularly for incidental findings where literature reports a malignant kidney lesion detection sensitivity of only 84%.
Deep learning evolution in renal imaging: Early segmentation models like U-Net focused on pixel-level identification of anatomical structures but lacked interactive features and clinical workflow integration. More recent architectures, including the Segment Anything Model (SAM) and its medical derivatives (MedSAM, MedSAM-2, ESP-MedSAM), introduced interactive segmentation via user-provided inputs such as points, bounding boxes, and text prompts. The nnInteractive model combined the robustness of nnU-Net with SAM-style interaction for accurate 3D segmentations. However, none of these tools offered end-to-end integration into the radiological workflow, including structured reporting and full clinical reader studies.
Introducing BMVision: To fill this gap, the authors developed BMVision, a specialized AI framework for kidney cancer detection and characterization. Unlike prior tools that stop at segmentation, BMVision integrates segmentation, post-processing, characterization, and automated structured reporting within a web-based viewer built on the Open Health Imaging Foundation (OHIF) V3 platform. The tool was validated through a two-stage retrospective reader study involving six radiologists and 200 CT scans from Tartu University Hospital (TUH), yielding 2,400 individual reads for comparative analysis.
BMVision is built on a three-module pipeline. The core is a 3D U-Net architecture adapted from the nnU-Net model, trained to perform four-class semantic segmentation that classifies each voxel into one of four categories: background, benign lesions, malignant lesions, or healthy kidney tissue. The model was trained on a combined dataset of 612 CT volumes drawn from the Tartu University Hospital (TUH) dataset, the C4KC-KiTS dataset (41 cases), and the TCGA-KIRC dataset (180 cases). This multi-source approach ensured diversity in scanner manufacturers, acquisition protocols, and patient populations.
Post-processing module: After segmentation, the post-processing module refines the output by removing false positives from areas outside the kidney region and resolving prediction inconsistencies within anatomical kidney boundaries. This step is critical because raw segmentation outputs from deep learning models frequently include spurious predictions in adjacent organs or tissue.
Characterization module: The characterization module converts the post-processed semantic segmentation mask into an instance segmentation map, distinguishing each individual object (left kidney, right kidney, malignant lesion, benign lesion). From this instance map, the system computes key metrics that radiologists need: lesion size, volume in cubic millimeters, and the largest diameter. These measurements feed directly into an automated structured report, reducing the need for manual dictation or typing.
Clinical viewer integration: All of this is presented through a custom web-based viewer built on OHIF V3, providing radiologists with an intuitive interface for interacting with segmentations, reviewing measurements, and generating reports. The same viewer was used in both AI-assisted and unaided arms of the reader study, eliminating any software-interface bias from the comparison.
The primary dataset comprised 291 histology-proven renal cancer cases and 300 controls diagnosed with appendicitis, all from patients treated at TUH between 2010 and 2020. From this pool, 100 controls and 100 cancer cases were reserved for the test set, while the remaining 391 CTs formed the development set. To minimize confounding, a stratified randomized split ensured balanced distributions of age and gender between development and test sets, with equal representation of controls and cases. The appendix region was removed from all test CT scans prior to model evaluation to avoid confounding from appendicitis-related findings.
Annotation process: A five-member annotation team of board-certified radiologists and radiology residents created ground truth labels using a two-stage process. First, radiologists localized and classified all renal findings (benign and malignant) using a ruler tool to mark the longest and shortest axes in the axial plane. Benign lesions included Bosniak categories I, II, and IIF, while malignant lesions encompassed Bosniak III-IV and histologically confirmed solid renal tumors. All lesions were annotated regardless of size, including findings as small as 1-2 mm. In the second stage, model-generated pre-segmentations were verified and refined. Each volume required consensus from at least two radiologists, with the union of boundaries used for agreed malignant lesions and discussion-based resolution for disagreements.
Pre-clinical benchmarking: Before the reader study, BMVision was tested on the 200-scan test set, achieving a pixel-level DICE score comparable to top models from the 2021 Kidney and Kidney Tumor Segmentation Challenge (KiTS21), which ranged from 81% to 86% for malignant lesions. More clinically relevant, BMVision's object-level sensitivity for malignant lesions reached 93.4%, and its patient-level sensitivity was 96%, both comparable to radiologist performance reported in the literature.
Imaging inclusion criteria: Only contrast-enhanced CT volumes in the corticomedullary, nephrogenic, or portal-venous phases were included. Soft CT kernel reconstructions were required, with slice thickness ranging from 0.625 to 5.0 mm and peak kilo voltage from 90 to 150. Exclusions applied to pregnant individuals, subjects under 18, patients with anatomical kidney abnormalities, polycystic kidney disease, severe hydronephrosis, or kidney transplants.
The reader study employed a fully crossed, two-arm design with a 3-4 week washout period to mitigate carryover effects from memorization. Six practicing radiologists participated, with experience levels ranging from 4 to 26 years across subspecialties including oncology, abdominal imaging, musculoskeletal radiology, and interventional radiology. In the first arm, radiologists followed the standard clinical workflow: manually identifying and measuring suspicious renal lesions, then typing a report. In the second arm, they performed the same tasks with BMVision assistance, using the model's measurements to semi-automatically generate a structured report.
Crossover structure: In Phase 1, each radiologist analyzed 50 cases and 50 controls, with half receiving AI assistance and half not, in randomized order unique to each reader. After the washout period, Phase 2 swapped the conditions, so every radiologist reviewed every case under both AI-assisted and unaided conditions. This yielded 2,400 individual reads (6 radiologists x 200 scans x 2 conditions), maximizing statistical power. Each CT volume was further analyzed in two stages: a malignant stage (3D diameter plus three orthogonal measurements) and a benign stage (two axial measurements per lesion).
Time tracking: A custom time-tracking system built into the viewer logged all user interactions and measured active working time. If no interaction was detected for more than 15 seconds, the timer paused, ensuring only active work was recorded. This granular tracking enabled precise comparison of efficiency between workflows.
Sample size justification: Assuming a conservative 20% reduction in reporting time with AI support, equal allocation between workflows, no dropouts, alpha = 0.025, and 80% power, the authors estimated at least 266 scans were needed. The study used 200 subjects with six readers each reviewing all scans in both conditions, well exceeding the required statistical power.
The headline result is a mean 33% reduction in reporting time across all radiologists when using BMVision compared to the unaided workflow. This reduction was statistically significant (p < 0.00001). Individual radiologists experienced varying degrees of speed-up, ranging from 18% to 52%. For control scans (no kidney cancer), mean time dropped from 1.95 minutes to 1.32 minutes (32% reduction). For cancer cases, mean time fell from 4.99 minutes to 3.32 minutes (33% reduction). Across all scans combined, the average decreased from 3.47 minutes to 2.32 minutes.
Report generation specifically: The time to prepare a diagnostic report showed an even more dramatic improvement. Without AI, the average report preparation time was 48.9 seconds. With BMVision's auto-generated structured reports, this dropped to 9.4 seconds, an 81% reduction (p < 0.00001). Individual radiologists saw improvements ranging from 71% to 87% in report generation time alone. This is because BMVision generates structured reports based on its detected and measured lesions, effectively eliminating the need for manual dictation or typing.
These efficiency gains are clinically meaningful beyond just saving time. Shorter reporting times can ease radiologist workload, potentially shorten patient waiting times, help alleviate patient anxiety, and enable more timely treatment decisions. The improvements apply across both traditional keyboard-based and voice-enabled reporting setups, since the structured report reduces the dictation burden regardless of input method.
Benign lesion detection: Object-level sensitivity for benign renal lesions improved from 79.9% (unaided) to 86.3% (AI-assisted), a statistically significant gain (p < 0.00001). At the patient level, sensitivity for detecting patients with benign lesions rose from 89.7% to 95.0% (p < 0.00001). This improvement matters clinically because better detection of benign lesions may help reduce unnecessary biopsies and interventions, contributing to safer and more cost-effective care. For comparison, BMVision alone (without radiologist input) achieved 82.4% object-level and 94.4% patient-level sensitivity for benign lesions.
Malignant lesion detection: Unaided radiologists already achieved high sensitivity for malignant lesions: 95.6% at the object level and 98.0% at the patient level. AI-assisted sensitivity was 96.7% (object-level) and 99.2% (patient-level), with no statistically significant difference compared to the unaided group (p = 0.41 at object level, p = 0.13 at patient level). The authors attribute this ceiling effect to the study design: radiologists were specifically instructed to focus on kidney cancer, driving malignant sensitivity to 98%, well above the 84% reported in the literature for incidental detection on CT scans.
Specificity: For identifying control patients (those with only benign or no lesions), specificity was 89.1% unaided and 91.1% with AI, with no significant difference (p = 0.37). For malignant lesion detection, specificity was 99.0% unaided versus 98.2% AI-assisted (p = 0.16). Critically, this means the sensitivity gains for benign lesions did not come at the expense of increased false positives.
BMVision's standalone performance (no radiologist) showed 93.4% object-level sensitivity and 96.0% patient-level sensitivity for malignant lesions, and 95.0% specificity. These numbers confirm that the model alone performs comparably to radiologists reported in the literature, and the combination of AI plus radiologist yields the best overall results.
One of the most striking findings was the improvement in inter-radiologist agreement. Object-level agreement, the proportion of lesions consistently identified by all six radiologists, jumped from 59.7% in the unaided workflow to 82.3% with AI assistance, a 22.6 percentage-point absolute increase that was statistically significant (p < 0.00001 by chi-squared test). The Cohen's Kappa coefficient improved from 0.68 (substantial agreement) to 0.88 (near-perfect agreement), also statistically significant by the Wilcoxon rank-sum test (p < 0.00001).
Reduced disagreement by lesion type: For benign lesions, AI assistance produced a 48.6% relative reduction in disagreement among radiologists (p < 0.00001). For malignant lesions, disagreement was reduced by 34.3% (p = 0.0008). The magnitude of disagreement reduction was larger for benign lesions, which aligns with the sensitivity findings and suggests that AI assistance is especially helpful for the more ambiguous category of renal findings.
Measurement standardization: The study also documented how AI influenced the physical measurements radiologists recorded. Without AI, radiologists consistently chose different axes to define the longest or shortest diameter of both malignant and benign lesions. With AI assistance, radiologists uniformly preferred the same measurement axes proposed by the model. This convergence has direct implications for clinical practice: consistent measurements support more reliable tumor staging using the TNM classification system and more reproducible nephrometry scoring, both of which influence surgical planning and treatment decisions in multidisciplinary tumor board settings.
Variability in lesion detection and measurement has long been recognized as a barrier to reliable diagnosis and consistent patient management. The fact that BMVision drove a shift from "substantial" to "near-perfect" Kappa agreement suggests the tool can serve as a standardizing anchor across radiologists with different experience levels and subspecialties.
Single-center limitation: The entire study was conducted at Tartu University Hospital in Estonia. This single-site design may limit generalizability to other clinical settings, scanner types, patient populations, and institutional workflows. The authors acknowledge this directly and identify a multi-center study as the logical next step. Such a study would ideally employ a more open-ended design in which radiologists receive less directed instructions, better reflecting routine clinical practice where kidney lesions are often encountered incidentally while scanning for other conditions.
Control cohort composition: The controls were patients with suspected appendicitis rather than a random sample of abdominal CT scans without renal pathology. While this selection provided convenient access to abdominal CTs without kidney cancer, it may not fully represent the diversity of cases in routine practice. Although the appendix region was surgically removed from all scans before evaluation, the underlying patient demographics and imaging characteristics of appendicitis patients may differ from a general abdominal CT population. Future studies should include a broader range of control cases to better match the clinical scenarios where BMVision would be deployed.
Focused study design and malignant sensitivity: Because radiologists were specifically instructed to focus on kidney cancer, malignant sensitivity was already near ceiling at 98% in the unaided group. In routine practice, where radiologists evaluate multiple organs simultaneously, AI assistance might show a more pronounced benefit for malignant lesion detection. This hypothesis remains untested and requires a study design that replicates the multi-organ evaluation of everyday clinical practice.
Proprietary code and data: BMVision is commercial software developed by Better Medicine OU, and its code cannot be publicly shared. The TUH dataset is also unavailable due to patient privacy regulations (GDPR compliance), though access may be provided to qualified researchers under a data-use agreement. The public training datasets (C4KC-KiTS and TCGA-KIRC) remain available through their respective repositories. This proprietary nature limits independent replication, which is an important consideration for the broader AI-in-medicine community.