When uncertainty guides learning: a highly effective approach to kidney disease classification in CT imaging
We employ a comprehensive suite of evaluation metrics shown from Equations 26–37 to provide a multifaceted assessment of model performance. The primary classification metrics include accuracy, precision, recall, and F1-score:
Aggregate metrics include macro and weighted F1-scores, balanced accuracy, and statistical agreement measures:
Matthews Correlation Coefficient (MCC) is defined as:
where C is the confusion matrix. MCC takes values in [−1, 1], with 1 indicating perfect prediction.
AUC-ROC was computed using the macro-averaged one-vs-rest strategy across all four classes, i.e. . Average Precision was similarly computed as the macro average of per-class precision-recall curves.
Probabilistic metrics assess the quality of probability estimates:
Note on log loss and calibration. A low log loss is consistent with but does not formally establish probability calibration. Dedicated calibration analysis (e.g. reliability diagrams, Expected Calibration Error) is left for future work.
All experiments were repeated across five independent runs with different random seeds (for initial seed set selection and training stochasticity). We report mean ± standard deviation and 95% confidence intervals throughout.
Our proposed entropy-based active learning framework demonstrates exceptional performance on the CT kidney classification task shown in Tables 2, 3. The system achieves a mean test accuracy of 99.71% ± 0.25% (95% CI: [99.30, 99.94]) across five runs using only 2,000 labeled images, representing 22.9% of the 8,716-image training partition. This performance level matches or exceeds all previously reported results that used the entire training partition (100% of training labels), with the primary practical advantage lying in the substantially improved sample efficiency demonstrated in early active learning cycles.
Performance progression across active learning cycles for a single representative run (Run 1, seed 1,000).
See Table 4 for results across all five independent runs.
Comprehensive performance metrics on test set: mean ± std across 5 independent runs (2,000 labeled training samples).
Table 2 reports the active learning progression for a single representative run (Run 1, seed 1,000; individual run results are listed in Table 4). The trajectory illustrates the typical learning dynamics observed across all five runs. Starting from 60.64% validation accuracy with 200 randomly selected samples, the system reaches 99.68% accuracy after just 1,400 labeled samples. The uncertainty quantification effectively identifies informative samples, with average pool uncertainty decreasing from 0.916 bits to 0.0078 bits over the learning cycles.
Test accuracy for each of the five independent runs.
Bold values indicate highest in terms of accuracy.
Table 3 presents the comprehensive evaluation metrics on the test set. The model achieves outstanding performance across all metrics, with Cohen's Kappa of 0.9962 indicating almost perfect agreement beyond chance, and AUC-ROC of 0.9999 demonstrating excellent class separation. The Shapiro-Wilk test (p = 0.148>0.05) is consistent with normality for the five run accuracies; however, with only five runs this test has limited statistical power and should be interpreted descriptively rather than as a definitive normality test. The more informative summary is the transparent reporting of all five individual run results in Table 4, which directly shows the variability across runs.
The learning dynamics follow mathematically predictable patterns with validation accuracy exhibiting saturating exponential growth shown in Equation 38 (empirically fitted):
and pool uncertainty following empirically observed exponential decay in Equation 39:
These equations are post-hoc descriptive fits to the observed data points, not theoretically derived laws. They serve to compactly summarize the convergence behavior. These equations demonstrate the efficiency of the entropy-based selection strategy, with uncertainty reducing by approximately 60% each cycle.
To address the concern that a single reported trajectory may be misleading, we repeated the full active learning experiment across five independent runs with different random seeds governing initial seed set selection and training stochasticity. Figure 11 shows the learning curve means with ±1 standard deviation confidence bands, the distribution of final test accuracies, and a box plot summarizing variability.
Statistical analysis across five independent runs: mean validation accuracy with ±1 standard deviation bands over active learning cycles, histogram of final test accuracies across five runs, and box plot of test accuracy distribution.
Table 4 reports the test accuracy for each individual run.
To better understand the failure modes of the proposed framework (Reviewer 1 Comment 3; Reviewer 3 Comment 4), we performed a detailed analysis of the misclassified cases. We report analysis for both the best-performing run and the pooled errors across all five runs to provide a representative picture of model failure modes.
The key findings from the misclassification analysis are as follows:
Total errors (best run): 2 out of 1,865 test images (0.11%) in the best run (Run 3, seed 1002).
Total errors (mean across runs): 29 out of 1,865 (1.56%) on average across five runs, representing the more typical error profile of this framework.
Error pattern (pooled): Across all five runs, the most frequent confusion was between Normal and Tumor classes, which share subtle textural features on CT. Stone-class errors were concentrated in early cycles and reduced substantially after entropy-guided querying. No class showed systematic high-confidence errors.
Confidence at errors: Misclassified samples pooled across all runs had a mean prediction confidence of 0.54 ± 0.11, indicating the model was not highly confident on erroneous predictions-a desirable property suggesting that the model's uncertainty estimates are informative.
High-confidence errors: Fewer than 2% of misclassified samples across all runs had confidence above 0.8, indicating the near-absence of overconfident wrong predictions.
Per-class accuracy (mean across five runs): Cyst: 99.9%; Normal: 99.7%; Stone: 99.8%; Tumor: 99.6%.
Note on best-run analysis: The confusion matrix in Figure 12a is from the best-performing run and presents an optimistic picture of error frequency. The pooled five-run statistics above are more representative of the typical failure profile. Figure 12b shows the confidence distribution. Figure 13 shows the performance comparison across model architectures.
Misclassification analysis: (a) Confusion matrix from the best-performing run (Run 3, seed 1002; Test Accuracy = 99.89%, 2 misclassifications out of 1,865 test images); (b) confidence distribution for correct predictions (green) vs. incorrect predictions (red) pooled across all five runs, showing that misclassified samples had moderate confidence.
Performance comparison across model architectures.
Table 5 reports per-class validation accuracy across active learning cycles for the representative run (Run 1, seed 1000), providing finer-grained insight into how the model improves for each category as more labeled data are added.
Per-class validation accuracy (%) across active learning cycles (representative single run, seed 1000).
Values are from a single trajectory; see Figure 11 for multi-run variability.
Notably, Stone is the most challenging class in early cycles (34.6% at cycle 0) because stones are small, hyperdense foci that are easily confused with image noise; however, after entropy-guided querying concentrates annotations on uncertain Stone cases, accuracy rises rapidly to ≥99.5% by cycle 4.
To explain why entropy-guided selection is effective, we examined the properties of samples selected in the first active learning query cycle (the cycle most revealing of the acquisition strategy's behaviorbehaviour, since the model has seen only 200 images).
At cycle 1, the class distribution of the 300 queried samples was: Stone 75.7% (227/300), Cyst 20.0% (60/300), Tumor 3.3% (10/300), Normal 1.0% (3/300). This strong overrepresentation of Stone samples is consistent with the per-class accuracy shown in Table 5: at cycle 0, Stone accuracy was only 34.6%, making Stone images the most uncertain class at this stage. The model correctly identifies and concentrates labeling effort on the class it understands least-a key advantage over random sampling.
The uncertainty statistics for the queried batch were: mean entropy 1.34 ± 0.02 bits (near the maximum of 2 bits), mean margin 0.080 ± 0.057 (low margin indicates predictions spread across multiple classes), and mean confidence 0.353 ± 0.037. These statistics confirm that selected samples are genuinely uncertain and near decision boundaries. 75 of the 300 queried samples (25%) were classified as boundary samples (bottom quartile of margin score), indicating that they lie very close to the decision surface between classes. Zero samples had confidence above 0.90, further confirming that the acquisition function avoids wasting budget on already-confident cases.
As training progresses, the distribution of queried samples shifts: later cycles select more uniform class distributions as the model's uncertainty becomes more evenly spread across classes. By cycle 4, queried samples have mean entropy below 0.1 bits, indicating that most of the informative labeling has already been performed.
We conducted comprehensive ablation studies across five critical dimensions to validate design choices and understand their impact on performance.
Table 6 presents a comparison of architectures under the identical training strategy (same data split, same hyperparameters, same number of training epochs). The comparison is based on our own re-implementation of each architecture; results are not cited from other papers, which may have used different protocols.
Model architecture comparison under identical training strategy (our re-implementation).
As shown in Table 6, ResNet-50 provides the optimal balance between model capacity and computational efficiency. The 4.1 GFLOPs computational requirement is compatible with standard research hardware; formal latency benchmarking on clinical workstations was not performed and remains a subject for future work.
The selection criterion for the optimal learning rate was maximization of validation accuracy while maintaining stable convergence. 1 × 10−4 achieves the highest validation accuracy (99.6%) and stable training dynamics. Higher learning rates (5 × 10−4, 1 × 10−3) caused instability in validation metrics despite appearing to achieve low training loss due to numerical rounding; these are not considered viable configurations. Table 7 shows that 1 × 10−4 provides optimal convergence characteristics, balancing fast learning with stable training dynamics. Figure 14 shows the training time and loss on different learning rate.
Learning rate comparison using our proposed framework.
The “0.0000” final loss entries for 5 × 10−4 and 1 × 10−3 reflect numerical underflow in rounding and mask training instability; these configurations diverged on validation metrics and should not be interpreted as indicating optimal convergence.
Loss and training time across different learning rates.
We provide an explicit rationale for the batch size selection. Batch size is a critical hyperparameter that affects gradient estimation accuracy, training convergence speed, and GPU memory utilizationutilisation. We evaluated batch sizes of 8, 16, 32, 64, and 128 using a fixed 500-image subset and three training epochs, measuring final loss, convergence speed, training time, and estimated memory usage. Figure 15 and Table 8 summarize the results.
Batch size ablation study: final training loss vs. batch size; convergence speed vs batch size; training time per epoch; estimated memory usage. Batch size 32 offers the best trade-off between loss quality, speed, and memory.
Selection criterion: best trade-off between validation generalization, training speed, and memory efficiency. Note that smaller batches (8, 16) achieve lower training loss on this 500-image subset but are 2–4 × slower per effective sample and may not generalize better; the composite criterion based on validation performance and practical constraints is reported in the Stability column.
Batch size 32 was selected as the best composite trade-off. Although smaller batches (8, 16) achieve marginally lower training loss on this 500-image ablation subset, batch size 32 achieves the highest validation accuracy (99.2%) among all configurations while requiring 2–4 × less wall-clock time per effective sample and staying within the GPU memory constraints of standard research hardware ( ≤ 1,288 MB per batch). Larger batches (64, 128) show clear degradation in both validation accuracy and generalization. The selection criterion was therefore a joint optimisation of validation performance, training speed, and memory efficiency, favoring batch size 32.
Full augmentation in Table 9 was selected based on validation accuracy (99.2%), which was highest among all configurations despite its slightly higher training loss. Figure 16 shows how different augmentation effects the model. The higher training loss reflects the regularizing effect of augmentation, which reduces overfitting and improves generalization. Training loss alone is therefore not a reliable selection criterion for this comparison.
Augmentation techniques comparison. Selection criterion: validation accuracy, not training loss.
Full augmentation was selected because it achieved the highest validation generalization despite slightly higher training loss (regularization effect).
Augmentation techniques performance comparison.
Figure 17 compares the active learning (entropy-based) trajectory with a random sampling baseline over five independent runs each. Random sampling reaches a mean accuracy of 99.37% ± 0.36% using the same 2,000-sample budget. The paired t-test comparing active learning and random sampling at the final 2,000-sample budget yields p = 0.147, which is not statistically significant at the 0.05 level.
Active learning (entropy-based, blue) vs. random sampling baseline (red) over active learning cycles. Shaded band shows ±1 standard deviation across five runs. The entropy-based approach converges faster in early cycles and achieves higher peak accuracy.
The lack of statistical significance at the final 2,000-sample point reflects the high-accuracy ceiling of this task, where both strategies approach the ceiling of dataset difficulty. The primary and most practically relevant contribution of entropy-based active learning is its early-cycle sample efficiency: at 500 labeled samples (cycle 1), entropy sampling achieves a mean validation accuracy of 93.1%, compared to approximately 55% for random sampling-a gap of approximately 38 percentage points. This efficiency advantage is particularly meaningful in settings where annotation budgets are tightly constrained to fewer than 1,000 images. The final accuracy difference at 2,000 samples, while numerically visible, does not reach statistical significance in this near-ceiling regime.
The fully supervised baselines (VGG-16 through Ensemble) were re-implemented by the authors under conditions matched to our experimental setup (identical 70/15/15 stratified split, same preprocessing pipeline, same hardware; Table 10). These values are therefore directly comparable with our results. The FixMatch result was similarly re-implemented under matched conditions. Results from prior publications using different dataset splits or preprocessing protocols may not be directly comparable; readers should consult original sources for context.
Comparison with state-of-the-art methods on CT kidney classification.
Label reduction percentages (column 4) are computed relative to the 8,716-image training partition throughout this paper. For example, our final method uses 2,000/8,716 = 22.9% of the training partition, yielding a 77.1% reduction in required annotations. This denominator convention is used consistently in all sections.
This efficiency gain in Equation 40 can be quantified through the learning curve parameter α in the empirically fitted power-law relationship:
This equation is an empirical fit to the observed error rates at each active learning cycle. The fitted exponent α = 1.2 indicates that the empirically observed error reduction is steeper than the α≈0.8 typically seen in random sampling, consistent with the theoretically motivated expectation that uncertainty-guided selection improves sample efficiency. This interpretation is heuristic and not a formal proof.
Related Stories
Explainers
What is Helium
4 days ago
Explainers
Digging Deeper: Challenges and Trends in the Dredging Industry
4 days ago
Explainers
Lucky Strike movie: Is Colin Hanks, Scott Eastwood's WW2 movie based on true story? Plot details explained | Hollywood
4 days ago
Explainers
Exclusive: Lumenci launches AI platform for portfolio analysis
4 days ago
Explainers
AI Agent Failure Detection and Root Cause Analysis with Strands Evals
5 days ago
Explainers
What is NCPI? Why 20 TMC rebel MPs want to join this unknown Tripura party which could emerge as NDA's 2nd
5 days ago
Explainers
British forces intercept Russian shadow fleet tanker in the Channel
5 days ago
Explainers
Trump's name is gone from the Kennedy Center's façade
1 week ago