Back

Quality versus quantity of training datasets for artificial intelligence-based whole liver segmentation

Castelo, A.; O'Connor, C.; Gupta, A. C.; Anderson, B. M.; Woodland, M.; Altaie, M.; Koay, E. J.; Odisio, B. C.; Tang, T. T.; Brock, K. K.

2026-02-18 radiology and imaging
10.64898/2026.02.17.26346486 medRxiv
Show abstract

Artificial intelligence (AI) based segmentation has many medical applications but limited curated datasets challenge model training; this study compares the impact of dataset annotation quality and quantity on whole liver AI segmentation performance. We obtained 3,089 abdominal computed tomography scans with whole-liver contours from MD Anderson Cancer Center (MDA) and a MICCAI challenge. A total of 249 scans were withheld for testing of which 30, MICCAI challenge data, were reserved for external validation. The remaining scans were divided into mixed-curation and highly-curated groups, randomly sampled into sub-datasets of various sizes, and used to train 3D nnU-Net segmentation models. Dice similarity coefficients (DSC), surface DSC with 2mm margins (SD 2mm), the 95th percentile of Hausdorff distance (HD95), and 2D axial slice DSC (Slice DSC) were used to evaluate model performance. The highly curated, 244-scan model (DSC=0.971, SD 2mm=0.958, HD95=2.98mm) performed insignificantly different on 3D evaluation metrics to the mixed-curation 2,840-scan model (DSC=0.971 [p>.999], SD 2mm=0.958 [p>.999], HD95=2.87mm [p>.999]). The 710-scan mixed-curation (Slice DSC=0.929) significantly outperformed the highly curated, 244-scan model (Slice DSC=0.923 [p=0.012]) on the 30 external scans. Highly curated datasets yielded equivalent performance to datasets that were a full order of magnitude larger. The benefits of larger, mixed-curation datasets are evidenced in model generalizability metrics and local improvements. In conclusion, tradeoffs between dataset quality and quantity for model training are nuanced and goal dependent.

Matching journals

The top 7 journals account for 50% of the predicted probability mass.

1
Computers in Biology and Medicine
120 papers in training set
Top 0.1%
14.4%
2
Scientific Reports
3102 papers in training set
Top 12%
7.2%
3
PLOS ONE
4510 papers in training set
Top 28%
6.4%
4
Medical Physics
14 papers in training set
Top 0.1%
6.4%
5
PLOS Computational Biology
1633 papers in training set
Top 6%
6.3%
6
Computer Methods and Programs in Biomedicine
27 papers in training set
Top 0.1%
4.9%
7
PLOS Digital Health
91 papers in training set
Top 0.5%
4.9%
50% of probability mass above
8
Journal of Medical Imaging
11 papers in training set
Top 0.1%
4.3%
9
European Radiology
14 papers in training set
Top 0.2%
4.0%
10
Expert Systems with Applications
11 papers in training set
Top 0.1%
4.0%
11
GigaScience
172 papers in training set
Top 0.6%
3.3%
12
Diagnostics
48 papers in training set
Top 0.8%
2.1%
13
Frontiers in Physiology
93 papers in training set
Top 3%
1.7%
14
Frontiers in Computational Neuroscience
53 papers in training set
Top 1%
1.7%
15
Biology Methods and Protocols
53 papers in training set
Top 1%
1.7%
16
Frontiers in Artificial Intelligence
18 papers in training set
Top 0.3%
1.7%
17
Informatics in Medicine Unlocked
21 papers in training set
Top 0.5%
1.5%
18
npj Digital Medicine
97 papers in training set
Top 2%
1.5%
19
Journal of Magnetic Resonance Imaging
14 papers in training set
Top 0.5%
0.9%
20
Nature Communications
4913 papers in training set
Top 60%
0.9%
21
Bioengineering
24 papers in training set
Top 1%
0.8%
22
Human Brain Mapping
295 papers in training set
Top 4%
0.8%
23
BMC Medical Informatics and Decision Making
39 papers in training set
Top 3%
0.7%
24
Archives of Clinical and Biomedical Research
28 papers in training set
Top 2%
0.7%
25
JMIRx Med
31 papers in training set
Top 2%
0.7%
26
iScience
1063 papers in training set
Top 37%
0.6%
27
eLife
5422 papers in training set
Top 61%
0.6%
28
Patterns
70 papers in training set
Top 3%
0.6%
29
Frontiers in Oncology
95 papers in training set
Top 4%
0.6%