Back

AI-based radiomics for pancreatic cysts: high diagnostic performance amid a persistent translational gap

Lettner, J. D.; Evrenoglou, T.; Binder, H.; Fichtner-Feigl, S.; Neubauer, C.; Ruess, D. A.

2026-02-12 radiology and imaging
10.64898/2026.02.10.26345995 medRxiv
Show abstract

BackgroundAI-based radiomics has demonstrated promising diagnostic performance for pancreatic cystic neoplasms, yet clinical translation remains limited. Whether this reflects insufficient model performance or structural limitations of the evidence base remains unclear. MethodsWe performed a systematic review and diagnostic test accuracy meta-analysis of AI-based radiomics in pancreatic cyst (2015-2025), addressing two clinically relevant tasks (Q1: cyst type differentiation/Q2: malignancy or high-grade dysplasia prediction). Training and validation datasets were synthesized independently using hierarchical models. Study evaluation extended beyond diagnostic performance to a four-dimensional framework integrating RQS 2.0, METRICS, TRIPOD+AI and PROBAST+AI explicitly contrasting pooled diagnostic performance with reporting quality, methodological rigor, and risk of bias. The review was pre-registered (PROSPERO) and conducted according to PRISMA 2020. ResultsTwenty-nine studies were included (Q1: n = 15; Q2: n = 14), predominantly retrospective and single center. Training-based analyses showed high apparent diagnostic performance for Q1 (pooled sensitivity/specificity: 0.89 [95% CI, 0.85-0.92]/ 0.90 [0.85-0.93]), but there was substantial heterogeneity ({tau}{superscript 2} = 0.56/0.78; {rho} = 0.38). Validation-based performance remained high (0.86 [0.82-0.89]/ 0.88 [0.81-0.93]), while heterogeneity persisted and prediction regions exceeded confidence regions. Training-based analyses demonstrated similarly high apparent performance (0.88 [0.79-0.95]/0.89 [0.81-0.94]) for Q2, with pronounced heterogeneity ({tau}{superscript 2} = 1.98/1.61; {rho} = 0.63). Validation-based performance was slightly lower, yet still clinically comparable (0.82 [0.75-0.89]/0.86 [0.80-0.91]), and heterogeneity persisted ({tau}{superscript 2} = 0.71/0.43; {rho} = 0.15). Across both tasks, high diagnostic accuracy occurred alongside incomplete reporting, limited validation and an elevated risk of bias. ConclusionAI-based radiomics for pancreatic cysts has reached a structural performance plateau. Further improvements in diagnostic accuracy alone are insufficient to achieve clinical translation and must be accompanied by a paradigm shift from performance-driven model development toward decision-anchored study designs, robust validation strategies, transparent reporting standard, and clinically integrated evaluation frameworks. SummaryAlthough pancreatic cystic lesions are increasingly being detected, imaging-based decision-making remains limited, particularly regarding differentiating between cyst types and stratifying malignancy risk. In this PRISMA-compliant and PROSPERO-registered systematic review and meta-analysis of diagnostic tests, we evaluated the use of AI-based radiomics for these two tasks, as well as its contextualized performance. In addition, a four-dimensional framework was employed to conduct the evaluation, incorporating diagnostic accuracy, reporting quality, risk of bias, and radiomics maturity. Across studies published between 2015 and 2025, the pooled diagnostic performance was consistently high, with only modest declines observed from the training to the validation stage. Nevertheless, considerable heterogeneity between studies and limited transportability remained evident. Multidimensional evaluation indicated a systematic dissociation between reported performance and methodological robustness, characterized by incomplete reporting, restricted validation, and an elevated risk of bias. These limitations were consistent across both clinical questions and were not resolved by increasing model complexity. The findings of this meta-analysis suggest that the structural performance of AI-based radiomics for pancreatic cysts has plateaued. To progress towards clinical translation, it is necessary to employ study designs anchored in decision-making processes, robust multi-center validation, and transparent, reproducible evaluation frameworks. This is preferred to further optimization of model architecture alone.

Matching journals

The top 6 journals account for 50% of the predicted probability mass.

1
European Radiology
14 papers in training set
Top 0.1%
10.6%
2
The Lancet Digital Health
25 papers in training set
Top 0.1%
10.2%
3
eBioMedicine
130 papers in training set
Top 0.1%
8.5%
4
BMC Medicine
163 papers in training set
Top 0.3%
8.5%
5
Scientific Reports
3102 papers in training set
Top 16%
6.5%
6
Diagnostics
48 papers in training set
Top 0.2%
6.4%
50% of probability mass above
7
Journal for ImmunoTherapy of Cancer
64 papers in training set
Top 0.3%
3.6%
8
PLOS ONE
4510 papers in training set
Top 38%
3.6%
9
Nature Communications
4913 papers in training set
Top 39%
3.6%
10
Frontiers in Oncology
95 papers in training set
Top 1%
3.1%
11
npj Precision Oncology
48 papers in training set
Top 0.3%
2.6%
12
JAMA Network Open
127 papers in training set
Top 2%
1.5%
13
eLife
5422 papers in training set
Top 45%
1.5%
14
Stroke: Vascular and Interventional Neurology
13 papers in training set
Top 0.3%
1.2%
15
npj Digital Medicine
97 papers in training set
Top 3%
1.2%
16
PLOS Medicine
98 papers in training set
Top 3%
1.1%
17
Journal of Medical Imaging
11 papers in training set
Top 0.2%
1.0%
18
Med
38 papers in training set
Top 0.6%
0.9%
19
The American Journal of Pathology
31 papers in training set
Top 0.4%
0.9%
20
BMC Medical Research Methodology
43 papers in training set
Top 1%
0.9%
21
Annals of Translational Medicine
17 papers in training set
Top 1%
0.9%
22
Heliyon
146 papers in training set
Top 5%
0.9%
23
Cell Reports Methods
141 papers in training set
Top 4%
0.8%
24
Computers in Biology and Medicine
120 papers in training set
Top 4%
0.8%
25
Gut
36 papers in training set
Top 0.8%
0.8%
26
BMJ Open
554 papers in training set
Top 13%
0.7%
27
Frontiers in Medicine
113 papers in training set
Top 7%
0.7%
28
PLOS Biology
408 papers in training set
Top 23%
0.7%
29
Journal of Clinical Medicine
91 papers in training set
Top 7%
0.7%
30
PLOS Digital Health
91 papers in training set
Top 3%
0.7%