Back

Comparison of foundation models and transfer learning strategies for diabetic retinopathy classification

Li, L. Y.; Lebiecka-Johansen, B.; Byberg, S.; Thambawita, V.; Hulman, A.

2026-04-20 health informatics
10.64898/2026.04.17.26351092 medRxiv
Show abstract

Diabetic retinopathy (DR) is a leading cause of vision impairment, requiring accurate and scalable diagnostic tools. Foundation models are increasingly applied to clinical imaging, but concerns remain about their calibration. We evaluated DINOv3, RETFound, and VisionFM for DR classification using different transfer learning strategies in BRSET (n = 16,266) and mBRSET (n = 5,164). Models achieved high discrimination in binary classification (normal vs retinopathy) in BRSET (AUROC 0.90-0.98), with DINOv3 achieving the best under full fine-tuning (AUROC 0.98 [95% CI: 0.97-0.99]). External validation on mBRSET showed decreased performance for all models regardless of the fine-tuning strategy (AUROC 0.70-0.85), though fine-tuning improved performance. Foundation models achieved strong discrimination but poor calibration, generally overestimating DR risk. While the generalist model, DINOv3, benefited from deeper fine-tuning, miscalibration remained evident. These findings underscore the need to improve calibration and the comprehensive evaluation of foundation models, which are essential in clinical settings. Author summaryArtificial intelligence is increasingly being used to detect eye diseases such as diabetic retinopathy from retinal images. Recent advances have introduced "foundation models," which are trained on large datasets and can be adapted to new tasks. We aimed to evaluate how well these models perform in a clinical prediction context, with a focus not only on accuracy but also on how reliably they estimate disease risk. In this study, we compared different types of foundation models using two independent datasets from Brazil. We found that while these models were generally good at distinguishing between healthy and diseased eyes, their predicted risks were often poorly calibrated. In other words, the estimated probabilities did not consistently reflect the true likelihood of disease. We also examined whether adapting the models to the target population could improve performance. Although this approach led to improvements, calibration issues remained. However, post-training correction improved the agreement between predicted risks and observed outcomes. Our findings highlight an important gap between model performance and clinical usefulness. We suggest that improving the reliability of risk estimates is essential before such systems can be safely used in healthcare.

Matching journals

The top 6 journals account for 50% of the predicted probability mass.

1
PLOS Digital Health
91 papers in training set
Top 0.1%
22.5%
2
Scientific Reports
3102 papers in training set
Top 8%
9.1%
3
Biology Methods and Protocols
53 papers in training set
Top 0.1%
6.8%
4
Frontiers in Medicine
113 papers in training set
Top 0.6%
6.4%
5
npj Digital Medicine
97 papers in training set
Top 0.9%
4.6%
6
Frontiers in Artificial Intelligence
18 papers in training set
Top 0.1%
4.2%
50% of probability mass above
7
Ophthalmology Science
20 papers in training set
Top 0.1%
3.7%
8
BMC Medical Informatics and Decision Making
39 papers in training set
Top 0.8%
3.6%
9
PLOS ONE
4510 papers in training set
Top 40%
3.6%
10
PLOS Computational Biology
1633 papers in training set
Top 11%
2.9%
11
Computers in Biology and Medicine
120 papers in training set
Top 2%
2.1%
12
BMJ Health & Care Informatics
13 papers in training set
Top 0.4%
1.7%
13
British Journal of Ophthalmology
14 papers in training set
Top 0.2%
1.5%
14
Communications Medicine
85 papers in training set
Top 0.4%
1.3%
15
Journal of Medical Internet Research
85 papers in training set
Top 3%
1.2%
16
Bioinformatics
1061 papers in training set
Top 8%
0.9%
17
Journal of Vision
92 papers in training set
Top 0.4%
0.9%
18
Frontiers in Bioinformatics
45 papers in training set
Top 0.5%
0.9%
19
JAMIA Open
37 papers in training set
Top 1%
0.9%
20
eBioMedicine
130 papers in training set
Top 3%
0.9%
21
NeuroImage: Clinical
132 papers in training set
Top 3%
0.8%
22
JMIR Formative Research
32 papers in training set
Top 2%
0.7%
23
JMIR Medical Informatics
17 papers in training set
Top 1%
0.7%
24
Computer Methods and Programs in Biomedicine
27 papers in training set
Top 1.0%
0.7%
25
Frontiers in Digital Health
20 papers in training set
Top 1%
0.7%
26
Translational Vision Science & Technology
35 papers in training set
Top 0.6%
0.7%
27
Scientific Data
174 papers in training set
Top 3%
0.6%
28
Journal of the American Medical Informatics Association
61 papers in training set
Top 2%
0.6%
29
Neurophotonics
37 papers in training set
Top 0.7%
0.6%
30
Patterns
70 papers in training set
Top 3%
0.6%