Back

Pixel-Based Skin Tone Estimation on Dermoscopy: A Dual-Rater MST Benchmark and Feasibility Study

Kumarasinghe, A.; Bui, V.; Ghanbarzadeh, R.

2026-05-17 health informatics
10.64898/2026.05.13.26353004 medRxiv
Show abstract

Skin-tone labels are absent from public dermoscopy benchmarks such as the International Skin Imaging Collaboration (ISIC), making it impossible to audit whether clinical AI performs equitably across skin tones. While several recent works estimate skin tone automatically from clinical photography and selfies, we ask whether this approach is feasible on dermoscopy, the primary imaging modality of these benchmarks. To answer this, we make three main contributions. First, we release MST-Derm, a dual-rater Monk Skin Tone (MST) annotation benchmark on 500 ISIC 2018 images. Raters were given an explicit unrateable option for crops where the skin surrounding the lesion was too occluded to label confidently. We find that 60% of images were marked unrateable, yielding a 193-image consensus subset (quadratic-weighted Cohen's Kappa = 0.82). Second, we conduct a systematic feasibility study of three pixel-based MST annotation pipelines spanning the principal families in prior work: palette matching in perceptual colour space, robust colour statistics, and projection to a 1D colorimetric scalar. All three pipelines produce ordinal signal above chance (95% confidence intervals on quadratic-weighted Kappa exclude zero). However, ISIC 2018's extreme light-skin bias leaves 82% of the evaluation set at MST 2, giving a constant "always predict MST 2" baseline an accuracy floor the methods cannot overcome. To separate algorithmic signal from dataset bias, we evaluate on a class-balanced subset. The best method reaches quadratic-weighted Kappa = 0.43 against the trivial baseline of Kappa = 0.00, confirming the signal is genuine. Third, we diagnose this performance ceiling. We trace the bottleneck to two causes: dermoscopy's specialised illumination physically compresses the colour range on which lighter skin tones differ, and ISIC's dataset skew makes standard absolute-accuracy metrics uninformative. We conclude that while pixel-based colour features carry real MST signal on dermoscopy, current performance is insufficient for autonomous annotation. We release the benchmark, annotation protocol, all prediction runs, and analysis code to facilitate the development of robust skin-tone estimators, a vital prerequisite for accurately auditing fairness and mitigating bias in dermatological machine learning.

Matching journals

The top 6 journals account for 50% of the predicted probability mass.

1
PLOS Digital Health
91 papers in training set
Top 0.1%
19.2%
2
Scientific Reports
3102 papers in training set
Top 4%
10.8%
3
PLOS ONE
4510 papers in training set
Top 26%
6.5%
4
Nature Communications
4913 papers in training set
Top 32%
5.0%
5
Scientific Data
174 papers in training set
Top 0.3%
5.0%
6
PLOS Computational Biology
1633 papers in training set
Top 6%
5.0%
50% of probability mass above
7
Journal of Vision
92 papers in training set
Top 0.2%
2.4%
8
Patterns
70 papers in training set
Top 0.5%
2.1%
9
Science Advances
1098 papers in training set
Top 14%
2.0%
10
npj Digital Medicine
97 papers in training set
Top 2%
1.8%
11
Bioinformatics
1061 papers in training set
Top 7%
1.8%
12
Nature Machine Intelligence
61 papers in training set
Top 2%
1.8%
13
Frontiers in Bioinformatics
45 papers in training set
Top 0.3%
1.5%
14
Cureus
67 papers in training set
Top 3%
1.4%
15
iScience
1063 papers in training set
Top 19%
1.4%
16
Biology Methods and Protocols
53 papers in training set
Top 1%
1.3%
17
Frontiers in Digital Health
20 papers in training set
Top 1%
1.0%
18
Medical Image Analysis
33 papers in training set
Top 0.8%
1.0%
19
Applications in Plant Sciences
21 papers in training set
Top 0.3%
0.9%
20
PNAS Nexus
147 papers in training set
Top 0.9%
0.9%
21
Journal of Pathology Informatics
13 papers in training set
Top 0.3%
0.8%
22
BMC Medical Informatics and Decision Making
39 papers in training set
Top 2%
0.8%
23
Nature Methods
336 papers in training set
Top 6%
0.8%
24
JCO Clinical Cancer Informatics
18 papers in training set
Top 0.8%
0.8%
25
Neural Networks
32 papers in training set
Top 0.7%
0.8%
26
Methods in Ecology and Evolution
160 papers in training set
Top 2%
0.8%
27
Frontiers in Plant Science
240 papers in training set
Top 5%
0.8%
28
European Journal of Human Genetics
49 papers in training set
Top 1%
0.8%
29
Royal Society Open Science
193 papers in training set
Top 5%
0.7%
30
Communications Biology
886 papers in training set
Top 28%
0.7%