Back

Grading of Erythema and Visual Attributes in Atopic Dermatitis across Diverse Skin Tones Using a Vision AI Pipeline

Abdolahnejad, M.; Kyremeh, M.; Smith, J.; Fang, G.; Chan, H. O.; Joshi, R.; Hong, C.

2026-03-31 dermatology
10.64898/2026.03.30.26349755 medRxiv
Show abstract

Background: Atopic dermatitis (AD) is a prevalent chronic inflammatory skin disease associated with clinical, psychosocial, and economic burden. Accurate severity assessment is essential for guiding treatment escalation and monitoring disease activity, yet clinician-based scoring systems such as the Eczema Area and Severity Index (EASI) are limited by subjectivity and considerable inter- and intra-rater variability. Erythema, a key driver of AD severity grading, is particularly prone to inconsistent evaluation due to differences in ambient lighting, device quality, skin tone, and rater experience, underscoring the need for objective, reproducible assessment tools. Objective: To develop and validate an artificial intelligence (AI) pipeline for grading erythema, excoriation, and lichenification severity in AD from clinical photographs. The study evaluated the level of agreement between AI severity ratings in each category against dermatologists, non-specialists, and a consensus reference standard, with erythema as the primary outcome of interest. Methods: A two-stage AI pipeline was developed using EfficientNet B7 convolutional neural networks (CNNs). The first CNN was trained as a binary AD classifier on 451 AD and 601 non-AD images for lesion detection and segmentation. The second CNN was trained on 173 dermatologist-annotated AD images which were scored on a 0-3 ordinal scale for erythema, excoriation, and lichenification. This CNN had a downstream feature extraction algorithms such red channel contrast for erythema, Law's E5L5 for excoriation, and S5L5 texture maps for lichenification. In a cross-sectional validation study, 41 independent test images were scored by two blinded dermatologists and two blinded physicians. AI predictions were compared to individual rater groups and mode-derived consensus scores using weighted Cohen's kappa, classification accuracy, confusion matrices, and error direction analyses. Results: On internal validation, the severity CNN achieved 84% overall accuracy (averaged across all three attributes), 86% sensitivity, 87% specificity, and a macro-averaged area under the receiver operating characteristic curve (AUC) of 0.90. In the external comparison with blinded human raters, erythema agreement between the AI and dermatologist consensus was substantial (accuracy 80.7%; kappa = 0.68), with no large (>2-point) misclassifications. Physician consensus agreement was lower (accuracy 54.8%; kappa = 0.34), reflecting greater variability among primary care physicians (non-specialists). For excoriation, AI-dermatologist agreement was moderate (accuracy 72.4%; kappa = 0.62); for lichenification, agreement was similar (accuracy 71.4%; kappa = 0.59). Across all features, disagreements were predominantly between adjacent severity categories. The AI was able to generate erythema severity grades for images of darker skin tones that dermatologists typically would not rate and were marked as "unable to assess". Limitations: The validation set was small (41 images), severe cases (score 3) were underrepresented, one rater participated in both training annotation and validation scoring, and sample size was insufficient for robust stratification by skin tone or body site. Conclusion: The AI pipeline demonstrated dermatologist-level accuracy for erythema scoring, consistent moderate agreement for excoriation and lichenification, and a potential advantage in assessing erythema on darker skin tones. These findings support its potential as a standardized, objective tool for AD severity assessment. Prospective validation in larger, more diverse cohorts is warranted.

Matching journals

The top 6 journals account for 50% of the predicted probability mass.

1
JAMA Network Open
127 papers in training set
Top 0.1%
14.7%
2
Frontiers in Medicine
113 papers in training set
Top 0.2%
10.3%
3
PLOS ONE
4510 papers in training set
Top 22%
8.4%
4
Scientific Reports
3102 papers in training set
Top 10%
8.4%
5
Experimental Dermatology
10 papers in training set
Top 0.1%
7.0%
6
Journal of Investigative Dermatology
42 papers in training set
Top 0.1%
7.0%
50% of probability mass above
7
PLOS Medicine
98 papers in training set
Top 1%
3.7%
8
npj Digital Medicine
97 papers in training set
Top 1%
3.3%
9
Nature Communications
4913 papers in training set
Top 45%
2.4%
10
Frontiers in Public Health
140 papers in training set
Top 3%
2.1%
11
PLOS Neglected Tropical Diseases
378 papers in training set
Top 3%
2.1%
12
Allergy
23 papers in training set
Top 0.2%
1.9%
13
eClinicalMedicine
55 papers in training set
Top 0.4%
1.9%
14
Human Genomics
21 papers in training set
Top 0.1%
1.7%
15
Eye
11 papers in training set
Top 0.3%
1.4%
16
European Journal of Cancer
10 papers in training set
Top 0.3%
1.4%
17
Ophthalmology Science
20 papers in training set
Top 0.2%
1.4%
18
JCI Insight
241 papers in training set
Top 5%
1.0%
19
Scientific Data
174 papers in training set
Top 2%
0.9%
20
Alzheimer's & Dementia: Diagnosis, Assessment & Disease Monitoring
38 papers in training set
Top 1.0%
0.8%
21
Cureus
67 papers in training set
Top 5%
0.8%
22
Informatics in Medicine Unlocked
21 papers in training set
Top 1%
0.8%
23
Journal of Advanced Research
15 papers in training set
Top 0.7%
0.8%
24
Journal of Translational Medicine
46 papers in training set
Top 3%
0.7%
25
Frontiers in Immunology
586 papers in training set
Top 8%
0.7%
26
Journal of Allergy and Clinical Immunology
25 papers in training set
Top 0.8%
0.7%
27
Frontiers in Nutrition
23 papers in training set
Top 2%
0.7%
28
Frontiers in Digital Health
20 papers in training set
Top 2%
0.7%
29
Diagnostics
48 papers in training set
Top 3%
0.5%
30
BMJ Open
554 papers in training set
Top 14%
0.5%