Back

Unsupervised Tissue Concepts for Explainable Sarcoma Subtype Prediction from H&E

Bisson, T.; Ingram, D.; Singh, S.; Li, A.; Flynn, S.; Wang, W.-L.; Kim, A. E.; Bridge, C. P.; Demicco, E. G.; Sorrentino, A.; Jiang, S.; Hung, Y. P.; Lazar, A. J.; Iafrate, A. J.

2026-05-20 pathology
10.64898/2026.05.15.26353333 medRxiv
Show abstract

Soft tissue sarcomas are a rare, heterogeneous group of tumors whose diagnosis remains challenging because of overlapping morphology and limited access to sarcoma-specialized pathologists. Although pathology foundation models have shown promise in computational pathology, their clinical translation remains limited by insufficient interpretability, particularly in diagnostically complex settings such as sarcoma diagnosis. Here, we developed and evaluated an H&E-based AI framework for sarcoma subtype classification that focused on explanability. Using the CONCH v1.5 foundation model, we computed embeddings from a tissue microarray cohort of 2,545 cases spanning 19 sarcoma subtypes and trained an attention-based multiple-instance learning model that achieved a balanced accuracy of 77.38% (SD 1.88). To move explainability beyond attention-based localization, we trained a sparse autoencoder on patch-level embeddings to learn 768 recurring visual concepts. 90 high-activation concepts were reviewed by three senior pathologists and curated into morphologically meaningful and non-meaningful categories, yielding a semantic dictionary of 41 diagnostically relevant tissue concepts. We then trained a linear attention-based model on the 768-concept vectors, which retained much of the performance of the raw embedding-based ABMIL model, achieving a balanced accuracy of 73.74% (SD 1.30). When restricting the linear model to pathologist-curated morphologic concepts only, balanced accuracy further decreased to 67.04% (SD 1.27), suggesting that the residual performance gain in the full concept model was driven by inconsistent, technical, or diagnostically irrelevant concepts. Concept-level explanations of the curated linear attention-based model aligned with known sarcoma morphology, including lipogenic, myxoid, spindle-cell, pleomorphic, vascular, small round blue cell, and matrix-forming patterns, and reproduced patterns of diagnostic overlap observed in human sarcoma pathology. Together, these results show that H&E-based foundation-model representations capture meaningful diagnostic structure within the known limitations of H&E in sarcoma diagnostics, but that their clinical value depends on whether this structure can be made interpretable to pathologists. Sparse autoencoder-derived concepts can address this critical gap by converting embedding-level signal into recurring morphologic patterns that pathologists can review and name, providing the foundation to link these patterns to subtype predictions. In doing so, this approach turns concept discovery into a practical form of diagnostic explanation, while also revealing where model performance is supported by recognizable histopathology and where it relies on diagnostically irrelevant or inconsistent visual patterns.

Matching journals

The top 1 journal accounts for 50% of the predicted probability mass.

1
Modern Pathology
21 papers in training set
Top 0.1%
58.8%
50% of probability mass above
2
Journal of Pathology Informatics
13 papers in training set
Top 0.1%
6.3%
3
Nature Communications
4913 papers in training set
Top 41%
3.6%
4
npj Digital Medicine
97 papers in training set
Top 1%
2.9%
5
Cancer Research
116 papers in training set
Top 2%
2.1%
6
PLOS Computational Biology
1633 papers in training set
Top 15%
1.9%
7
Nature Machine Intelligence
61 papers in training set
Top 2%
1.7%
8
Scientific Reports
3102 papers in training set
Top 58%
1.7%
9
npj Precision Oncology
48 papers in training set
Top 0.7%
1.5%
10
Clinical Cancer Research
58 papers in training set
Top 1%
1.3%
11
Proceedings of the National Academy of Sciences
2130 papers in training set
Top 38%
1.2%
12
Cell Reports Medicine
140 papers in training set
Top 6%
1.1%
13
eBioMedicine
130 papers in training set
Top 3%
0.9%
14
Journal of Medical Imaging
11 papers in training set
Top 0.2%
0.9%
15
Laboratory Investigation
13 papers in training set
Top 0.2%
0.9%
16
Medical Image Analysis
33 papers in training set
Top 0.9%
0.9%
17
Science Translational Medicine
111 papers in training set
Top 5%
0.9%
18
eLife
5422 papers in training set
Top 54%
0.9%
19
Breast Cancer Research
32 papers in training set
Top 0.5%
0.7%
20
Science Advances
1098 papers in training set
Top 30%
0.7%
21
JAMIA Open
37 papers in training set
Top 2%
0.6%
22
Briefings in Bioinformatics
326 papers in training set
Top 8%
0.6%
23
Frontiers in Bioinformatics
45 papers in training set
Top 1%
0.6%