Back

Large-Language Models for data extraction from written kidney biopsy reports

Niggemeier, L.; Hoelscher, D. L.; Herkens, T. C.; Gilles, P.; Boor, P.; Buelow, R.

2026-02-25 pathology
10.64898/2026.02.23.26346945 medRxiv
Show abstract

IntroductionKidney biopsy reports contain rich information that is clinically actionable and useful for research. However, the narrative format hinders scalable reuse. We here investigated whether open-source large language models (LLMs) can extract relevant, standardized readouts from native kidney biopsy pathology reports. MethodsGerman free-text native kidney biopsy reports were parsed with three open-source LLMs (Llama3 70B, Llama3 8B, MedGemma) to generate structured JSON outputs covering relevant report elements (e.g., diagnosis, glomerular counts, histopathological patterns). Two independent observers manually curated the same report elements; disagreements between the two were resolved by an experienced nephropathologist to create the final ground truth. Performance was assessed using strict and soft matching and summarized accuracy. Inter-rated agreement was quantified using Cohens and Lights Kappa with 95% confidence intervals via 1000-times bootstrapping. ResultsLlama3 70B achieved the highest overall accuracy (93.3% strict, 97.1% soft), followed by MedGemma. These larger models showed near perfect performance for explicit and discrete variables and positivity of immunohistochemistry markers, while accuracy decreased for report elements requiring interpretation (e.g., primary diagnosis, interstitial inflammation in fibrosis vs. non-fibrotic cortex). Human raters showed strong agreement for the primary diagnosis ({kappa} = 0.74, 95% CI 0.64-0.84). Adding Llama3 70B or MedGemma as a third rater increased overall agreement (0.82, 95% CI 0.74-0.89 and 0.78, 95% CI 0.69-0.85, respectively), whereas Llama3 8B reduced it. ConclusionsOpen-source LLMs can accurately transform narrative nephropathology reports into a structured and machine-readable format, potentially supporting scalable retrospective cohort building. While some report elements can be extracted without supervision, interpretation-dependent elements should be supervised by a human observer. Lay SummaryRetrospective data collection from nephropathology reports is essential for building informative cohorts in computational nephropathology research, yet manual processing of narrative reports is time-consuming and limits scalability. In this study, we demonstrate that open-source large language models can reliably extract key diagnostic, quantitative, and descriptive data elements from kidney biopsy reports with high accuracy. While factual and clearly stated report elements can be extracted automatically, findings that require contextual or interpretative judgment still benefit from expert supervision. Overall, this approach substantially reduces manual effort and enables efficient generation of structured datasets from diagnostic routine, facilitating the development of kidney registries and future computational nephropathology research. In addition, such systems could be implemented into the routine diagnostic workflow, to directly transform narrative reports into structured data.

Matching journals

The top 9 journals account for 50% of the predicted probability mass.

1
Modern Pathology
21 papers in training set
Top 0.1%
10.7%
2
Journal of the American Society of Nephrology
52 papers in training set
Top 0.1%
7.0%
3
The Lancet Digital Health
25 papers in training set
Top 0.1%
6.5%
4
Kidney360
22 papers in training set
Top 0.2%
5.0%
5
Kidney International
25 papers in training set
Top 0.1%
5.0%
6
Nature Communications
4913 papers in training set
Top 35%
4.4%
7
Scientific Reports
3102 papers in training set
Top 26%
4.4%
8
PLOS ONE
4510 papers in training set
Top 34%
4.3%
9
BMC Medicine
163 papers in training set
Top 1%
4.1%
50% of probability mass above
10
Kidney International Reports
14 papers in training set
Top 0.1%
3.7%
11
JAMA Network Open
127 papers in training set
Top 0.9%
3.7%
12
Diabetologia
36 papers in training set
Top 0.4%
2.7%
13
Journal of Pathology Informatics
13 papers in training set
Top 0.1%
2.5%
14
American Journal of Transplantation
15 papers in training set
Top 0.1%
2.4%
15
Frontiers in Pharmacology
100 papers in training set
Top 2%
1.7%
16
The American Journal of Pathology
31 papers in training set
Top 0.2%
1.7%
17
eBioMedicine
130 papers in training set
Top 1%
1.7%
18
Journal of Clinical Pathology
12 papers in training set
Top 0.2%
1.7%
19
Computers in Biology and Medicine
120 papers in training set
Top 2%
1.5%
20
Biology Methods and Protocols
53 papers in training set
Top 1%
1.4%
21
npj Digital Medicine
97 papers in training set
Top 2%
1.4%
22
Computational and Structural Biotechnology Journal
216 papers in training set
Top 6%
1.3%
23
PLOS Biology
408 papers in training set
Top 15%
1.1%
24
Laboratory Investigation
13 papers in training set
Top 0.1%
1.0%
25
Journal of Medical Imaging
11 papers in training set
Top 0.2%
0.9%
26
Journal of the American Medical Informatics Association
61 papers in training set
Top 2%
0.8%
27
Med
38 papers in training set
Top 0.8%
0.8%
28
Cytometry Part A
30 papers in training set
Top 0.3%
0.8%
29
The Lancet
16 papers in training set
Top 0.8%
0.7%
30
The Journal of Pathology
22 papers in training set
Top 0.5%
0.7%