Large-Language Models for data extraction from written kidney biopsy reports

Niggemeier, L.; Hoelscher, D. L.; Herkens, T. C.; Gilles, P.; Boor, P.; Buelow, R.

2026-02-25 pathology

10.64898/2026.02.23.26346945 medRxiv

Show abstract

IntroductionKidney biopsy reports contain rich information that is clinically actionable and useful for research. However, the narrative format hinders scalable reuse. We here investigated whether open-source large language models (LLMs) can extract relevant, standardized readouts from native kidney biopsy pathology reports. MethodsGerman free-text native kidney biopsy reports were parsed with three open-source LLMs (Llama3 70B, Llama3 8B, MedGemma) to generate structured JSON outputs covering relevant report elements (e.g., diagnosis, glomerular counts, histopathological patterns). Two independent observers manually curated the same report elements; disagreements between the two were resolved by an experienced nephropathologist to create the final ground truth. Performance was assessed using strict and soft matching and summarized accuracy. Inter-rated agreement was quantified using Cohens and Lights Kappa with 95% confidence intervals via 1000-times bootstrapping. ResultsLlama3 70B achieved the highest overall accuracy (93.3% strict, 97.1% soft), followed by MedGemma. These larger models showed near perfect performance for explicit and discrete variables and positivity of immunohistochemistry markers, while accuracy decreased for report elements requiring interpretation (e.g., primary diagnosis, interstitial inflammation in fibrosis vs. non-fibrotic cortex). Human raters showed strong agreement for the primary diagnosis ({kappa} = 0.74, 95% CI 0.64-0.84). Adding Llama3 70B or MedGemma as a third rater increased overall agreement (0.82, 95% CI 0.74-0.89 and 0.78, 95% CI 0.69-0.85, respectively), whereas Llama3 8B reduced it. ConclusionsOpen-source LLMs can accurately transform narrative nephropathology reports into a structured and machine-readable format, potentially supporting scalable retrospective cohort building. While some report elements can be extracted without supervision, interpretation-dependent elements should be supervised by a human observer. Lay SummaryRetrospective data collection from nephropathology reports is essential for building informative cohorts in computational nephropathology research, yet manual processing of narrative reports is time-consuming and limits scalability. In this study, we demonstrate that open-source large language models can reliably extract key diagnostic, quantitative, and descriptive data elements from kidney biopsy reports with high accuracy. While factual and clearly stated report elements can be extracted automatically, findings that require contextual or interpretative judgment still benefit from expert supervision. Overall, this approach substantially reduces manual effort and enables efficient generation of structured datasets from diagnostic routine, facilitating the development of kidney registries and future computational nephropathology research. In addition, such systems could be implemented into the routine diagnostic workflow, to directly transform narrative reports into structured data.

Large-Language Models for data extraction from written kidney biopsy reports

Matching journals