Back

LLM-based reconstruction of longitudinal clinical trajectories in chronic liver disease.

Paverd, H.; Gao, Z.; Mahani, G.; Fabre, M.; Burge, S.; Hoare, M.; Crispin-Ortuzar, M.

2026-02-10 transplantation
10.64898/2026.02.10.26345124 medRxiv
Show abstract

Background & AimsLiver cancer primarily develops in patients with chronic liver disease (CLD), yet most cases are diagnosed at an advanced stage with poor prognosis. While clinical surveillance of patients with CLD generates extensive longitudinal data, its unstructured free-text nature hinders large-scale research. To unlock this real-world evidence, we developed a scalable framework using open-source Large Language Models (LLMs) to transform unstructured clinical text into structured data. MethodsWe conducted a multi-stage evaluation of LLM-based extraction from multi-source clinical documentation of liver transplant recipients. A calibration set comprising 507 reports (414 radiology, 65 pathology, and 28 liver transplant assessment reports) from 30 patients was manually annotated to benchmark four open-source LLMs (Llama 3.1 8B, Llama 3.3 70B, Open-BioLLM 70B, DeepSeek R1 8B) against a regular expression baseline across 73 tasks. To ensure structured outputs, we compared constrained decoding (Guidance and Ollama packages) against unconstrained prompting across 5,590 prompt-output pairs. The finalised pipeline was then applied to the full cohort of 835 patients transplanted in our centre over the past decade. ResultsAmong the models tested, Llama 3.3 70B performed best, exceeding 90% accuracy on 59/73 tasks, outperforming both a medically fine-tuned model (OpenBioLLM 70B) and a smaller variant (Llama 3.1 8B). Constrained decoding achieved >99.9% format adherence, far surpassing unconstrained prompting (87.4%). Applied to the full cohort, the pipeline successfully analysed 22,493 reports to generate 37,125 datapoints (45 variables, 835 patients) without manual annotation. Further analysis confirmed known liver cancer risk factors (male sex, viral hepatitis, smoking, diabetes), and allowed for reconstruction of longitudinal disease timelines. ConclusionsThis work provides a scalable blueprint for transforming real-world clinical free-text into structured formats, paving the way for accelerated, data-driven research into complex pre-cancerous diseases like CLD.

Matching journals

The top 13 journals account for 50% of the predicted probability mass.

1
Nature Communications
4913 papers in training set
Top 16%
10.6%
2
Cell Reports Medicine
140 papers in training set
Top 0.3%
7.1%
3
Annals of Internal Medicine
27 papers in training set
Top 0.1%
5.1%
4
PLOS ONE
4510 papers in training set
Top 35%
4.1%
5
JAMA Network Open
127 papers in training set
Top 0.7%
4.1%
6
Journal of Hepatology
18 papers in training set
Top 0.1%
3.8%
7
Transplantation
13 papers in training set
Top 0.1%
3.2%
8
Scientific Reports
3102 papers in training set
Top 44%
2.7%
9
eBioMedicine
130 papers in training set
Top 0.5%
2.7%
10
Nature Medicine
117 papers in training set
Top 1%
2.2%
11
Bioinformatics
1061 papers in training set
Top 6%
2.2%
12
npj Digital Medicine
97 papers in training set
Top 2%
2.0%
13
Hepatology
18 papers in training set
Top 0.2%
1.8%
50% of probability mass above
14
Genome Medicine
154 papers in training set
Top 4%
1.8%
15
Frontiers in Public Health
140 papers in training set
Top 4%
1.8%
16
Scientific Data
174 papers in training set
Top 1.0%
1.8%
17
The Lancet Digital Health
25 papers in training set
Top 0.3%
1.8%
18
Journal of Translational Medicine
46 papers in training set
Top 0.7%
1.8%
19
Science Translational Medicine
111 papers in training set
Top 3%
1.6%
20
BMC Medicine
163 papers in training set
Top 4%
1.6%
21
iScience
1063 papers in training set
Top 17%
1.6%
22
Gut
36 papers in training set
Top 0.5%
1.4%
23
Journal of the American Medical Informatics Association
61 papers in training set
Top 2%
1.0%
24
Med
38 papers in training set
Top 0.5%
1.0%
25
Biology Methods and Protocols
53 papers in training set
Top 2%
0.9%
26
Nucleic Acids Research
1128 papers in training set
Top 15%
0.9%
27
Clinical Pharmacology & Therapeutics
25 papers in training set
Top 0.6%
0.8%
28
JAMIA Open
37 papers in training set
Top 1%
0.8%
29
Communications Biology
886 papers in training set
Top 22%
0.8%
30
Cancers
200 papers in training set
Top 4%
0.8%