Back

Learning the natural history of human disease with generative transformers

Shmatko, A.; Jung, A. W.; Gaurav, K.; Brunak, S.; Mortensen, L.; Birney, E.; Fitzgerald, T.; Gerstung, M.

2024-06-07 epidemiology
10.1101/2024.06.07.24308553 medRxiv
Show abstract

Decision-making in healthcare relies on the ability to understand patients past and current health state to predict, and ultimately change, their future course. Artificial intelligence (AI) methods promise to aid this task by learning patterns of disease progression from large corpora of health records to predict detailed outcomes for an individual. However, the potential of AI has not yet been fully investigated at scale. Here, we modify the GPT (generative pretrained transformer) architecture to model the temporal progression and competing nature of human diseases in a population scale cohort. We train this model, termed Delphi-2M, on data from 0.4 million participants of the UK Biobank and validate it using external data from 1.9 million Danish individuals with no change in parameters. Delphi-2M predicts the rates of more than 1,000 different ICD-10 coded diseases and death, conditional on each individuals past disease history, age, sex and baseline lifestyle information, and with accuracy comparable to existing single-disease models. Delphi-2Ms generative nature also enables sampling future health trajectories at any point within an individuals life course with outcomes across the entire disease spectrum. Sampled health trajectories provide meaningful estimates of future disease burden for up to 20 years and enable training AI models which have never seen actual data. Explainable AI methods provide insights into Delphi-2Ms predictions, revealing temporal clusters of co-morbidities within and across different disease chapters and their time-dependent consequences on the future health course. These analyses, however, also reveal that biases underlying the available training data, which in the case of the UK Biobank stem from distinct healthcare sources, are learned and highlighted. In summary, GPT-based models appear well suited for predictive and generative health-related tasks, are applicable to population scale health data sets and provide insights into the temporal dependencies of past events that shape future health, impacting our ability to obtain an instantaneous view of personalised health state.

Matching journals

The top 4 journals account for 50% of the predicted probability mass.

1
Nature Medicine
117 papers in training set
Top 0.1%
33.7%
2
Nature Machine Intelligence
61 papers in training set
Top 0.2%
8.6%
3
npj Digital Medicine
97 papers in training set
Top 0.7%
7.0%
4
Nature Communications
4913 papers in training set
Top 28%
6.5%
50% of probability mass above
5
Nature
575 papers in training set
Top 5%
5.0%
6
Nature Human Behaviour
85 papers in training set
Top 0.6%
4.4%
7
eLife
5422 papers in training set
Top 24%
3.7%
8
Proceedings of the National Academy of Sciences
2130 papers in training set
Top 19%
3.7%
9
Science Advances
1098 papers in training set
Top 8%
3.1%
10
PLOS Computational Biology
1633 papers in training set
Top 17%
1.5%
11
Science
429 papers in training set
Top 15%
1.5%
12
Scientific Reports
3102 papers in training set
Top 61%
1.5%
13
Science Translational Medicine
111 papers in training set
Top 3%
1.5%
14
Communications Medicine
85 papers in training set
Top 0.4%
1.4%
15
Nature Genetics
240 papers in training set
Top 5%
1.3%
16
Genome Medicine
154 papers in training set
Top 6%
1.0%
17
Cell Genomics
162 papers in training set
Top 6%
0.8%
18
Nature Computational Science
50 papers in training set
Top 2%
0.7%
19
EMBO Molecular Medicine
85 papers in training set
Top 5%
0.7%
20
Nature Biomedical Engineering
42 papers in training set
Top 2%
0.7%
21
Journal of Biomedical Informatics
45 papers in training set
Top 2%
0.7%
22
Nature Biotechnology
147 papers in training set
Top 8%
0.7%
23
IEEE Journal of Biomedical and Health Informatics
34 papers in training set
Top 2%
0.7%
24
Briefings in Bioinformatics
326 papers in training set
Top 8%
0.5%
25
Genome Research
409 papers in training set
Top 5%
0.5%
26
International Journal of Epidemiology
74 papers in training set
Top 3%
0.5%
27
Nature Methods
336 papers in training set
Top 7%
0.5%