Back

Comparing natural language processing representations of disease sequences for prediction in the electronic healthcare record

Beaney, T.; Jha, S.; Alaa, A.; Smith, A.; Clarke, J.; Woodcock, T.; Majeed, A.; Aylin, P.; Barahona, M.

2023-11-16 health informatics
10.1101/2023.11.16.23298640 medRxiv
Show abstract

Natural language processing (NLP) is increasingly being applied to obtain unsupervised representations of electronic healthcare record (EHR) data, but their performance for the prediction of clinical endpoints remains unclear. Here we use primary care EHRs from 6,286,233 people with Multiple Long-Term Conditions in England to generate vector representations of sequences of disease development using two input strategies (212 disease categories versus 9,462 diagnostic codes) and different NLP algorithms (Latent Dirichlet Allocation, doc2vec and two transformer models designed for EHRs). We also develop a new transformer architecture, named EHR-BERT, which incorporates socio-demographic information. We then compare use of each of these representations to predict mortality, healthcare use and new disease diagnosis. We find that representations generated using disease categories perform similarly to those using diagnostic codes, suggesting models can equally manage smaller or larger vocabularies. Sequence-based algorithms perform consistently better than bag-of-words methods, with the highest performance for EHR-BERT.

Matching journals

The top 3 journals account for 50% of the predicted probability mass.