Back

Denoising Longitudinal Social Media for Pandemic Monitoring

Lin, S.; Garay, L.; Hua, Y.; Guo, Z.; Xu, X.; Yang, J.

2024-06-30 public and global health
10.1101/2024.06.29.24309690 medRxiv
Show abstract

ObjectiveCurrent studies leveraging social media data for disease monitoring face challenges like noisy colloquial language and insufficient tracking of user disease progression in longitudinal data settings. This study aims to develop a pipeline for collecting, cleaning, and analyzing large-scale longitudinal social media data for disease monitoring, with a focus on COVID-19 pandemic. Materials and MethodsThis pipeline initiates by screening COVID-19 cases from tweets spanning February 1, 2020, to April 30, 2022. Longitudinal data is collected for each patient, two months before and three months after self-reporting. Symptoms are extracted using Name Entity Recognition (NER), followed by denoising with a combination of Graph Convolutional Network (GCN) and Bidirectional Encoder Representations from Transformers (BERT) model to retain only User Symptom Mentions (USM). Subsequently, symptoms are mapped to standardized medical concepts using the Unified Medical Language System (UMLS). Finally, this study conducts symptom pattern analysis and visualization to illustrate temporal changes in symptom prevalence and co-occurrence. ResultsThis study identified 191,096 self-reported COVID-19-positive cases from COVID-19-related tweets and retrospectively collected 811,398,280 historical tweets, of which 2,120,964 contained symptoms information. After denoising, 39% (832,287) of symptom-sharing tweets reflected user-related mentions. The trained USM model achieved an F1 score of 0.926. Further analysis revealed a higher prevalence of upper respiratory tract symptoms during the Omicron period compared to the Delta and wild-type periods. Additionally, there was a pronounced co-occurrence of lower respiratory tract and nervous system symptoms in the wild-type strain and Delta variant. ConclusionThis study established a robust framework for pandemic monitoring via social media, integrating denoising of user-related symptom mentions and longitudinal data. The findings underscore the importance of denoising procedures in revealing accurate prevalence trends, thereby minimizing biases in symptom analysis.

Matching journals

The top 2 journals account for 50% of the predicted probability mass.

1
Journal of Medical Internet Research
85 papers in training set
Top 0.1%
43.6%
2
Journal of Biomedical Informatics
45 papers in training set
Top 0.2%
6.7%
50% of probability mass above
3
Scientific Reports
3102 papers in training set
Top 15%
6.6%
4
PLOS ONE
4510 papers in training set
Top 30%
5.1%
5
International Journal of Medical Informatics
25 papers in training set
Top 0.3%
4.4%
6
npj Digital Medicine
97 papers in training set
Top 1%
3.2%
7
BMC Medical Informatics and Decision Making
39 papers in training set
Top 1.0%
2.7%
8
Database
51 papers in training set
Top 0.3%
2.0%
9
JMIR Public Health and Surveillance
45 papers in training set
Top 1%
2.0%
10
IEEE Access
31 papers in training set
Top 0.3%
1.9%
11
JMIR Medical Informatics
17 papers in training set
Top 0.8%
1.6%
12
Frontiers in Psychiatry
83 papers in training set
Top 2%
1.4%
13
Journal of the American Medical Informatics Association
61 papers in training set
Top 1%
1.4%
14
Frontiers in Digital Health
20 papers in training set
Top 1%
0.9%
15
Frontiers in Public Health
140 papers in training set
Top 7%
0.8%
16
Data in Brief
13 papers in training set
Top 0.3%
0.8%
17
Patterns
70 papers in training set
Top 2%
0.8%
18
PLOS Digital Health
91 papers in training set
Top 3%
0.8%
19
EClinicalMedicine
21 papers in training set
Top 1%
0.7%
20
IEEE Journal of Biomedical and Health Informatics
34 papers in training set
Top 2%
0.7%
21
JMIRx Med
31 papers in training set
Top 2%
0.5%
22
Eurosurveillance
80 papers in training set
Top 2%
0.5%
23
PLOS Computational Biology
1633 papers in training set
Top 28%
0.5%
24
BMC Infectious Diseases
118 papers in training set
Top 6%
0.5%
25
Wellcome Open Research
57 papers in training set
Top 3%
0.5%
26
Nature Communications
4913 papers in training set
Top 66%
0.5%