Back

Scaling Sensor Metadata Extraction for Exposure Health Using LLMs

Shah-mohammadi, F.; Im, S.; Facelli, J.; Cummins, M.; Gouripeddi, R.

2025-08-26 occupational and environmental health
10.1101/2025.08.21.25334173 medRxiv
Show abstract

BackgroundThe rapid evolution and diversity of sensor technologies, coupled with inconsistencies in how sensor metadata is reported across formats and sources, present significant challenges for generating exposomes and exposure health research. ObjectiveDespite the development of standardized metadata schemas, the process of extracting sensor metadata from unstructured sources remains largely manual and unscalable. To address this bottleneck, we developed and evaluated a large language model (LLM)-based pipeline for automating sensor metadata extraction and harmonization from exposure health literature publicly available. MethodsUsing GPT-4 in a zero-shot setting, we constructed a pipeline that parses full-text PDFs to extract metadata and harmonizes output into structured formats. Results: Our automated pipeline achieved substantial efficiency gains in completing extractions much faster than manual review and demonstrated strong performance with average accuracy and precision of 94.74%, recall of 100%, and F1-score of 97.28%. ConclusionsThis study demonstrates the feasibility and scalability of leveraging LLMs to automate sensor metadata extraction for exposure health, reducing manual burden while enhancing metadata completeness and consistency. Our findings support the integration of LLM-driven pipelines into exposure health informatics platforms.

Matching journals

The top 7 journals account for 50% of the predicted probability mass.

1
Environmental Health Perspectives
17 papers in training set
Top 0.1%
20.1%
2
The Innovation
12 papers in training set
Top 0.1%
7.1%
3
Scientific Reports
3102 papers in training set
Top 15%
6.6%
4
Environment International
42 papers in training set
Top 0.2%
6.5%
5
PLOS ONE
4510 papers in training set
Top 30%
5.0%
6
International Journal of Epidemiology
74 papers in training set
Top 0.5%
3.7%
7
Environmental Research
46 papers in training set
Top 0.4%
3.7%
50% of probability mass above
8
Science of The Total Environment
179 papers in training set
Top 2%
2.8%
9
International Journal of Environmental Research and Public Health
124 papers in training set
Top 3%
2.7%
10
ACS Nano
99 papers in training set
Top 1%
2.7%
11
Environmental Science & Technology
64 papers in training set
Top 1%
2.5%
12
Scientific Data
174 papers in training set
Top 0.8%
2.2%
13
Indoor Air
10 papers in training set
Top 0.1%
2.2%
14
Nature
575 papers in training set
Top 10%
1.8%
15
Open Research Europe
14 papers in training set
Top 0.1%
1.7%
16
Environmental Research Letters
15 papers in training set
Top 0.4%
1.4%
17
Environmental Pollution
35 papers in training set
Top 2%
1.3%
18
Toxicological Sciences
38 papers in training set
Top 0.4%
1.1%
19
eBioMedicine
130 papers in training set
Top 3%
1.0%
20
GeoHealth
10 papers in training set
Top 0.5%
1.0%
21
Journal of Agricultural and Food Chemistry
14 papers in training set
Top 0.9%
1.0%
22
PLOS Global Public Health
293 papers in training set
Top 5%
0.9%
23
Journal of the American Society for Mass Spectrometry
33 papers in training set
Top 0.4%
0.9%
24
Frontiers in Bioengineering and Biotechnology
88 papers in training set
Top 2%
0.8%
25
Sensors
39 papers in training set
Top 2%
0.8%
26
American Journal of Infection Control
12 papers in training set
Top 0.3%
0.8%
27
ACS ES&T Water
18 papers in training set
Top 0.4%
0.8%
28
Communications Biology
886 papers in training set
Top 22%
0.8%
29
Bioengineering
24 papers in training set
Top 1%
0.7%
30
Frontiers in Public Health
140 papers in training set
Top 9%
0.5%