Back

A standardized atlas of human bronchoalveolar lavage cells built using scalable ensemble annotation and cross-study robust markers

Hu, Y.; Liu, Z.; Bai, K.; Moa, B.; Leung, J. M.; V.Gerayeli, F.; Shao, X.; Sin, D.; Zhang, X.

2026-01-09 genomics
10.64898/2026.01.08.698293 bioRxiv
Show abstract

BackgroundBronchoalveolar lavage (BAL) single-cell RNA sequencing (scRNA-seq) offers rich insights into pulmonary immune dynamics, yet consistent cell-type annotation remains elusive. Existing methods often rely on a single reference, risking inconsistency and domain shift across datasets. A BAL-specific, high-resolution annotation framework is critically needed. MethodsWe developed BAL-EA (BAL Ensemble Annotation), a BAL-centric automated annotation framework that integrates robust, cross-study marker discovery with ensemble machine learning. BAL-EA harmonizes BAL cell identities into a three-tier taxonomy (11 major lineages, 13 refined classes, 21 fine-grained subtypes) compatible with the Human Lung Cell Atlas (HLCA) while capturing lavage-enriched biology. Marker catalogues were derived via reproducibility-guided differential expression across at least 10 independent sub-studies, ensuring resilience to dataset-specific bias. Comparative benchmarking was performed against six leading annotation tools using independent BAL datasets. ResultsWe assembled the largest BAL scRNA-seq atlas to date, integrating more than 347,333 lung cells from HLCA, multiple public BAL datasets, and the largest inhouse BAL cohort ever reported (241,924 cells from 30 individuals). BAL-EA outperformed existing annotation tools, achieving balanced macro-F1 scores over 0.95 for key lineages such as alveolar macrophages (AM), non-alveolar macrophages, and epithelial cells. Application to Chronic Obstructive Pulmonary Disease (COPD) BAL samples revealed reproducible disease-associated shifts, including increased neutrophils and CCL2-positive macrophages alongside reduced AM in COPD patients, findings validated in independent COVID-19 BAL datasets. The released atlas includes harmonized multi-resolution annotations, robust marker panels, pretrained models. ConclusionsThis work contributes the most comprehensive BAL scRNA-seq atlas, introduces a novel BAL-specific annotation framework (BAL-EA), standardizes BAL taxonomy at three resolutions, and provides rigorously validated marker gene resources. Together, these advances deliver a powerful reference for reproducible BAL scRNA-seq analysis and lay the foundation for clinical and translational applications in respiratory disease research.

Matching journals

The top 6 journals account for 50% of the predicted probability mass.

1
American Journal of Respiratory Cell and Molecular Biology
38 papers in training set
Top 0.1%
13.0%
2
Nature Communications
4913 papers in training set
Top 13%
12.6%
3
Genome Medicine
154 papers in training set
Top 0.5%
9.3%
4
Thorax
32 papers in training set
Top 0.1%
7.3%
5
EBioMedicine
39 papers in training set
Top 0.1%
7.3%
6
Scientific Reports
3102 papers in training set
Top 30%
4.0%
50% of probability mass above
7
Bioinformatics
1061 papers in training set
Top 5%
3.7%
8
European Respiratory Journal
54 papers in training set
Top 0.7%
2.1%
9
Respiratory Research
19 papers in training set
Top 0.2%
1.9%
10
Cell Reports
1338 papers in training set
Top 23%
1.7%
11
NAR Genomics and Bioinformatics
214 papers in training set
Top 2%
1.7%
12
Cell Genomics
162 papers in training set
Top 3%
1.7%
13
American Journal of Respiratory and Critical Care Medicine
39 papers in training set
Top 0.5%
1.7%
14
Nature Genetics
240 papers in training set
Top 5%
1.5%
15
Journal of Translational Medicine
46 papers in training set
Top 1%
1.5%
16
JCI Insight
241 papers in training set
Top 4%
1.4%
17
Life Science Alliance
263 papers in training set
Top 0.6%
1.3%
18
PLOS ONE
4510 papers in training set
Top 60%
1.3%
19
Frontiers in Immunology
586 papers in training set
Top 6%
1.0%
20
Nucleic Acids Research
1128 papers in training set
Top 15%
1.0%
21
PLOS Computational Biology
1633 papers in training set
Top 22%
0.9%
22
International Journal of Epidemiology
74 papers in training set
Top 2%
0.9%
23
Computational and Structural Biotechnology Journal
216 papers in training set
Top 7%
0.9%
24
The Journal of Infectious Diseases
182 papers in training set
Top 4%
0.8%
25
GigaScience
172 papers in training set
Top 3%
0.8%
26
iScience
1063 papers in training set
Top 28%
0.8%
27
Nature Medicine
117 papers in training set
Top 4%
0.8%
28
BMC Genomics
328 papers in training set
Top 5%
0.8%
29
Database
51 papers in training set
Top 0.9%
0.8%
30
Genome Biology
555 papers in training set
Top 7%
0.8%