Back

DIANA: Deep Learning Identification and Assessment of Ancient DNA

Duitama Gonzalez, C.; Lopopolo, M.; Nishimura, L.; Faure, R.; Duchene, S.

2026-04-10 bioinformatics
10.64898/2026.04.09.717429 bioRxiv
Show abstract

The field of ancient metagenomics provides insights into past microbiomes, but with a growing dataset size, methods that rely on reference databases have limited scope. Here, we introduce DIANA, a multi-task neural network that predicts key metadata categories from unitig abundances. Trained on 2,597 run accessions (1.72 Tbp of assembled unitig sequences), DIANA accurately identifies sample host (94.6%), community type (90.0%), and material (88.9%) on held-out test data and demonstrates robust generalisation on an independent validation set. A key innovation is DIANAs ability to perform semantic generalisation, correctly classifying samples with labels unseen during training -- such as novel subspecies -- to their appropriate parent categories. By leveraging both known and uncharacterized genomic sequences, DIANA provides a rapid, data-driven system for metadata validation and quality control, accelerating discovery in ancient metagenomics research.

Matching journals

The top 5 journals account for 50% of the predicted probability mass.

1
Nature Biotechnology
147 papers in training set
Top 0.1%
25.6%
2
Nature Communications
4913 papers in training set
Top 19%
10.0%
3
Nature Methods
336 papers in training set
Top 1%
7.1%
4
Genome Biology
555 papers in training set
Top 1%
6.3%
5
Nature Microbiology
133 papers in training set
Top 0.6%
4.8%
50% of probability mass above
6
Microbiome
139 papers in training set
Top 0.7%
4.8%
7
Nucleic Acids Research
1128 papers in training set
Top 6%
3.6%
8
Nature
575 papers in training set
Top 7%
3.6%
9
Cell
370 papers in training set
Top 6%
3.6%
10
Cell Systems
167 papers in training set
Top 5%
3.0%
11
Advanced Science
249 papers in training set
Top 9%
2.1%
12
Genome Medicine
154 papers in training set
Top 4%
1.9%
13
Science
429 papers in training set
Top 14%
1.7%
14
Cell Genomics
162 papers in training set
Top 4%
1.6%
15
Cell Host & Microbe
113 papers in training set
Top 3%
1.5%
16
Bioinformatics
1061 papers in training set
Top 8%
1.2%
17
Cell Reports Methods
141 papers in training set
Top 3%
1.2%
18
Scientific Reports
3102 papers in training set
Top 69%
0.9%
19
PLOS Computational Biology
1633 papers in training set
Top 22%
0.9%
20
Briefings in Bioinformatics
326 papers in training set
Top 6%
0.9%
21
Nature Genetics
240 papers in training set
Top 7%
0.9%
22
Nature Machine Intelligence
61 papers in training set
Top 3%
0.9%
23
PLOS ONE
4510 papers in training set
Top 66%
0.8%
24
Nature Computational Science
50 papers in training set
Top 2%
0.7%
25
Cell Reports
1338 papers in training set
Top 35%
0.7%
26
Bioinformatics Advances
184 papers in training set
Top 5%
0.6%
27
mSystems
361 papers in training set
Top 8%
0.6%