Back

An improved generic schema for high fidelity data linkage and sample tracing across complex multi-assay medical entomology studies

Kavishe, D. R.; Msoffe, R. V.; Mmbaga, S.; Tarimo, L. J.; Butler, F.; Kaindoa, E. W.; Govella, N. J.; Kiware, S. S.; Killeen, G.

2026-05-13 bioinformatics
10.64898/2026.05.11.724183 bioRxiv
Show abstract

Evidence-based decision making on malaria vector control strategies increasingly rely on triangulation of data which requires informatics systems that can integrate data from complex, multi-stage studies involving mosquitoes. This manuscript describes a performance evaluation of an extended version of the generic schema underpinning the VBDs360 platform, specifically improved to accommodate multiple distinct entomological assays spanning the field, insectary and laboratory. The utility of this extension, with respect to high-fidelity data linkage and robust sample traceability across complex entomological workflows, was evaluated through a case study conducted in southern Tanzania. Wild female mosquitoes were collected from 40 locations across a >4,000 km{superscript 2} area and then reared through multiple generations in an insectary before derived iso-female lineages were tested for phenotypic susceptibility to a pyrethroid insecticide. Such multi-generational lineages (F to F where n [≥] 2) were propagated to prevent non-heritable maternal effects on phenotype and produce enough progeny for standard WHO susceptibility assays. All samples were subsequently archived in a molecular laboratory, where all F specimens were tested for sibling species identity. A paper-based implementation of the extended schema enabled successful integration of 77,017 lines of data distributed across 6 different tables that spanned 3 distinct field, insectary, and laboratory workflows, implemented by three different teams working in different locations. At each step, fully independent and redundant primary and secondary keys enabled high fidelity error correction and sample tracing. Consistently perfect linkage between assay design and sample sorting data was achieved for F0 wild-caught adults, with 100% of 66,108 record successfully linked between field capture and morphological categorization. This complete traceability extended to the propagation of derived Fn lineages, with all 100 and 243 records from 9 adult-derived and 13 larval-derived lineages, respectively, correctly linked. Insecticide susceptibility phenotype further confirmed 100% linkage for 5,654 records between exposure history and recorded mortality outcome data in the insectary. Although such cross-cleaned linkages to sample analysis and storage data recorded by the laboratory team were not entirely perfect and could be improved, they were nevertheless of very high fidelity (97.3% (1967/2,022) for F0 samples and 99.3% (437/440) for Fn samples). Overall, this pilot application of the extended generic schema ensured robust data provenance and minimized transcription errors in this complex study distributed across multiple teams and locations. These findings demonstrate how this generic informatics framework may be scaled and adapted to support data integrity across diverse, large-scale, multi-team entomological research workflows.

Matching journals

The top 7 journals account for 50% of the predicted probability mass.

1
PLOS ONE
4510 papers in training set
Top 9%
18.7%
2
Scientific Data
174 papers in training set
Top 0.2%
8.5%
3
Parasites & Vectors
57 papers in training set
Top 0.2%
7.2%
4
Nature Communications
4913 papers in training set
Top 29%
6.4%
5
Scientific Reports
3102 papers in training set
Top 26%
4.4%
6
GigaScience
172 papers in training set
Top 0.3%
4.3%
7
PLOS Computational Biology
1633 papers in training set
Top 9%
3.6%
50% of probability mass above
8
Insects
36 papers in training set
Top 0.3%
3.6%
9
PLOS Neglected Tropical Diseases
378 papers in training set
Top 2%
2.5%
10
BMC Biology
248 papers in training set
Top 0.8%
1.9%
11
BMC Bioinformatics
383 papers in training set
Top 4%
1.9%
12
PLOS Global Public Health
293 papers in training set
Top 3%
1.9%
13
Malaria Journal
48 papers in training set
Top 0.8%
1.9%
14
eLife
5422 papers in training set
Top 38%
1.9%
15
Computational and Structural Biotechnology Journal
216 papers in training set
Top 4%
1.7%
16
Gigabyte
60 papers in training set
Top 0.7%
1.7%
17
Peer Community Journal
254 papers in training set
Top 2%
1.7%
18
Bioinformatics Advances
184 papers in training set
Top 3%
1.7%
19
Methods in Ecology and Evolution
160 papers in training set
Top 2%
1.2%
20
PLOS Biology
408 papers in training set
Top 13%
1.2%
21
Communications Biology
886 papers in training set
Top 17%
1.0%
22
Epidemics
104 papers in training set
Top 1%
0.9%
23
Genome Medicine
154 papers in training set
Top 8%
0.8%
24
The Lancet Microbe
43 papers in training set
Top 1%
0.7%
25
BMC Genomics
328 papers in training set
Top 7%
0.6%
26
Molecular Ecology Resources
161 papers in training set
Top 1%
0.6%
27
Advanced Science
249 papers in training set
Top 23%
0.5%
28
mSphere
281 papers in training set
Top 7%
0.5%