Back

An Expert-Informed Synthetic Animal Data Generator: A Physiology-Consistent Generative Framework for High-Fidelity Animal Digital Twins

Youssef, A.; Sun, C.; Norton, T.

2026-04-27 bioengineering
10.64898/2026.04.23.720335 bioRxiv
Show abstract

Digital twins are increasingly recognized as a transformative technology for precision livestock farming; however, a major bottleneck in their development remains the scarcity of high-quality, high-granularity physiological data. This study introduces the expert-informed conditional diffusion (EICD) framework, a novel approach to synthesizing high-fidelity metabolic time-series trajectories by embedding mechanistic biological principles directly into the generative process. While traditional generative models often prioritize statistical pattern-matching over biological reality, frequently resulting in physiological hallucinations, the EICD framework utilizes a physiology loss function (PhLF) to act as a form of mechanistic regularization. This guardrail penalizes samples that contradict expert-defined constraints, such as the laws of porcine bioenergetics, effectively steering the model toward a realistic physiological manifold. The framework was validated using an empirical dataset of growing pigs under varying thermal conditions. Quantitative results demonstrate near-perfect statistical distributional fidelity, with the model achieving an average Jensen-Shannon divergence (JSD) of 0.062 and a Kullback-Leibler divergence (KLD) of 0.19. The full EICD model produced a mean energy expenditure (EE) of 284.94 {+/-} 38.70 kJ/kg/day, mirroring the empirical average of 281.33 {+/-} 41.58 kJ/kg/day. In contrast, the standard generative diffusion model (i.e., with no physiology guardrail) exhibited significant distributional drift, yielding a mean EE of 334.41 kJ/kg/day. The biological integrity of the model was further assessed using the biological violation rate (BVR), a novel metric defined as the percentage of generated samples that fall outside the physically possible metabolic boundaries established by species-specific laws. While the standard diffusion model produced frequent biological artifacts, the EICD framework successfully suppressed these hallucinations, ensuring that synthetic trajectories remain strictly grounded in mechanistic laws. Despite these advancements, limitations remain at physiological extremes where individual stochasticity is high. By providing a reliable method for generating physiology-consistent synthetic data, this framework provides a robust foundation for the next generation of animal digital twins. HighlightsO_LIA novel expert-informed conditional diffusion (EICD) framework is proposed for physiology-consistent synthetic data generation in precision livestock farming. C_LIO_LIA physiology loss function (PhLF) embeds species-specific bioenergetic laws directly into the generative process as a mechanistic guardrail. C_LIO_LIThe framework achieves near-perfect distributional fidelity (JSD = 0.062) while suppressing physiological hallucinations (BVR = 0.93%). C_LIO_LIAn ablation study confirms that biological consistency is not an emergent property of standard diffusion models but requires explicit mechanistic constraints. C_LIO_LIThe framework provides a scalable solution for synthetic data augmentation in precision livestock farming, supporting the 3Rs and enabling high-throughput in silico experimentation. C_LI

Matching journals

The top 7 journals account for 50% of the predicted probability mass.

1
Computational and Structural Biotechnology Journal
216 papers in training set
Top 0.1%
12.8%
2
PLOS Computational Biology
1633 papers in training set
Top 3%
10.2%
3
Advanced Science
249 papers in training set
Top 2%
7.2%
4
Journal of The Royal Society Interface
189 papers in training set
Top 0.4%
6.9%
5
Frontiers in Bioengineering and Biotechnology
88 papers in training set
Top 0.3%
4.9%
6
Computers in Biology and Medicine
120 papers in training set
Top 0.4%
4.9%
7
Scientific Reports
3102 papers in training set
Top 33%
3.7%
50% of probability mass above
8
Imaging Neuroscience
242 papers in training set
Top 1%
3.7%
9
Annals of Biomedical Engineering
34 papers in training set
Top 0.3%
3.7%
10
Nature Communications
4913 papers in training set
Top 40%
3.6%
11
npj Digital Medicine
97 papers in training set
Top 1%
3.6%
12
PLOS ONE
4510 papers in training set
Top 44%
2.6%
13
Plant Phenomics
17 papers in training set
Top 0.1%
2.1%
14
npj Systems Biology and Applications
99 papers in training set
Top 1.0%
1.8%
15
IEEE Access
31 papers in training set
Top 0.4%
1.7%
16
GigaScience
172 papers in training set
Top 2%
1.2%
17
Journal of Neural Engineering
197 papers in training set
Top 1%
1.2%
18
Epidemics
104 papers in training set
Top 1%
1.1%
19
Bioinformatics
1061 papers in training set
Top 8%
1.0%
20
iScience
1063 papers in training set
Top 24%
1.0%
21
Methods in Ecology and Evolution
160 papers in training set
Top 2%
0.9%
22
Network Neuroscience
116 papers in training set
Top 0.9%
0.9%
23
IEEE Transactions on Biomedical Engineering
38 papers in training set
Top 0.9%
0.8%
24
eLife
5422 papers in training set
Top 55%
0.8%
25
Proceedings of the National Academy of Sciences
2130 papers in training set
Top 46%
0.7%
26
IEEE Journal of Biomedical and Health Informatics
34 papers in training set
Top 2%
0.6%
27
Physical Review X
23 papers in training set
Top 0.7%
0.6%
28
Physical Review Research
46 papers in training set
Top 1%
0.6%
29
Frontiers in Computational Neuroscience
53 papers in training set
Top 2%
0.6%
30
Communications Biology
886 papers in training set
Top 29%
0.6%