An Expert-Informed Synthetic Animal Data Generator: A Physiology-Consistent Generative Framework for High-Fidelity Animal Digital Twins
Youssef, A.; Sun, C.; Norton, T.
Show abstract
Digital twins are increasingly recognized as a transformative technology for precision livestock farming; however, a major bottleneck in their development remains the scarcity of high-quality, high-granularity physiological data. This study introduces the expert-informed conditional diffusion (EICD) framework, a novel approach to synthesizing high-fidelity metabolic time-series trajectories by embedding mechanistic biological principles directly into the generative process. While traditional generative models often prioritize statistical pattern-matching over biological reality, frequently resulting in physiological hallucinations, the EICD framework utilizes a physiology loss function (PhLF) to act as a form of mechanistic regularization. This guardrail penalizes samples that contradict expert-defined constraints, such as the laws of porcine bioenergetics, effectively steering the model toward a realistic physiological manifold. The framework was validated using an empirical dataset of growing pigs under varying thermal conditions. Quantitative results demonstrate near-perfect statistical distributional fidelity, with the model achieving an average Jensen-Shannon divergence (JSD) of 0.062 and a Kullback-Leibler divergence (KLD) of 0.19. The full EICD model produced a mean energy expenditure (EE) of 284.94 {+/-} 38.70 kJ/kg/day, mirroring the empirical average of 281.33 {+/-} 41.58 kJ/kg/day. In contrast, the standard generative diffusion model (i.e., with no physiology guardrail) exhibited significant distributional drift, yielding a mean EE of 334.41 kJ/kg/day. The biological integrity of the model was further assessed using the biological violation rate (BVR), a novel metric defined as the percentage of generated samples that fall outside the physically possible metabolic boundaries established by species-specific laws. While the standard diffusion model produced frequent biological artifacts, the EICD framework successfully suppressed these hallucinations, ensuring that synthetic trajectories remain strictly grounded in mechanistic laws. Despite these advancements, limitations remain at physiological extremes where individual stochasticity is high. By providing a reliable method for generating physiology-consistent synthetic data, this framework provides a robust foundation for the next generation of animal digital twins. HighlightsO_LIA novel expert-informed conditional diffusion (EICD) framework is proposed for physiology-consistent synthetic data generation in precision livestock farming. C_LIO_LIA physiology loss function (PhLF) embeds species-specific bioenergetic laws directly into the generative process as a mechanistic guardrail. C_LIO_LIThe framework achieves near-perfect distributional fidelity (JSD = 0.062) while suppressing physiological hallucinations (BVR = 0.93%). C_LIO_LIAn ablation study confirms that biological consistency is not an emergent property of standard diffusion models but requires explicit mechanistic constraints. C_LIO_LIThe framework provides a scalable solution for synthetic data augmentation in precision livestock farming, supporting the 3Rs and enabling high-throughput in silico experimentation. C_LI
Matching journals
The top 7 journals account for 50% of the predicted probability mass.