Back

A Reproducible Health Informatics Pipeline for Simulating and Integrating Early-Phase Oncology Clinical, Biomarker, and Pharmacokinetic Data for Exploratory Decision-Support Analytics

Petalcorin, M. I. R.

2026-04-02 health informatics
10.64898/2026.03.27.26349538 medRxiv
Show abstract

Background: Early-phase oncology development increasingly depends on integrated interpretation of clinical outcomes, translational biomarkers, and pharmacokinetic exposure rather than toxicity alone. This shift has created a need for reproducible analytical workflows that can combine heterogeneous trial data into traceable, analysis-ready outputs suitable for exploratory review and early decision support. Objective: To develop a reproducible Python-based workflow that simulates a plausible early-phase oncology study, integrates clinical, biomarker, and pharmacokinetic data, and generates analysis-ready datasets, visual summaries, and exploratory predictive models relevant to early development analytics. Methods: A workflow was constructed to simulate an early-phase oncology cohort of 120 patients distributed across multiple dose levels. Three synthetic raw data sources were generated, including patient-level clinical data, baseline biomarker data, and longitudinal pharmacokinetic profiles. These sources were merged into a single analysis-ready dataset containing derived variables such as tumor percent change from baseline, clinical-benefit status, exposure summaries, adverse-event indicators, and survival outcomes. The workflow produced structured tables, patient listings, waterfall plots, Kaplan-Meier-style survival curves, biomarker-response visualizations, pharmacokinetic profile plots, and exploratory machine-learning outputs. Results: The final integrated dataset contained 120 patients and 30 variables. Median survival across the simulated cohort was 243.8 days, and higher dose groups showed improved median survival and greater clinical benefit relative to the low-dose group. Clinical benefit increased from 8.6% in the low-dose group to 29.0% in the medium-dose group and 45.2% in the high-dose group. Higher baseline LDH, CRP, and ctDNA fraction tracked with less favorable tumor-response trajectories, whereas higher exposure, reflected by AUC and Cmax, associated with improved disease control. Pharmacokinetic profiles showed clear dose-dependent separation. Grade 3 or higher adverse-event rates remained within a plausible exploratory range across dose groups. A random-forest model for clinical benefit achieved an exploratory ROC AUC of 0.845, while a logistic-regression model for strict responder status could not be fit because no simulated patient met the prespecified objective response threshold. Conclusions: This proof-of-concept demonstrates that a transparent Python workflow can generate a coherent early-phase oncology analytical ecosystem from synthetic inputs. The workflow supports integration of heterogeneous data streams, derivation of analysis-ready variables, production of interpretable outputs, and exploratory modeling in a reproducible framework. Although the simulated responder prevalence was too low to support objective response modeling, this limitation itself highlights the importance of simulation calibration for downstream analytical validity. The framework provides a practical Health Informatics demonstration of how early oncology trial data can be structured and analyzed for exploratory translational decision support.

Matching journals

The top 3 journals account for 50% of the predicted probability mass.

1
JCO Clinical Cancer Informatics
18 papers in training set
Top 0.1%
44.8%
2
npj Digital Medicine
97 papers in training set
Top 1%
3.9%
3
Scientific Reports
3102 papers in training set
Top 32%
3.9%
50% of probability mass above
4
BMC Medical Informatics and Decision Making
39 papers in training set
Top 0.7%
3.9%
5
Cancer Medicine
24 papers in training set
Top 0.4%
3.1%
6
BMC Medical Research Methodology
43 papers in training set
Top 0.4%
2.5%
7
Clinical and Translational Science
21 papers in training set
Top 0.3%
2.2%
8
PLOS ONE
4510 papers in training set
Top 50%
1.9%
9
Journal of Medical Internet Research
85 papers in training set
Top 2%
1.8%
10
PLOS Computational Biology
1633 papers in training set
Top 15%
1.8%
11
BMJ Health & Care Informatics
13 papers in training set
Top 0.5%
1.6%
12
JAMIA Open
37 papers in training set
Top 0.9%
1.6%
13
Computer Methods and Programs in Biomedicine
27 papers in training set
Top 0.4%
1.6%
14
Frontiers in Digital Health
20 papers in training set
Top 0.8%
1.4%
15
Annals of Internal Medicine
27 papers in training set
Top 0.7%
1.0%
16
Informatics in Medicine Unlocked
21 papers in training set
Top 0.8%
1.0%
17
BMC Bioinformatics
383 papers in training set
Top 6%
1.0%
18
Journal of the American Medical Informatics Association
61 papers in training set
Top 2%
0.8%
19
JMIR Public Health and Surveillance
45 papers in training set
Top 3%
0.8%
20
JMIR Medical Informatics
17 papers in training set
Top 1%
0.8%
21
Nature Communications
4913 papers in training set
Top 61%
0.8%
22
Biomedicines
66 papers in training set
Top 2%
0.8%
23
BMC Infectious Diseases
118 papers in training set
Top 6%
0.7%
24
The Lancet Digital Health
25 papers in training set
Top 1%
0.7%
25
Frontiers in Artificial Intelligence
18 papers in training set
Top 1.0%
0.5%
26
Cell Reports Medicine
140 papers in training set
Top 10%
0.5%
27
European Journal of Cancer
10 papers in training set
Top 0.7%
0.5%
28
Database
51 papers in training set
Top 1%
0.5%
29
Computers in Biology and Medicine
120 papers in training set
Top 6%
0.5%
30
Radiotherapy and Oncology
18 papers in training set
Top 0.3%
0.5%