Back

Compact longitudinal representations derived from mixed-format lifestyle questionnaires outperform static text-derived features for ALS-versus-control classification

Radlowski Nova, J.; Lopez-Carbonero, J. I.; Corrochano, S.; Ayala, J. L.

2026-03-25 bioinformatics
10.64898/2026.03.23.713709 bioRxiv
Show abstract

BackgroundMixed-format lifestyle questionnaires contain both structured variables and free-text responses, but it remains unclear whether language-derived variables provide incremental predictive value beyond structured data, and under which representational condition. It was investigated whether variables derived from patient-reported free text improve ALS-versus-control classification beyond structured questionnaire data, and whether their value depends on how temporal information is represented. MethodsA leakage-free machine-learning pipeline was developed to classify ALS versus controls from questionnaire-derived data, including a schema-guided LLM-based text-to-table extraction and a compact longitudinal encoding strategy. Three feature configurations were compared: Pool1, containing structured baseline variables only; Pool2, adding compact summaries derived from first-time-point (T1) free-text responses; and Pool3, further incorporating compact descriptors of change between T1 and T2. Logistic Regression, linear Support Vector Classification, and Random Forest were evaluated using repeated stratified holdout (10 seeds) and repeated stratified 5-fold cross-validation. Final ablation analyses were performed to isolate the contribution of the compact text block and the compact temporal block. ResultsAfter leakage correction, performance estimates became more conservative, indicating that previous results had been optimistic. In the final configuration, Pool3 achieved the best performance, with Random Forest reaching a holdout accuracy of 0.673, F1-weighted score of 0.666, and Matthews correlation coefficient of 0.323; cross-validated F1-weighted score and Matthews correlation coefficient were 0.654 and 0.312, respectively. Pool2 did not show a robust improvement over Pool1. Ablation analysis showed that removing the compact temporal block markedly reduced Pool3 performance, whereas removing the compact text block had little overall effect. These findings indicate that the primary value of language-based processing in small clinical cohorts lies not in static feature enrichment, but in enabling compact representations of longitudinal change. ConclusionsIn this setting, the main predictive gain did not arise from static text-derived variables alone, but from representing questionnaire information as compact longitudinal change descriptors. These findings suggest that, in small clinical cohorts, the value of language-based processing may lie more in summarizing trajectories than in expanding static feature spaces.

Matching journals

The top 9 journals account for 50% of the predicted probability mass.

1
Scientific Reports
3102 papers in training set
Top 0.5%
20.3%
2
PLOS ONE
4510 papers in training set
Top 23%
7.5%
3
Journal of the American Medical Informatics Association
61 papers in training set
Top 0.6%
4.3%
4
The Lancet Digital Health
25 papers in training set
Top 0.1%
3.7%
5
Biology Methods and Protocols
53 papers in training set
Top 0.2%
3.7%
6
npj Digital Medicine
97 papers in training set
Top 1%
3.4%
7
BMC Bioinformatics
383 papers in training set
Top 3%
2.9%
8
Acta Psychiatrica Scandinavica
10 papers in training set
Top 0.1%
2.9%
9
BMC Medical Research Methodology
43 papers in training set
Top 0.4%
2.7%
50% of probability mass above
10
JMIR mHealth and uHealth
10 papers in training set
Top 0.1%
2.2%
11
JMIR Medical Informatics
17 papers in training set
Top 0.6%
1.9%
12
Frontiers in Artificial Intelligence
18 papers in training set
Top 0.2%
1.9%
13
BMC Medical Informatics and Decision Making
39 papers in training set
Top 1%
1.8%
14
BioData Mining
15 papers in training set
Top 0.3%
1.8%
15
Journal of Translational Medicine
46 papers in training set
Top 0.7%
1.8%
16
Nature Communications
4913 papers in training set
Top 54%
1.4%
17
Computational and Structural Biotechnology Journal
216 papers in training set
Top 5%
1.4%
18
Clinical Infectious Diseases
231 papers in training set
Top 3%
1.3%
19
Genome Medicine
154 papers in training set
Top 6%
1.3%
20
Journal of Biomedical Informatics
45 papers in training set
Top 1.0%
1.3%
21
Artificial Intelligence in Medicine
15 papers in training set
Top 0.5%
0.9%
22
Bioinformatics
1061 papers in training set
Top 8%
0.9%
23
iScience
1063 papers in training set
Top 28%
0.8%
24
Journal of Medical Internet Research
85 papers in training set
Top 4%
0.8%
25
Bioinformatics Advances
184 papers in training set
Top 4%
0.8%
26
European Journal of Human Genetics
49 papers in training set
Top 1%
0.8%
27
PLOS Computational Biology
1633 papers in training set
Top 27%
0.7%
28
Brain Communications
147 papers in training set
Top 3%
0.7%
29
JCO Clinical Cancer Informatics
18 papers in training set
Top 0.9%
0.7%
30
Life
27 papers in training set
Top 0.7%
0.5%