Back

Study Design Indexing in Transition: A Focused Comparison of manual NLM Indexing vs. Transformer-based Automated Models

Das, P.; Schneider, J.; Mayo-Wilson, E.; Kilicoglu, H.; Menke, J. D.; Nam, D.; Ninan, K.; Oberste, J.-P.; Troy, A. M.; Ying, X.; Holt, A. W.; Smalheiser, N. R.

2026-06-04 health informatics
10.64898/2026.06.03.26354854 medRxiv
Show abstract

Objectives: Study design indexing of biomedical publications is crucial for evidence retrieval and synthesis. We sought to evaluate the accuracy and suitability of a transformer-based model (TM) for indexing clinical study designs, in comparison to National Library of Medicine (NLM) indexing. However, this is challenging for at least three reasons: First, to date, all automated systems have been trained and evaluated on manual NLM indexing assignments, itself subject to errors. Second, TM's probabilistic predictive scores take into account uncertainty, and can be converted to TRUE/FALSE assignments in different ways depending on the needs of users, while NLM labels are categorical. Third, our goal (to tag articles only that exhibit a given design) differs from NLM which tags articles that both discuss as well as exhibit that design. Materials and Methods: Therefore, we carried out a limited evaluation of the TM model that focuses only on the articles that received the most confident predictions, that is, the highest scores that are almost certainly TRUE and the lowest scores that are almost certainly FALSE, but which disagreed with NLM assignments. This was performed both for articles published in 2016 (when NLM decisions were manual) and in 2025 (when NLM decisions were automated). To establish ground truth, dual annotators indexed the articles independently, following written definitions, for four prominent study designs--cohort, case-control, cross-sectional, and case report. Results: For three designs (case-control, case report, cross-sectional), the articles having the top 100 predictive TM scores (when NLM failed to assign that design) were judged to exhibit that design in the great majority (86-100%) of cases. Conversely, the articles having the lowest 100 predictive TM scores (when NLM did assign the study design) exhibited the design only in relatively few (0-21%) of cases. The most confident predictions of the TM model were highly accurate and not redundant with automated NLM indexing; the exception was cohort studies articles, in which both TM and NLM labels showed high error rates of both omission and commission. Discussion and Conclusion: TM may have value for identifying articles exhibiting study designs, which is especially important for clinical decision-making as well as systematic reviews and other evidence syntheses. NLM indexing of cohort studies cannot be regarded as a reliable gold standard for training or evaluation of automated systems, warranting efforts to create a new manually annotated corpus.

Matching journals

The top 5 journals account for 50% of the predicted probability mass.

1
Journal of the American Medical Informatics Association
61 papers in training set
Top 0.2%
14.9%
2
Journal of Clinical Epidemiology
28 papers in training set
Top 0.1%
14.5%
3
Research Synthesis Methods
20 papers in training set
Top 0.1%
14.5%
4
PLOS ONE
4510 papers in training set
Top 30%
4.9%
5
JAMIA Open
37 papers in training set
Top 0.2%
4.9%
50% of probability mass above
6
Journal of Biomedical Informatics
45 papers in training set
Top 0.4%
4.0%
7
BMJ Open
554 papers in training set
Top 5%
3.7%
8
BMC Medical Research Methodology
43 papers in training set
Top 0.3%
3.6%
9
BMC Medicine
163 papers in training set
Top 2%
2.1%
10
BMJ Health & Care Informatics
13 papers in training set
Top 0.4%
1.8%
11
JMIR Medical Informatics
17 papers in training set
Top 0.7%
1.8%
12
BMJ
49 papers in training set
Top 0.6%
1.5%
13
BMC Medical Informatics and Decision Making
39 papers in training set
Top 2%
1.2%
14
Healthcare
16 papers in training set
Top 1%
1.2%
15
Scientific Reports
3102 papers in training set
Top 68%
1.0%
16
Journal of Medical Internet Research
85 papers in training set
Top 4%
1.0%
17
Artificial Intelligence in Medicine
15 papers in training set
Top 0.5%
0.9%
18
European Journal of Epidemiology
40 papers in training set
Top 0.6%
0.9%
19
npj Digital Medicine
97 papers in training set
Top 3%
0.9%
20
JAMA
17 papers in training set
Top 0.3%
0.8%
21
International Journal of Medical Informatics
25 papers in training set
Top 1%
0.8%
22
Computers in Biology and Medicine
120 papers in training set
Top 4%
0.8%
23
Neuroscience & Biobehavioral Reviews
43 papers in training set
Top 1.0%
0.8%
24
Scientific Data
174 papers in training set
Top 3%
0.7%
25
PLOS Biology
408 papers in training set
Top 23%
0.7%
26
The Lancet Digital Health
25 papers in training set
Top 1%
0.7%
27
PeerJ
261 papers in training set
Top 19%
0.5%