Back

A Novel Open Access Multimodal Dataset Of Nodule Imaging And Circulating Proteome From A Lung Cancer Screening Cohort

Cobo, M.; Serrano, D.; Barranco, J.; Pasquier, A.; de-Torres, J. P.; Zulueta, J. J.; Echeveste, J. I.; Ezponda, A.; Argueta, A.; Sanz-Ortega, J.; Berto, J.; Alcaide, A. B.; di Frisco, M.; Felgueroso, C.; Campo, A.; de la Fuente, A. A.; Escobar, A.; Valencia, K.; Orive, D.; Ocon, M. d. M.; Globacka, H. B.; Fortuno, M. A.; Perna, V.; Rodriguez, M.; Lozano, M. D.; Calvo, A.; Pio, R.; Hung, R. J.; Seijo, L. M.; Silva, W.; Bastarrika, G.; Lloret Iglesias, L.; Montuenga, L. M.

2025-12-27 oncology
10.64898/2025.12.23.25342921 medRxiv
Show abstract

IntroductionLow-dose computed tomography (LDCT) lung cancer screening has significantly enhanced early detection and patient survival rates in the population at risk. Current screening methods, that primarily rely on LDCT imaging, will very likely benefit from molecular biomarkers to achieve a more comprehensive, accurate, personalized and non-invasive risk assessment leveraging multimodal tools. We present a novel open access multimodal (imaging, proteomics and demographic) dataset designed to provide an available research resource on LDCT-based early lung cancer detection. The dataset includes annotated screening LDCT scans and plasma proteomics generated by proximity extension assay (Olink) platform. MethodsThe dataset integrates data from control screened individuals without nodules or with benign nodules, and LDCT-diagnosed lung cancer individuals, matched by sex, age and time between image and sample collection. Both radiological and molecular signatures were collected within a six month window, providing detailed insights into disease progression. Nodules were considered as lung cancer cases if biopsy-confirmed lung cancer was diagnosed within 5 years after imaging, enabling the study of longitudinal biomarker evolution and its correlation with imaging findings. To complement the dataset, clinical and demographic data are also available in open access, providing a detailed overview of patient characteristics. The informed consent signed by the participants allows for unrestricted open access for requests directy or indirectly related to lung cancer research. ResultsThe dataset consists of annotated screening LDCT scans and plasma proteomics data measured with most of the Olink Target 96 platforms (1078 individual proteins across 12 panels focused on a specific area of disease or biology) for a total of 211 screening participants. There are 67 lung cancer patients, 68 matched controls with benign pulmonary nodules, 71 matched controls without nodules and 5 surgically excised false positive lesions. Experiments were performed to assess the technical quality and provide a proof-of-concept of usability of the dataset, showing the alignment with findings from previous published studies. ConclusionThis comprehensive dataset aims to facilitate research towards the development of personalized multimodal artificial intelligence models. We also aim to support the investigation of the relationship between imaging and molecular data, paving the way for more accurate understanding of early lung cancer biology. Finally, our open access dataset may help to develop or validate individualized risk prediction models that could significantly advance early lung cancer detection and intervention strategies.

Matching journals

The top 10 journals account for 50% of the predicted probability mass.

1
Scientific Reports
3102 papers in training set
Top 4%
10.9%
2
Journal of Translational Medicine
46 papers in training set
Top 0.1%
8.8%
3
PLOS ONE
4510 papers in training set
Top 26%
6.7%
4
Computers in Biology and Medicine
120 papers in training set
Top 0.4%
5.1%
5
Diagnostics
48 papers in training set
Top 0.2%
5.1%
6
Database
51 papers in training set
Top 0.1%
4.5%
7
PLOS Computational Biology
1633 papers in training set
Top 9%
3.8%
8
BMJ Open
554 papers in training set
Top 7%
2.2%
9
iScience
1063 papers in training set
Top 9%
2.2%
10
Journal of Clinical Medicine
91 papers in training set
Top 3%
2.0%
50% of probability mass above
11
Frontiers in Oncology
95 papers in training set
Top 2%
2.0%
12
Frontiers in Bioinformatics
45 papers in training set
Top 0.2%
1.8%
13
PeerJ
261 papers in training set
Top 6%
1.8%
14
European Respiratory Journal
54 papers in training set
Top 0.9%
1.7%
15
npj Digital Medicine
97 papers in training set
Top 2%
1.4%
16
Metabolites
50 papers in training set
Top 0.6%
1.4%
17
Nature Communications
4913 papers in training set
Top 56%
1.3%
18
Frontiers in Pharmacology
100 papers in training set
Top 3%
1.2%
19
Molecular Cancer
14 papers in training set
Top 0.6%
1.2%
20
International Journal of Molecular Sciences
453 papers in training set
Top 11%
1.2%
21
Cancers
200 papers in training set
Top 4%
1.2%
22
eLife
5422 papers in training set
Top 50%
1.2%
23
JNCI: Journal of the National Cancer Institute
16 papers in training set
Top 0.5%
1.0%
24
Heliyon
146 papers in training set
Top 4%
0.9%
25
Biomedical Optics Express
84 papers in training set
Top 0.9%
0.9%
26
Clinical Chemistry
22 papers in training set
Top 0.6%
0.9%
27
BMC Bioinformatics
383 papers in training set
Top 6%
0.9%
28
Briefings in Bioinformatics
326 papers in training set
Top 6%
0.8%
29
European Radiology
14 papers in training set
Top 0.7%
0.8%
30
Annals of Translational Medicine
17 papers in training set
Top 1%
0.8%