Back

Custom Biomedical FAIR Data Analysis in the Cloud Using CAVATICA

Berke, S. R.; Kanchan, K.; Marazita, M. L.; Tobin, E.; Ruczinski, I.

2024-06-28 health informatics
10.1101/2024.06.27.24309340 medRxiv
Show abstract

The historically fragmented biomedical data ecosystem has moved towards harmonization under the findable, accessible, interoperable, and reusable (FAIR) data principles, creating more opportunities for cloud-based research. This shift is especially opportune for scientists across diverse domains interested in implementing creative, nonstandard computational analytic pipelines on large and varied datasets. However, executing custom cloud analyses may present difficulties, particularly for investigators lacking advanced computational expertise. Here, we present an accessible, streamlined approach for the cloud compute platform CAVATICA that offers a solution. We outline how we developed a custom workflow in the cloud, for analyzing whole genome sequences of case-parent trios to detect sex-specific genetic effects on orofacial cleft risk, which required several programming languages and custom software packages. The approach involves just three components: Docker to containerize software environments, tool creation for each analysis step, and a visual workflow editor to weave the tools into a Common Workflow Language (CWL) pipeline. Our approach should be accessible to any investigator with basic computational skills, is readily extended to implement any scalable high-throughput biomedical data analysis in the cloud, and is applicable to other commonly used compute platforms such as BioData Catalyst. We believe our approach empowers versatile data reuse and promotes accelerated biomedical discovery in a time of substantial FAIR data.

Matching journals

The top 7 journals account for 50% of the predicted probability mass.

1
SoftwareX
15 papers in training set
Top 0.1%
10.2%
2
Bioinformatics
1061 papers in training set
Top 3%
9.2%
3
Journal of the American Medical Informatics Association
61 papers in training set
Top 0.4%
7.2%
4
Patterns
70 papers in training set
Top 0.1%
6.9%
5
PLOS ONE
4510 papers in training set
Top 27%
6.4%
6
GigaScience
172 papers in training set
Top 0.2%
6.4%
7
PLOS Computational Biology
1633 papers in training set
Top 8%
4.0%
50% of probability mass above
8
BMC Bioinformatics
383 papers in training set
Top 2%
4.0%
9
Scientific Reports
3102 papers in training set
Top 36%
3.6%
10
JCO Clinical Cancer Informatics
18 papers in training set
Top 0.3%
3.1%
11
BMC Medical Genomics
36 papers in training set
Top 0.3%
2.1%
12
JAMIA Open
37 papers in training set
Top 0.7%
1.9%
13
Computer Methods and Programs in Biomedicine
27 papers in training set
Top 0.3%
1.9%
14
Frontiers in Bioinformatics
45 papers in training set
Top 0.1%
1.9%
15
BMC Medical Informatics and Decision Making
39 papers in training set
Top 1%
1.7%
16
PeerJ
261 papers in training set
Top 7%
1.7%
17
Journal of Biomedical Informatics
45 papers in training set
Top 0.9%
1.5%
18
PLOS Digital Health
91 papers in training set
Top 2%
1.0%
19
iScience
1063 papers in training set
Top 26%
0.9%
20
Frontiers in Digital Health
20 papers in training set
Top 1%
0.9%
21
NAR Genomics and Bioinformatics
214 papers in training set
Top 3%
0.9%
22
International Journal of Medical Informatics
25 papers in training set
Top 1%
0.8%
23
Nature Communications
4913 papers in training set
Top 62%
0.8%
24
Data in Brief
13 papers in training set
Top 0.5%
0.7%
25
IEEE Journal of Biomedical and Health Informatics
34 papers in training set
Top 2%
0.7%
26
eLife
5422 papers in training set
Top 61%
0.6%
27
GENETICS
189 papers in training set
Top 2%
0.6%
28
Nature Computational Science
50 papers in training set
Top 2%
0.6%
29
European Journal of Epidemiology
40 papers in training set
Top 1%
0.5%
30
Journal of Medical Internet Research
85 papers in training set
Top 6%
0.5%