Back

Proteomic Fingerprinting: A novel privacy concern

Hill, A. C.; Litkowski, E. M.; Manichaikul, A.; Lange, L.; Pratte, K. A.; Kechris, K. J.; DeCamp, M.; Coors, M.; Ortega, V. E.; Rich, S. S.; Rotter, J. I.; Gerzsten, R. E.; Clish, C. B.; Curtis, J.; Hu, X.; Ngo, D.; ONeal, W. K.; Meyers, D.; Bleecker, E.; Hobbs, B. D.; Cho, M. H.; Banaei-kashani, F.; Bowler, R. P.

2022-04-12 genetic and genomic medicine
10.1101/2022.04.06.22269907 medRxiv
Show abstract

IntroductionPrivacy protection is a core principle of genomic research but needs further refinement for high-throughput proteomic platforms. MethodsWe identified independent single nucleotide polymorphism (SNP) quantitative trait loci (pQTL) from COPDGene and Jackson Heart Study (JHS) and then calculated genotype probabilities by protein level for each protein-genotype combination (training). Using the most significant 100 proteins, we applied a naive Bayesian approach to match proteomes to genomes for 2,812 independent subjects from COPDGene, JHS, SubPopulations and InteRmediate Outcome Measures In COPD Study (SPIROMICS) and Multi-Ethnic Study of Atherosclerosis (MESA) with SomaScan 1.3K proteomes and also 2,646 COPDGene subjects with SomaScan 5K proteomes (testing). We tested whether subtracting mean genotype effect for each pQTL SNP would obscure genetic identity. ResultsIn the four testing cohorts, we were able to correctly match 90%-95% their proteomes to their correct genome and for 95%-99% we could match the proteome to the 1% most likely genome. With larger profiling (SomaScan 5K), correct identification was > 99%. The accuracy of matching in subjects with African ancestry was lower ([~]60%) unless training included diverse subjects. Mean genotype effect adjustment reduced identification accuracy nearly to random guess. ConclusionLarge proteomic datasets (> 1,000 proteins) can be accurately linked to a specific genome through pQTL knowledge and should not be considered deidentified. These findings suggest that large scale proteomic data be given privacy protections of genomic data, or that bioinformatic transformations (such as adjustment for genotype effect) should be applied to obfuscate identity.

Matching journals

The top 11 journals account for 50% of the predicted probability mass.

1
BMC Medical Genomics
36 papers in training set
Top 0.1%
14.6%
2
Bioinformatics
1061 papers in training set
Top 4%
6.5%
3
PLOS ONE
4510 papers in training set
Top 26%
6.5%
4
Genetic Epidemiology
46 papers in training set
Top 0.1%
4.4%
5
Scientific Reports
3102 papers in training set
Top 30%
4.0%
6
BMC Genomics
328 papers in training set
Top 0.8%
3.7%
7
International Journal of Epidemiology
74 papers in training set
Top 0.8%
2.7%
8
Genome Medicine
154 papers in training set
Top 3%
2.5%
9
Human Molecular Genetics
130 papers in training set
Top 1%
2.1%
10
Nature Communications
4913 papers in training set
Top 48%
1.9%
11
European Journal of Epidemiology
40 papers in training set
Top 0.2%
1.9%
50% of probability mass above
12
Journal of the American Medical Informatics Association
61 papers in training set
Top 1%
1.8%
13
Journal of Proteome Research
215 papers in training set
Top 1%
1.7%
14
Wellcome Open Research
57 papers in training set
Top 0.8%
1.7%
15
GigaScience
172 papers in training set
Top 2%
1.5%
16
JAMA Network Open
127 papers in training set
Top 2%
1.5%
17
PLOS Computational Biology
1633 papers in training set
Top 18%
1.4%
18
Alzheimer's & Dementia
143 papers in training set
Top 2%
1.2%
19
Circulation: Genomic and Precision Medicine
42 papers in training set
Top 0.9%
1.2%
20
F1000Research
79 papers in training set
Top 3%
1.1%
21
Frontiers in Pharmacology
100 papers in training set
Top 3%
1.0%
22
Frontiers in Bioinformatics
45 papers in training set
Top 0.5%
1.0%
23
European Respiratory Journal
54 papers in training set
Top 1%
1.0%
24
Data in Brief
13 papers in training set
Top 0.3%
0.9%
25
Human Genetics and Genomics Advances
70 papers in training set
Top 0.7%
0.8%
26
EBioMedicine
39 papers in training set
Top 0.9%
0.8%
27
Alzheimer's & Dementia: Diagnosis, Assessment & Disease Monitoring
38 papers in training set
Top 1%
0.8%
28
JAMA
17 papers in training set
Top 0.3%
0.8%
29
Genetics in Medicine
69 papers in training set
Top 1.0%
0.8%
30
BMC Bioinformatics
383 papers in training set
Top 7%
0.8%