Proteomic Fingerprinting: A novel privacy concern
Hill, A. C.; Litkowski, E. M.; Manichaikul, A.; Lange, L.; Pratte, K. A.; Kechris, K. J.; DeCamp, M.; Coors, M.; Ortega, V. E.; Rich, S. S.; Rotter, J. I.; Gerzsten, R. E.; Clish, C. B.; Curtis, J.; Hu, X.; Ngo, D.; ONeal, W. K.; Meyers, D.; Bleecker, E.; Hobbs, B. D.; Cho, M. H.; Banaei-kashani, F.; Bowler, R. P.
Show abstract
IntroductionPrivacy protection is a core principle of genomic research but needs further refinement for high-throughput proteomic platforms. MethodsWe identified independent single nucleotide polymorphism (SNP) quantitative trait loci (pQTL) from COPDGene and Jackson Heart Study (JHS) and then calculated genotype probabilities by protein level for each protein-genotype combination (training). Using the most significant 100 proteins, we applied a naive Bayesian approach to match proteomes to genomes for 2,812 independent subjects from COPDGene, JHS, SubPopulations and InteRmediate Outcome Measures In COPD Study (SPIROMICS) and Multi-Ethnic Study of Atherosclerosis (MESA) with SomaScan 1.3K proteomes and also 2,646 COPDGene subjects with SomaScan 5K proteomes (testing). We tested whether subtracting mean genotype effect for each pQTL SNP would obscure genetic identity. ResultsIn the four testing cohorts, we were able to correctly match 90%-95% their proteomes to their correct genome and for 95%-99% we could match the proteome to the 1% most likely genome. With larger profiling (SomaScan 5K), correct identification was > 99%. The accuracy of matching in subjects with African ancestry was lower ([~]60%) unless training included diverse subjects. Mean genotype effect adjustment reduced identification accuracy nearly to random guess. ConclusionLarge proteomic datasets (> 1,000 proteins) can be accurately linked to a specific genome through pQTL knowledge and should not be considered deidentified. These findings suggest that large scale proteomic data be given privacy protections of genomic data, or that bioinformatic transformations (such as adjustment for genotype effect) should be applied to obfuscate identity.
Matching journals
The top 11 journals account for 50% of the predicted probability mass.