Sharing genomes without personal identifiers is common practice. However, recent studies revealed the risk of re-identifying people from their genomes, or attached quasi-identifiers, such as sex, birthdate and zip code. The additional availability of an individual’s RNA-seq data, has implications for privacy, as it may be linked to the genome, potentially allowing the person’s privacy to be breached. For example, sex and ethnicity information may be inferred directly from a genome, and the study may provide zip code. This could be linked to RNA-seq data from a diabetes study with attached birthdates and income. These combined quasi-identifiers may uniquely identify the person, and the study reveals the person’s disease status. RNA-seq reads contain genetic variants, and thus can be directly linked to the genome. To avoid this risk, some researchers now release gene expression, isoform expression and exon read count data instead of raw reads.
However, gene expression can also be linked to the genome based on expression QTLs. Using a Bayesian framework, we found that it is feasible to predict genomic variants from relative isoform expression. Based on GTEx splicing QTLs data, using relative isoform expression from 30 genes, we could identify the target genome within a pool containing hundreds of individuals with >96% accuracy. It is possible to identify the target genome of an RNA-seq dataset from millions of individuals using more splicing QTLs. Researchers have proposed to eliminate the risk of gene-expression-based linking attacks by adding noise to the gene expressions, based on the observation that only a few genes enable linkage. However, we found that there are now many more such genes than previously reported. We find that expression data enables the re-identification of target genome from a pool containing billions of genomes. Our result implies that mitigation of the linking risk by adding noise would severely abrogate biological entity of the data, since the data will no longer be biologically meaningful when over half of gene expressions are modified. Our study also implies that other kinds of “omic” data, including DNA modification and protein metabolite levels, may also leak genome privacy.
上海交通大学生命科学技术学院 Copyright © 2017 沪交ICP备05029. All Rights Reserved.