Genome privacy leakage from omics data
发布时间 :2018-06-11  阅读次数 :5527

主讲人:胡智强

主讲人简介:胡智强博士,现为美国UC Berkeley博士后,2006年博士毕业于上海交通大学生命科学技术学院生物信息学与生物统计学系。2018年以共同第一作者身份在Nature发表3000水稻泛基因组论文。

报告时间:2018-06-13   12:55-15:30

地点:东中院  1-304教室

联系人:韦朝春 ccwei@sjtu.edu.cn

 

讲座摘要:

Sharing genomes without personal identifiers is common practice. However, recent studies revealed the risk of re-identifying people from their genomes, or attached quasi-identifiers, such as sex, birthdate and zip code. The additional availability of an individual’s RNA-seq data, has implications for privacy, as it may be linked to the genome, potentially allowing the person’s privacy to be breached. For example, sex and ethnicity information may be inferred directly from a genome, and the study may provide zip code. This could be linked to RNA-seq data from a diabetes study with attached birthdates and income. These combined quasi-identifiers may uniquely identify the person, and the study reveals the person’s disease status. RNA-seq reads contain genetic variants, and thus can be directly linked to the genome. To avoid this risk, some researchers now release gene expression, isoform expression and exon read count data instead of raw reads.

However, gene expression can also be linked to the genome based on expression QTLs. Using a Bayesian framework, we found that it is feasible to predict genomic variants from relative isoform expression. Based on GTEx splicing QTLs data, using relative isoform expression from 30 genes, we could identify the target genome within a pool containing hundreds of individuals with >96% accuracy. It is possible to identify the target genome of an RNA-seq dataset from millions of individuals using more splicing QTLs. Researchers have proposed to eliminate the risk of gene-expression-based linking attacks by adding noise to the gene expressions, based on the observation that only a few genes enable linkage. However, we found that there are now many more such genes than previously reported. We find that expression data enables the re-identification of target genome from a pool containing billions of genomes. Our result implies that mitigation of the linking risk by adding noise would severely abrogate biological entity of the data, since the data will no longer be biologically meaningful when over half of gene expressions are modified. Our study also implies that other kinds of “omic” data, including DNA modification and protein metabolite levels, may also leak genome privacy.