The National Cancer Institute Genomic Data Commons (GDC) is an information system for storing, analyzing, and sharing genomic and clinical data from patients with cancer. The recent high-throughput sequencing of cancer genomes and transcriptomes has produced a big data problem that precludes many cancer biologists and oncologists from gleaning knowledge from these data regarding the nature of malignant processes and the relationship between tumor genomic profiles and treatment response. The GDC aims to democratize access to cancer genomic data and to foster the sharing of these data to promote precision medicine approaches to the diagnosis and treatment of cancer.
International audience ; Background Contemporary bioscience sometimes demands vast sample sizes and there is often then no choice but to synthesise data across several studies and to undertake an appropriate pooled analysis. This same need is also faced in health-services and socio-economic research. When a pooled analysis is required, analytic efficiency and flexibility are often best served by combining the individual-level data from all sources and analysing them as a single large data set. But ethico-legal constraints, including the wording of consent forms and privacy legislation, often prohibit or discourage the sharing of individual-level data, particularly across national or other jurisdictional boundaries. This leads to a fundamental conflict in competing public goods: individual-level analysis is desirable from a scientific perspective, but is prevented by ethico-legal considerations that are entirely valid. Methods DataSHIELD (Data Aggregation Through Anonymous Summary-statistics from Harmonized Individual-levEL Databases), provides a simple approach to analysing pooled data that circumvents this conflict. This is achieved via parallelized analysis and modern distributed computing and, in one key setting, takes advantage of the properties of the updating algorithm for generalized linear models (GLMs). Results The conceptual use of DataSHIELD is illustrated in two different settings. Conclusions As the study of the aetiological architecture of chronic diseases advances to encompass more complex causal pathways - e.g. to include the joint effects of genes, life-style and environment - sample size requirements will increase further and the analysis of pooled individual-level data will become ever more important. An aim of this conceptual paper is to encourage others to address the challenges and opportunities that DataSHIELD presents, and to explore potential extensions, for example to its use when different data sources hold different data on the same individuals.
<b><i>Background:</i></b> DataSHIELD (Data Aggregation Through Anonymous Summary-statistics from Harmonised Individual levEL Databases) has been proposed to facilitate the co-analysis of individual-level data from multiple studies without physically sharing the data. In a previous paper, we investigated whether DataSHIELD could protect participant confidentiality in accordance with UK law. In this follow-up paper, we investigate whether DataSHIELD addresses a broader range of ethics-related data-sharing concerns. <b><i>Methods:</i></b> Ethics-related data-sharing concerns of Institutional Review Boards, ethics experts, international research consortia and research participants were identified through a literature search and systematically examined at a multidisciplinary workshop to determine whether DataSHIELD proposes mechanisms which can address these concerns. <b><i>Results:</i></b> DataSHIELD addresses several ethics-related data-sharing concerns related to privacy, confidentiality, and the protection of the research participant's rights while sharing data and after the data have been shared. The data remain entirely under the direct management of the study that collected them. Data processing commands are strictly supervised, and the data are queried in a protected environment. Issues related to the return of individual research results when data are shared are eliminated; the responsibility for return remains at the study of origin. <b><i>Conclusion:</i></b> DataSHIELD can provide an innovative and robust solution for addressing commonly encountered ethics-related data-sharing concerns.
The Pan-Cancer Analysis of Whole Genomes (PCAWG) project generated a vast amount of whole-genome cancer sequencing resource data. Here, as part of the ICGC/TCGA Pan-Cancer Analysis of Whole Genomes (PCAWG) Consortium, which aggregated whole genome sequencing data from 2658 cancers across 38 tumor types, we provide a user's guide to the five publicly available online data exploration and visualization tools introduced in the PCAWG marker paper. These tools are ICGC Data Portal, UCSC Xena, Chromothripsis Explorer, Expression Atlas, and PCAWG-Scout. We detail use cases and analyses for each tool, show how they incorporate outside resources from the larger genomics ecosystem, and demonstrate how the tools can be used together to understand the biology of cancers more deeply. Together, the tools enable researchers to query the complex genomic PCAWG data dynamically and integrate external information, enabling and enhancing interpretation. ; The ICGC Data Portal development is supported by the Ontario Institute for Cancer Research (OICR) through funding provided by the government of Ontario. UCSC Xena development is supported by the National Cancer Institute of the National Institutes of Health under award numbers 5U24CA180951-04 and 5U24CA210974-02. The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health. Chromothripsis Explorer development is supported by the European Union's Framework Programme For Research and Innovation Horizon 2020 (2014-2020) under the Marie Curie Sklodowska-Curie Grant Agreement No. 703543 (I.C.-C.). Expression Atlas development is supported by the European Molecular Biology Laboratory (EMBL) member states, the Single Cell Gene Expression Atlas from the Wellcome Trust (grant numbers 108437/Z/15/Z), the National Science Foundation of USA grant to Gramene database [NSF IOS #1127112], Open Targets, and Chan Zuckerberg Initiative. PCAWG-Scout development is supported by joint BSC-IRB-CRG Program in Computational Biology and Severo Ochoa Award SEV 2015-0493. In addition, this work has been supported by the Spanish Government (SEV 2015-0493) and from the BSC-Lenovo Master Collaboration Agreement (2015). We acknowledge the contributions of the many clinical networks across ICGC and TCGA who provided samples and data to the PCAWG Consortium and the contributions of the Technical Working Group and the Germline Working Group of the PCAWG Consortium for collation, realignment, and harmonized variant calling of the cancer genomes used in this study. We thank the patients and their families for their participation in the individual ICGC and TCGA projects. ; Peer Reviewed ; Postprint (published version)
The Pan-Cancer Analysis of Whole Genomes (PCAWG) project generated a vast amount of whole-genome cancer sequencing resource data. Here, as part of the ICGC/TCGA Pan-Cancer Analysis of Whole Genomes (PCAWG) Consortium, which aggregated whole genome sequencing data from 2658 cancers across 38 tumor types, we provide a user's guide to the five publicly available online data exploration and visualization tools introduced in the PCAWG marker paper. These tools are ICGC Data Portal, UCSC Xena, Chromothripsis Explorer, Expression Atlas, and PCAWG-Scout. We detail use cases and analyses for each tool, show how they incorporate outside resources from the larger genomics ecosystem, and demonstrate how the tools can be used together to understand the biology of cancers more deeply. Together, the tools enable researchers to query the complex genomic PCAWG data dynamically and integrate external information, enabling and enhancing interpretation. ; The ICGC Data Portal development is supported by the Ontario Institute for Cancer Research (OICR) through funding provided by the government of Ontario. UCSC Xena development is supported by the National Cancer Institute of the National Institutes of Health under award numbers 5U24CA180951-04 and 5U24CA210974-02. The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health. Chromothripsis Explorer development is supported by the European Union's Framework Programme For Research and Innovation Horizon 2020 (2014-2020) under the Marie Curie Sklodowska-Curie Grant Agreement No. 703543 (I.C.-C.). Expression Atlas development is supported by the European Molecular Biology Laboratory (EMBL) member states, the Single Cell Gene Expression Atlas from the Wellcome Trust (grant numbers 108437/Z/15/Z), the National Science Foundation of USA grant to Gramene database [NSF IOS #1127112], Open Targets, and Chan Zuckerberg Initiative. PCAWG-Scout development is supported by joint BSC-IRB-CRG Program in Computational Biology and Severo Ochoa Award SEV 2015-0493. In addition, this work has been supported by the Spanish Government (SEV 2015-0493) and from the BSC-Lenovo Master Collaboration Agreement (2015). We acknowledge the contributions of the many clinical networks across ICGC and TCGA who provided samples and data to the PCAWG Consortium and the contributions of the Technical Working Group and the Germline Working Group of the PCAWG Consortium for collation, realignment, and harmonized variant calling of the cancer genomes used in this study. We thank the patients and their families for their participation in the individual ICGC and TCGA projects. ; Sí