Unifying the global coding sequence space enables the study of genes with unknown function across biomes
Abstract
AbstractOne of the biggest challenges in molecular biology is bridging the gap between the known and the unknown coding sequence space. This challenge is especially extreme in microbial systems, where between 40% and 60% of the predicted genes are of unknown function. Discarding this uncharacterized fraction should not be an option anymore. Here, we present a conceptual framework and a computational workflow that bridges this gap and provides a powerful strategy to contextualize the investigations of genes of unknown function. Our approach partitions the coding sequence space removing the known-unknown dichotomy, unifies genomic and metagenomic data and provides a framework to expand those investigations across environments and organisms. By analyzing 415,971,742 genes predicted from 1,749 metagenomes and 28,941 bacterial and archaeal genomes we showcase our approach and its application in ecological, evolutionary and biotechnological investigations. As a result, we put into perspective the extent of the unknown fraction, its diversity, and its relevance in genomic and environmental contexts. By identifying a target gene of unknown function for antibiotic resistance, we demonstrate how a contextualized unknown coding sequence space enables the generation of hypotheses that can be used to augment experimental data. ; The authors thankfully acknowledge the computer resources at MareNostrum and the technical support provided by Barcelona Supercomputing Center (RES-AECT-2014-2-0085), the BMBF795 funded de.NBI Cloud within the German Network for Bioinformatics Infrastructure (de.NBI) (031A537B, 031A533A, 031A538A, 031A533B, 031A535A, 031A537C, 031A534A, 031A532B), the University of Oxford Advanced Research Computing (http://dx.doi.org/10.5281/zenodo.22558) and the MARBITS bioinformatics core at ICM-CSIC. CV was supported by the Max Planck Society. AFG received funding from the European Union's Horizon 2020 research and innovation program Blue Growth: Unlocking the potential of Seas and Oceans under grant agreement no. 634486 (project acronym INMARE). AM was supported by the Biotechnology and Biological Sciences Research Council [BB/M011755/1, BB/R015228/1] and RDF by the European Molecular Biology Laboratory core funds. EOC was supported by project INTERACTOMA RTI2018-101205-B-I00 from the Spanish Agency of Science MICIU/AEI. SGA and PS received additional funding by the project MAGGY (CTM2017-87736-R) from the Spanish Ministry of Economy and Competitiveness. The Malaspina 2010 Expedition was supported by the Spanish Ministry of Economy and Competitiveness (MINECO) through the Consolider-Ingenio program (ref. CSD2008-00077). The authors thank Johannes Söding and Alex Bateman for helpful discussions.
Problem melden