Unraveling the hidden universe of small proteins in bacterial genomes
Abstract
Identification of small open reading frames (smORFs) encoding small proteins (≤ 100 amino acids; SEPs) is a challenge in the fields of genome annotation and protein discovery. Here, by combining a novel bioinformatics tool (RanSEPs) with "-omics" approaches, we were able to describe 109 bacterial small ORFomes. Predictions were first validated by performing an exhaustive search of SEPs present in Mycoplasma pneumoniae proteome via mass spectrometry, which illustrated the limitations of shotgun approaches. Then, RanSEPs predictions were validated and compared with other tools using proteomic datasets from different bacterial species and SEPs from the literature. We found that up to 16 ± 9% of proteins in an organism could be classified as SEPs. Integration of RanSEPs predictions with transcriptomics data showed that some annotated non-coding RNAs could in fact encode for SEPs. A functional study of SEPs highlighted an enrichment in the membrane, translation, metabolism, and nucleotide-binding categories. Additionally, 9.7% of the SEPs included a N-terminus predicted signal peptide. We envision RanSEPs as a tool to unmask the hidden universe of small bacterial proteins. ; We thank Dr. Luca Cozzuto from the Bioinformatics Unit at CRG for providing valuable guidance for the conservation studies. Also, we would like to thank Dr. Carolina Gallo, Dr. Eva Yus, and Dr. Raul Burgos for providing MS data. Finally, we thank Dr. Marc Weber for his recommendation about how to analyze Ribo-Seq data. We acknowledge support of the Spanish Ministry of Economy, Industry and Competitiveness (MEIC) to the EMBL partnership, the Spanish Ministry of Economy and Competitiveness, "Centro de Excelencia Severo Ochoa", the European Research Council (ERC) under the European Union's Horizon 2020 research and innovation program under agreement No 670216 (MYCOCHASSIS), the CERCA Programme/Generalitat de Catalunya, the European Regional Development Fund (ERDF) project from Instituto Carlos III (ISCIII, Acción Estratégica en Salud 2016; reference CP16/00094), and "Secretaria d'Universitats i Recerca del Departament d'Economia i Coneixement de la Generalitat de Catalunya" (2014SGR678). The CRG/UPF Proteomics Unit is part of the "Plataforma de Recursos Biomoleculares y Bioinformáticos (ProteoRed)" supported by grant PT13/0001 of Instituto de Salud Carlos III from the Spanish Government.
Problem melden