Marine Genomics 18 (2014) 97–99
Contents lists available at ScienceDirect
Marine Genomics journal homepage: www.elsevier.com/locate/margen
Metagenomes from two microbial consortia associated with Santa Barbara seep oil Erik R. Hawley a, Stephanie A. Malfatti b, Ioanna Pagani c, Marcel Huntemann c, Amy Chen c, Brian Foster c, Alexander Copeland c, Tijana Glavina del Rio c, Amrita Pati c, Janet R. Jansson c,d, Jack A. Gilbert e,f, Susannah Green Tringe c,d, Thomas D. Lorenson g, Matthias Hess a,c,h,i,⁎ a
Washington State University, Richland, WA, USA Lawrence Livermore National Laboratory, Biosciences and Biotechnology Division, Livermore, CA, USA c DOE Joint Genome Institute, Walnut Creek, CA, USA d Lawrence Berkeley National Laboratory, Berkeley, CA, USA e Argonne National Laboratory, Lemont, IL, USA f University of Chicago, Chicago, IL, USA g U.S. Geological Survey, Menlo Park, CA, USA h Paciﬁc Northwest National Laboratory, Chemical & Biological Process Development Group, Richland, WA, USA i Environmental Molecular Sciences Laboratory, Richland, WA, USA b
a r t i c l e
i n f o
Article history: Received 7 June 2014 Received in revised form 10 June 2014 Accepted 10 June 2014 Available online 20 June 2014 Keywords: Bioremediation Hydrocarbon degradation Marine ecosystem Metagenomics Natural oil seeps
a b s t r a c t The metagenomes from two microbial consortia associated with natural oils seeping into the Paciﬁc Ocean offshore the coast of Santa Barbara (California, USA) were determined to complement already existing metagenomes generated from microbial communities associated with hydrocarbons that pollute the marine ecosystem. This genomics resource article is the ﬁrst of two publications reporting a total of four new metagenomes from oils that seep into the Santa Barbara Channel. © 2014 Elsevier B.V. All rights reserved.
1. Introduction Hydrocarbons can be major contaminants of the marine and coastal ecosystems and can have signiﬁcant socio-ecological impacts. Although microbial consortia indigenous to areas with constitutively increased concentrations of hydrocarbons are well known for their ability to degrade these contaminants (Vila et al., 2010), very little is known about the microbial response and processes that occur after an oil spill and during the remediation of hydrocarbons in uncontrolled and complex ecological systems (Head et al., 2006). Natural hydrocarbon seepage areas in the marine system can be found around the globe and one region that has obtained signiﬁcant attention in recent years is the Gulf of Mexico (GoM). Other regions, such as the Santa Barbara Channel (SBC) – which contains some of the most active hydrocarbon seeps in the world (Hornaﬁus et al., 1999) – has obtained signiﬁcant less Abbreviations: eDNA, environmental DNA; GoM, Gulf of Mexico; SBC, Santa Barbara Channel. ⁎ Corresponding author at: Washington State University, Richland, WA, USA. E-mail address: [email protected]
http://dx.doi.org/10.1016/j.margen.2014.06.003 1874-7787/© 2014 Elsevier B.V. All rights reserved.
attention. To build a comprehensive knowledge database, which will eventually facilitate the development of sustainable strategies for oil remediation in the case of future oil spills, it will be crucial to collect and analyze biological data from seep areas other than the GoM. Here we report two metagenomes (Oil-MG-1 and Oil-MG-3) from SBC seep oils, which will complement the rapidly increasing number of largescale sequence-based studies from samples acquired from the GoM after the Deepwater Horizon blowout and the few small to mediumscale metagenomic studies from other hydrocarbon seep rich regions that have been conducted until to date. Metagenomic data was generated from two hydrocarbon-adapted consortia collected using a remotely operated vehicle from submarine oil seeps located within a 30 m radius from 34.3751°N, 119.8532°W at 65 m (Oil-MG-1) and 47 m (Oil-MG-3). The collected oil samples were transported immediately to the laboratory and stored at −20 °C until DNA extraction was performed. Environmental DNA (eDNA) was extracted from 500 mg of the seep oils using a FastDNA Spin Kit for Soil (MP Biomedicals) according to the manufacturer's protocol. Bead-beating was conducted three times
E.R. Hawley et al. / Marine Genomics 18 (2014) 97–99
(20 s) using a Mini-Beadbeater-16 (Biospec Products). Samples were kept on ice for 1 min between each round of bead-beating. From each sample 200 ng of eDNA was sheared to 270 bp using the Covaris E210 and subjected to size selection using SPRI beads (Beckman Coulter). Sequencing libraries were generated from the obtained fragments using the KAPA-Illumina library creation kit (KAPA Biosystems). Libraries were quantiﬁed by qPCR using KAPA Biosystem's next-generation sequencing library qPCR kit and run on a Roche LightCycler 480 realtime PCR instrument. Quantiﬁed libraries were then prepared for sequencing on the Illumina HiSeq2000 sequencing platform, utilizing a TruSeq paired-end cluster kit, v3, and Illumina's cBot instrument to generate clustered ﬂowcells. Sequencing of ﬂowcells was performed on the Illumina HiSeq2000 platform using a TruSeq SBS sequencing kit 200 cycles, v3, following a 2 × 150 indexed run recipe. A total of 51.8 Gbp and 54.1 Gbp were generated for Oil-MG-1 and Oil-MG-3 respectively. Raw metagenomic reads were trimmed using a minimum quality score cutoff of 10. Trimmed, paired-end reads were assembled using SOAPdenovo v1.05 (Luo et al., 2012) with a range of Kmers (81, 85, 89, 93, 97, 101). Default settings for all SOAPdenovo assemblies were used. Contigs generated by each assembly were sorted into two pools based on length. Contigs smaller than 1800 bp were assembled using Newbler (Life Technologies) to generate larger contigs (ﬂags: − tr, − rip, − mi 98, − ml 80). Contigs larger than 1800 bp, as well as contigs generated from the ﬁnal Newbler run, were combined using minimus 2 (ﬂags: −D MINID = 98 −D OVERLAP = 80) [AMOS (http://sourceforge.net/projects/amos)]. Read depth estimates are based on mapping the trimmed, screened, paired-end Illumina reads to assembled contigs using BWA (http://bio-bwa.sourceforge.net/). Un-assembled, paired reads were merged with FLASH (http:// sourceforge.net/projects/ﬂashpage). Assembled contigs along with the merged, un-assembled reads were submitted to the Integrated Metagenome Analysis System (https://img.jgi.doe.gov/) for functional annotation. Submitted sequences were trimmed to remove low quality regions and stretches of undetermined sequences at the ends of contigs were removed. Each sequence was checked with the DUST algorithm (Morgulis et al., 2006) for low complexity regions. Sequences with less than 80 unmasked nt were removed. Additionally very similar sequences (similarity N 95%) with identical 5′ pentanucleotides are replaced by one representative using UCLUST (www.drive5.com). The feature prediction pipeline included the detection of non-coding RNA genes followed by prediction of protein coding genes. Identiﬁcation of tRNAs was performed using tRNAScan-SE-1.23 (Lowe and Eddy, 1997). In case of conﬂicting predictions, the best scoring predictions were selected. The last 150 nt of the sequences were also checked by comparing these to a database containing tRNA sequences identiﬁed in isolate genomes using blastn (Altschul et al., 1997). Hits with high similarity were kept. Ribosomal RNA genes were predicted using the hmmsearch (Eddy, 2011) with internally developed models for the three types of RNAs for the domains of life. Identiﬁcation of protein-coding genes was performed using four different gene calling tools, GeneMark (v.2.6r) (Besemer and Borodovsky, 2005), Metagene (v. Aug08) (Noguchi et al., 2006), Prodigal (v2.50) (Hyatt et al., 2010) and FragGeneScan (Rho et al., 2010) all of which are ab initio gene prediction programs. We typically followed a majority rule based decision scheme to select the gene calls. When there was a tie, we selected genes based on an order of gene callers determined by runs on simulated metagenomic datasets (Genemark N Prodigal N Metagene N FragGene-Scan). Finally, CDS and other feature predictions were consolidated. Regions identiﬁed previously as RNA genes were preferred over protein-coding genes. Subsequent functional prediction involved comparison of predicted protein sequences to the public IMG database using the USEARCH algorithm (www.drive5.com), the COG db using the NCBI developed PSSMs (Tatusov et al., 2003), and the PFAM database (Punta et al., 2012) using hmmsearch. Assignment to KEGG Ortholog protein families was performed as described previously (Mao et al., 2005).
Table 1 Assembly statistics.
High-quality reads generated Total scaffolds assembled Total base pairs assembled Scaffolds ≥ 1 kbp Scaffolds ≥ 10 kbp Scaffolds ≥ 25 kbp Scaffolds ≥ 100 kbp Scaffolds ≥ 250 kbp Size of longest scaffold (kbp) Genes on longest scaffold
334,697,839 736,537 544,336,013 92,314 4114 781 15 0 163.2 186
347,283,646 491,247 437,751,882 83,092 3900 758 29 1 267.6 257
Analysis of the assembled sequences revealed 1,136,186 genes with 99.3% annotated as protein coding from Oil-MG-1 and 843,676 genes with 99% annotated as protein coding from Oil-MG-3. A total of 788,331 of the protein coding genes, corresponding to 69.9% of the total predicted protein-coding genes from Oil-MG-1 and 583,785 of the protein coding genes, corresponding to 69.9% of the total predicted protein-coding genes from Oil-MG-3, were assigned to a putative family or function based on the presence of conserved Pfam domains with the remaining genes annotated as hypothetical proteins. A summary of the assembly statistics and of the features of the assembled metagenomes is provided in Tables 1 & 2. 2. Sequence and annotation accession Sequences and annotation results as well as tools for further analysis of these metagenomes are publicly available in NCBI's SRA under the accession numbers SRX560108 and SRX559946 and at IMG/M under the Taxon IDs 3300001750 and 3300001749 for Oil-MG-1 and OilMG-3 respectively. Acknowledgments MHess and ERH and the work performed in the laboratory of MHess were funded by Washington State University. The work conducted by the U.S. Department of Energy Joint Genome Institute was supported by the Ofﬁce of Science of the U.S. Department of Energy under Contract No. DE-AC02-05CH11231. Work conducted by JAG was supported by the U.S. Dept. of Energy under Contract No.DE-AC02-06CH11357. We are extremely thankful to our colleagues who provided letters of support for our Community Sequencing Program proposal. Additional thanks go to Matt Ashby and Ulrika Lidstrom at Taxon and staff Table 2 Metagenome features. Oil-MG-1 Size [Mbp] Scaffolds GC [%] Genes Genes identiﬁed RNA genes rRNA genes 5S rRNA 16S rRNA 18S rRNA 23S rRNA 28S rRNA tRNA genes Protein coding genes With product name With COG With Pfam With KO With enzyme
544.3 736,537 42.65 1,136,186 8353 1660 272 530 7 842 9 6693 1,127,833 638,526 629,380 788,331 469,582 270,787
Oil-MG-3 437.8 491,247 44.92 843,676 8855 1447 242 425 8 764 8 7408 834,821 482,267 473,753 583,785 338,527 196,365
E.R. Hawley et al. / Marine Genomics 18 (2014) 97–99
members of the Chemical and Biological Process Development Group – in particular David Culley, Jon Magnuson, Kenneth Bruno, Jim Collett and Scott Baker – and members of the Microbial Community Initiative – in particular Allan Konopka, Jim Fredrickson and Steve Lindeman – at PNNL for scientiﬁc discussions throughout the project. References Altschul, S.F., Madden, T.L., Schaffer, A.A., Zhang, J.H., Zhang, Z., Miller, W., Lipman, D.J., 1997. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 25 (17), 3389–3402. Besemer, J., Borodovsky, M., 2005. GeneMark: web software for gene ﬁnding in prokaryotes, eukaryotes and viruses. Nucleic Acids Res. 33, W451–W454. Eddy, S.R., 2011. Accelerated proﬁle HMM searches. PLoS Comput. Biol. 7 (10). Head, I.M., Jones, D.M., Roling, W.F., 2006. Marine microorganisms make a meal of oil. Nat. Rev. Microbiol. 4 (3), 173–182. Hornaﬁus, J.S., Quigley, D., Luyendyk, B.P., 1999. The world's most spectacular marine hydrocarbon seeps (Coal Oil Point, Santa Barbara Channel, California): quantiﬁcation of emissions. J. Geophys. Res. Oceans 104 (C9), 20703–20711. Hyatt, D., Chen, G.L., LoCascio, P.F., Land, M.L., Larimer, F.W., Hauser, L.J., 2010. Prodigal: prokaryotic gene recognition and translation initiation site identiﬁcation. BMC Bioinformatics 11.
Lowe, T.M., Eddy, S.R., 1997. tRNAscan-SE: a program for improved detection of transfer RNA genes in genomic sequence. Nucleic Acids Res. 25 (5), 955–964. Luo, R., Liu, B., Xie, Y., Li, Z., Huang, W., Yuan, J., He, G., Chen, Y., Pan, Q., Liu, Y., et al., 2012. SOAPdenovo2: an empirically improved memory-efﬁcient short-read de novo assembler. GigaScience 1 (1), 18. Mao, X.Z., Cai, T., Olyarchuk, J.G., Wei, L.P., 2005. Automated genome annotation and pathway identiﬁcation using the KEGG Orthology (KO) as a controlled vocabulary. Bioinformatics 21 (19), 3787–3793. Morgulis, A., Gertz, E.M., Schaffer, A.A., Agarwala, R., 2006. A fast and symmetric DUST implementation to mask low-complexity DNA sequences. J. Comput. Biol. 13 (5), 1028–1040. Noguchi, H., Park, J., Takagi, T., 2006. MetaGene: prokaryotic gene ﬁnding from environmental genome shotgun sequences. Nucleic Acids Res. 34 (19), 5623–5630. Punta, M., Coggill, P.C., Eberhardt, R.Y., Mistry, J., Tate, J., Boursnell, C., Pang, N., Forslund, K., Ceric, G., Clements, J., et al., 2012. The Pfam protein families database. Nucleic Acids Res. 40 (D1), D290–D301. Rho, M.N., Tang, H.X., Ye, Y.Z., 2010. FragGeneScan: predicting genes in short and errorprone reads. Nucleic Acids Res. 38 (20). Tatusov, R.L., Fedorova, N.D., Jackson, J.D., Jacobs, A.R., Kiryutin, B., Koonin, E.V., Krylov, D.M., Mazumder, R., Mekhedov, S.L., Nikolskaya, A.N., et al., 2003. The COG database: an updated version includes eukaryotes. BMC Bioinformatics 4. Vila, J., Maria Nieto, J., Mertens, J., Springael, D., Grifoll, M., 2010. Microbial community structure of a heavy fuel oil-degrading marine consortium: linking microbial dynamics with polycyclic aromatic hydrocarbon utilization. FEMS Microbiol. Ecol. 73 (2), 349–362.