Biochemical and Biophysical Research Communications 401 (2010) 447–450
Contents lists available at ScienceDirect
Biochemical and Biophysical Research Communications journal homepage: www.elsevier.com/locate/ybbrc
webFOG: A web tool to map genomic features onto genes Sonika Tyagi a,b,⇑, Mitchell S. Stark a, Nicholas K. Hayward a, David C. Whiteman b, Derek J. Nancarrow a a b
Oncogenomics Laboratory, Queensland Institute of Medical Research, Herston, Brisbane, QLD 4029, Australia Cancer Control Group, Queensland Institute of Medical Research, Herston, Brisbane, QLD 4029, Australia
a r t i c l e
i n f o
Article history: Received 11 September 2010 Available online 24 September 2010 Keywords: miRNA mapping SNP mapping Primer mapping Genomic feature annotation
a b s t r a c t A large number of new genomic features are being discovered using high throughput techniques. The next challenge is to automatically map them to the reference genome for further analysis and functional annotation. We have developed a tool that can be used to map important genomic features to the latest version of the human genome and also to annotate new features. These genomic features could be of many different source types, including miRNAs, microarray primers or probes, Chip-on-Chip data, CpG islands and SNPs to name a few. A standalone version and web interface for the tool can be accessed through: http://populationhealth.qimr.edu.au/cgi-bin/webFOG/index.cgi. The project details and source code is also available at http://www.bioinformatics.org/webfog. Ó 2010 Elsevier Inc. All rights reserved.
1. Introduction High throughput technologies such as high density single nucleotide polymorhism (SNP), expression or methylation microarrays provide great insight into changes present at a large number of genomic locations within a particular sample. These are useful for comparing differences between tissue type or cell states. These technologies produce tens of thousands of data points, which spatially relate to the positions of exons, promoters or CpG islands. Similarly, the capacity of highly parallel sequencing technologies to detect small RNAs at unprecedented depth suggests their value in systematically identifying novel miRNAs. The current challenge is to extract meaningful biological information from large volumes of output obtained from these technologies. A genome-wide location analysis is required in order to determine where a particular microarray primer or probe, a SNP, or a newly discovered miRNA, maps within the reference genome. A way of assessing whether a novel miRNA comes from a coding or non-coding region would not only give insight into its expression and transcriptional regulation but would also permit it to be functionally annotated. For example, miRNAs that span the boundary of an exon and intron are classified as ‘mirtrons’ [1] and miRNAs that are part of intron of a gene are usually transcribed as part of the pre-processed gene transcript. Similarly, one would like to know whether a methyla-
⇑ Corresponding author at: Queensland Institute of Medical Research, 300 Herston Rd., Herston, QLD 4029, Australia. Fax: +61 7 38453508. E-mail addresses:
[email protected] (S. Tyagi), Mitchell.Stark@qimr. edu.au (M.S. Stark), Nick.Hayward@qimr. edu.au (N.K. Hayward), David.Whiteman @qimr.edu.au (D.C. Whiteman),
[email protected] (D.J. Nancarrow). 0006-291X/$ - see front matter Ó 2010 Elsevier Inc. All rights reserved. doi:10.1016/j.bbrc.2010.09.077
tion primer binds upstream or downstream from the transcription start site of the target gene. Mapping sequences onto genes can be defined as the assignment of important features to specific chromosomal locations with respect to the known genes. This is an essential part of the informatics of various cancers and disease related studies, thus, a generic tool to do this job automatically will be of great help to researchers. In this report we present a bioinformatics tool designed to map genomic features onto the latest genome annotations and to locate and report the genomic feature in question in relation to the position of a gene. The tool currently works for mapping microarray primers, miRNAs, promoters or CpG islands and SNPs. To our knowledge webFOG (web version of Features Onto Genes) is the only tool designed to specifically map new features and report the relationship of gene-feature locations. 2. Material and methods 2.1. Algorithm A perl based FOG (features onto genes) tool was developed in the Linux environment for mapping genomic features to the known genes on the reference genome. The user can select a genome build, chromosome and search space for gene annotation from the web page (Fig. 1A). There are six options available for the mapping search space, namely: 1. from up-k-bases to GeneStart 2. from up-k-bases to GeneEnd 3. from up-k-bases to down-k-bases
448
S. Tyagi et al. / Biochemical and Biophysical Research Communications 401 (2010) 447–450
Fig. 1A. Screen shot of the main page of webFOG.
4. from GeneStart to GeneEnd 5. from GeneStart to down-k-bases 6. from GeneEnd to down-k-bases The genomic coordinates should be provided in a text file, preferably in .psl format (http://genome.ucsc.edu/FAQ/FAQformat.html #format2) of BLAT [2] for one chromosome at a time. The user can also run a limited version of BLAT through the tool web page to generate a .psl output (Fig. 1B). This version of BLAT can be run only for one chromosome at a time. We have also provided a perl script (to-psl.pl) to generate a dummy .psl format if we know feature name, chromosome name and its start and end on the given chromosome. The instructions to run this perl script can be found in the ‘readme.txt’ file provided with the standalone version of this tool. The standalone version is available to download from web interface itself. Unlike the web version which must be run for a single chromosome at a time, the standalone version can be used for batch screening of the genomic data. The output for webFOG is a comma-separated format which could be directly
imported into a spreadsheet program. The features can also be sorted based on their genomic location (Table 1). This tool can automatically map thousands of novel features to a genome. For the current version we have used hg18 and hg19 builds of the human genome to extract the latest annotation [3]. However, this tool is flexible to add newer versions of genome annotations as they become available. This method uses genomic coordinates to match feature locations with respect to those of annotated genes. Specific gene search space such as ORF start, 50 UTR, 30 UTR, upstream of gene, downstream of a gene is calculated on the fly depending on the search space options chosen by the user. The orientation of a gene (+ or ) is taken into consideration while assigning a feature to it. 3. Results and discussion To illustrate the versatility, easy use and capabilities of the tool we have demonstrated its use in different modes with a variety of input data types:
Fig. 1B. BLAT run page: BLAT [2] can be run for one chromosome at a time to produce .psl output.
449
S. Tyagi et al. / Biochemical and Biophysical Research Communications 401 (2010) 447–450 Table 1 Gene maps of miRNAs from miRBASE database version 15. Feature name
Chromosome
Feature start
Feature end
Chr strand
Gene start
Gene end
Gene strand
Gene name
Accession No.
Map on gene (from – to)
hsa-mir-4301 hsa-mir-4317 hsa-let-7a-2 hsa-mir-625 hsa-mir-1306 hsa-mir-636 hsa-mir-1254 hsa-mir-3183
Chr11 Chr18 Chr11 Chr14 Chr22 Chr17 Chr10 Chr17
113320745 6374360 122017230 65937820 20073581 74732532 70519075 925716
113320810 6374424 122017301 65937904 20073665 74732630 70519171 925799
+ + +
113280317 5954705 121959810 65877839 20067833 74730198 70480970 906759
1.13E+08 6414910 1.22E+08 1.22E+08 20099398 74733412 70551307 1012324
+ + +
DRD2 L3MBTL4 LOC399959 FUT8 DGCR8 SFRS2 CCAR1 ABR
NM_000795 NM_173464 NR_024430 NM_178155 NM_022720 NM_003016 NM_018237 NM_001092
3UTR–3UTR 3UTR–3UTR 5UTR–5UTR 5UTR–5UTR Exon_2–Exon_2 Exon_2–Intron_1 Intron_15–Intron_15 Intron_15–Intron_15
3.3. As a promoter and CpG island mapper We used published ChIP (chromatin immunoprecipitation) data for promoters [7] and CpG islands [8] for annotating the sequences according to the hg19 gene annotation (See additional file 2, 3). 3.4. As a SNP mapper We are currently using this tool to map SNPs from particular cancer types to obtain a genome-wide map of somatic mutations. This tool has proven to be a crucial part of the pipeline and has helped reduce the number of variants detected from 100,000 per sample to several hundred which alter protein coding regions, based on their genomic locations (data not shown here). 3.5. Availability and requirements Fig. 2. Gene map of known human miRNAs with miRBASE (version 15).
3.1. As a miRNA mapper miRNA tags (sequence reads) were mapped to the known gene structures in the human genome. We used the miRNA mapper function of the tool to annotate novel miRNAs from a cancer study [4]. A miRNA detection and analysis tool called miRanalyzer [5] was used to align the miRNA tags to the genome and used the position coordinates for their functional annotation. Output for the webFOG tool clearly demonstrates the presence of mirtrons [1], matches of different miRNA tags on alternatively spliced products of a gene and also on the overlapping genes [4]. We also used the miRNA mapping application for known human miRNAs from miRBase v.15 [6] to obtain a gene map of known human miRNAs (See additional file 1). We successfully mapped 554 of the known human miRNAs (940 in total) to known protein coding transcripts (Fig. 2). Of which 65% lie in the introns, 17% are part of the 5UTRs, 9% are in the 3UTR and only 4% are contained within the exons. The remainder are found at the boundaries of the different regions of the transcript, for example, 5% are mirtrons (Intron to Exons or Exons to Intron boundaries), and we found two miRNAs mapping to Exon-UTR and one mapping to Intron-UTR region. 3.2. As a microarray primer mapper This method uses genomic position coordinates such as those generated by BLAT to match primer locations with respect to those of annotated genes. The search space can be selected by the user, for example, one can search either upstream or downstream of a gene and/or within the gene itself.
Project name: webFOG. Project home page: http://populationhealth.qimr.edu.au/cgibin/webFOG/index.cgi and http://www.bioinformatics.org/ webfog. Operating system(s): Linux, Windows, Mac. Programming language: perl, CGI. License: GNU GPL. Any restrictions to use by non-academics: licence needed. Acknowledgment S.T. developed the tool. S.T. and M.S. tested the tool. N.H., D.W. and D.N. provided input with drafting the genetic approach of the tool. Paper was written by S.T., M.S., D.W., N.H. and D.N. This work was supported by PROBE-net Australia, funded by the Cancer Council NSW through the Strategic Research Partnership Program (SRP08-09). References [1] E. Berezikov, W.J. Chung, J. Willis, E. Cuppen, E.C. Lai, Mammalian mirtron genes, Mol. Cell 28 (2007) 328–336. [2] W.J. Kent, BLAT–the BLAST-like alignment tool, Genome Res. 12 (2002) 656– 664. [3] W.J. Kent, C.W. Sugnet, T.S. Furey, K. Roskin, T.H. Pringle, A.M. Zahler, D. Haussler, The human genome browser at UCSC, Genome Res. 12 (2002) 996– 1006. [4] M.S. Stark, S. Tyagi, D.J. Nancarrow, G.M. Boyle, A.L. Cook, D.C. Whiteman, P.G. Parsons, C. Schmidt, R.A. Sturm, N.K. Hayward, Characterization of the melanoma miRNAome by deep sequencing, PLosONE 5 (2010) e9685. [5] M. Hackenberg, M. Sturm, D. Langenberger, J.M. Falcon-Perez, A.M. Aransay, miRanalyzer: a microRNA detection and analysis tool for next-generation sequencing experiments, Nucleic Acid Res. 37 (2009) W68–W76. [6] S. Griffiths-Jones, H.K. Saini, S. van Dongen, A.J. Enright, MiRBase: tools for microRNA genomics, Nucleic Acid Res. 36 (2008) D154–D158.
450
S. Tyagi et al. / Biochemical and Biophysical Research Communications 401 (2010) 447–450
[7] M. Weber, J.J. Davies, D. Wittig, E.J. Oakeley, M. Haase, W.L. Lam, D. Schübeler, Chromosome-wide and promoter-specific analyses identify sites of differential DNA methylation in normal and transformed human cells, Nat. Genet. 37 (2005) 853–862.
[8] J. Su, Y.Z. Lv, H. Liu, X. Tang, F. Wang, Y. Qi, Y. Feng, X. Li, CpG_MI: a novel approach for identifying functional CpG islands in mammalian genomes, Nucleic Acid Res. 38 (2010) 1–2.