Comparative View of In Silico DNA Sequencing Analysis Tools

Share Embed


Descrição do Produto

Chapter 13 Comparative View of In Silico DNA Sequencing Analysis Tools Sissades Tongsima, Anunchai Assawamakin, Jittima Piriyapongsa, and Philip J. Shaw Abstract DNA sequencing is an important tool for discovery of genetic variants. The task of detecting single-nucleotide variants is complicated by noise and sequencing artifacts in sequencing data. Several in silico tools have been developed to assist this process. These tools interpret the raw chromatogram data and perform a specialized base-calling and quality-control assessment procedure to identify variants. The approach used to identify variants differs between the tools, with some specific to SNPs and other for Indels. The choice of a tool is guided by the design of the sequencing project and the nature of the variant to be discovered. In this chapter, these tools are compared to facilitate the choice of a tool used for variant discovery. Key words: DNA sequencing, resequencing, variation, single-nucleotide polymorphism (SNP), Indel, base calling.

1. Introduction Before the advances in molecular biology, genes were merely abstract units of hereditary known only from the phenotypic expressions of genetic variants (alleles). We now define alleles from variations in DNA sequences. The smallest unit of variation is a change of a single base, either as a substitution (singlenucleotide polymorphism, SNP) or as an insertion/deletion of a base (Indel). A number of in silico tools have been developed to assist in SNP and Indel analysis. Whatever method is used for detecting DNA variants, all putative novel variants must be unequivocally verified by DNA sequencing. Much effort is thus B. Yu, M. Hinchcliffe (eds.), In Silico Tools for Gene Discovery, Methods in Molecular Biology 760, DOI 10.1007/978-1-61779-176-5_13, © Springer Science+Business Media, LLC 2011

207

208

Tongsima et al.

focused toward resequencing genomic regions among cohorts of individuals. The availability of genome sequences has greatly facilitated the process of DNA variant discovery by the resequencing approach. Novel DNA variants within candidate regions may be rare, in which case the same region may have to be analyzed among several individuals. The “shotgun” approach using nextgeneration sequencing methods is not appropriate for this task, as most variants discovered would not be within the target region and the cost is still too high to be practical for this application. The conventional/Sanger resequencing approach for variant discovery begins with design of overlapping PCR amplicons for the candidate genomic region from the reference genome sequence. The amplicons are limited to a few hundred base pairs each, since the maximum sequence read length is approximately 800 bp. PCR primers must be designed to specifically amplify the target genomic region and avoid repetitive sequence (including pseudogenes), known SNPs in primers, high GC content, and known copy number variation regions. PCR primer design is facilitated by “in silico PCR” tools, which are described in Chapters 6 and 18. Optimal PCR conditions for each primer pair also need to be determined empirically. Once the conditions optimal for each amplicon are known, the amplicons are sequenced using the same primers. Sequencing is carried out by the Sanger method (1) using BigDye terminator reaction chemistry, and bases detected with capillary-based sequencing machines (2). Fluorescence-based sequencers generate two data files for each sample read, a chromatogram trace file (e.g., .abi, .scf, .alf, .ctf, and .ztr) and a FASTA base-called sequence file. The automatic base-calling procedure used to generate the FASTA sequence translates the different fluorescent intensities from the chromatogram trace file. When more than one base signal is detected at a calling position, the International Union of Pure and Applied Chemistry (IUPAC) ambiguous nucleotide codes are assigned to that position. Since heterozygous individuals are more common than homozygotes, variants typically manifest in chromatogram traces as mixed signals. These signals are misinterpreted by the automatic base-calling procedure as “N”, or calling error. Therefore, the automatic base-called sequence is not suitable for variant detection because variants cannot be distinguished from common sequencing artifacts, which include the following: (1) polymerase slippage, resulting in peak overlap, (2) loss of resolution at the beginning and the end of the read, (3) mixed amplicon/contamination, and (4) dye blob in which unused BigDye masks the nucleotide peak signal. The efficiency of nucleotide variation detection thus relies on the accuracy of in silico tools used to interpret the chromatogram traces. The variant detection tools discussed in this chapter perform base calling and then assess the quality of each base call to identify true variants.

Comparative View of In Silico DNA Sequencing Analysis Tools

209

The information captured in the chromatogram trace file is merely the intensity of four different wavelengths generated by the laser-excited fluorophores that pass through sequencing capillaries. The signal intensities of each of the four base signals are captured within a base-call array, which store the sampling interval in the trace corresponding to each base position. DNA variant discovery tools must extract different types of signals and make decisions on whether the signal information is of high enough quality to distinguish variants from noise in the input file of chromatograms.

2. Materials We compare publicly available in silico tools used for DNA variant discovery from sequencing data. Table 13.1 presents more information about where they can be obtained and their primary references. Commercial tools (i.e. Mutation Surveyor (see Chapter 14) and Sequencher) are widely used and their properties are compared, where possible, with the freely available tools below.

Table 13.1 Selected list of DNA variant discovery tools, including primary references and the links for software download Tool

Download link

Reference

PolyBayes

http://bioinformatics.bc.edu/marthlab/Software_Release

(3)

Genalys

http://software.cng.fr/

(4)

SNPDetector http://lpg.nci.nih.gov/

(5)

novoSNP

http://www.molgen.ua.ac.be/bioinfo/novosnp/

(6)

InSNPs

http://www.mucosa.de/insnp/

(7)

SeqDoC

http://research.imb.uq.edu.au/seqdoc/

(8)

PolyPhred

http://droog.gs.washington.edu/polyphred/

(9)

AutoCSA

http://www.sanger.ac.uk/genetics/CGP/Software/AutoCSA/

(10)

PolyScan

http://genome.wustl.edu/tools/genome_center_software/polyscan

(11)

VarDetect

http://www4a.biotec.or.th/GI/tools/vardetect

(12)

PineSAP

http://dendrome.ucdavis.edu/adept2/pinesap.html

(13)

Mutation R Surveyor

http://www.softgenetics.com/mutationSurveyor.html



R http://www.genecodes.com/ Sequencher



210

Tongsima et al.

3. Methods 3.1. Standard Procedure for Variant Detection

The discovery of DNA sequence variants comprises a number of common steps which can be broadly separated into two parts – production of raw sequencing data and identification of DNA variants. To obtain the sequence data, DNA samples are collected from the designated cohort of individuals, target regions are amplified by PCR, and the amplicons usually sequenced at a DNA sequencing facility. Sequencing the same region on both strands is also standard, but not always performed. For large projects, there is a trade-off between the greater accuracy of bidirectional sequencing and the lower cost of single-pass sequencing. Once the raw sequencing data are obtained, DNA variants are identified through a standard procedure. First, the basecalling process generates sample nucleotide sequences. The chromatogram signal can be affected by several factors, such as the sensitivity of the allele detection method and the quality of DNA samples, and is frequently found to be ambiguous. Therefore, quality validation is regularly integrated into the base-calling process, in which a quality score is calculated for each base called. Most tools, including the commercial ones, use the well-known Phred quality score in the base-calling process (9). To assure the accuracy of sequences included for further analysis, low-quality base calls are identified and excluded. Low-quality calls predominate at both ends of the sequence read, which are trimmed generating a defined length of high-quality base calls (see Note 1). Most tools trim the sequences automatically, with the trimming controlled by user-defined thresholds based on Phred scores. Commercial tools offer more options for trimming (see Note 1), the trimming boundaries can be visualized, and common artifacts can be automatically removed, e.g. primer/dye blob removal in Sequencher. The next step is sequence alignment and comparison of sample reads against the reference sequence. The commercial tools have a chief advantage over the academic tools for this part of the process, since they can automatically perform the contiguous sequence (contig) assembly by aligning reads from both forward and reverse orientations simultaneously. Most academic tools do not perform contig assembly and rely on other tools to perform this task. Hence, they are less convenient to use, especially for large projects. The base calls from sample-generated sequences that do not match with the reference sequence are highlighted as putative DNA variants. The commercial tools have built-in patented variant detection algorithms (e.g., Mutation Surveyor’s anti-correlation technology) which automatically flag the variants, without the need for user intervention to assess the confidence of prediction.

Comparative View of In Silico DNA Sequencing Analysis Tools

211

In the past few years, several computational algorithms have been developed to accelerate SNP discovery by increasing the efficiency and accuracy of raw sequence data analysis. Although these programs share the aforementioned standard analysis procedure, they come with various features and parameters which users can choose to match their specific needs and experimental design. The comparative factors in which users should take into consideration for in silico tool selection include pooling of samples, accuracy, detection of Indels, the reference sequence, database crosschecking, and reporting. 3.2. Pooling of Samples

Depending on the cohort sample size, researchers should decide whether to pool DNA samples or not. DNA pooling is a way to reduce sequencing costs when a large number of individuals and SNPs are analyzed. By this means, equal amounts of genomic DNA from individual samples are pooled together prior to the generation of PCR amplicons. The greater potential for ambiguous signals in pooled DNA samples means that the power to detect variants is lower, particularly for rare variants. Furthermore, allele frequency cannot be calculated except when samples are pooled into pairs. The allele frequency of pooled pairs can be estimated by quantification of peak heights of sequence trace. Reading signals from pooled pairs generates five distinct combinations of chromatogram signals. For a given SNP locus, the five outcomes are as follows: (1) both samples are homozygous wild type; (2) both samples are homozygous variant type; (3) one sample is homozygous wild type and the other is homozygous variant type; (4) one sample is homozygous wild type and the other is heterozygous; and (5) one sample is homozygous variant type and the other is heterozygous. Most SNP detection tools are designed for analysis of non-pooled samples; however, a few programs such as Genalys (4) and VarDetect (12) can analyze the input trace sequences obtained from pooled DNA. The commercial tool Mutation Surveyor has an additional feature for analyzing pooled samples not provided by other tools, e.g., somatic mutation detection, mutation quantification, and methylation analysis. In these examples, the pooled samples are typically pooled tissues or cell types from one individual in which accurate quantification of variant frequencies is the focus.

3.3. Base-Calling Accuracy

To guarantee the success of SNP discovery, the critical part is to obtain accurate sequence information. The base-called sequence data are scrutinized to remove data of unacceptable quality. The accuracy of base calling is usually calculated from several parameters including peak spacing and uncalled/called peak resolution. Ideally, the chromatogram trace should have well-defined, evenly spaced peaks and minimum noise. To correctly detect variants, one must be able to distinguish sequencing artifacts from the true variant signals.

212

Tongsima et al.

Most SNP discovery tools incorporate a base-calling algorithm into their framework for identifying nucleotide sequence from raw chromatogram data. Phred (14, 15) is the most widely used base-calling program, which is also frequently incorporated into SNP discovery tools such as PolyBayes (3), SNPDetector (5), novoSNP (6), PolyPhred (9, 14), and PineSAP (13). Phred reads trace sequence files and assigns a base-specific quality score, which is a log-transformed error probability. The error probability is calculated by examining the peaks around each base call. Average peak spacing (base-calling bin adjustment) and peak height are common features that are used by Genalys, PolyScan (11), and VarDetect for variant detection. Heuristics are employed by these tools to differentiate variants from sequence artifact mixed peak signals. Furthermore, instead of using Phred quality, these tools (except PolyScan) introduce a quality estimation scheme to be used along with their variant detection heuristics. Genalys strives to generalize SNP variant bases by taking into account the average nucleotide peak height from all samples and the influence of the preceding base on peak height. To detect a variant, the observed peak height is compared with the average peak height of the previous three nucleotides (of the same type). SNPs are called when the peak height drops significantly. VarDetect does not make use of the peak height information but rather focuses on what the peak shape of a true variant should look like. It also focuses on adjusting the interval in which a base call is to be made. Slippage can increase nucleotide signal at the base-call position, thus leading to potential ambiguity in the peak signal. This artifact is automatically detected and disregarded as a variant by VarDetect. This strategy enables VarDetect to properly adjust base-call spacing (bin adjustment) at positions for which standard base calling algorithms may have marked as unreadable, i.e., reported as “N”. While the majority of tools employ a modified base-calling procedure accompanied by a quality score assessment, a few tools such as InSNPs (7) and SeqDoC (8) do not include this module. InSNPs uses the automatic sequencer basecall results and prompts the user to identify SNPs from a candidate list. SeqDoC does not make base calls but rather highlights putative variants by direct comparison of the chromatogram traces between sample and designated reference data. 3.4. Detecting Indel Variations

Insertions or deletions (Indels) are common variations, although it is not yet clear how common, since methods to detect them are not as accurate as for SNPs. A heterozygous sample with an Indel variant generates a mixed-trace chromatogram pattern immediately 3 to the Indel (see Note 2). It is difficult to distinguish this pattern from sequence artifact, in particular low-quality read regions at the end of the trace. Indels can be detected reliably by allele-specific amplification, but this solution is expensive and not

Comparative View of In Silico DNA Sequencing Analysis Tools

213

practical. Computational approaches have been introduced for discovery of Indels from mixed-trace patterns. To identify Indel variants from the mixed trace, the trace corresponding to the reference sequence is subtracted from the continuous mixed trace. The commercial tool CodonCode Aligner uses this approach, whereas the commercial Mutation Surveyor detects Indels using the patented anti-correlation technology. Similar reference subtraction approaches have been adopted by academic variant detection tools including PolyPhred, STADEN (16), novoSNP, InSNP (7), PolyScan, and AutoCSA (10). The accuracy of the sequence subtraction heuristic relies heavily on the reference sequence used. The reference sequence is a consensus of several sequences, which may not be representative of the cohort under investigation. If the reference sequence differs from both alleles, reconstruction (extraction) of the mixed sequence is not possible. Newer in silico tools try to reconstruct continuous mixed sequences without using a reference sequence but rather perform the extraction directly from the mixed traces. These tools include ShiftDetector (17), the newer version of CodonCode Aligner, and Indelligent (18). 3.5. The Reference Sequence (RefSeq)

The detection of DNA sequence variation relies on sequence alignment and identification of base differences from a reference. Poor initial alignments can greatly increase the error rate of DNA variant prediction. Therefore, local alignment methods, e.g. Smith–Waterman algorithm (19) or BLAST (20), are used for this task because they avoid misalignment due to low quality of some sequence regions. Sample sequences are aligned with the genomic reference sequence, which is obtained from a public database for well-annotated genomes. Most variant detection tools require the existing genome reference sequence for identifying putative DNA variants, for example, PolyBayes, SNPDetector, novoSNP, InSNPs, PolyPhred, AutoCSA, PolyScan, VarDetect, and PineSAP. The advantage of using the genomic reference is that the homozygous variant form can be detected. However, the drawback of using the genomic reference is that it may not be representative of the population under investigation. Some nucleotide positions containing the same base in all cohort individuals may be misinterpreted as DNA variants in comparison with the genomic reference. In this case, the nucleotide is not a variant for the sample population. Instead of using a reference, a few tools, e.g. SeqDoC, avoid this problem by automatically selecting a chromatogram trace from the cohort to be the reference. The commercial Mutation Surveyor tool has a special reference sequence feature, in which a synthetic trace is generated from the nucleotide sequence using a proprietary algorithm. This synthetic reference trace is used for quantification of variant frequency, a feature not offered by other tools.

x

x

Make use of bidirectional trace

Two-pooled DNA

Phred

x

Quality score

Peak correction

Bayesian inference

/

Algorithm

Require RefSeq

SNP identification

Phred

Algorithm

Base calling

/

Overlap fragment (Batch)

Sample DNA

2002

x

Signal ratio peak height

Local heuristic

x

Local heuristic

/

/

x

PolyBayes Genalys

1999

/

Neighborhood quality

x

Phred

Phred

x

/

/

/

Feature score

x

Phred

Phred

x

/

/

/

Crossreference

x

x

x

x

/

/

SNPDetector novoSNP InSNPs

2005

Table 13.2 The feature comparison of DNA variant detection tools

X

Difference profile

X

X

X

X

X

X

SeqDoC

/

Error probabilities

x

Phred

Phred

x

/

/

PolyPhred

2006

/

Peak height

x

/

/

x

x

AutoCSA

2007

/

Horizontal/ vertical

x

x

Local heuristic

x

/

/

PolyScan

/

Codemap

Local heuristic

Local heuristic

Local heuristic

/

/

/

VarDetect

2008

/

PolyPhred, PolyBayes with ML

x

Phred

Phred

x

/

/

PineSAP

2009

214 Tongsima et al.

Require RefSeq

x

CONSED

Automated sequence annotation

Data editing

x

Graphic interface

Command line

x, not available; /, available

/

/

Easy installation

UNIX

Operation system

Usage and platform

x

Allele calculation

Data reporting

x

x

Indel algorithm

Indel identification

Table 13.2 (Continued)

x

x

/

x

x

x

x

/

/

/

/

/

Mac, UNIX, Linux Windows

x

x

/

x

/

/

/

x

x

/

/

/

/ /

/

/

Mac, Windows Windows, UNIX

/

/

x

/

/

x

/

x

Web

X

X

X

/

/

/

/ with CONSED

/

Mac OS, Windows, UNIX

/

x

x

/

/

/

/

x

Mac OS, Windows, UNIX

/

x

x

/

/

x

/

x

/

/

/

/

/

/

/

/

Mac, Linux, Mac, Windows, UNIX UNIX

/

x

x

/

/

/

x

x

Web

x

x

x

/

Comparative View of In Silico DNA Sequencing Analysis Tools 215

216

Tongsima et al.

3.6. Database Cross-checking

A number of free public archives have been established to deposit genetic variation data, e.g., dbSNP and Database of Genomic Variants (DGV). If several variants are discovered, it can be laborious to manually cross-check the databases to determine if variants are novel. This cross-checking procedure is facilitated by tools such as SNP BLAST (see Note 3). The VarDetect tool is linked to the ThaiSNP database, allowing download of the SNP-annotated genomic reference sequence. Users can then visualize the positions of putative SNPs from their data and compare them with the known SNPs of the reference sequence.

3.7. Reporting

All genetic variation detection tools provide reports of putative SNPs and Indels, although the information shown varies. The commercial tools have a great advantage over the academic tools since they have graphical cross-linked displays, allowing intuitive navigation through the whole project-analyzed dataset. A number of programs allow data editing, since automatic procedures may fail to detect some ambiguous signals as variants or conversely may report false positives. In addition to the commercial tools, some academic tools offer data editing, i.e., SNPDetector, novoSNP, InSNPs, PolyPhred, AutoCSA, PolyScan, and VarDetect. The feature comparison of different academic variant detection programs is demonstrated in Table 13.2.

4. Notes Detecting sequence variants using in silico tools is quite straightforward. On the other hand, the accuracy of detection is dependent on several factors, such as the quality of input data, the nature of the variant being detected, and the algorithm used to detect the variant. Finally, putative variants must be cross-checked to determine if they are novel. 1. Assessing chromatogram patterns and trimming reads High-quality sequencing data are obviously essential for variant detection. It is recommended to divide the region of interest into fragments of 500–800 bp with at least 30 bases of overlap. Overlap is needed to overcome the problem of low-quality signals at the beginning and the end of the trace. Sequencing both strands (bidirectional traces) is also preferred for most in silico tools to minimize the number of false positives. Once the raw data have been collected, the first step is to assess the overall quality of each trace. Shown below are some common chromatogram patterns that the user should be able to recognize from their data. The pattern also guides the choice of an in silico tool

Comparative View of In Silico DNA Sequencing Analysis Tools

217

to be used for variant detection. Chromatogram trace external viewer programs, such as Phred, 4Peaks (21), BioEdit (22), and FinchTV (23), are excellent tools for assessing trace raw data. Commercial tools have built-in raw data visualization interfaces, which effortlessly link to the variant detection modules. Furthermore, although these commercial tools cannot be used for the entire variant process without payment, they do have free trial evaluation. With this option, the visualization tools in them can be used to assist the raw data processing for variant discovery using another academic tool, e.g., for validation of variant prediction. Raw data viewers can also generate reverse complement patterns, which are very useful for assessing bidirectional data. The first trace example shown below is typical, in which well-resolved peaks are observed throughout the majority of the trace and most automatic base calls have high scores (Fig. 13.1a). From this type of data, SNP variants can be detected. In the second trace example, the read length of automatic base calls with high scores is truncated prematurely at the 3 -end (Fig. 13.1b, c). In this type of data, Indel variants may exist. However, if the peaks are uniformly low in height and the automatic base calls have low scores

Fig. 13.1. The examples of the DNA sequence trace chromatograms with different patterns. (a) High-quality peaks throughout the trace; (b) low-quality base calls at the 3 -end; (c) opposite strand read of the template in (b), also showing low-quality base calls at the 3 -end. The Phred quality of each base is represented by the blue line. The trimming areas are represented by the red shaded boxes.

218

Tongsima et al.

throughout, the data are probably unacceptable for variant detection and should be discarded. In silico tools for SNP detection mask or trim the ends of the data before performing SNP detection. A threshold quality score is chosen for trimming. A default score is incorporated into each tool, removing the need for the user to select a score. However, the default score may not be suitable for every experiment; hence, a better way is to acquire an appropriate cutoff from the trace data. By viewing input sequences using sequence viewer programs, users can estimate the threshold to be used in variant detection. Figure 13.1 shows three sample sequences of the ESR (estrogen receptor) gene from 400-bp amplicons, in which the 4Peaks program was used to visualize the traces. The first trace has an average Phred quality of 54.4% (Fig. 13.1a). 4Peaks allows us to visualize the trimming boundaries, which are varied according to the trimming threshold (set to 20% in all sample traces and shown by the red horizontal line). Close attention should be paid to the traces in Fig. 13.1b, c. The overall quality drops to 13.3%. The trimming at the 3 -end appears to be much larger than that of the 5 -end. After observing the trace closely, the trace immediately after the 5 trimming box has a short stretch with Phred scores well above the 20% threshold (bases 21–58). Immediately 3 of this region, the Phred scores are below the threshold and a mixed-trace pattern is apparent. An Indel variant may exist, accounting for this pattern. If the bidirectional

Fig. 13.2. The output of SeqDoC variant detection tool for three sample input sequences: two sequences showed mixedtrace patterns (putative Indel), and a third with a normal trace pattern was designated as the reference sequence. The two pairwise alignments of the putative Indel-variant sequences with the reference are shown. Each alignment pane is structured in three windows, where the top and the bottom windows present the input traces. The middle window reveals the trace subtraction result.

Comparative View of In Silico DNA Sequencing Analysis Tools

219

Fig. 13.3. Similarity search for known SNPs deposited in the SNP database using the SNP BLAST tool. (a) Snapshot of the SNP BLAST main page. SNP flanking sequences are requested as input. (b) The output of SNP BLAST showing the list of SNP rs IDs with high-scoring matches to the input.

220

Tongsima et al.

sequence of this trace is available (as is the case shown in Fig. 13.1c), the same mixed-trace pattern is observed on the other strand. 2. Indel detection from mixed-trace patterns If an Indel variant is suspected from the characteristic mixed-trace patterns as described in Fig. 13.1b, c, Indel detection tools can be used to test the Indel-variant hypothesis. In this example, the Web-based tool SeqDoC was employed and the result is shown in Fig. 13.2. From position 57 onward, the subtraction extracts the mixed trace of the two overlapping traces, whose intensities mirror each other. This result is highly suggestive of a single base deletion at position 56. Furthermore, an SNP at position 38 was also detected. 3. Cross-checking against known variants SNP detection tools report putative SNPs in their genomic sequence context by showing the SNP and flanking sequence. Currently, no tool can automatically cross-check against SNP databases to determine if discovered variants are novel. This cross-checking process is laborious, since users must search multiple database Web sites. To minimize this task, NCBI has provided a Web application, called SNP BLAST, which allows users to input SNP flanking sequences and visualize the locations of these SNPs on the NCBI Web site. This tool can be accessed at http://www.ncbi.nlm.nih.gov/projects/SNP/ snp_blastByOrg.cgi, which allows users to BLAST their SNP flanking sequences against different organisms. If the target genome is human, one can use the direct link to BLAST human chromosomes (http://www.ncbi.nlm.nih. gov/SNP/snpblastByChr.html). Figure 13.3a shows the Web interface of the SNP BLAST tool. In this example, we want to locate the SNP (C/G) with the flanking sequences: GAAGGGCACTCAGGCAAGTACTTTAAGTCATCACATAGTT and AGTGTCCACAATTTCCAGCACGGTGGACTTCATTGGAAAG, on gene XRCC5. A sequence comprising of SNP allele C with 5 and 3 flanking sequences is used as an input. The BLAST result is displayed in Fig. 13.3b. rs3815855 is identified as a known SNP in the query sequence, as can be seen from the alignment with the SNP position marked. References 1. Sanger, F., Nicklen, S., Coulson, A. R. (1992) DNA sequencing with chainterminating inhibitors. 1977, Biotechnology 24, 104–108.

2. MacBeath, J. R., Harvey, S. S., Oldroyd, N. J. (2001) Automated fluorescent DNA sequencing on the ABI PRISM 377, Methods Mol Biol 167, 119–152.

Comparative View of In Silico DNA Sequencing Analysis Tools 3. Marth, G. T., Korf, I., Yandell, M. D., Yeh, R. T., Gu, Z., Zakeri, H., Stitziel, N. O., Hillier, L., Kwok, P. Y., Gish, W. R. (1999) A general approach to single-nucleotide polymorphism discovery, Nat Genet 23, 452–456. 4. Takahashi, M., Matsuda, F., Margetic, N., Lathrop, M. (2003) Automated identification of single nucleotide polymorphisms from sequencing data, J Bioinform Comput Biol 1, 253–265. 5. Zhang, J., Wheeler, D. A., Yakub, I., Wei, S., Sood, R., Rowe, W., Liu, P. P., Gibbs, R. A., Buetow, K. H. (2005) SNPdetector: a software tool for sensitive and accurate SNP detection, PLoS Comput Biol 1, e53. 6. Weckx, S., Del-Favero, J., Rademakers, R., Claes, L., Cruts, M., De Jonghe, P., Van Broeckhoven, C., De Rijk, P. (2005) novoSNP, a novel computational tool for sequence variation discovery, Genome Res 15, 436–442. 7. Manaster, C., Zheng, W., Teuber, M., Wachter, S., Doring, F., Schreiber, S., Hampe, J. (2005) InSNP: a tool for automated detection and visualization of SNPs and InDels, Hum Mutat 26, 11–19. 8. Crowe, M. L. (2005) SeqDoC: rapid SNP and mutation detection by direct comparison of DNA sequence chromatograms, BMC Bioinformatics 6, 133. 9. Ewing, B., Green, P. (1998) Base-calling of automated sequencer traces using Phred. II. Error probabilities, Genome Res 8, 186–194. 10. Dicks, E., Teague, J. W., Stephens, P., Raine, K., Yates, A., Mattocks, C., Tarpey, P., Butler, A., Menzies, A., Richardson, D., Jenkinson, A., Davies, H., Edkins, S., Forbes, S., Gray, K., Greenman, C., Shepherd, R., Stratton, M. R., Futreal, P. A., Wooster, R. (2007) AutoCSA, an algorithm for high throughput DNA sequence variant detection in cancer genomes, Bioinformatics 23, 1689–1691. 11. Chen, K., McLellan, M. D., Ding, L., Wendl, M. C., Kasai, Y., Wilson, R. K., Mardis,

12.

13.

14.

15.

16. 17. 18.

19. 20.

21. 22. 23.

221

E. R. (2007) PolyScan: an automatic indel and SNP detection approach to the analysis of human resequencing data, Genome Res 17, 659–666. Ngamphiw, C., Kulawonganunchai, S., Assawamakin, A., Jenwitheesuk, E., Tongsima, S. (2008) VarDetect: a nucleotide sequence variation exploratory tool, BMC Bioinformatics 9 Suppl 12, S9. Wegrzyn, J. L., Lee, J. M., Liechty, J., Neale, D. B. (2009) PineSAP – sequence alignment and SNP identification pipeline, Bioinformatics 25, 2609–2610. Bhangale, T. R., Stephens, M., Nickerson, D. A. (2006) Automating resequencingbased detection of insertion-deletion polymorphisms, Nat Genet 38, 1457–1462. Nickerson, D. A., Tobe, V. O., Taylor, S. L. (1997) PolyPhred: automating the detection and genotyping of single nucleotide substitutions using fluorescencebased resequencing, Nucleic Acids Res 25, 2745–2751. Staden, R. (1996) The Staden sequence analysis package, Mol Biotechnol 5, 233–241. Seroussi, E., Ron, M., Kedra, D. (2002) ShiftDetector: detection of shift mutations, Bioinformatics 18, 1137–1138. Dmitriev, D. A., Rakitov, R. A. (2008) Decoding of superimposed traces produced by direct sequencing of heterozygous indels, PLoS Comput Biol 4, e1000113. Smith, T. F., Waterman, M. S. (1981) Identification of common molecular subsequences, J Mol Biol 147, 195–197. Altschul, S. F., Gish, W., Miller, W., Myers, E. W., Lipman, D. J. (1990) Basic local alignment search tool, J Mol Biol 215, 403–410. http://mekentosj.com/science/4peaks/ http://www.mbio.ncsu.edu/bioedit/ bioedit.html http://www.geospiza.com/Products/ finchtv.shtml

Lihat lebih banyak...

Comentários

Copyright © 2017 DADOSPDF Inc.