ZTR: a new format for DNA sequence trace data

May 26, 2017 | Autor: James Bonfield | Categoria: Bioinformatics, Algorithms, Computational Biology, Database Management Systems, Biological Sciences, Software, Humans, Mathematical Sciences, Human Genome Project, DNA sequence, Software, Humans, Mathematical Sciences, Human Genome Project, DNA sequence

Share Embed

Denunciar este link

Descrição do Produto

Vol. 18 no. 1 2002 Pages 3–10

BIOINFORMATICS

ZTR: a new format for DNA sequence trace data James K. Bonfield ∗ and Rodger Staden MRC Laboratory of Molecular Biology, Hills Road, Cambridge CB2 2QH, UK Received on July 2, 2001; revised on September 11, 2001; accepted on September 12, 2001

ABSTRACT Motivation: To produce an open and extensible file format for DNA trace data which produces compact files suitable for large-scale storage and efficient use of internet bandwidth. Results: We have created an extensible format named ZTR. For a set of data taken from an ABI-3700 the ZTR format produces trace files which require 61.6% of the disk space used by gzipped SCFv3, and which can be written and read at greater speed. The compression algorithms used for the trace amplitudes are used within the National Center for Biotechnology Information (NCBI) trace archive. Availability: Source code is available from ftp: A //ftp.mrc-lmb.cam.ac.uk/pub/staden/io lib/io lib.tar.gz. complete format description can be found at http: //www.mrc-lmb.cam.ac.uk/pubseq/ztr.html. Test data is available from ftp://ftp.mrc-lmb.cam.ac.uk/pub/staden/ io lib/test data. Contact: [email protected]

INTRODUCTION The genome projects performed to date are just a beginning, and as DNA sequencing is increasingly being used for new scientific, medical and forensic purposes, the trace data accumulated so far represent only a tiny fraction of the storage requirement of the future. Major centres such as the National Center for Biotechnology Information (NCBI) and the European Bioinformatics Institute (EBI) are collecting the trace data files from genome projects and making them available via the internet (http://www.ncbi.nlm.nih.gov/Traces/, http://trace.ensembl.org/). It is important that the storage and transfer of trace data is efficient and that the format used is easily adaptable. To illustrate the size of the storage problem let us take a single human genome project as an example. Suppose we aim at 5-fold coverage, that we use sequencing instruments which generate Applied Biosystems’ ABI format trace files (typical file size 190 kb), and that we get on average 500 reliable bases per reading. Then we would ∗ To whom correspondence should be addressed.

c Oxford University Press 2002

require around 30 million traces, equating to 5700 Gb of storage. This is for one individual for one species. At the time of writing (June 2001) the NCBI trace archive also contains 23 million traces for other species. In 1991 our group introduced SCF format (Dear and Staden, 1992). The major motivations then were: (1) sequencing machine independence; (2) operating system independence; (3) an open and public format with sources available to all; (4) small file size; (5) to introduce the idea of base call confidence values and encourage their use. This format is now the most widely used and the sources are available via ftp. During the intervening years we have produced two major revisions. The current one (SCFv3) includes the use of a finite differences function plus a reorganization of the order of the file contents to potentiate the efficient use of standard compression programs.

The content of trace files The minimum information needed in a DNA sequence trace file is shown below with their percentage of the overall uncompressed SCF file size: (1) the base calls (1%); (2) base confidence values (4%); (3) the trace amplitudes for each of the four base types (88%); (4) the offsets of the base calls relative to the trace coordinates (represented by the element numbers of the trace values) (4%); (5) various textual comments (sample identifiers, run date, etc.) (0.2%). The remaining ∼2.8% is in unused (marked as ‘spare’) fields. Typically ABI files will also contain additional textual data plus arrays of current, voltages, temperature and unprocessed trace amplitudes. A list of data items that today’s users may want to store in the files is specified at the NCBI trace repository 3

J.K.Bonfield and R.Staden

Table 1. Gzip compression ratios on a selection of file types

Format

Original size

Gzipped size

Fraction

ABI SCFv2 SCFv3

18 158 424 7 887 845 7 887 845

8 427 773 3 881 662 2 396 562

0.464 0.492 0.304

(Gorrell et al., 2000), but we do not know what may be required in the future. For example we originally suggested that it would be useful to store a confidence value for each of the four base types at each position in the sequence and provided four such slots in the SCF format. At present many use a single value, that for the called base only. However at least one group have a trace analysis program (ATQA, 1998) which in addition to an overall confidence for each base call can also calculate the probability of insertions and deletions at each base position. Luckily these useful values can be stored in the spare confidence value slots in the SCF file, but the format we propose in this paper can readily incorporate new data types such as these.

Space saving methods Lossless file compression tools save disk space, not by deleting information, but by analyzing the data to repack the information using fewer bytes. One of the most commonly used tools for this is gzip. We determined that storing the data within an SCF file in a different order can significantly improve the performance of gzip. Also the trace amplitudes are not ideally suited to compression by gzip, but storing the differences between one value and the next reduces the signal variability. This finite differences technique can be applied up to three times before compression ratios start to suffer. These ideas were used to form a new revision of SCF—version 3. Table 1 demonstrates the compression ratios of gzip on a set of ABI-3700 files in ABI, SCF version 2 and SCF version 3 formats. More recently, Jean Thierry-Mieg at the NCBI produced a new trace format named CTF (unpublished) which compresses better than SCFv3. The CTF format specifies its own compression algorithms. A raw CTF file has a similar size to gzipped SCFv3, but is substantially faster to read back. Furthermore CTF files can still be compressed further by using external programs, such as gzip, hence giving additional space savings. Despite the size reduction of SCFv3 and CTF we recently felt that another completely new file structure would both enable us to make the files even smaller and produce an extensible format which would facilitate its use for novel future applications of DNA sequencing. We have named this new binary format ZTR. 4

SYSTEM AND METHODS The design of ZTR builds on this previous work and borrows new ideas taken from the PNG format (Boutell et al., 1997), the public successor to the GIF image format. We wanted to reduce data size further, and also to incorporate additional textual information. The key design principles are: (1) Extensibility: we cannot easily foresee what future data may need to be stored within a trace file, so we need a mechanism of incorporating new information in a way which does not invalidate the file format. (2) Small: a small size not only saves disk space, but will reduce network usage and download times. Rather than have a format which requires use of external compression tools we would like the format to specify its own compression methods. (3) Fast: ZTR file accessing should not be substantially slower than existing SCF implementations. Given that gzipped SCF files are the norm, we considered this to be our target speed for both reading and writing. (4) Public: both the specification and the source code for an example implementation should be freely available to both academic and commercial users.

Extensibility The basic structure of a ZTR file is a header indicating the file format, followed by zero or more data blocks. In ZTR we call these blocks ‘chunks’. The use of separate chunks for each data type implies new data types can be added without changing the basic file structure and conversely chunks may be omitted. ZTR readers should ignore chunks with unknown type and so new files are backwards compatible. These features contribute to the extensibility of the ZTR format. The header structure is, in hex bytes: 8-byte magic number: AE 5A 54 52 0D 0A 1A 0A Format version, major: 01 Format version, minor: 02 The magic number includes an 8-bit byte (AE), a controlZ (1A) character (used to indicate end-of-file under DOS), both bare newline (0A) and carriage-return newline (0D 0A) combinations, and the text ‘ZTR’ (5A 54 52). The purpose of this is to act as an immediate check for the more troublesome aspects of file reading and data transfer and so aid the detection of corrupt files. For example, using ftp to transfer a ZTR file in ASCII mode may swap newline and newline-carriage return. Such files will not have a valid ZTR header, so rather than return a corrupted file the reading code will return an error.

ZTR: New format for DNA sequence trace data

Each chunk consists of a type, meta-data and data. The chunk data is the main information we wish to store. The meta-data, which is not needed for many chunk types, is typically a small amount of information about the data. For example the chunk to store the digitized trace samples will have the samples themselves in the data block and the name of the channel (A, C, G, T) in the meta-data block. All integer values are stored using 4-byte values in bigendian format (i.e. most significant byte first). The chunk structure is: 4-byte chunk type: Meta-data length (big endian): Meta-data: Data length (big endian): Data:

XX XX XX XX XX XX XX XX (any number of bytes, up to 232 ) XX XX XX XX (any number of bytes, up to 232 )

The format of the meta-data and data elements is chunk type dependent. The complete information may be found in the on-line ZTR format specification (http://www. mrc-lmb.cam.ac.uk/pubseq/ztr.html).

Chunk types The chunk type may be considered to be a 4-character string. Bit 5 of the first character indicates whether this chunk type is part of the public ZTR specification (in which case bit 5 is clear) or whether it is a private extension (bit 5 is set). Bit 5 of the remaining three characters is reserved for future use and so currently should always be clear. Practically speaking this means that public chunk types consist entirely of uppercase letters and private chunk types start with a lowercase letter. This means that TEXT and tEXT are two completely independent chunk types and the similarity of their names does not imply a relationship between the format of their data. Also it is clear that private extensions will not clash with future public extensions. At present the publicly defined chunk types are: A single channel of trace samples, stored in 16bit format.

SAMP.

SMP4. Four concatenated arrays of trace samples, storing the same information as 4 SAMP chunks for the A, C, G and T channels. Note that both SMP4 and SAMP chunks can be combined within the same file if desired. SMP4 typically gives compression ratios 4% smaller than 4 separate SAMP chunks, at a reduced CPU usage.

Base calls, encoded using the NC-IUB character set (NC-IUB, 1985).

BASE.

A mapping of base numbers to trace sample numbers, stored as an array of 32-bit integer values.

BPOS.

CNF1. The confidence values for the called base type. The scale must be −10 log10 (Perror ) expressed as an 8-bit

integer; the same as used by Phred (Ewing and Green, 1998), TraceTuner (http://www.paracel.com/tracetuner/), ATQA (1998) and Li-Cor base callers. CNF4. The four confidence values stored in the same scale as CNF1, but with one value per base type. To aid compression, the confidence for all the called bases (which defaults to T if not A, C or G) is stored first followed by the remaining confidence values for A, C, G and T. CSID. The confidence values for substitution, insertion and deletion, stored in the −10 log10 (Perror ) scale. ATQA is one such program to produce these values.

Poor quality clip points. Specified in base coordinates, this indicates where data (at both ends) should be considered as poor quality. This is included primarily for backwards compatibility with SCF—the CNF* chunk types provide more detailed information.

CLIP.

COMM.

User defined text comments, in 8-bit ASCII.

TEXT. A series of identifier-value pairs stored as one or more sets of identifier, nul, value, nul terminating in an additional (i.e. double) nul character. The identifiers are defined as part of the ZTR spec, but have been taken from the NCBI trace repository RFC version 1.17.

A 32-bit cyclic redundancy check (ANSI X3.66) value of all the data since the last CR32 chunk, including the ZTR header if appropriate.

CR32.

Compression Each chunk data block is compressed using zero or more filtering and compression algorithms. The available algorithm choices are: DELTA1, DELTA2, DELTA4. These apply the forward finite differences technique to 1, 2 or 4 byte words. This replaces each 1, 2 or 4 byte word with the difference between itself and the previous word. It does not directly decrease the size of the data. Table 2 contains an example of DELTA1 filtering.

These attempt to store numerically small 16-bit and 32-bit integer values in a single 8-bit integer. Values in the range of −127 to +127 are stored directly in 8-bits. For values outside this range we emit −128 followed by the actual 16 or 32-bit value. Table 3 contains an example of the 16TO8 filter type.

16TO8, 32TO8.

FOLLOW1. This analyzes the complete data block to determine for each 8-bit value (‘x’) which other value most frequently follows it (follow (x)). Then for each byte of data we store follow (previous byte)—current byte. To enable reversal of this function we also prepend the data block with the 256-byte follow table.

5

J.K.Bonfield and R.Staden

Table 2. Example of levels 1–3 of the DELTA1 filter

Level 0 1 2 3

Data stream before and after the DELTA1 filters +4 +4 +4 +4

+7 +3 −1 −5

+12 +5 +2 +3

+17 +5 +0 −2

+24 +7 +2 +2

+30 +6 −1 −3

+36 +6 +0 +1

+40 +4 −2 −2

+43 +3 −1 +1

After

00

4B

00

4B

55

FF

55

EB

EB

FC 80

22

FC

00

22

80

+40 −3 −3 +0

After

5 5

6 6

7

7 8

7 4

7

7

8 8

00

7 0

+28 −7 −2 +0

+21 −7 +0 +2

+14 −7 +0 +0

+9 −5 +2 +2

7.50 6.16 4.97 5.62

BB Chunk type

Filters/compressors (plus arguments)

SAMP/SMP4

DELTA2 (×3 or ×2, depending on data range) 16TO8 FOLLOW1 RLE ZLIB (Z HUFFMAN ONLY)

BASE/TEXT/COMM

ZLIB (Z HUFFMAN ONLY)

CNF1/CNF4/CSID

DELTA1 (×1) RLE ZLIB (Z HUFFMAN ONLY)

BPOS

DELTA4 (×1) 32TO8 ZLIB (Z HUFFMAN ONLY)

BB

Table 4. An example of the RLE compression method, using 8 as the token

Before

+35 −5 −2 +1

Table 5. Summary of chunk type and the default filters and compression algorithms

Table 3. An example of 16TO8 of 5 big-endian 16-bit numbers

Before

+43 +0 −3 −2

Entropy

7

6 6

Run length encoding. If 4 or more identical 8-bit values are detected in a row then RLE replaces this data with a special token followed by the number of repeated bytes and the value. If the token itself is within the raw data then it is output followed by zero. The token may be chosen to be a symbol with a low natural frequency. Table 4 contains an example of the RLE compression method.

as 16-bit quantities regardless of their actual scale. 8-bit data (0–255) is compressed best using DELTA2 with 2 rounds, whereas full 16-bit data is compressed best using 3 rounds.

Uses the zlib library (Deutsch and Gailly, 1996) to apply the LZ77 compression algorithm followed by Huffman encoding (Huffman, 1952). Zlib allows for Huffman encoding only (denoted below as Z HUFFMAN ONLY), which for trace data typically reduces file size more than LZ77 and is faster. However all valid zlib streams are allowed within a ZTR file. The first byte of the encoded chunk data indicates the algorithm used, followed by any algorithm specific parameters required for decoding, followed by the encoded data itself. A value of zero for the first byte indicates the raw data. Hence ZTR decoders simply need to keep recursively applying the uncompression algorithms until the raw data is obtained. Experimentation has determined which sets of algorithms are best applied to each type of chunk. Table 5 lists the default filter and compression types used. Note that other combinations of filters and compression methods may be used as they still produce a valid format ZTR file. The trace amplitudes (in the SAMP chunks) are all treated

IMPLEMENTATION The source code implementing the ZTR format is contained within a library named ‘io lib’. This library also implements read-only support for the Applied Biosystems’ ABI and Pharmacia’s ALF format trace files and read–write support for the SCF and CTF formats. Io lib is coded using ANSI C and is known to work on both UNIX and Microsoft Windows based systems. Internally it uses a common C structure for storing a trace along with a common programming interface for reading and writing this structure. This means that the application does not need to know the file format of the trace data and so as new formats are added existing applications will not need to be modified or even recompiled. Io lib supports the notion of a trace search path, which is independent from the trace format. Traces may be loaded directly from a file on disk in the current working directory, from an alternative directory, or extracted from within a tar file.

RLE.

ZLIB.

6

ZTR: New format for DNA sequence trace data

The tar file support allows for archiving many trace files into a single file, which has several benefits. It makes distribution of data much easier, it may reduce disk space and it reduces the number of files on the disk. This is important as most filesystems support a limited number of files, usually specified at the time of formatting. Although this number is typically set very high, a large number of very small files can still cause problems. Many filing systems also have a block size. The size required to store a file of length N will be N rounded up to the next multiple of BLOCK SIZE; averaging at N + BLOCK SIZE/2. On Microsoft Windows the block size can often be as much as 64 kb, meaning an average wastage of 32 kb per file. Tar archives typically use an internal block size of 512 bytes, which greatly reduces wasted space. This point is not to be underestimated; a 64 kb block size means that there is usually no saving in switching from gzipped SCF to ZTR unless tar archives are also used. Fortunately UNIX file systems usually have much smaller block sizes. For example in Linux the ext2 filesystem has a block size of 1024, 2048 or 4096 bytes. Trace files within the tar file may be compressed if desired, although with the ZTR format this is not advisable due to the use of its own compression functions. However the complete tar file itself must not be compressed as this would prevent random access within it. In order to reduce time spent searching for files within a tar archive io lib can use an index file. The index consists of a series of lines containing the trace name and file offset, allowing for complete random access within the trace archive. The current implementation performs a linear search through the index, so access time is still proportional to number of files in the archive. However the time taken to find a file within a directory is also dependent on the number of files contained within it. At present we only support read access to tar files.

RESULTS We analyzed the performance of ZTR on multiple sets of data covering several machine manufacturers and multiple sequencing chemistries, with each set consisting of 100 traces. The ABI-3700 and MegaBACE data sets (from the Sanger Centre) were re-base-called using Phred 0.990722.g. The Li-Cor data set (from Genoscope) was converted from SCFv2 to SCFv3 format, but was not re-base-called as the Li-Cor base-caller produces confidence values in the same log scale as Phred. All of this data is publicly available on our ftp site. Table 6 presents the gzipped SCF size for each of these three sets along with the size for the ABI-3700 data set in the original ABI file format. The timings here represent summation of the user and system CPU times, taken from a 433 MHz Compaq Alpha running Digital UNIX V4.0E. Real times averaged at approximately 20% slower

Table 6. Total size in bytes and timings in seconds for 100 trace files

Instrument

Format

ABI-3700 ABI-3700 ABI-3700 MegaBACE Li-Cor

ABI Gzipped ABI Gzipped SCF Gzipped SCF Gzipped SCF

Size in bytes

Read time (s)

Write time (s)

18 915 025 8 780 830 2 494 217 3 953 805 1 815 428

3.55 6.18 1.54 1.73 1.17

– – 9.15 13.14 5.01

Table 7. File size and I/O times as percentages relative to gzipped SCF

Format

Size relative to SCF.gzip ABI-3700 Mega- Licor Average BACE

Average timings Read & Encode & decode write

SCF.raw SCF.gzip SCF.bzip2 SCF.szip

317.1 100.0 72.8 71.1

202.0 100.0 75.9 74.8

264.9 100.0 85.7 80.2

261.3 100.0 78.1 75.4

30.9 100.0 370.3 937.9

7.9 100.0 143.4 164.6

CTF.raw CTF.gzip CTF.bzip2 CTF.szip

96.2 70.2 65.6 63.2

112.2 80.3 72.0 70.5

114.5 83.0 79.2 75.5

107.6 77.8 72.2 69.8

34.2 79.2 324.6 740.1

117.6 144.4 217.3 227.6

150.0 69.5 62.9 60.8

99.5 73.1 68.7 67.2

220.1 84.2 72.3 68.4

156.5 75.6 68.0 65.4

34.9 85.4 370.9 779.2

8.2 50.6 125.9 129.8

61.6

69.7

79.1

70.1

67.9

34.3

ZTR(1).raw ZTR(1).gzip ZTR(1).bzip2 ZTR(1).szip ZTR(2).raw

for reading (when not cached) and 50% slower for writing. A comparison between SCF, CTF and ZTR is presented in Table 7. Here we have normalized the the sizes and times against the gzipped SCF results from Table 6. Several compression tools are also compared, including gzip (implemented using zlib 1.1.3), bzip2 (version 1.0.1) and szip (version 1.11). Gzip was implemented as a library call and so avoids the need for running an external process. This does not affect the size, but reduces the real and cpu time and so there is a small bias against the timings for bzip2 and szip. Both gzip and bzip2 are widely-used open source programs. Szip is freely available for many operating systems, but is not open-source. It is included as an illustration of one of the best general purpose compression tools. Table 7 contains a lot of information so we have made the rows of formats which are faster at reading or writing than all others for a given file size bold. The remaining rows contain results which are bettered on both speed and size by at least one other format. For example SCF.gzip is always beaten on speed and size by ZTR(2).raw. 7

J.K.Bonfield and R.Staden

Table 8. Relative proportions of data within a ZTR file

Chunk type

File (%)

Bits/item

SMP4 CNF4 BPOS BASE TEXT

92.07 2.72 2.38 1.59 1.23

3.24 bits/sample 4.45 bits/value 3.90 bits/value 2.60 bits/base 7.89 bits/character

The ZTR(1) and ZTR(2) formats are both valid ZTR files, but ZTR(1) does not include the final FOLLOW1 and ZLIB compression methods. This means that a raw ZTR(1) file is substantially larger than ZTR(2), but the more complex external compression tools (bzip2 and szip) reduce the ZTR(1) files to less than ZTR(2). ZTR(2)’s internal compression prevents external tools from further reducing the file size, so these values are not shown in Table 7. Both ZTR sets have been encoded using a single SMP4 chunk instead of 4 separate SAMP chunks. In summary, whilst ZTR(2) is not the smallest file format (although it is close), to produce smaller files takes substantially longer. The ZTR(2) implementation fulfils the goals of being faster than gzipped SCF with a much smaller output and so is our default implementation of the ZTR format. The ZTR(1).gzip files are larger than the ZTR(2) files, despite the fact that the Huffman compression used within ZTR(2) is the same code as used in gzip. This can be explained by noting that different chunks contain byte values with substantially different frequency distributions, but gzip averages all these together (assuming that the entire file fits within one gzip block). An additional benefit to this approach is that random access to any chunk is still possible, which in turn provides faster extraction of specific data. For example extracting just the base calls from ZTR(2) files is faster than extracting them from ZTR(1).gzip files. We can see that the Li-Cor files compress much less than the ABI and MegaBACE files. The main reason is that the Li-Cor data only stores 8-bit samples, compared to the 11-bit data from ABI and MegaBACE machines. Scaling down the other data sets to 8-bit samples gives results comparable to the Li-Cor data (ZTR(2) is 72.9% for ABI and 80.4% for MegaBACE). The other factor resulting in differences in compression ratios between data sets is the noise in the trace data (which depends in part on the preprocessing of the original data signals). As the noise increases the entropy of the data also increases, resulting in poorer compression. Table 8 details the breakdown of a ZTR(2) file expressed as a percentage of the overall size and in bits per item. This table was computed by averaging only the ABI-3700 files. 8

From this we can see that the Huffman encoding used for TEXT and BASE chunks is not optimal, mostly due to the small size of the information being compressed. The TEXT chunks in this data set do not include the NCBI text attributes and so their average size is just 196 bytes. With longer TEXT chunks the compression rates will improve, but it is unlikely the size will be a significant portion of the total file, so optimizing this will not provide an overall improvement in compression.

DISCUSSION We have presented ZTR as an extensible and compact replacement to gzipped SCF, but have concentrated on the issues of file compression. The Huffman encoding used in ZTR represents a very basic compression algorithm. Better entropy encoders are known, with arithmetic coding (Rissanen and Langdon, 1979) being the most widely used. They may produce smaller files without too large an impact on speed, but these algorithms often require larger amounts of data to work efficiently. Higher order statistical encoders (such as the PPM family; Cleary and Teahan, 1997) may also reduce space, but these are currently slow algorithms. It can be seen that the higher order block sorting methods (http://www.compressconsult.com/ st/), as used in szip, and the Burrows–Wheeler transform (Burrows and Wheeler, 1994), as used in bzip2, give substantial improvements, but again the methods are relatively slow. However the most profitable strategy may lie in trying to curve-fit the data. We have experimented with using Chebyshev polynomials (Press et al., 1992) to fit the previous 4 samples in order to predict a value for the 5th sample. The difference between the predicted and real sample value can then be stored. This is still work in progress, but our current algorithm can compress the ABI-3700 data set to 56.7% of the gzipped SCF size (2.98 bits/sample). CPU performance is still a big issue with this method, with read times being approximately 2.7 times slower than gzipped SCF files. We have also examined the use of lossy compression for the trace amplitudes. The simplest way to lose information uniformly is down-scaling. The original SCFv1 implementation stored information in 8-bits, but downscaling to any range also improves compression. Table 9 shows the results of this form of lossy compression on the 100 ABI-3700 files using the ZTR(2) format. We would not recommend using lossy compression for permanent archive of trace data, but for visual inspection over the network 7-bit data is generally adequate. The nature of the ZTR format is such that, if useful, any of these alternative or additional methods can be implemented in the future without affecting the reading of older files. The original size of the ABI-3700 data set is more than

ZTR: New format for DNA sequence trace data

Table 9. The effect of down-scaling on file size

Range 0–1600 (lossless) 0–1024 0–512 0–256 0–128 0–64 0–32

Average ZTR(2) file size 15 949 14 656 12 506 10 846 9 175 7 924 6 765

double the size of the uncompressed SCF files. This is due to the additional information stored in an ABI file. By defining further ZTR chunk types it would be possible to store all the data in an ABI file within ZTR, utilizing appropriate compression methods for each chunk. The main proportion (96%) of an ABI file consists of the 12 DATA channels corresponding to raw and processed copies of the trace data and various instrument settings (voltage, current, power, temperature). Of the remaining ABI information approximately 2/3 is base calls and base offsets. All of this already compresses well using ZTR. We estimate that a complete ZTR encoded ABI file will be 27% of the original size, compared to 49% for gzip and 34% for bzip2. Hence ZTR is a suitable open format for use by manufacturers of sequencing instruments. In the SCF format the number and type of data items is rigidly defined in the header, with just one single ‘private’ block for additional data. ZTR overcomes this limitation by having an arbitrary number of chunks, with either public or private data types. CTF also overcomes many of the SCF limitations, however it does not distinguish between public and private chunk types and does not separate the data from the compression and filter algorithms. These last two differences directly impact on the extensibility and hence the long term future of the format. ZTR file readers can be assured that chunk types listed in the public specification will be in a known format, but this does not preclude the addition of new chunk types or the development of new compression algorithms. ZTR could also be used for related data such as that generated in Single Stranded Conformational Polymorphism (Hayashi, 1991) experiments. Preliminary investigations of SSCP data have shown that ZTR produces size reductions similar to those achieved for sequencing traces. Although the public specification does not explicitly discuss storage of SSCP data it is envisaged that this will be achieved by using the existing SAMP chunk types with appropriate meta-data fields. If, as we hope, others do wish to contribute new ZTR chunk types and compression methods we suggest that they contact us beforehand so that we can help to avoid

duplication of work and reduce any fragmentation of the format. Initially such additions should be implemented as private types, but once stabilized these could be migrated to public types in future revisions of the format. In the Staden Package individual traces stored in formats readable by io lib can be viewed using a program called Trev (Bonfield et al., 2002), available from ftp://ftp.mrc-lmb.cam.ac.uk/pub/staden/trev/. Also via io lib, multiple traces can be viewed using the package’s main sequence assembly and editing program, Gap4 (Staden et al., 1998), which uses a single binary but machine independent database for each sequencing project. This database stores sequence readings, confidence values, contigs, templates, read-pair data, annotations, edit information and links to the trace data. The package also contains the gap4 viewer (ftp://ftp.mrc-lmb.cam.ac.uk/pub/staden/gap4 viewer/). This viewer is a complete but read-only version of Gap4 and hence enables all of the above information to be displayed using its graphical user interface. Gap4 viewer and trev executables for UNIX and Microsoft Windows are available free to commercial and academic users from our ftp site. The io lib implementation could readily be extended to provide additional search paths to allow for direct file loading over the internet, possibly using CORBA (Parsons et al., 1999). This would enable any program using io lib, including trev and the gap4 viewer, to access and display traces directly from remote trace archives. Calculations based on typical files obtained from the Sanger Centre show that a gzipped Gap4 database from a finished assembly project occupies only 3% of the storage required for the project’s gzipped SCF files. CAF (Dear et al., 1998) files are of comparable size. Although consensus confidence values are useful when it comes to checking the evidence for individual bases in a consensus sequence from a genome project, we believe that where doubts arise most people would prefer to see all the relevant sequences and traces aligned. They are also likely to be interested only in specific regions and hence not need to download all the traces from the relevant project. Bringing these last arguments together, in our view, if the sequence assembly databases were made publically available somewhere, the extra 3% of storage needed would greatly increase the value of trace and sequence data archives, and in addition to the contribution made by ZTR, further reduce the bandwidth required to service the expected growth in this information.

ACKNOWLEDGEMENTS The authors would like to thank Jean Thierry-Mieg for the adding CTF to io lib which catalyzed us into finishing our own work on ZTR, Mark Jordan for the meta-data and 9

J.K.Bonfield and R.Staden

general comments, Andrew McLachlan for the Chebyshev prediction idea, Steven Leonard for extending io lib to use zlib instead of gzip and his ideas with tar support, and both the Sanger Centre and Genoscope for providing test data. This work was supported by the UK Medical Research Council.

REFERENCES ATQA (1998) http://www.wagner.com/technologies/biotech/ atqaadcopy.html, Wagner Associates. Bonfield,J.K. and Staden,R. (1995) The application of numerical estimates of base calling accuracy to DNA sequencing projects. Nucleic Acids Res., 23, 1406–1410. Bonfield,J.K., Beal,K.F., Betts,M.J. and Staden,R. (2002) Trev: a DNA trace viewer. Bioinformatics, 18, 194–195. Boutell,T. et al. (1997) Portable Network Graphics (PNG) specification version 1.0. RFC 2083, http://www.libpng.org/pub/png. Burrows,M. and Wheeler,D.J. (1994) A block-sorting lossless data compression algorithm. Technical Report. Digital Equipment Corporation, Palo Alto, CA. Cleary,J.G. and Teahan,W.J. (1997) Unbounded length contexts for PPM. The Comput. J., 40, 67–75. Dear,S., Durbin,R., Hillier,L., Marth,G., Thierry-Mieg,J. and Mott,R. (1998) Sequence assembly with CAFTOOLS. Genome Res., 9, 260–267. Dear,S. and Staden,R. (1992) A standard file format for data from DNA sequencing instruments. DNA Sequence, 3, 107–110.

10

Deutsch,P. and Gailly,J-L. (1996) ZLIB Compressed data format specification version 3.3. RFC 1950, http://www.gzip.org/zlib/ Ensembl Trace Server (2000) http://trace.ensembl.org/. Ewing,B. and Green,P. (1998) Base-calling of automated sequencer traces using Phred. II. Error probabilities. Genome Res., 8, 186– 194. Gorrell,H.G. et al. (2000) NCBI trace archive RFC. http://www. ncbi.nlm.nih.gov/Traces/TraceArchiveRFC.html. Hayashi,K. (1991) PCR-SSCP: a simple and sensitive method for detection of mutations in the genomic DNA. PCR Meth. Appl., 1, 34–38. Huffman,D.A. (1952) A method for the construction of minimumredundancy codes. Proc. IRE, 40, 1098–1101. NCBI Trace Archive http://www.ncbi.nlm.nih.gov/Traces/. NC-IUB (1985) Nomenclature Committee of the International Union of Biochemistry. Nomenclature for incompletely specified bases in nucleic acid sequences. Recommendations 1984. Eur. J. Biochem, 150, 1–5. http://www.chem.qmw.ac.uk/iubmb/misc/ naseq.html. Parsons,J.D., Buehler,E. and Hillier,L. (1999) DNA sequence chromatogram browsing using JAVA and CORBA. Genome Res., 9, 277–281. Press,W.H., Teukolsky,S.A., Vetterling,W.T. and Flannery,B.P. (1992) Numerical Recipies in C: The Art of Scientific Programming, 2nd edn, Cambridge University Press, Cambridge. Rissanen,J.J. and Langdon,G.G. (1979) Arithmetic coding. IBM J. Res. Develop., 23, 149–162. Staden,R., Beal,K.F. and Bonfield,J.K. (1998) The Staden Package 1998. Comput. Meth. Mol. Biol., 132, 115–130.

Lihat lebih banyak...

ZTR: a new format for DNA sequence trace data

Descrição do Produto

Comentários