Vol. 12 no. 1 1996 Pages 1-8
CLEANUP: a fast computer program for removing redundancies from nucleotide sequence databases Giorgio Grillo, Marcella Attimonelli1, Sabino Liuni and Graziano Pesole12
Abstract
Introduction The way data are stored in biosequence databases strongly affects the scientific approach to be used and the quality of science that can be done with the database. In this context, a key concept is redundancy even if biological data are too complex to fit a simple definition of redundancy. The sequence databases now available contain a considerable amount of redundancy as they may contain multiple copies of the same sequence. Redundant entries can be originated by multiple submission of the same Cenlro di Studio sui Milocondri e Metabolismo Energetico, CNR and 'Dipartimento di Biochimica e Biologia Molecolare. Universita di Bari, Italy 2 To whom correspondence should be addressed. E-mail:
[email protected]
) Oxford University Press
Systems and methods The algorithm presented here has been implemented in CLEANUP, a computer program written in C and running on VMS and UNIX operating systems. CLEANUP allows Pearson/FastA and GCG input formats for nucleotide sequences and automatically produces a new dataset purged of redundant sequences in both of these formats. It is available free of charge by anonymous FTP (IP address: area.ba.cnr.it, directory: pub/software/ cleanup).
Downloaded from bioinformatics.oxfordjournals.org by guest on July 13, 2011
A key concept in comparing sequence collections is the issue of redundancy. The production of sequence collections free from redundancy is undoubtedly very useful, both in performing statistical analyses and accelerating extensive database searching on nucleotide sequences. Indeed, publicly available databases contain multiple entries of identical or almost identical sequences. Performing statistical analysis on such biased data makes the risk of assigning high significance to non-significant patterns very high. In order to carry out unbiased statistical analysis as well as more efficient database searching it is thus necessary to analyse sequence data that have been purged of redundancy. Given that a unambiguous definition of redundancy is impracticable for biological sequence data, in the present program a quantitative description of redundancy will be used, based on the measure of sequence similarity. A sequence is considered redundant if it shows a degree of similarity and overlapping with a longer sequence in the database greater than a threshold fixed by the user. In this paper we present a new algorithm based on an 'approximate string matching' procedure, which is able to determine the overall degree of similarity between each pair of sequences contained in a nucleotide sequence database and to generate automatically nucleotide sequence collections free from redundancies.
sequence or by multiple sequencing of the same gene in the same organisms. Even if in the latter case redundancy may have biological significance in studying polymorphic patterns, it can still prevent a given sequence collection being used for some statistical analyses, e.g. methods to identify sequence patterns peculiar to the sequence collection under investigation, or can greatly slow down extensive database searching. As a qualitative definition of sequence redundancy is impracticable, we shall apply a quantitative description based on the measure of sequence similarity, depending on the particular application we wish to carry out. According to the above considerations the identification of redundant sequences in a given sequence collection can be simply achieved by performing pairwise alignments between all the sequences in the database. However, this approach is impracticable for large sequence collections as it would require a considerable amount of computational time and is not user-friendly. Here we present a new algorithm, based on an 'approximate string matching' procedure, which is able to determine the overall degree of similarity between each pair of sequences contained in a nucleotide sequence database without performing the time-consuming task of pairwise global best-alignments. The algorithm generates a new dataset from the redundant nucleotide sequence collection purified from redundancy according to cutoff parameters set by the user. This method is very fast and sensitive and allows the user to generate automatically and in a user-friendly way sequence databases that are free of redundancies.
C.Crillo el al.
c)
b)
a) 1 1
1 e)
d)
II
1 --V
\
k
f)
1 1
g)
1 1
The algorithm has been tested on two sequence datasets: (i) 362 nucleotide sequences extracted from the EMBL database (release 41) through ACNUC retrieval program (Gouy el al., 1985) using 'SP = Homo sapiens and O = mitochondrion' selection criteria which extract all those entries from human with report 'Mitochondrion' in the OG field of EMBL entry line; (ii) 2400 nucleotide sequences (5 523 925 bases) defined by the 'in:dro*' GCG logical name which selects all Drosophila entries of the invertebrate division of the Genbank collection (release 90). Algorithm
The computational model adopted to recognize redundancies (totally or partially overlapping sequences) within collections of nucleotide sequences is based on the observation that the occurrence probability of oligos with length > 8 nucleotides in two or more sequences is rather rare in the case of evolutionarily unrelated sequences. The occurrence of a common oligo will be thus considered as a clue of possible redundancy that will be further tested by performing a progressive search of matching oligos through the full length of both sequences under examination. The searching process needs to perform pairwise comparisons of each sequence with all the others. We define the first sequence in the comparison as the 'primary sequence" and all the others as 'secondary sequences'. The algorithm is structured in the following steps: 1. Sorting of the sequences according to decreasing length.
2. Numeric coding of the sequences and generation of the index structure. 3. Generation of the primary sequences data structure containing the pattern positions and compositions. 4. Similarity searching between primary and secondary sequences. 5. Creation of a sequence collection free from redundancy according to user-selected parameters. As the sequences are sorted according to decreasing length, the primary sequence will certainly have a length equal or greater than any secondary sequence. The searching phase will be greatly improved by this sorting both in the case of similar sequences partially or totally overlapping and in the case of sequences sharing similar segments. In the first case only three possibilities can occur: (i) total coincidence, (ii) partial overlapping on the right or (iii) partial overlapping on the left (Figure la-c). The finding of a match between the rightmost or leftmost element of the secondary sequence with any element of the primary one will provide a remarkable clue of high sequence similarity and will thus prime the overall searching procedure. In the second case (Figure ld-g), the sorting will minimize the number of comparisons to be carried out between the primary and the secondary sequences. Figure 2 schematically shows how the numeric coding and the index structure are generated. Given a sequence S, L nucleotides long, extract all possible L- w + 1 iv-mers. From each »v-mer P = S\S2 . . . sw (sj = A, C, G, T) all possible /t-tuples (k < w), ph are generated, formed by
Downloaded from bioinformatics.oxfordjournals.org by guest on July 13, 2011
Fig. 1. Similarity patterns (shaded areas) between primary (upper) and secondary (lower) sequences taken into account by the CLEANUP algorithm. Cases (f )-(g) refer respectively to similar domains found in unsimilar surrounding regions and to possible domain shuffling. Arrows indicate 'guide patterns' used for hooking similar segments in the compared sequences.
CLEANUP: removing redundancies from nucleotide sequences
guide pattern
N(p)
Pi
i ©
f A'
'© © ©
(A)
© ©
| ACGTACG.
k-tuples
©
w-mer
i A )
(A)
L-,0 L©
CD
©
primary Sequences
secondary Sequences
0 12
1
3
1
4 .5
1819
4-2 4 -
1 1 II 1 C i 1 IllOlllllI
position composition
II
1 i n H1 • (.
)i
(
)l
II
I
1
!
position composition
. i
position composition
I ( • • )l 1 position composition
Data Structure Fig. 2. Schematic description of how the data structure is generated for each primary sequence. To each data-structure element, corresponding to a specific fc-tuple, the information is linked through pointers on the positions and compositions of all fc-tuples extracted from the primary sequence. It may happen that the same Ar-tuple occurs more than once or that it is not found at all. The finding of common A-tuples between primary and secondary sequences (shaded box) define the guide pattern used for the hooking of similar segments in the compared sequences.
concatenating nucleotides, even non-contiguous, extending over the string of interest.
where /|,. . ., ih, . . ., ik is one of the possible combinations of the w nucleotides of P, with i,,