Two Dimensional Parameterized Matching

June 16, 2017 | Autor: Carmit Hazay | Categoria: Color Image, Time Complexity

Descrição do Produto

Two Dimensional Parameterized Matching Richard Cole∗

Carmit Hazay†

Moshe Lewenstein†,‡

Dekel Tsur§

Abstract Two equal length strings, or two equal sized two-dimensional texts, parameterize match (p-match) if there is a one-one mapping (relative to the alphabet) of their characters. Two-dimensional parameterized matching is the task of finding all m × m substrings of an n×n text that p-match to an m×m pattern. This models, for example, searching for color images with changing of color maps. We present two algorithm that solve the two-dimensional parameterized matching problem. The time complexities of our algorithms are O(n2 log2 m) and O(n2 + m2.5 · polylog(m)).

1

Introduction

Let S and S ′ be two equal length strings. We say that S and S ′ parameterize match, or p-match for short, if there is a bijection π from the alphabet of S to the alphabet of S ′ such that S ′ [i] = π(S[i]) for every index i. In the parameterized matching problem, introduced by Baker [8,9], one is given a text T and a pattern P and the goal is to find all the substrings of T of length |P | that p-match to P . Baker introduced parameterized matching for applications that arise in software tools for analyzing source code. Other applications for parameterized matching arise in image processing and computational biology (see [2]). In [8, 9], an optimal linear time algorithm was given for the parameterized matching problem when the alphabet has constant size. An optimal algorithm for parameterized matching in the presence of an unbounded size alphabet was given in [5]. In [8, 9], the parameterized matching problem is solved by constructing parameterized suffix trees, which also allows for online p-matching. The parameterized suffix tree was further explored by Kosaraju [16] and faster constructions were given by Cole and Hariharan [12]. In [6], approximate parameterized matching was introduced and a solution for binary alphabets was given. In [15], an O(nk 1.5 + mk log m) time algorithm was given for approximate parameterized matching with k mismatches, and a strong relation was shown between this problem and maximum matchings in bipartite graphs. ∗

New York University, [email protected]. Bar-Ilan University, {harelc,moshe}@cs.biu.ac.il. ‡ ML was supported by an IBM faculty award grant. § Ben-Gurion University of the Negev, [email protected].

†

1

One of the interesting problems in web searching is searching for color images, see [4,7,17]. If the colors are fixed, this is exact two-dimensional pattern matching [3]. However, images can appear under different color maps: The image is the same image, however each color has been recolored with a unique color. Two-dimensional parameterized search is precisely what is needed. An algorithm for two-dimensional parameterized matching was given in [2] and its time complexity is O(n2m log2 m log log m) for a text of size n × n and a pattern of size m × m. It is an open question whether a linear time algorithm for two-dimensional parameterized matching exists. In this paper we show two new algorithms for the problem. The first algorithm is almost linear in the input size, and the second algorithm is linear in the text size, but with a penalty in the preprocessing stage. We propose two algorithms. The first algorithm is a convolutions-based method that uses a novel reduction of the two-dimensional space to one dimension, with careful handling of out-of-boundary entries for two-dimensional queries. The second algorithm is a dueling based solution that utilizes properties of the two-dimensional form of the problem. Timewise, the first algorithm has time complexity O(n2 log2 m). The time complexity of the second algorithm is O(n2 + m2.5 · polylog(m)). For alphabets drawn from a large universe the time complexity increases to O(n2 log σ + m2.5 · polylog(m)), where σ = min(m, |Σ|). However, this is unavoidable as it was shown in [5] that one-dimensional parameterized matching requires Ω(n log σ) time in the comparison model. One step of our algorithms is to count the number of distinct characters in every m × m substring of an n × n string. Amir et al. [4] gave an O(n2 log m)-time algorithm for this problem. We show in this paper an O(n2 )-time algorithm for the problem. This result may be of independent interest. The rest of the paper is organized as follows: Section 2 contains definitions. The O(n2 log2 m)-time algorithm is given in Section 3. The O(n2 + m2.5 · polylog(m)) algorithm is described in Section 4. In Section 5 we present the algorithm for substrings character counting.

2

Preliminaries

Let S and S ′ be two-dimensional strings of equal size. We say that there is a function matching between S and S ′ if there is a mapping f from the alphabet of S to the alphabet of S ′ such that S ′ [x, y] = f (S[x, y]) for all x and y. If the mapping f is one-to-one, we say that S and S ′ parameterize match, or p-match for short. Note that the definition of function matching is asymmetric whereas the definition of parameterized matching is symmetric. The two-dimensional parameterized matching problem is defined as follows: Input: An n × n text T and an m × m pattern P . Output: All substrings of T of size m × m that p-match to P . Observation 1. There is a parameterized matching between S and S ′ if and only if there is a function matching between S and S ′ , and the number of distinct characters in S is equal to the number of distinct characters in S ′ . 2

Exploiting Observation 1, our algorithms have the following structure: First, our algorithms create a list L of m × m substrings of T such that 1. Every string in L function matches to P . 2. Every substring of T that p-matches to P appears in L. The second step is computing the number of distinct characters in every m × m substring of T . The strings in L that have the same number of distinct characters as P are precisely the substrings of T that p-match to P . This stage takes O(n2 ) time using the algorithm in Section 5. The left-to-right/top-to-bottom traversal order of a two-dimensional string is an order of the locations inside the string that starts with the locations of the first (topmost) row from left to right, then the locations of the second row from left to right, and so on. Other traversal orders are defined similarly. We use [a, b] × [c, d] to denote the set of all pairs (x, y) of integers with a ≤ x ≤ b and c ≤ y ≤ d. We will write [a]×[c, d] instead of [a, a]×[c, d] and [a, b]×[c] instead of [a, b]×[c, c].

3

An O(n2 log2 m) algorithm

It is helpful to recall the one-dimensional parameterized matching algorithm, due to Amir, Farach, and Muthukrishnan [7]. It is similar to the Knuth-Morris-Pratt string matching algorithm. The key idea is to recode the occurrences of each character in terms of their separation; namely, if a character c occurs in the pattern (or text) at locations i1 < i2 < · · · < ik , these occurrences of c are replaced by the numbers 0, i2 − i1 , i3 − i2 , . . . , ik − ik−1 , respectively. In other words, each character of the pattern (or text) is replaced by the distance to the nearest occurrence of the same character to the left, or by 0 if there is no such occurrence. Except for the first occurrence of each character, a parameterized match in the original text and pattern corresponds to a standard match in the recoded text and pattern. One perspective on this is that all occurrences of the same character in the pattern (and the text) have been connected into a structure. Identifying a match becomes a question of finding the alignments of the pattern against the text for which the structures in the pattern and text match. We will use a similar construction for the two-dimensional problem. However, this will now require creating connections from each character to Θ(log m) occurrences of the character. Our solution has the following form. Let t = ⌈log2 m⌉. For each location (x, y) in the pattern (or the text) we define 4t + 4 disjoint rectangles. From each rectangle we may select a location (x′ , y ′) that contain the same character as (x, y) (if there is such a location in the rectangle). The location (x′ , y ′) is called a neighbor of (x, y). We now describe the rectangles for location (x, y) in the pattern. The first four rectangles comprise row x and column y partitioned at location (x, y), i.e., the rectangles are [x] × [y + 1, m], [x] × [1, y − 1], [x + 1, m] × [y], and [1, x − 1] × [y]. Next, we define t disjoint 3

rectangles R(x,y),0 , . . . , R(x,y),t−1 that cover the quadrant below and to the right of (x, y). For i = 0, . . . , t − 2, R(x,y),i = [x + m − 2i+1 + 1, x + m − 2i ] × [y + 1, m] and R(x,y),t−1 = [x + 1, x + m − 2t−1 ] × [y + 1, m].

Some of the rectangles may not fit inside the pattern, namely, they contain rows greater than m. These rectangles will be inactive and will not generate neighbors. For the quadrant above and to the right of (x, y) we define the rectangles S(x,y),i = [x − m + 2i , x − m + 2i+1 − 1] × [y + 1, m] for i = 0, . . . , t − 2, and S(x,y),t−1 = [x − m + 2t−1 , x − 1] × [y + 1, m]. Analogous rectangles are defined for the remaining two quadrants. See Figure 1 for an example. Suppose that P [x, y] = c. Each active rectangle is traversed in a direction away from (x, y) until the first occurrence of c is found. More precisely, the traversal order is top-tobottom/left-to-right in the rectangles to the right of column y, top-to-bottom/right-to-left in the rectangles to the left of column y, top-to-bottom in the rectangle [x + 1, m] × [y], and bottom-to-top in the rectangle [1, x − 1] × [y]. The first occurrence of c in some rectangle is the neighbor generated by the rectangle, and if there is no such occurrence then the rectangle does not generate a neighbor. The selection of neighbors for a location (x, y) in the text is performed again by defining rectangles around (x, y). The only difference is that the rectangles that are to the right of column y now extend until column n, and the rectangle [x + 1, m] × [y] now extends until row n. Note that the number of rectangles and their rows are still defined according to m. We say that two locations (x, y) and (x′ , y ′) are linked if there is a series of locations (z0 , w0 ) = (x, y), (z1 , w1 ), . . . , (zl , wl ), (zl+1 , wl+1) = (x′ , y ′) such that for all i, either (zi , wi ) is a neighbor of (zi+1 , wi+1 ), or (zi+1 , wi+1 ) is a neighbor of (zi , wi ) (or possibly both). The following lemma shows that all the occurrences of some character in the pattern are linked. Lemma 1. Let (x, y) and (x′ , y ′ ) be two locations in the pattern both holding the same character. Then, these two locations are linked. Proof. Clearly, if x = x′ then there is a series of locations along row x linking the locations (x, y) and (x′ , y ′). Similarly, these two locations are linked if y = y ′. We now consider the case that x 6= x′ and y 6= y ′ . W.l.o.g. suppose that x < x′ and y < y ′. We claim that either (x, y) lies in one of the active rectangles of (x′ , y ′) or (x′ , y ′) lies in one of the active rectangles of (x, y) (or possibly both). W.l.o.g. suppose that x ≤ m/2. If (x′ , y ′) is inside one the active rectangles of (x, y) then we are done. Otherwise, there must be at least one inactive rectangle among R(x,y),0 , . . . , R(x,y),t−1 . 4

a a

a

a

a a a

a a

a a a a

a a a

a a

a a

a a a a

Figure 1: An example of neighbors selection for a pattern of size 8 × 8 (left) and a text of size 11 × 11 (right). The neighbors of (3, 4) in P and T are marked in bold. The active rectangles for location (3, 4) in P and T are shown in the figure. The rectangles for location (3, 4) in P in the bottom-right quadrant are R(3,4),0 = [10] × [5, 8], R(3,4),1 = [8, 9] × [5, 8], and R(3,4),2 = [4, 7] × [5, 8]. Only the last rectangle is active. The corresponding rectangles for location (3, 4) in T are [10] × [5, 12], [8, 9] × [5, 12], and [4, 7] × [5, 12]. All these rectangles are active. In this example P p-matches to T [1 . . 8, 1 . . 8], so by Observation 2 the neighbors of (3, 4) in P are aligned with the neighbors of (3, 4) in T . Let R(x,y),j be the topmost inactive rectangle. From the fact that x ≤ m/2 it follows that R(x,y),t−1 is active, so j ≤ t − 2. The location (x′ , y ′) must be inside R(x,y),j (since the last row of R(x,y),j is larger than m). By the symmetry of the construction we have that (x, y) is inside S(x′ ,y′ ),j . We will now show that S(x′ ,y′ ),j is active. Since (x′ , y ′) ∈ R(x,y),j we have that x′ ≥ x + m − 2j+1 + 1. Moreover, from the fact that R(x,y),j does not fit inside the pattern we have that x + m − 2j > m, so x > 2j . Thus, x′ > 2j + m − 2j+1 + 1 = m − 2j + 1, so x′ − m + 2j > 1. The last inequality implies that the rectangle S(x′ ,y′ ),j is active. We have shown that (x, y) is inside an active rectangle of (x′ , y ′). We have shown that either (x, y) lies in one of the active rectangles of (x′ , y ′), or (x′ , y ′) lies in one of the active rectangles of (x, y). W.l.o.g. suppose that the latter occurs. It need not be that (x′ , y ′) is a neighbor of (x, y), however. Nonetheless, by induction on y ′ − y, we show that they are linked. The base case y = y ′ has already been demonstrated. Let (x′′ , y ′′) denote the neighbor of (x, y) in the rectangle containing (x′ , y ′). Then y < y ′′ ≤ y ′ . By induction, (x′′ , y ′′) and (x′ , y ′) are linked and the inductive claim follows. We say that the neighbors of location (x, y) in P are aligned with the neighbors of location (x+a, y+b) in T if for all i, if the i-th rectangle of (x, y) in P produces a neighbor (x′ , y ′) then the i-th rectangle of (x+ a, y + b) in T produces a neighbor and the neighbor is (x′ + a, y ′ + b). Observation 2. Let T ′ = T [a + 1 . . a + m, b + 1 . . b + m] be some substring of T .

5

1. If P p-matches to T ′ then for every location (x, y) in P , the neighbors of (x, y) in P are aligned with the neighbors of (x + a, y + b) in T . 2. If for every (x, y), the neighbors of (x, y) in P are aligned with the neighbors of (x + a, y + b) in T , then there is a function matching between P and T ′ . Proof. Part 1 of the observation follows from the construction. To prove part 2, consider two locations (x1 , y1 ) and (x2 , y2) in P with P [x1 , y1 ] = P [x2 , y2 ]. By Lemma 1, (x1 , y1 ) is linked with (x2 , y2 ) in P . As the neighbors of a location (x, y) in P are aligned with the neighbors of (x + a, y + b) for every (x, y), it follows that (x1 + a, y1 + b) is linked with (x2 + a, y2 + b) in T . Therefore T [x1 + a, y1 + b] = T [x2 + a, y2 + b]. Since this holds for every two locations in P that contain the same character, we conclude that there is a function matching between P and T ′ . We note that the converse of part 1 of the observation does not hold, namely, if P pmatches to T ′ then the neighbors of a location (x + a, y + b) in T ′ are not necessarily aligned with neighbors of (x, y) in P . For example, in Figure 1, location (3, 4) in T has a neighbor (3, 9) which has no corresponding neighbor in P . We now describe how to transform the problem of aligning the neighbor structures into an exact matching problem. The text and pattern are recoded as follows: Each location (x, y) in the pattern is encoded by a sequence of 4t + 4 = Θ(log m) characters c1 · · · c4t+4 . The character ci corresponds to the i-th rectangle of (x, y). If the i-th rectangle of (x, y) produces a neighbor (x′ , y ′) then ci = (x − x′ , y − y ′ ). If the i-th rule does not produce a neighbor then ci = φ. The text is recoded similarly, except that when a rectangle yields no neighbor we set ci = 0. Let P2 and T2 denote the recoded pattern and text. Treating φ as a wildcard, Observation 2 implies that for every m × m substring T ′ of T and the corresponding substring T2′ of T2 , 1. If there is a parameterized matching between P and T ′ then there is an exact wildcard matching between P2 and T2′ . 2. If there is an exact wildcard matching between P2 and T2′ then there is a function matching between P and T ′ . Using standard techniques, the wildcard matching problem can be made one-dimensional. The recoded pattern and text have sizes O(m2 log m) and O(n2 log m), respectively. Thus, the wildcard matching problem can be solved in time O(n2 log2 m) [10, 11]. It remains to show how to identify the neighbors. This is readily done in O(m2 log2 m) time in the pattern and O(n2 log2 m) time in the text. We describe the approach for the pattern. Finding the neighbors in the rectangles [x]×[y +1, m], [x]×[1, y −1], [x+1, m]×[y], and [1, x − 1] × [y] for all locations (x, y) is done by simple scans of the rows and columns of the pattern. We next describe how to find the neighbors in the remaining rectangles. The idea is to maintain for each character c, windows w1 , . . . , wt−1 over the pattern. A window wi has a width of m columns and a height of 2i rows (except the window wt−1 6

which has height m − 2t−1 ). For each i, the window wi is slid down the pattern. During the movement of wi , the occurrences of c that are inside wi are kept in a balanced search tree in top-to-bottom/left-to-right order. For a location (x, y) in P that contains the character c, we can find the neighbor of (x, y) in some rectangle R(x,y),i by doing a search on the binary search tree of wi at the time that the window wi covers the rows of R(x,y),i . Each search takes O(log m) time. Thus, over all characters and neighbors the searches take O(m2 log2 m) time. To slide a window down one row requires deleting some character instances and adding others. This takes time O(log m) per change, and as each character instance is added once and deleted once from a window of each size, this takes time O(m2 log2 m) over all characters and windows.

4

An O(n2 + m2.5 · polylog(m)) algorithm

In this section we present another algorithm for parameterized matching that follows the “duel-and-sweep” paradigm. This paradigm appeared in [3] (there it was named “consistency and verification” and was used for two-dimensional exact matching) and is based upon the duelling technique [18, 19]. We begin by giving an overview of the “duel-and-sweep” algorithm for two-dimensional exact matching. The algorithm maintains a list of candidates that initially contains all m×m substrings of T . The list is pruned in two stages, called the duelling stage and the sweeping stage. After these stages the list contains exactly the substrings of T that match P . Consider two candidates T1 = T [x . . x + m − 1, y . . y + m − 1] and T2 = T [x + a . . x + a + m − 1, y + b . . y + b + m − 1], where a ≥ 0 and b ≥ 0. Suppose that these two substrings share common letters of T (in other words, the rectangles [x, x + m − 1] × [y, y + m − 1] and [x + a, x + a + m − 1] × [y + b, y + b + m − 1] have non-empty intersection). If both these candidates match P then we conclude that when we align P against itself with offset (a, b), the overlapping areas in the two copies of P match. In other words, we have that P1 = P [1 . . m−a, 1 . . m−b] matches P2 = P [a+1 . . m, b+1 . . m]. If on the other hand P1 does not match P2 , then T1 and T2 cannot both match P . In this case at least one of the candidates T1 and T2 can be ruled out in constant time in a process called duelling: Let (x′ , y ′) be a location in P1 such that P1 [x′ , y ′ ] 6= P2 [x′ , y ′]. The location (x′ , y ′) is called a witness for (a, b). Let c = T [x+ a+ x′ −1, y + b+ y −1]. Clearly, we have that either c = P1 [x′ , y ′], c = P2 [x′ , y ′], or neither of these equalities hold. In the first case c 6= P2 [x′ , y ′] = P [x′ + a, y ′ + b], so T2 does not match P . In the second case c 6= P1 [x′ , y ′] = P [x′ , y ′], so T1 does not match P . In the last case, neither T1 nor T2 match P . In order to perform duels between candidates, the algorithm precomputes a witness table that contains a witness for every mismatch offset (a, b). After performing all possible duels between the candidates, we have that the remaining candidate are pairwise consistent. Amir et al. [3] showed how to perform the duels in O(n2 ) time. The duelling stage of Amir et al. relies on a transitivity property of the consistency relation, which follows from the fact that exact matching is a transitive relation. Note that parameterized matching is also transitive, and this fact allows us to use the duelling stage of Amir et al. in our algorithm. 7

In the sweeping stage the algorithm checks for each candidate whether it forms a match. Using the fact that the candidates are consistent, this stage is also implemented in O(n2 ) time. In the next sections we describe how to adapt the “duel-and-sweep” algorithm to twodimensional parameterized matching.

4.1

Duelling stage

In exact matching a witness is simply one location with a mismatch between the two aligned copies of the pattern. However, in parameterized matching one location is not sufficient to rule out a match. The following definition captures the concept of a witness in parameterized matching. Definition 1 (mismatch offset). Let P be a pattern of size m × m. An offset (a, b) with b ≥ 0 is called a mismatch offset of P if when P is aligned with itself with offset (a, b), the overlapping areas in the two copies of P do not p-match (in other words, if a ≥ 0 then (a, b) is a mismatch offset if P [1 . . m − a, 1 . . m − b] does not p-match to P [a + 1 . . m, b + 1 . . m], and if a < 0 then (a, b) is a mismatch offset if P [1 − a . . m, 1 . . m − b] does not p-match to P [1 . . m + a, b + 1 . . m].) Definition 2 (witness). Let (a, b) be a mismatch offset of a pattern P . A witness w.r.t. (a, b) is a pair of locations (x, y), (x′ , y ′) such that one of the following holds: 1. P [x, y] = P [x′ , y ′] and P [x + a, y + b] 6= P [x′ + a, y ′ + b]. 2. P [x, y] 6= P [x′ , y ′] and P [x + a, y + b] = P [x′ + a, y ′ + b]. If the first condition above is satisfied then the witness is called a type 1 witness, and otherwise it is called a type 2 witness. Given a witness table for P , the duelling stage for parameterized matching is performed using the duelling algorithm of Amir et al. [3]. In Section 4.4 we show how to construct the witness table for P .

4.2

Sweeping stage

As in many pattern matching algorithms, we assume w.l.o.g. that n = 2m. Larger texts can be cut into overlapping pieces of size 2m × 2m which are handled independently, where successive pieces overlap in m − 1 rows or columns. Definition 3 (strip). Let T be an n × n text. The i-th strip of T is the rectangle [1, n] × [i, i + m − 1]. Definition 4 (predecessor). Let T be an n × n text, S a strip of T , and (x, y) a location inside S. The predecessor of (x, y) w.r.t. S is the first location whose character is equal to P [x, y] encountered when traversing S in right-to-left/bottom-to-top order starting from (x, y). 8

a a

a a

a a a Figure 2: An example for the definition of predecessor. Here m = 4 and n = 8, so the first strip of T is the rectangle [1, 8] × [1, 4]. The predecessors (w.r.t. the first strip) of the locations in the first strip that contain the character ‘a’ are shown in the figure ((7, 3) is the predecessor of (8, 2), (6, 1) is the predecessor of (7, 3) etc.). See Figure 2 for an example of the above definitions. A location and its predecessor in T will be called a location-predecessor pair. Recall that a candidate is an m × m substring of T that survived the duelling stage. The start location of a candidate T [x . . x + m − 1, y . . y + m − 1] is the location (x, y). We say that a candidate is contained in the y-th strip if the start location of the candidate is (x, y) for some x. We say that a candidate T [x . . x + m − 1, y . . y + m − 1] contains a pair of locations (x1 , y1 ), (x2 , y2 ) if both (x1 , y1 ) and (x2 , y2 ) are inside the rectangle [x, x + m − 1] × [y, y + m − 1]. Definition 5 (mismatch pair). Let T ′ = T [x . . x + m − 1, y . . y + m − 1] be a candidate. A mismatch pair of T ′ is a pair of locations (x1 , y1 ), (x2 , y2 ) that is contained in T ′ such that T [x1 , y1] = T [x2 , y2 ] and P [x1 − x + 1, y1 − y + 1] 6= P [x2 − x + 1, y2 − y + 1]. We now observe that: Observation 3. For every candidate T ′ = T [x . . x + m − 1, y . . y + m − 1], 1. If P p-matches to T ′ then there are no mismatch pairs of T ′ . 2. If there is no location-predecessor pair (w.r.t. the y-th strip) that is a mismatch pair of T ′ then there is a function matching between P and T ′ . By Observation 3, it is suffices to check for each candidate whether the location-predecessor pairs (w.r.t. to the strip that contains the candidate) that are contained in the candidate are mismatch pairs for the candidate. Clearly, this approach is not efficient (the time complexity for checking one candidate is Θ(m2 ), so the total time to check all candidate is Θ(m4 ) in the worst case). However, we can use the fact that after the duelling stage, the remaining candidates are pairwise consistent. Therefore, for each location-predecessor pair w.r.t. some strip y, the pair is either a mismatch pair for every candidate (contained in the y-strip or in some other strip) that contains the pair or the pair is not a mismatch pair for any candidate. Thus, for each location-predecessor pair w.r.t. some strip y we need to check only 9

two characters in P , and if there is a mismatch we can rule out all the remaining candidates that contain the pair. Our algorithm does not rule out all the candidates that contain the pair. Instead, it rules out the candidates with start location in column y ′ ≥ y. We first need to show how to compute all location-predecessor pairs. This is performed by going over all strips of T from left to right while maintaining the predecessor of every location inside the current strip. Computing the predecessor of every location in the first strip takes O(m2) time. When going from the (y − 1)-th strip to the y-th strip, there are at most 3n = 6m new location-predecessor pairs: (1) Every location in column y + m − 1 and its predecessor form a new pair, (2) Every location in column y + m − 1 may become the predecessor of some location in columns y, . . . , y + m − 2, and (3) Every location in columns y, . . . , y + m − 2 whose predecessor was in column y − 1 will form with its new predecessor a new pair. After a preprocessing step on T that will be described in Section 4.3, we can find all the new location-predecessor pairs in O(m) time. When checking if there are location-predecessor pairs w.r.t. the current strip that are mismatch pairs, we only need to check the location-predecessor pairs which are new (i.e., did not appear in the previous strip), as we already checked the old pairs in previous iterations. Let (x1 , y1), (x2 , y2 ) be some new location-predecessor pair for the y-th strip. As noted above, (x1 , y1 ), (x2 , y2 ) is either a mismatch pair for all the candidates that contain the pair, or for none. Therefore, we only need to find one candidate (x′ , y ′) that contains the pair, with y ′ ≥ y, and check whether P [x1 − x′ + 1, y1 − y ′ + 1] = P [x2 − x′ + 1, y2 − y ′ + 1]. If these two character are not equal, then (x′ , y ′) cannot be a match, and moreover, no candidate that contains the pair (x1 , y1 ), (x2 , y2 ) is a match. In other words, we can rule out every candidate whose start location is inside the rectangle [x1 − m + 1, x2 ] × [y, min(y1 , y2 )]. We need to handle two issues: How to find a candidate that contains the pair (x1 , y1), (x2 , y2), and in the event of a mismatch, how to rule out the candidates in the corresponding rectangle. We first deal with the former issue. Given a location-predecessor pair (x1 , y1 ), (x2 , y2 ), find the candidate which starts at the smallest row that may contain the pair (x1 , y1 ), (x2 , y2 ). This candidate will be denoted (x′ , y ′). More precisely, among all the candidates with start locations inside the rectangle [x1 −m+1, 2m]×[y, min(y1 , y2 )], (x′ , y ′) is the candidate whose start location is in the smallest row (ties are broken arbitrarily). Clearly, if (x′ , y ′) does not contain the pair (x1 , y1 ), (x2 , y2 ) (that is, if x′ > x2 ), then there are no candidates with start location in a column greater than y that contain the pair (x1 , y1 ), (x2 , y2). We now describe how to find the candidate (x′ , y ′) in constant time. To do that, we build a 2m × 2m array A, where A[i, j] is the smallest integer r ≥ i such that (r, j) is a start location of a candidate. If there are no such integers then A[i, j] = 2m. The array A can be easily computed by scanning the text column by column, from bottom to top. Now, given a location-predecessor pair (x1 , y1), (x2 , y2 ), in order to find the candidate (x′ , y ′) we find the minimum value in the subrow A[x1 − m + 1, y.. min(y1 , y2 )]. This is the range minima problem [14], so after preprocessing each row of A in O(m) time per row, we can find the minimum element in some subrow of A in constant time. We now turn to the problem of eliminating candidates. As discussed above, during the algorithm we find rectangles that do not contain a match, and eliminate the candidates

10

with start location inside these rectangles. Instead of eliminating the candidates at the time each rectangle is discovered, we store the rectangle in a list L, and after obtaining all the rectangles we perform a candidates elimination stage. In the candidates elimination stage, we need to remove each candidate whose top-left corner is in some rectangle of L. This is done by moving a vertical sweep line from left to right. We maintain two vectors B and C, where B[i] (resp., C[i]) is the number of rectangles in L that intersect the current sweep line, and whose top (resp., bottom) row is i. Using these vectors, we can compute the number of rectangles in L that contain each location on the sweep line, and eliminate the candidates whose start location is contained in at least one rectangle. Since the number of rectangles in L is at most 6m2 , it follows that the time complexity of this stage is O(m2 ).

4.3

Text Preprocessing

In this section, we show how to compute an array of pointers that will be used to maintain the predecessors w.r.t. the current strip. Definition 6 (left predecessor). For a location (x, y) in T with y > m, the left predecessor of (x, y) is the predecessor of location (x, y) w.r.t. the (y − m + 1)-th strip of T . Given the left predecessors of all the locations in T , we maintain the predecessor of every location w.r.t. to the current strip as follows. The predecessor and successor pointers for the current strip form a collection of doubly linked lists, one per character, which are stored in place in T , i.e., each location holds its predecessor and successor pointers. The left predecessors are used in updating these lists when moving from the (y − 1)-th strip to the y-th strip as follows. First, each location (x, y − 1) in column y − 1, in top-to bottom order say, is removed from its list by connecting its neighbors. Second, in top-to-bottom order, each location (x, y + m − 1) is inserted right after its left predecessor. Now, we show how to compute the left predecessor of each location (x, y) in T with y > m. First, for each character σ we form a list LC σ of candidate left predecessors. This consists of the locations of σ in left-to-right/bottom-to-top order except that in each row only the rightmost location holding σ in columns 1 to m and the locations holding σ in columns m + 1 to 2m are kept. Next LC σ is traversed. We will maintain a second list Lσ which holds, in traversal order, traversed locations (x, y) with y > m for which the left predecessor has not yet been traversed. Our implementation of this process uses the following two properties of the Lσ lists. Lemma 2. If (x, y) and (x′ , y ′) are two locations in Lσ , where (x, y) precedes (x′ , y ′) in Lσ , then x > x′ and y < y ′. Proof. Note that the traversal order is also the scan order of the locations. Now, suppose for a contradiction that either x ≤ x′ or y ≥ y ′. From the fact that (x, y) appears before (x′ , y ′) in the left-to-right/bottom-to-top scan we have that either (1) x = x′ and y < y ′ , or (2) x > x′ and y ≥ y ′ . Since |y − y ′| < m (this follows from the inequalities m < y ≤ 2m and 11

m < y ′ ≤ 2m), in the first case we have that (x, y) is the left predecessor of (x′ , y ′), and in the second case we have that (x′ , y ′ ) is the left predecessor of (x, y). In both cases we have that the list Lσ cannot contain both (x, y) and (x′ , y ′), so we have reached a contradiction. Lemma 3. Suppose that the traversal of LC σ has reached location (x, y). Let (x1 , y1 ), . . . , (xs , ys ) be the locations in Lσ according to their order. The location (x, y) is either the left predecessor of (x1 , y1), . . . , (xf , yf ) for some 1 ≤ f ≤ s, the left predecessor of (xf , yf ), . . . , (xs , ys ) for some 1 < f ≤ s, or the left predecessor of none of these locations. Proof. Consider the following cases: 1. If y > ys or if y ≤ y1 − m then (x, y) is the left predecessor of none of the locations in Lσ . 2. If y1 ≤ y ≤ ys then let f be the minimum integer such that y ≤ yf . Then, (x, y) is the left predecessor of (xf , yf ), . . . , (xs , ys ). 3. If neither case 1 nor 2 occurs, then y1 − m + 1 ≤ y < y1 . Let f be the maximum integer such that yf − m + 1 ≤ y. In this case, (x, y) is the left predecessor of (x1 , y1), . . . , (xf , yf ). Lemma 3 suggests the following algorithm for computing the left predecessors. Suppose that currently Lσ = (x1 , y1 ), (x2 , y2), . . . , (xs , ys ), and that (x, y) is the location being traversed. Then Case 1 y > y1 . Make (x, y) the left predecessor of all of (x1 , y1), . . . , (xf , yf ), where f is the largest index such that yf − y < m. Case 2 y > ym . Add (x, y) to the end of Lσ . Case 3 y1 ≤ y ≤ ym . Make (x, y) the left predecessor of all of (xg , yg ), . . . , (xm , ym), where g is the least index such that y ≤ yg . Clearly, the time complexity for computing the left predecessors is O(m2 ).

4.4

Pattern Preprocessing

In this section we show how to compute the witness table for the pattern. We need to compute a witness for every mismatch offset (a, b). There are two cases to consider: when a ≥ 0 and when a < 0. In the following, √ we will handle the former case. The latter case is symmetrical, and thus omitted. Let l = ⌊ m⌋. We will assume throughout the section that a ≤ m − 3l and b ≤ m − 3l. At the end of this section we will show how to handle mismatch offsets (a, b) with either a > m − 3l or b > m − 3l. The main idea is similar to the algorithm in Section 3: We select neighbors for each location in P and then we align the neighbors in regions of P by solving exact wildcard 12

Figure 3: The subsets of D a,b . matching problems. Since we need to align regions of various sizes, the neighbor selection scheme of Section 3 do not work here. Let (a, b) be a mismatch offset, and let D a,b = [1, m − a] × [1, m − b] be the set of locations in the overlap area when P is aligned against itself with offset (a, b). We partition D a,b into subregions (see Figure 3): D1a,b , D3a,b , D7a,b , and D9a,b are the l × l squares in the corners of D a,b . D2a,b , D4a,b , D6a,b , and D8a,b are the rectangles of width or height l forming the borders of D a,b excluding the corners, and D5a,b is the remaining part of D a,b . Formally, D1a,b = [1, l] × [1, l],

D2a,b = [1, l] × [l + 1, m − b − l],

D3a,b = [1, l] × [m − b − l + 1, m − b],

and so on. We define D a,b [i1 , i2 , . . . , ik ] = Dia,b ∪ Dia,b ∪ · · · ∪ Dia,b . 1 2 k We say that a witness w for (a, b) is non-simple if it satisfies any of the following conditions: 1. w is of type 1, one of the locations of w is in D a,b [1], and the other location is in D a,b [5, 6, 8, 9]. 2. w is of type 2, one of the locations of w is in D a,b [9], and the other location is in D a,b [1, 2, 4, 5]. 3. w is of type 1, one of the locations of w is in D a,b [3, 6] and the other location is in D a,b [7, 8]. 4. w is of type 2, one of the locations of w is in D a,b [2, 3] and the other location is in D a,b [4, 7]. The algorithm comprises three stages: 1. Find simple witnesses. 13

Figure 4: The active rectangles of location (3, 4) in an 8 × 8 pattern, where l = 2. The H(x,y),i rectangles are shown on the left and the V(x,y),i rectangles are shown on the right. 2. Find non-simple witnesses that satisfy Conditions 1 or 2 above. 3. Find non-simple witnesses that satisfy Conditions 3 or 4 above. The three stages of the algorithm are described in the rest of this section. In each stage, we handle only those offsets for which no witness has been found so far. In the following stages, we will describe only how to find witnesses of type 1, as handling the witnesses of type 2 is symmetrical.

Stage 1 For each location (x, y) in P we choose 4⌊ ml ⌋ + 4l − 4 neighbors. Similarly to Section 3, we define rectangles for each location in P , where each rectangle can provide one neighbor (note that here the rectangles are not disjoint). For a location (x, y) we define the following rectangles: For i = −⌊ ml ⌋, . . . , ⌊ ml ⌋ − 1, let and

H(x,y),i = [x + il, x + (i + 1)l − 1] × [1, y − 1] V(x,y),i = [1, x − 1] × [y + il, y + (i + 1)l − 1].

2 See Figure 4 for an example. Furthermore, for j = −l + 1, . . . , l − 1, we define H(x,y),j = 2 [x + j] × [1, y − 1] and V(x,y),j = [1, x − 1] × [y + j]. We say that a rectangle is active if it is contained in [1, m]×[1, m]. For example, H(x,y),i is active if 1 ≤ x+il and x+(i−1)l−1 ≤ m. For every active rectangle of (x, y) we scan the locations inside the rectangle in a direction away from (x, y) namely, the scan order of H(x,y),i is top-to-bottom/right-to-left, and the scan order of V(x,y),i is left-to-right/bottom-to-top. The first occurrence of the character T [x, y] in the rectangle is the neighbor generated by the rectangle. Consider some offset (a, b). We say that (x′ , y ′) is a D a,b -neighbor of (x, y) if (x′ , y ′ ) is a neighbor that is generated by an active rectangle of (x, y) and the corresponding rectangle of (x + a, y + b) is also active (in other words, (x′ , y ′) is a neighbor of (x, y) that is generated

14

by a rectangle that is contained in D a,b ). We say that two locations (x, y) and (x′ , y ′) in D a,b are D a,b -linked if there is a series of locations (z0 , w0) = (x, y), (z1 , w1 ), . . . , (zl , wl ), (zl+1 , wl+1 ) = (x′ , y ′) such that for all i, at least one of the locations (zi , wi) and (zi+1 , wi+1 ) is a D a,b -neighbor of the other location. Lemma 4. Let (a, b) be an offset. Let (x, y) and (x′ , y ′ ) be two locations inside D a,b with P [x, y] = P [x′ , y ′]. Suppose that neither of the following conditions hold • One of the locations is in D a,b [1] and the other location is in D a,b [5, 6, 8, 9]. • One of the locations is in D a,b [3, 6] and the other location is in D a,b [7, 8].

Then, (x, y) and (x′ , y ′) are D a,b -linked.

Proof. We first claim that at least one case among the following cases must occur. Case 1 |y − y ′| < l. Case 2 |x − x′ | < l. Case 3 The topmost location among (x, y) and (x′ , y ′) is in D a,b [2, 5, 8]. Case 4 The leftmost location among (x, y) and (x′ , y ′) is in D a,b [4, 5, 6]. For a contradiction, suppose that Cases 1–4 do not occur. W.l.o.g. we assume that x > x′ , so (x′ , y ′) is the topmost location among (x, y) and (x′ , y ′). Since Case 3 does not occur, we have that either y ′ ≤ l or y ′ ≥ m − b − l + 1. Suppose first that y ′ ≤ l. If y ≤ y ′ then |y − y ′| < l which contradicts the assumption that Case 1 does not occur. Therefore y > y ′ , so (x′ , y ′) is the leftmost location among (x, y) and (x′ , y ′). Since Case 4 does not occur, we have that either x′ ≤ l or x′ ≥ m − a − l + 1. If x′ ≥ m − a − l + 1 then m − a − l + 1 ≤ x′ < x ≤ m − a. Therefore |x − x′ | < l, but this contradicts the assumption that Case 2 does not occur. On the other hand, if x′ ≤ l then (x′ , y ′) ∈ D a,b [1]. As we assumed that Cases 1 and 2 do not occur, (x, y) ∈ / D a,b [1, 2, 3, 4, 7]. Therefore, (x, y) ∈ D a,b [5, 6, 8, 9] which contradicts an assumption of the lemma. From the above we conclude that y ′ ≥ m − b − l + 1. Moreover, y < y ′ otherwise Case 1 occurs. Now, (x, y) is the leftmost location among (x, y) and (x′ , y ′), so it follows that either x ≤ l or x ≥ m − a − l + 1. The former case cannot occur otherwise Case 2 occurs. If the latter case occurs then from the assumption that Cases 1 and 2 do not occur we obtain that (x′ , y ′) ∈ D a,b [3, 6] and (x, y) ∈ D a,b [7, 8] which contradicts an assumption of the lemma. Therefore, at least one case among Cases 1-4 must occur. We now show that the first part of the lemma holds when Case 1 occurs and when Case 3 occurs. The proof for Cases 2 and 4 is symmetric. W.l.o.g. we assume that x ≥ x′ . Case 1 First, if y = y ′ then (x, y) and (x′ , y ′) are D a,b -linked by a series of locations on 2 column y (due to the V·,0 rectangles). Moreover, if 0 < |y − y ′| < l then we have that ′ ′ 2 2 a,b (x , y ) ∈ V(x,y),y′ −y , so the rectangle V(x,y),y -neighbor (x′′ , y ′) for (x, y). ′ −y generates a D From the above, (x′′ , y ′) is D a,b -linked to (x′ , y ′) and therefore (x, y) and (x′ , y ′) are D a,b linked. Similarly, (x, y) and (x′ , y ′) are D a,b -linked if x − x′ < l. 15

Case 3 We prove the lemma for Case 3 using induction on x−x′ . The base of the induction (when x = x′ ) was already shown. Suppose now that x > x′ . Let i be the integer such that (x′ , y ′) ∈ V(x,y),i . Since l+1 ≤ y ′ ≤ m−b−l it follows that the rectangle V(x,y),i is contained in D a,b . Therefore, (x, y) has a D a,b -neighbor (x′′ , y ′′) with x < x′′ ≤ x′ . By induction (x′′ , y ′′) and (x′ , y ′) are D a,b -linked, and therefore (x, y) and (x′ , y ′) are also D a,b -linked. We now define strings P1 and P2 . The string P1 is obtained from P by replacing each character P [x, y] in P with L = 4⌊m/l⌋ + 4l − 4 characters that encode the locations of the neighbors of (x, y) relative to (x, y). If the rectangle does not yield a neighbor then the character for this rectangle is the wildcard character φ. The string P2 is built in two steps. The first step is the same as for building P1 , except that we use 0 for rectangles that yield no neighbors. After the first step, P2 is an m × mL string. In the second step we expand P2 into a 2m×2mL string by padding with φ characters. After building P1 and P2 we solve the exact wildcard matching problem on P1 and P2 , where P1 is the pattern and P2 is the text. Moreover, using the algorithm of Alon and Naor [1] we find a witness for every mismatch between P1 and P2 , namely, for every offset (a, b) such that P1 does not match P2 [a + 1 . . a + m, b + 1 . . b + m] we find a location (x, y) such that P1 [x, y] 6= P2 [x + a, y + b]. By Lemma 4 we have that • If P1 does not match P2′ = P2 [a + 1 . . a + m, b + 1 . . b + m] then (a, b) is a mismatch offset. Moreover, from a witness to the mismatch of P1 and P2′ we can obtain a type 1 witness for (a, b) in constant time. • If (a, b) has a simple type 1 witness then P1 does not match P2′ . The bottleneck in the time complexity of Stage 1 is the time for computing the mismatch witnesses using the algorithm of Alon and Naor. The time complexity of this step is O(|P2| · 3 polylog(m)) = O(( ml + l) · polylog(m)) = O(m5/2 · polylog(m)).

Stage 2 Recall that the goal of Stage 2 is to find type 1 witnesses w for mismatch offsets (a, b) such that one location of w is in D a,b [1] and the other location is in D a,b [5, 6, 8, 9] = [l + 1, m − a] × [l + 1, m − b]. Stage 2 is composed of three substages. Stage 2a In this stage we find witnesses w for mismatch offsets (a, b) such that one location of w is in D a,b [1] and the other location is in [l + 1, m − a] × [l + 1, m − b − 3l]. Recall that in Stage 1 we created one instance of exact wildcard matching and used this instance to generate witnesses for all offsets. In this stage we create O(m2 /l) instances. The main idea is to group offsets into sets and build an instance for each set. In each set, all the offsets are close to each other, and we use this fact in the construction. 16

Let (a, b′ ) be an offset where b′ is a multiple of l. For a location (x, y) we define the rectangle a,b′ = [l + 1, m − a] × [y + 1, y + m − b′ − 3l]. H(x,y) From each rectangle we generate a neighbor by scanning the rectangle in top-to-bottom/rightto-left order and selecting the first occurrence of P [x, y] (if there are any). Lemma 5. Let (a, b) be an offset and let b′ = l⌊b/l⌋. Let (x, y) ∈ D a,b [1] and (x′ , y ′) ∈ [l + 1, m − a] × [l + 1, m − b − 3l] be two locations with P [x, y] = P [x′ , y ′]. Then, (x, y) has a a,b′ neighbor (x′′ , y ′′) in the rectangle H(x,y) and (x′′ , y ′′) is D a,b -linked to (x′ , y ′). ′

′

a,b a,b generates a . Thus, the rectangle H(x,y) Proof. It is easy to verify that (x′ , y ′ ) ∈ H(x,y) ′′ ′′ ′′ ′ neighbor; let (x , y ) denote that neighbor. Due to the scan order, y ≥ y ≥ l +1. Moreover, y ′′ ≤ y + m − b′ − 3l ≤ m − b − l − 1 and l + 1 ≤ x′′ ≤ m − a. Therefore, (x′′ , y ′′ ) ∈ D a,b [5, 8]. We also have (x′ , y ′) ∈ D a,b [5, 8]. By Lemma 4, (x′ , y ′) and (x′′ , y ′′ ) are D a,b -linked.

We now describe how to find witnesses by building appropriate exact matching instances.′ ′ ′ For every a and b′ where b′ is a multiple of l we build strings P1a,b and P2a,b . The string P1a,b ′ a,b′ relative is an l × l string, where P1a,b [x, y] is the location of the neighbor of (x, y) in H(x,y) ′

′

to (x, y) (if the rectangle does not yield a neighbor then P1a,b [x, y] = φ). To build P2a,b we take the string P ′ = P [a + 1 . . m, b′ + 1 . . m]. For each location (x, y) ∈ [1, l] × [1, 2l − 1] in ′ a,b′ w.r.t. P ′. The string P2a,b is an P ′ , we compute the neighbor of (x, y) in the rectangle H(x,y) l × (2l − 1) string consisting of the encodings of these neighbors (when a rectangle does not ′ ′ ′ yield a neighbor the corresponding character in P2a,b is 0). For each P1a,b and P2a,b we solve the exact wildcard matching problem on these strings and find witnesses to the mismatches. From these witnesses we obtain witnesses for the mismatch offsets. Stage 2b This stage finds witnesses w for mismatch offsets (a, b) such that one location of w is in D a,b [1] and the other location is in [l + 1, m − a − 3l] × [l + 1, m − b]. This stage is analogous to Stage 2b and we omit the details. Stage 2c In this stage we find witnesses w for offsets (a, b) such that one location of w is in D a,b [1], and the other location is in [m − a − 3l + 1, m − a] × [m − b − 3l + 1, m − b]. Let a′ be a multiple of l. For a location (x, y) and i = 1, . . . , 5l − 2 we define the rectangle ′

a H(x,y),i = [x − m + a′ + i] × [1, y − 1].

As before, each rectangle may generate a neighbor. The rectangles are scanned in right-to-left order.

17

Lemma 6. Let (a, b) be an offset and let a′ = l⌊a/l⌋. Let (x, y) ∈ [m − a − 3l + 1, m − a] × [m − b − 3l + 1, m − b] and (x′ , y ′) ∈ D a,b [1] be two locations with P [x, y] = P [x′ , y ′]. Then, a′ there is an index 1 ≤ i ≤ 5l − 2 such that (x, y) has a neighbor (x′ , y ′′) in H(x,y),i and (x′ , y ′′) is D a,b -linked to (x′ , y ′). Proof. Let i = x′ −x+ m−a′ . Since m−a−3l + 1 ≤ x ≤ m−a and 1 ≤ x′ ≤ l we have that i ≥ 1 − (m − a) + m − a′ ≥ 1 and i ≤ l − (m − a − 3l + 1) + m − a′ = 4l − 1 + a − a′ ≤ 5l − 2. a′ Clearly, (x′ , y ′) ∈ H(x,y),i and the lemma follows. ′

′

′

For every a′ which is a multiple of l we build strings P1a and P2a . The string P1a is a string of size (4l − 1) × (m · (5l − 2)) that contains encodings of neighbors for locations ′ (x, y) ∈ [m − a′ − 4l + 2, m − a′ ] × [1, m]. The string P2a contains encodings of neighbors for ′ locations (x, y) ∈ [m − 3l + 1, m] × [m − 3l + 1, m]. P2a is also padded with wildcards so that ′ ′ its size is (5l − 2) × (2m · (5l − 2)). For each pair of strings P1a and P2a we solve the exact matching with wildcard problem and find witnesses to the mismatches. We now compute the time complexity of Stage 2. In Stage 2a and Stage 2b we construct O(m2 /l) pairs of strings each of size O(l2 ). In Stage 2c we construct O(m/l) pairs of strings each of size O(ml). Therefore, the time complexity of Stage 2 is O(m2 l · polylog(m)) = O(m5/2 · polylog(m)).

Stage 3 Again we consider several sub-stages. Stage 3a In this stage we find witnesses w for mismatch offsets (a, b) such that one location of w is in D a,b [3], and the other location is in D a,b [7, 8]. This stage is similar to Stage 2c. Let a′ be a multiple of l. For a location (x, y) and i = 1, . . . , 3l − 2 we define the rectangle ′

a ,2 H(x,y),i = [x + m − a′ − i] × [1, y − 1]. ′

′

The scan order of the rectangles is right-to-left order. We build strings P1a ,2 and P2a ,2 as ′ follows: The string P1a is a string of size l ×(m·(3l −2)) that contains encodings of neighbors ′ for locations (x, y) ∈ [1, l] × [1, m]. The string P2a is a string of size (2l − 1) × (2m) that contains encodings of neighbors for locations (x, y) ∈ [a′ + 1, a′ + 2l − 1] × [m − l + 1, m], and it is padded with wildcards. Stage 3b In this stage we find witnesses w for mismatch offsets (a, b) such that one location of w is in D a,b [7], and the other location is in D a,b [3, 6]. The stage is similar to Stage 3a and we omit the details. 18

Stage 3c Let (a, b) be some mismatch offset. For a rectangle D ⊆ D a,b define D+(a, b) = {(x+a, x+b) : (x, y) ∈ D}. We say that a rectangle D ⊆ D a,b is a mismatch rectangle if the number of distinct characters inside the region D of P is strictly less than the number of distinct characters inside the region D + (a, b) of P . A mismatch rectangle D is minimal if there is no mismatch rectangle that is contained in D. If no witness for (a, b) was found during Stage 1 or Stage 2, then we have that for every type 1 witness w.r.t. (a, b), one location of the witness is in D a,b [3, 6] and the other location is in D a,b [7, 8]. If additionally no witness for (a, b) was found during Stage 3a or Stage 3b, then for every type 1 witness w.r.t. (a, b), one location of the witness is in D a,b [6] and the other location is in D a,b [8]. Symmetrically, we have that for every type 2 witness w.r.t. (a, b), one location of the witness is in D a,b [2] and the other location is in D a,b [4]. We now show how to a find a witness for such an offset. Note that there are no type 2 witnesses w.r.t. (a, b) with both locations inside D a,b [5, 6, 8, 9]. This yields the following observations: Observation 4. For every rectangle D ⊆ D a,b [5, 6, 8, 9], the number of distinct characters inside the region D of P is less than or equal to the number of distinct characters inside the region D + (a, b) of P with equality if and only if there is no witness w for (a, b) for which both locations are in D. Proof. Since there are no type 2 witnesses w.r.t. (a, b) with both locations inside D a,b [5, 6, 8, 9], there is a function matching between the substrings of P defined by the region D + (a, b) and the region D. Therefore, the first part of the observation holds. The second part of the observations follows from Observation 1. Observation 5. Let D ⊆ D a,b [5, 6, 8, 9] be a minimal mismatch rectangle. Then, the locations at the bottom left corner and the top right corner of D form a witness w.r.t. (a, b). Proof. By Observation 4 there is a witness w for (a, b) for which both locations are in D. From the minimality of D the two locations of w must be at opposite corners of D. Since one location of w is in D a,b [6] and the other location is in D a,b [8], it follows that the locations of w are at the bottom left corner and top right corner of D. We do not know how to efficiently find a minimal mismatch rectangle. Instead, we will find a rectangle D such that the coordinates of the corners of D are close to those of some minimal mismatch rectangle. The corners of D will not necessarily give a witness w.r.t. (a, b), so an additional search for the witness will be required. Consider the following intersection counting problem: Given a set S of points in the plane where each point has a color, build a data-structure for S that can answer queries of the form “what is the number of distinct colors for the points inside the rectangle {(x, y) ∈ R2 : x ≤ b, c ≤ y ≤ d}?”. Denote by n the number of points in S. Gupta et al. [13] showed a data-structure for this problem with preprocessing time O(n log2 n) that answers queries in O(log2 n) time. Using this result, we obtain the following lemma:

19

Lemma 7. Let P be an m × m string, and let l be an integer. After preprocessing of P 3 in O( ml log2 m) time, the following queries can be answered in O(log2 m) time: “what is the number of distinct characters in the substring P [a . . b, c . . d]?”, where a, b, c, and d are integers with either b = m or a is a multiple of l. Proof. The preprocessing is as follows: Create a set of points S containing a point (x, y) for every x = 1, . . . , m and y = 1, . . . , m. The color of (x, y) is P [x, y]. For every integer i ≤ m/l, build the data structure of Gupta et al. on the set of points Si = {(x, y) ∈ S : x ≥ il}. Also build a data structure on S ′ = {(−x, y) : (x, y) ∈ S}. Given a query (a, b, c, d), if a is a multiple of l, return the number of distinct colors in the points of Sa/l that are inside the rectangle {(x, y) ∈ R2 : x ≤ b, c ≤ y ≤ d}. If b is equal to m, return the number of distinct colors in the points of S ′ that are inside the rectangle {(x, y) ∈ R2 : x ≤ −a, c ≤ y ≤ d}. Assume that we have the data-structure of Lemma 7 on P without its first row. We now define a mismatch rectangle D ⊆ D a,b [5, 6, 8, 9] by starting with D = D a,b [5, 6, 8, 9] and then one by one moving the left, right, and top edges of D to the right, left, and down, respectively, as long as the rectangle remains a mismatch rectangle. More precisely, we construct D as follows. 1. We find the maximum y such that the rectangle [l + 1, m − a] × [y, m − b] is a mismatch rectangle. Note that from Observation 4, [l + 1, m − a] × [y ′, m − b] is a mismatch rectangle for every y ′ < y. Therefore, we can find the value of y using binary search and queries to the data-structure of Lemma 7. 2. We find the minimum z such that the rectangle [l + 1, m − a] × [y, z] is a mismatch rectangle. 3. We find the maximum x such that x − 1 is a multiple of l and [x, m − a] × [y, z] is a mismatch rectangle. Let D = [x, m − a] × [y, z]. From Observation 4 and the definition of D we obtain that there is a type 1 witness w.r.t. (a, b) with one endpoint in [m−a−l+1, m−a]×[y] and the other endpoint in [x, x+l−1]×[z]. We find such a witness as follows. 1. Let V [1 . . |Σ|] and L[1 . . |Σ|] be arrays containing zeros. 2. For x′ = m − a − l + 1, . . . , m − a, V [P [x′ , y]] ← P [x′ + a, y + b] and L[P [x′ , y]] ← x′ . 3. For x′ = x, . . . , x + l − 1, if V [P [x′ , z]] ∈ / {0, P [x′ + a, z + b]} then output the witness ′ ′ (L[P [x , z]], y), (x , z) and stop. 3

By Lemma 7, the time complexity of this stage is O( ml log2 m + m2 log3 m + m2 l) = O(m5/2 log2 m).

20

4.5

Handling special offsets

We now describe how to handle a mismatch offset (a, b) with a > m − 3l. For each location (x, y) in P we define the following rectangles: For i = −3l+2, . . . , 3l−2, let H(x,y),i = [x+i]× ˆ (x,y),i = [x + i] × [y + 1, m]. For each location we choose neighbors using these [1, y − 1] and H rectangles and then construct an exact wildcard matching problem as described in Stage 1. The size of the strings in this problem is O(m2 l), so this stage takes O(m5/2 · polylog(m)) time. Handling mismatch offsets (a, b) with b > m − 3l is similar. As each stage of the pattern preprocessing takes O(m5/2 · polylog(m)) time, we conclude that the total time complexity for preprocessing the pattern is O(m5/2 · polylog(m)).

5

Substrings character counting

In this section we show an O(n2 ) time algorithm that counts the number of distinct characters in every m × m substring of an n × n string T . W.l.o.g. we assume that n = 2m. For each location (x, y) in T we associate a rectangle R(x,y) = [max(1, x−m+1), min(m+ 1, x)] × [max(1, y − m + 1), min(m + 1, y)]. The rectangle R(x,y) is the rectangle of all start locations of the m × m substrings of T that contain location (x, y). For a character c ∈ Σ, Sc is the set of all rectangles R(x,y) such that T [x, y] = c. The counting algorithm is similar to the algorithm of [4]. The algorithm moves a strip (see Definition 3) along T and computes the number of distinct characters in every m × m substring of T contained in the current strip. We use the fact that a character c ∈ Σ appears in a substring T [x . . x + m − 1, y . . y + m − 1] if and only if (x, y) ∈ ∪R∈Sc R. Therefore, when processing the y-th strip, the algorithm computes the intersection of ∪R∈Sc R with column y for all c ∈ Σ. These intersections are obtained from the intersections of ∪R∈Sc R with column y − 1 for all c ∈ Σ. Let S be a set of rectangles. The lower boundary (resp., upper boundary) of S is the set of all locations (x, y) such that (x, y) ∈ ∪R∈S R and (x + 1, y) ∈ / ∪R∈S R (resp., (x − 1, y) ∈ / ∪R∈S R). The boundary of S is the union of the lower and upper boundaries of S. A left upper boundary location (resp., right upper boundary location) of S is a location (x, y) that is inside the upper boundary of S and (x, y − 1) (resp., (x, y + 1)) is not inside the upper boundary. We define left and right lower boundary locations similarly. The algorithm consists of two stages. In the first stage we compute the boundary of Sc for every c ∈ Σ (more precisely, we compute the left and right boundary locations of Sc ). The second stage uses this information to compute the number of distinct characters in each m × m substring of T .

5.1

Computing the boundary

In this section we show how to compute the boundary for some set Sc . For a rectangle R = [x1 , x2 ] × [y1 , y2] let top(R) = x1 , bottom(R) = x2 , left(R) = y1 , and right(R) = y2 . From the assumption that n = 2m we have that for every rectangle R ∈ Sc 21

either left(R) = 1 or right(R) = m + 1. Furthermore, for every R ∈ Sc either top(R) = 1 or bottom(R) = m + 1. The algorithm consists of two steps: First, we compute the boundaries of Sc1 and Sc2 , where Sc1 is the set of all rectangles R ∈ Sc such that top(R) = 1, and Sc2 = Sc \ Sc1 . In the second stage we merge the two boundaries and obtain the boundary of Sc . We now describe how to compute the boundary of Sc1 . Computing the boundary of Sc2 is symmetrical. We assume that there is at least one rectangle with left(R) = 1 and one rectangle with right(R) = m + 1 (handling the case when there are rectangles of only one type is simpler). From each group of rectangles R ∈ Sc1 with left(R) = 1 and the same value of right(R), we pick the rectangle that has the largest value of bottom(R). Let S1 , . . . , Sl be the chosen rectangles, ordered so that right(S1 ) < right(S2 ) < · · · < right(Sl ). We build a set S1 by choosing each rectangle Si such that bottom(Si ) > bottom(Sj ) for all j > i. Similarly, from each group of rectangles R ∈ Sc1 with right(R) = m+1 and the same value of left(R), we pick the rectangle that has the largest value of bottom(R). Let S1′ , . . . , Sl′′ be the chosen rectangles, ordered such that left(S1′ ) > left(S2′ ) > · · · > left(Sl′′ ). Define S2 to be the set containing every rectangle Si′ for which bottom(Si′ ) > bottom(Sj′ ) for all j > i. Now, let S1 = {R1 , . . . , Rk } and S2 = {R1′ , . . . , Rk′ ′ } where right(R1 ) < right(R2 ) < · · · < right(Rk ) and left(R1′ ) > left(R2′ ) > · · · > left(Rk′ ′ ). If right(Rk ) < left(Rk′ ′ ) then the left upper boundary locations of Sc1 are (1, 1), (1, left(Rk′ ′ )),

the right upper boundary locations are (1, right(Rk )), (1, m + 1), the left lower boundary locations are (bottom(R1 ), 1), (bottom(R2 ), right(R1 ) + 1), . . . , (bottom(Rk ), right(Rk−1 ) + 1), (bottom(Rk′ ′ ), left(Rk′ ′ )), . . . , (bottom(R1′ ), left(R1′ )) and the right lower boundary locations are (bottom(R1 ), right(R1 )), . . . , (bottom(Rk ), right(Rk )), (bottom(Rk′ ′ ), left(Rk′ ′ −1 ) − 1), . . . , (bottom(R2′ ), left(R1′ ) − 1), (bottom(R1′ ), m + 1). See Figure 5(a) for an example. Now suppose that right(Rk ) ≥ left(Rk′ ′ ). Start with i = 1 and j = 1. While right(Ri ) < ′ left(Rj′ ), if j = k ′ or bottom(Ri+1 ) ≥ bottom(Rj+1 ) then increase i by 1, and otherwise increase j by 1. When this process finishes, we have right(Ri ) ≥ left(Rj′ ). There are several case that we need to consider. In the first case assume that right(Ri ) > left(Rj′ ) and bottom(Ri ) < bottom(Rj ). Then the left upper boundary locations of Sc1 are (1, 1), the right upper boundary locations are (1, m + 1), the left lower boundary locations are (bottom(R1 ), 1), (bottom(R2 ), right(R1 ) + 1), . . . , (bottom(Ri ), right(Ri−1 ) + 1), ′ ′ (bottom(Rj′ ), right(Ri ) + 1), (bottom(Rj−1 ), left(Rj−1 )), . . . , (bottom(R1′ ), left(R1′ )) 22

(a)

(b)

Figure 5: An example of the boundary computation for Sc1 . Figure (a) shows the case right(Rk ) < left(Rk′ ′ ) and figure (b) shows the case right(Rk ) > left(Rk′ ′ ). In each figure, the left boundary locations are shown as black circles, and the right boundary locations are shown as white circles. and the right lower boundary locations are (bottom(R1 ), right(R1 )), . . . , (bottom(Ri ), right(Ri )), ′ (bottom(Rj′ ), left(Rj−1 ) − 1), . . . , (bottom(R2′ ), left(R1′ ) − 1), (bottom(R1′ ), m + 1). See Figure 5(b) for an example. The other cases are similar. We now describe how to merge the boundaries of Sc1 and Sc2 . We first define x1 (y) (resp., x2 (y)) to be the integer x such that (x, y) is a location on the lower boundary of Sc1 (resp., on the upper boundary of Sc2 ). If there is no such location then x1 (y) is undefined. To merge the boundaries of Sc1 and Sc2 , we process the left and right boundary locations of Sc1 and Sc2 from left to right. Suppose that we process some location (x, y). There are two possible cases. The first case is when either x1 (y) < x2 (y) − 1, or at least one of x1 (y) and x2 (y) is undefined. The second case is when both x1 (y) and x2 (y) are defined and x1 (y) ≥ x2 (y) − 1. In the former case (x, y) is also a left or right boundary location of Sc , while in the latter case it is not. The columns in which there is a change between the two cases above generate additional boundary locations of Sc . More precisely, suppose for example that x1 (y−1) < x2 (y−1)−1 and x1 (y) ≥ x2 (y)−1. This means that either (x1 (y), y) is a right lower boundary location of Sc1 , or (x2 (y), y) is a right upper boundary location of Sc2 , or both. Suppose that the first case occurs. We then have that (x2 (y − 1), y − 1) is a left upper boundary location of Sc that is not a boundary location of Sc1 and Sc2 . The time complexity for computing the boundary of Sc is O(|Sc |). Therefore, computing all the boundaries takes O(n2 ) time.

5.2

Counting the characters

For every c ∈ Σ, let LUc (resp., RUc ) be the set of left (resp., right) upper boundary locations of Sc . Let LLc (resp., RLc ) be the set of left (resp., right) lower boundary locations of Sc . 23

P The algorithm moves a strip along T and maintains a vector D such that xi=1 D[i] gives the number of distinct characters in the m × m substring that starts at row x in the current strip. For the 1st strip, the vector D is constructed as follows: For every c ∈ Σ and every pair (x, 1) ∈ LUc , add 1 to D[x] (note that the pair (x, 1) can appear in several sets LUc when x = 1). For every c ∈ Σ and every pair (x, 1) ∈ LLc , subtract 1 from D[x + 1] When moving from the (y − 1)-th strip to the y-th strip we update D as follows. For every c ∈ Σ: • For every pair (x, y) ∈ LUc , add 1 to D[x]. • For every pair (x, y) ∈ LLc , subtract 1 from D[x + 1]. • For every pair (x, y − 1) ∈ RUc , subtract 1 from D[x]. • For every pair (x, y − 1) ∈ RLc , add 1 to D[x + 1].

References [1] N. Alon and M. Naor. Derandomization, witnesses for boolean matrix multiplication and construction of perfect hash functions. Algorithmica, 16:434–449, 1996. [2] A. Amir, Y. Aumann, M. Lewenstein, and E. Porat. Function matching. SIAM J. on Computing, 35(5):1007–1022, 2006. [3] A. Amir, G. Benson, and M. Farach. An alphabet independent approach to two dimensional pattern matching. SIAM J. on Computing, 23(2):313–323, 1994. [4] A. Amir, K. W. Church, and E. Dar. The submatrices character count problem: an efficient solution using separable values. Information and Computation, 190(1):100–116, 2004. [5] A. Amir, M. Farach, and S. Muthukrishnan. Alphabet dependence in parameterized matching. Information Processing Letters, 49(3):111–115, 1994. [6] A. Apostolico, P. L. Erd¨os, and M. Lewenstein. Parameterized matching with mismatches. J. of Discrete Algorithms, 5(1):135–140, 2007. [7] G. P. Babu, B. M. Mehtre, and M. S. Kankanhalli. Color indexing for efficient image retrieval. Multimedia Tools and Applications, 1(4):327–348, 1995. [8] B. S. Baker. Parameterized string pattern matching: Algorithms and applications. J. of Computer and Systems Sciences, 52(1):28–42, 1996. [9] B. S. Baker. Parameterized duplication in strings: Algorithms and an application to software maintenance. SIAM J. on Computing, 26(5):1343–1362, 1997. 24

[10] P. Clifford and R. Clifford. Simple deterministic wildcard matching. Information Processing Letters, 101(2):53–54, 2007. [11] R. Cole and R. Hariharan. Verifying candidate matches in sparse and wildcard matching. In Proc. 34th ACM Symposium on Theory Of Computing (STOC), pages 592–601, 2002. [12] R. Cole and R. Hariharan. Faster suffix tree construction with missing suffix links. SIAM J. on Computing, 33(1):26–42, 2003. [13] P. Gupta, R. Janardan, and M. Smid. Further results on generalized intersection searching problems: Counting, reporting, and dynamization. J. of Algorithms, 19(2):282–317, 1995. [14] D. Harel and R. E. Tarjan. Fast algorithms for finding nearest common ancestors. SIAM J. on Computing, 13(2):338–355, 1984. [15] C. Hazay, M. Lewenstein, and D. Sokol. Approximate parameterized matching. ACM Transactions on Algorithms, 3(3), 2007. [16] S. R. Kosaraju. Faster algorithms for the construction of parameterized suffix trees. In Proc. 36th Symposium on Foundation of Computer Science (FOCS), pages 631–637, 1995. [17] M. Swain and D. Ballard. Color indexing. International Journal of Computer Vision, 7(1):11–32, 1991. [18] U. Vishkin. Optimal parallel pattern matching in strings. In Proc. 12th International Colloquium on Automata, Languages and Programming (ICALP), pages 91–113, 1985. [19] U. Vishkin. Deterministic sampling — a new technique for fast pattern matching. SIAM J. on Computing, 20:303–314, 1991.

25

Lihat lebih banyak...

Two Dimensional Parameterized Matching

Descrição do Produto

Comentários