A practical approach to type-sensitive parsing

August 9, 2017 | Autor: Ken Sailor | Categoria: Cognitive Science, Computer Software, Computer Languages, Type Inference
Share Embed


Descrição do Produto

Pergamon

oo,~-ossl(~)oeo4-2

A PRACTICAL

APPROACH

Comput. Lang. Vol. 20, No. 2, pp. 101-116, 1994 Copyright © 1994 ElsevierScienceLtd Printed in Great Britain. All rights reserved 0096-0551/94 $7.00 + 0.00

TO TYPE-SENSITIVE

PARSING

KEN SAILOR a n d CARL MCCROSKY'~ Department of Computational Science, University of Saskatchewan, Saskatoon, Canada S7N 0W0

(Received 4 November 1993; revision received 25 April 1994) Abstract--Type-sensitive parsing of expressions is context-sensitive parsing based on type. Previous research reported a general class of algorithms for type-sensitive parsing. Unfortunately, these algorithms are impractical for languages that infer type. This paper describes a related algorithm which is much more efficient--its incremental cost is linear (with a small constant) in the length of the expression, even when types must be deduced. Our method can be applied to any statically typed language solving a variety of problems associated with conventional parsing techniques including problems with operator precedence and the interaction between infix operators and higher order functions. context-sensitive parsing

type-sensitive parsing

type inference

programming language syntax

1. I N T R O D U C T I O N

Human interpretation and generation of language are guided by context. Although "meaning" in natural languages is not well-defined, meaning is the context that guides our parsing of natural languages. If one were to say "time flies like a racehorse" in a conversation about how quickly the term has ended, the phrase is most likely interpreted as an appropriate metaphor of how quickly time seems to pass. In this case, "time" is a noun and "flies" is a verb. It is also possible that the speaker might be engaged in a discussion of how to determine the speed insects travel. In this case, the sentence is a suggestion of how this speed may be determined, and the word "time" is the verb while "flies" is the noun. A not inconceivable possibility is that the phrase concerns the preferences of a particular kind of fly. These flies prefer racehorses. In this case, "time" is an adjective while "flies" is a noun. Each of the interpretations is valid. The only way to determine the meaning of the sentence is to consider it in context, or guess a likely interpretation in the absence of context. Although natural language processing remains an unsolved computational problem, humans perform natural language processing with ease. It is not generally considered a challenge to converse even though the rules of conversation are not explicit nor is it understood how we generate or interpret utterances. Considering that programming languages are a type of formal language with simple syntax and well-defined semantics, it is curious that many students have difficulty writing syntactically correct programs. Given the best programming languages and the best programmers, programs are notoriously hard to understand. Programs are annotated with natural language comments to facilitate understanding even though the language of those comments is not syntactically simple nor is its semantics well-defined. In light of these observations, we question the current belief that a simpler syntax is a better syntax. It is certainly true that there are bad complicated syntaxes, but a complicated syntax is not in itself a detriment to the usability of a programming language. When we describe a syntax as simple, we mean that it has a context-free grammar, and a short one at that. A programming language syntax is generally considered good if it has a simple syntax and bad otherwise. Further, it is usually a routine task to construct a context-free grammar that links program text and program meaning with a parse tree. In this way, context-free grammars join the syntax and semantics of a language. Although there are benefits to the context-free approach, the fact remains that humans have a talent for parsing far in excess of simple grammars. We believe there is value in exploring context-sensitive parsing techniques for programming languages. Although this may result in more computationally complicated methods, it is possible

?Author for correspondence. 101

102

KEN SAILORand CARL McCROSKY

that these methods will allow syntaxes that are more easily generated and understood by humans. If there is a correspondence between a context-sensitive method and the way we naturally generate language, then it might be that such a method would make it easier to read and write programs. Anyone who has ever taught introductory programming knows that the rules of context-free parsing are a source of frustration for many beginning programmers. Consider the Pascal expression b or x < y and c, where b and c are boolean variables and x and y are integer variables. The choice of precedence in Pascal makes the boolean operators bind tighter than the relational operator, forcing a type error when or is applied to b and x. How many times must instructors explain that although the phrase makes sense (to us) and has only one valid interpretation, the parser ignores meaning when determining meaning! What kind of context-sensitive parsing is possible and what benefits would such methods have? In a previous paper, [1], we show that context-sensitive parsing of expressions is possible in programming languages where types are supplied by the programmer. The benefits of this method are a simpler and type-safe form of precedence (precedence will never force a type error) and a reduced need for disambiguating tokens like parentheses. For this technique, expression type is used as an approximation of meaning to guide the parsing of expressions. The cost of the method is O(n ~) where n is the number of elements in an expression, when types are supplied by the programmer, and exponential in n when types must be inferred. In this paper, we report a new approach that yields a practical context-sensitive algorithm within type inference. This algorithm infers types while parsing expressions at no increase in the cost of type inference. In a language including Hindley-Milner type inference, this algorithm makes context-sensitive parsing available with no additional cost. In a language with restricted polymorphism and programmer-supplied types, the new method reduces the complexity of context-sensitive expression parsing to O (n). Similar to the method reported in Ref. [1], our new method provides type-safe and simpler precedence and a reduced need for disambiguating syntactic tokens. The new algorithm, like the old, uses type as an approximation of expression meaning. Given an expression consisting of a list of elements, various groupings of elements are tested to determine if they form validly typed function applications. When a grouping satisfies the typing restriction, then one element of the grouping is assumed to be a function and the other elements arguments to that function. This grouping becomes a single element in the original expression, and new groupings are tested until there is one element remaining in the expression being processed. When one element remains, the expression has been completely parsed. The new algorithm is based on a pushdown automaton. It never considers more than the first two elements of an expression and the top of the stack to determine the next grouping. This restricted lookahead (and behind) is the source of the efficiency gain over the previous algorithm. Section 2 gives motivation for and an overview of type-sensitive parsing. Section 3 provides preliminary definitions. Section 4 specifies the algorithm. Section 5 gives example parses of the algorithm. Section 6 discusses the time complexity of the algorithm and relates practical experience with our implementation. Possible extensions are discussed in Section 7. Section 8 presents our conclusions.

2. LET TYPING BE THE GUIDE Our interest in context-sensitive parsing grew out of the inability of context-free grammars to parse expressions obvious to us. We use four simple examples as an introduction to parsing using type context. 2.1. Interpreting "sin cos x "

Prefix function application is typically a tightly binding form of function application in programming languages. Languages like Pascal make prefix function application explicit through

A practical approach to type-sensitiveparsing

I03

the use of parentheses and commas: sin (cos (x)). Languages like Haskell [2] and Mirandat [3] do not require arguments to be gathered in parenthesized lists, but treat function application as left-associative. In these languages sin cos x is a typing error since left-associative function application would apply sin to cos and apply the result to x: an error because sin is a function of a number not another function. The correct form of the sample expression for these languages would be sin (cos x ) . Regardless of the presence or absence of parentheses, the example expression has only one reasonable interpretation, that cos is applied to x before sin is applied to the result. This meaning can be determined from the type of the expression elements involved. 2.2. Interpreting "true or 4 < 5 and f a l s e " In this second example, the weakness of standard precedence is exposed. Precedence determines a binding hierarchy of operator symbols used to determine the parse of infix operator applications. If the relative binding of < is less than that of or (as it is in Pascal) then the expression is an error because these precedences force the application of or to true and 4. This problem is somewhat mollified by more clever choices of precedence to ease the construction of standard phrases, but this cleverness results in complicated precedence hierarchies that many programmers choose not to learn. A common response to the precedence hierarchy of the C programming language is to ignore precedence and parenthesize expressions completely (just to be on the safe side), resulting in the unreadable expressions precedence is meant to eliminate. Allowing the context of type to guide parsing, we can introduce two rules: that infix operators may only be applied to correctly typed arguments; and that precedence is only considered among homogeneously typed function applications. These two rules collapse complicated hierarchies into simple ones. For example, it is only necessary to know the relative precedences of the boolean operators instead of understanding how the booleans compare to all other operators. The simple Pascal scheme of assigning and a multiplicative precedence and or an additive precedence is enough to guide parsing without stumbling over simple examples like the one above. In the example expression, the relational operator must be applied to the two numbers first. Once this application has been formed, the relative precedence of the boolean operators can be considered to complete the parse. 2.3. Interpreting "3 -/- 4" and "reduce -/- 0"

In higher-order programming languages, operators can be arguments to functions as well as functions that are applied infix to arguments. If + is addition, the single sensible meaning of the first expression above is the application of + to 3 and 4. If reduce (foldr in Miranda or Haskell) is the standard higher order function that applies a binary function between every element in a list and an identity value, then the single sensible meaning of the second expression is the application of reduce to + and 0 (the standard definition of the function that sums a list of numbers). It is impossible for a context-free parser to distinguish the two expressions (other than for a fixed number of functions and operators hard-coded in a grammar). Functional languages that allow higher-order function application use additional symbols to signal the two cases for their context-free parsers. In Miranda or Haskell, the second phrase must be written reduce ( + ) O, using parentheses as magic characters to override the usual meaning of + to the parser. Similar magic characters are used to indicate that a function not requiring the magic parentheses is to be interpreted as infix. Miranda requires a prepended "$" character while Haskell uses surrounding backquotes. If the context of type is introduced to parsing, then it is possible to distinguish the two cases without magic characters. The first example must be infix operator application while the second must be prefix function application. 2.4. Parsing guided by type

Type-sensitive parsing, as introduced in Ref. [1], uses type context for a subtler interpretation of syntax. In this scheme, a traditional context-free parser gathers an expression into a list of tokens, reduce + 0 would be parsed to a list further annotated by token type (from a symbol table), such as [reduce ~ ~ ~~ ~ ~ ~ L~,,• ~ ~, + In,~ ~,,~ ~n,,OIn,]. Then a set of conditional, prioritized rewrite rules is applied to the list, creating application combinations until the list is reduced to one dement. tMiranda is the trademark of Research SoftwareLtd.

104

KENSAILORand CARLMCCROSKY

In this example, the highest priority rule with a satisfied conditional would be partial prefix application, resolving the list into the single application (reduce-t-0). The worst-case quadratic efficiency cost for this scheme results from the fact that the entire expression may be scanned n times to determine the next rewrite rule to apply. As expressions tend to be short, this cost is practical and allows for a sophisticated interpretation of syntax. Rules are easily developed to handle the examples above as well as multiplication by juxtaposition and other notational practices. Unfortunately, efficiency costs for this method are unacceptably large when types are not given but must be inferred. Our interest in higher-order functional programming languages where type inference is a standard feature led to the development of a type-sensitive parsing method practical in this domain. Like Ref. [1], we use a context-free parser to parse a program into a form where expressions are gathered lists of sub-expressions. A modified version of Milner's type inference algorithm is then applied to the program that returns a program annotated by type where function application is fully resolved if the program is determined to be well-typed. Although type inference has a worst-case exponential cost [4], it is widely recognized as a practical compiler analysis. Our extensions providing type-sensitive parsing do not increase the practical costs. Rather than scanning the entire string of expressions, this version of type-sensitive parsing moves left-to-right through a list of expressions with no backup. Once an application is formed, it is never broken. As there are a constant number of possibilities for each position considered, the degradation of standard type inference is at most a constant factor. This algorithm has been implemented in our experimental functional language, Falafel, which has a syntax closely related to the syntax of Miranda. After defining the algorithm, we include experimental results demonstrating the low cost of the method in comparison with strictly prefix, left-associative parsing. 3. P R E L I M I N A R Y D E F I N I T I O N S In this section, definitions are given for standard terms like precedence and associativity, as well as new terms useful in the context of type-sensitive parsing and our algorithm. Given an expression that is a pair of infix function applications such as x . y + z or in general argt opl arg2 op2 arg3, the expression is parsed into prefix form (or order of application is determined) respecting the binding properties of the operators involved. For example, if opl has a higher binding priority than 0t92, then the prefix form is op2 (opt argt arg2) arg 3 . If op2 has the higher binding priority, then the prefix form is opt argl (op2 arg2 arg3). This higher binding priority is the definition of precedence. If opt and op2 have the same precedence and if they are left-associative, the left-most operator binds tightest resolving to op2 (opt arg~ arg2) arg3. If the operators have the same precedence but are both right-associative, then the expression yields opt argl (op2 arg2 arg3). It is not allowed that operators have the same precedence but different associativities, since this results in ambiguity: either operator could be chosen as most tightly binding. Precedence is a useful strategy for eliminating unnecessary parentheses and improving the readability of expressions, although complicated precedence hierarchies tend not to be used (as noted above). We modify the standard definition of precedence so that the precedence of operators is only considered in a type safe context. A classification of functions is introduced to explain this modification to precedence. Functions of two or more arguments can be divided into two categories: homogeneous binary and heterogeneous. Homogeneous binary functions take two arguments and return a result, all of the same type. I f f i s a homogeneous binary function, then f : t ---, t --, t ( r e a d f i s a function from t to t to t for some type t ). A heterogeneous function's argument and result types are not required to be the same, or it may be a function of more than two arguments (in a curried language). Given a string of infix homogeneous binary function applications, arg~ opl arg 2 op2 • • • opn_ l argn, if every operator is of type t ---, t --, t and each arg is of type t, then regardless of the order of application of the operators, the expression is well-typed. The standard concept of precedence can be used to reduce the need for parentheses in type safety. For example, 3 , 4 + 5 div 7 - 5 is well-typed regardless of the order the operators are applied. Conversely, given two homogeneous binary functions of incompatible types, no interpretation based solely on precedence can safely resolve order of application.

A practical approach to type-sensitiveparsing

105

Precedence is not type-safe in resolving the order of application of strings of infix heterogeneous function applications. Given the expression arg~ het~ arg 2 het2 arg3, and let het~ : t~ ~ t 2 ~ t 3 and het2:t4 ~ t5---, t6, if t 3 unifies with t4 then het2 (het~ arg~ arg2) arg 3 is a possible interpretation. If t2 unifies with t6, then het~arg~(het2arg2arg3) is a possible interpretation. It is unusual for both interpretations to be valid. This would require the typings het~ : t~ ~ t2 ~ t2 and het2: t2 ~ t3 ~ t 2 . For example, given bn : Bool ~ Int ~ Int and nb: Int ~ Bool ~ lnt, then either interpretation of true bn 5 nb f a l s e is valid in terms of type. In this case precedence could choose between valid interpretations preserving type safety. This is rare and complicated, however, Commonly, there is a single valid interpretation of a string of infix, heterogeneous function applications, and precedence serves no purpose in these cases other than to introduce typing errors. For this reason, we restrict precedence to infix application of compatibly typed, homogeneous binary functions. We distinguish between heterogeneous operators that gather arguments, and those that do not. The relational operators, like < , are typical of heterogeneous operators that gather their arguments. In expressions like 3 + 4 < 5 • 6, we choose to allow the numeric operators to be applied before the comparison is made. There are other heterogeneous functions for which the gathering behavior is less attractive. Where pick is the array indexing function, we choose to interpret 3 pick A + 5 as a pick of A, not A + 5 (assuming A + 5 has meaning). Different preferences on this level have little effect on the algorithm that follows, although there is an aesthetic effect on the language accepted. In our work on syntax, the relational operators are the only operators we are aware of that are generally expected to gather arguments. A question of language design is whether gathering should be restricted to a specific set of operators or whether it should be under programmer control. As long as the algorithm can determine the behavior of a particular symbol, it makes no difference whether the set is open or closed. We choose prefix application to bind more tightly than infix. An expression like sin 3 + 4 is interpreted as (sin 3) + 4 rather than sin (3 4- 4). This choice forces prefix function application to be non-gathering. 4. AN I N F O R M A L P R E S E N T A T I O N The goal of type-sensitive parsing is to provide a parsing method for languages more easily parsed by humans than the languages accepted by context-free parsers. This goal is not well defined, because how we (humans) parse or even whether there is just one way all people parse remains unknown. Indeed, it is ironic that mathematically more complicated languages--natural languages--appear easier for us to parse than context-free programming languages. If a parsing algorithm is a success, its workings should go unnoticed. That is, it should do what we expect; when it surprises us, it has failed. Writing software to such demanding and vague specifications is difficult indeed. Evidence that we have succeeded is given in the form of expressions our algorithm parses to our satisfaction. Context-free grammars can be concise and, when one is versed in the formalism, easily understood. If context-sensitive techniques such as ours are to be used, there must be some method of explaining the technique concisely. In the section that follows, the complete method is given in all its detail. In this section, we give a less formal description to introduce the reader to the technique as well as to demonstrate that the method has a concise description. A description such as this could be used in presenting the technique within a particular language definition. The algorithm works left-to-right through a list of expressions. Using a place marker, the algorithm identifies an expression in the list as the current expression. Given a list of expressions, we can informally denote the state of the algorithm by wxyz where underlining indicates the current expression is x. In any step, the algorithm considers at most two expressions: the current expression and the adjacent expression to its right or left. The expressions to the right of the current expression are expressions yet to be considered. The expressions to the left of the current expression have been passed over for one of two reasons: either the expression represents an application of an operator that gathers its arguments; or the expression

106

l~r4 SxmORand CARLMcCROSKY

could not be applied to the expression on its right or left. As the algorithm progresses, function applications are introduced, denoted by parenthesized groupings. For example, (fx) (g y z) s t denotes that the current expression is the application of g to y and z. The application ( f x ) has been formed and passed over. The expressions s and t are yet to be considered. The algorithm is a list of conditional rules. If the condition of a rule is satisfied, then the expression list and placemarker are changed as specified by the rule. The rules are attempted in order of presentation. If the condition of the rule is not satisfied, then the next rule in the list is attempted. The ordering of the termination rules is arbitrary. No other rule would apply when either termination rule is satisfied. The ordering of the remaining rules reflects our preference that prefix application be more tightly binding than infix application. An example is given with each rule. The first two rules govern termination. (Rule names match the formal algorithm.) 1. (Termination) If there is only one expression in the expression list, then the algorithm has successfully determined function application. (reduce + 0) =~ (reduce + 0) 2. (Forced Prefix) If the current expression marker has been moved beyond the last expression in the expression list, check if the expressions to the left form an application in prefix form. If not, a type error is reported. Figure 3 presents the full processing of the example and includes a motivating discussion. map ( < 3) nums_ ~ p r e f i x (map ( < 3) nums) The next five rules are the inner workings of the algorithm. Rule 3 is prefix function application. Rules 4 and 6 test when the expression on the left of the current expression applies to the current expression. Rule 5 is infix function application. Rule 7 applies when no other rule does. 3. (Prefix) If the current expression applies to the expression on its right, form the application (merge the two expressions into a single expression represented by the application of the first to the second). double 3 =~ (double 3) 4. (Non-gathering) If the expression to the left is not a gathering expression and that expression applies to the current expression, form the application. ( + 3 ) 4 ~ ( + 3 4) 5. (Infix) If the current expression could be an argument of the expression to its right, form the application. If the expression on the right is a gathering operator, move the current marker ahead to the next expression. In the examples, + is non-gathering and < is gathering. 3+4=~(+3)4 3 < 4 =:, ( < 3 ) 4 6. (Gathering) If the expression to the left is a gathering expression that applies to the current expression, form the application. ( < 3 ) 4 =~ ( < 3 4) 7. (Block) If none of the above rules apply, move the place marker to the next expression. map 3 < n u m s =~ map 3 < n u m s Figure 1 demonstrates how these rules are used to interpret the example expressions from Section 2. The rule name and number producing the line are given on the right. In processing sin cos x, the value of the Block Rule is shown. Because sin cos can not be interpreted as a valid prefix or infix application, the Block Rule inserts an implicit open parenthesis

A practical approach to type-sensitive parsing

107

COS X

sin cos x

Block (7)

sin ~cos x~ (sin (cos x))

Prefix (3) Non-gathering (4)

3+4

(+ 3) 4 (+ 3 4)

Infix (5) Prefix (3)

re.zhr~ + 0 (reduce +) 0 (reduce + 0)

Prefix (3) Prefix (3)

true or 3 < 4 and false (or true~ 3 < 4 and false (or true) 3 < 4 and false (or true) (< 3) 4 and false (or true) (< 3 4) and false (or u'ue (< 3 4)) and false (and (or true (< 3 4~) false (and (or true (< 3 4) false)

Infix (5) Block (7)

L~x (5) Gathering (6) Non-Gathering (4)

Infix (5) Pre~ (3)

Fig. 1. The informal algorithm applied to the examples.

before cos by moving the current expression marker to the right, cos applies to x, satisfying the condition of the Prefix Rule, and sin applies to the result by the Non-gather Rule. The second and third examples show how infix operator application is distinguished from higher-order function application. In the first case, 3 does not apply to + , so infix application is detected. In the second case, reduce does apply to + , signalling prefix application. The example of true or 3 < 4 and false shows the recognition of infix operators. In this case, the Infix Rule forms the application of or to true. The next operand for or is not found, so the current expression is bumped to the right. < applies to 3, and because < is a gathering operator, 4 becomes the current expression. Although < is gathering, the application to 4 is completed because 4 and and are incompatibly typed. Now the partial application of or to the left is completed with the second argument ( < 3 4 ) . Finally, the infix application of and is recognized. This case demonstrates the recognition of gathering and non-gathering infix operators. It also demonstrates that this naive version does not consider precedence. Precedence can be added by delaying the forming of function applications in the case of infix homogeneous binary functions. The Infix Rule is used to recognize such applications, but function applications are not formed until the longest list of homogeneous binary applications has been parsed. Using angle brackets to represent this delayed application form, this example expression is processed as shown in Fig. 2. At the end of ~a3= 'a' and x = x 'a')(
Lihat lebih banyak...

Comentários

Copyright © 2017 DADOSPDF Inc.