Runtime data analysis for Java programs

May 23, 2017 | Autor: Jean-Guy Schneider | Categoria: Data Analysis, Software Development, Program Analysis, Java Programming, Object Oriented Systems, Dynamic Panel Data, Flow analysis, Source Code, Dynamic Panel Data, Flow analysis, Source Code

Share Embed

Denunciar este link

Descrição do Produto

Runtime Data Analysis for Java Programs Andrew Cain, Jean-Guy Schneider, Doug Grant and Tsong Yueh Chen Swinburne University of Technology School of Information Technology P.O. Box 218, Hawthorn, Victoria 3122, Australia {acain,jschneider,dgrant,tychen}@it.swin.edu.au

Abstract Program analysis plays a key role in many areas of software development, such as performance tuning, testing, debugging, and maintenance. Program analysis can be carried out statically or dynamically, and these two approaches are generally seen as being complementary to each other. In this paper, we will focus on runtime data analysis for object oriented systems in general and the Java 2 platform in particular. More specifically, we are investigating various runtime approaches to monitor the access and modification of variables in a Java program in order to keep track of the their usage history, being of particular interest for dynamic data flow analysis. Our results indicate that the Java 2 platform does not provide sufficient tools to enable comprehensive runtime data analysis in this context, especially when the complete source code of an application under investigation is unavailable.

1

Introduction

Data flow analysis is a technique traditionally used by compilers for code optimization [1], but it has also shown to be a useful technique in other areas, such as performance tuning, testing, debugging, and maintenance. Data flow analysis can be performed either statically or dynamically, and these two approaches are generally seen to be complementary. Static analysis can ensure that all paths within the code are examined, though it is difficult to analyze the use of individual array elements, pointers, and, in the case of object oriented systems, objects. A dynamic approach can only examine program paths covered by the respective input data, but can provide more exact information on the use of objects, individual array elements, and pointers. In order to perform a comprehensive runtime data flow analysis, it is, amongst others, necessary to obtain accurate runtime information about the access and modification of each program variable under investigation. This can in general not be completely done in a generic way as languages may support different data interaction models. In this paper, we will focus on technical aspects related to retrieving runtime information required to perform dynamic data flow analysis in the context of the Java 2 platform [10]. As discussed in Section 6, our results are not only applicable to dynamic data flow analysis, but to also other areas where runtime data analysis is required. The remainder of this paper is organized as follows: in Section 2, we briefly introduce the concepts of dynamic data flow analysis, followed by a discussion of previous approaches to performing data flow analysis for object oriented programs (Section 3). In Section 4, we present a set of requirements for performing comprehensive dynamic data flow analysis for Java. Section 5 presents different approaches that can be taken to retrieve runtime data flow information for Java programs. We conclude in Section 6 with our main observations and outline future directions.

1

2

Dynamic Data Flow Analysis

In the area of software testing, dynamic data flow analysis (DDFA) is used for identifying anomalies in the sequence of actions performed upon the program’s variables [17]. There are three basic actions that can be performed on a variable, namely define, reference, and undefine [11]. A variable is considered to be defined when its value is set; it is referenced when its value is referred to; and is undefined when its value is destroyed or goes out of scope. Data flow anomalies represent improper sequences of actions performed on a data element. Three such anomalies exist, define-define, defineundefine, and undefine-reference [11]. The define-define anomaly indicates that the data element was assigned a value that was never used. If a variable that was undefined receives a reference action then this indicates a undefine-reference anomaly. The undefine-reference anomaly is an erroneous sequence, as an undefined data element should not be accessed. The define-undefine anomaly indicates that the data element’s value was defined but not used before the element was destroyed. The define-undefine, and define-define anomalies do not necessarily indicate the presence of an error. Other studies have extended the available actions and/or anomalies [6, 7]. One of the key issues is that a proper dynamic data flow analysis will not change the semantics of the program, i.e. the tested program is observably equivalent [5] to the original program. If the semantics of the program is changed or not captured correctly, any analysis performed maybe flawed. A commonly used technique to safeguard the semantics is to use a different set of variables other than those appearing in the program for the collection of the information about the uses of variables. Short circuit evaluation of boolean conditions, and abrupt termination due to exceptions, are two areas which require particular attention in languages which support these features. Since different types of data may be designed for different ways of use (e.g., value and reference types), different analyzes should be built for different types of data. The design of the analysis should take into consideration not only the intrinsic characteristics of the data, but also how they can be used or misused.

3

Related Work

Dynamic data flow analysis has been performed for various procedural programming languages using the technique of program instrumentation [4, 6, 11, 18]. In [4, 6, 11], code to report the actions on variables, called probes, is inserted into the source of the program to be analyzed. The probes act on state variables which are declared explicitly to collect data information for analysis. The approach for C presented in [18] uses a different form of instrumentation: the memory address of a variable is used to perform the analysis. In this instance, the probe replaces the variables access, using functions for reference and definition. The reference and definition functions take the address of a variable, update the state for that variable, and return the variable’s address, which is then dereferenced and used in the original expression [18]. The approaches presented in [4, 6, 11, 18] individually address different issues related to performing dynamic data flow analysis. Huang proposes the technique of program instrumentation to facilitate dynamic data flow analysis [11] whereas the approach proposed by Chan and Chen focuses on how to carry out data flow analysis across subprogram boundaries and provides an alternative to carry out analysis even if part of the program source is not available for instrumentation [4]. The approach presented by Chen et al. takes into consideration the data types and their uses for COBOL and provides a more comprehensive framework for analyzing COBOL programs [6]. Finally, Price proposes an innovative approach to make use of the address of assigned storage instead of identifiers in the program to keep track of the data flow analysis [18]. Issues related to global variables and uninstrumented source are introduced in [11]. Huang proposes a technique for analyzing variables which cross subprogram boundaries and for calls to uninstrumented subroutines. This technique proposes that the state of the variable should be set to referenced if it comes from and when it returns from uninstrumented code. This does not take into consideration the actual use of the variable and may result in anomalies not being reported

2

correctly. A different approach is proposed in [4] which allows the programmer to specify how method parameters are used within each method. This approach results in a much closer match to the actual use of the variables than the one presented in [11]. However, neither of these two approaches addresses complete variable analysis. Finally, [11] presents an approach for performing analysis on global variables but does not address issues related to global variable accesses from uninstrumented code. Two approaches have been developed for performing dynamic data flow analysis of object oriented programs: one for C++ [7] and one for Java [2]. These approaches are built on the techniques developed for procedural languages and thereby inherit their deficiencies in relation to comprehensive analysis. Chen and Low [7] extend the approach of Price [18] to extract the information from C++ programs. Boujarwah et al. [2] found that this approach is not applicable for Java. Their approach attempts to use a combination of identifiers from the programs source to uniquely identify variables. The main problem with this approach is that all object members are identified via an object name, representing the name of the variable which refers to the object. As a result, all actions on a reference to an object must be tracked, making it impossible to perform analysis without having the entire source. As the approach does not consider the presence of uninstrumented code, no strategy is presented for analyzing access of public instance variables from uninstrumented code. Based on this work, we have defined a purely object oriented approach for tracking the state of each variable under investigation. This approach keeps track of the state of variables using meta-level techniques [12]; it replicates the state of each variable in an associated meta object. However, a detailed discussion of this approach is outside the scope of this report (refer to [3] for details) and is the topic of a forthcoming publication.

4

DDFA for Java

Complementing the fundamentals of dynamic data flow analysis illustrated in Section 2, we defined a set of additional requirements for our approach in order to perform a complete analysis for Java programs. 1. The approach must allow at least the tracking of actions for the definition, reference and destruction of all variables under investigation. 2. The approach must be able to handle any type of variable, independent of scope, type, or visibility. 3. The approach must support targeted analysis of source, thereby allowing analysis of individual parts of a system, and also allowing analysis of systems that use third party components. 4. The output generated by the approach must enable programmers to identify the location and type of any anomalies produced. 5. The approach must enable automated analysis. Like in any other (object oriented) programming language, variables in Java are assigned to a particular scope and have an explicit visibility. Hence, our approach must be able to handle global, class, instance, and local variables, as well as method parameters and constants. The types of variables supported by the language will affect the way analysis is performed. Languages like Java, and C# [8], have two different kinds of types. These are referred to as primitive, and reference types, in the Java language specification [10], and as value, and reference types, in the CLI standard [9]. Variables of a reference type contain a reference to an object [10], or bit sequence [9]. It is, therefore, possible for one object to be referenced from multiple variables, and thus modifications made via one variable to affect the object referred to by another variable.

3

x is Passed By Value By Reference

Reference Type y is a copy of the reference contained in x y is a reference to x.

Value Type y is a copy of the value contained in x y is a reference to the storage of x

Result of y = z has no effect on x x is assigned the value of z (value types), or x is changed to refer to the same location as z (reference types).

Table 1: Method void foo(Type y){ . . . y = z; . . . } is called using foo(x), where Type can be replaced with any value or reference type. Variables of a value type directly contain their data [10, 9]. As each value type variable contains its own storage it is not possible for changes to a value type variable to affect the value of another variable. Requirement 2 ensures that correct analysis for both types of variables is supported by the approach. The way arguments are passed to parameters affects the parameter’s analysis; an issue that is addressed in requirement 2. Actions on a parameter passed by value have no affect on the argument that was passed to the parameter. While, actions on a parameter passed by reference will directly affect the argument that was passed to the parameter. Table 4 shows the possible combinations that result if a language supports both passing by reference and value, and reference and value types, respectively. Targeted analysis, requirement 3, refers to the ability to analyze only part of a program. This requirement is of key importance in programs that make use of libraries or third party components. The approach cannot require all code as the source of used libraries and components may not be available. Individually each of these requirements can be met, it is only by combining these requirements that the complexity of the analysis is highlighted. Targeted analysis, requirement 3, combined with required analysis of publicly scoped variables, requirement 2, represents one of these cases (i.e. the entire source may not be available for analysis). Code contained within the unavailable code may use and modify the values of publically scoped variables exposed from within the analyzed code. Hence, a lower level of analysis may prove difficult (or even impossible) in relation to the identification of the anomalies within the programs source. Initially our approach will target the Java programming language, with the aim that further refinement will allow the approach to address other object oriented languages, in particular C#. Therefore, our approach does not need to cover constructs such references to value types or passing parameters by value, respectively (both would be necessary for C#).

5

Retrieving Runtime Information

For Java there are two ways that we can retrieve the required information at runtime: program transformation or debugging services. Program transformation can be performed at a number of levels, either by inserting probes into the original source, providing a compiler which produces instrumented byte code, or by instrumenting byte code produced from an existing compiler. Alternatively the debugging services provided in the Java Platform Debugger Architecture [16] can be used to watch for access and modification of Java fields [10].

5.1

Source code Instrumentation

Previous approaches to source code instrumentation have provided two main variants. The first variant was used in [4, 6, 11] and places instrumented probes beside each statement. Where 4

the language supports short circuit condition evaluation these conditions must be decomposed to ensure that the actions reported actually occurred during execution. The Java language specifies that operands for operators are evaluated left to right [10], and so compiler dependent semantics need not be considered. Java does, however, allow abrupt termination due to exceptions which requires particular attention for decomposition. The second variant was developed for C [18] and tracks variables by their memory address. As described in Section 3, this approach replaces variable access with reference and definition functions. As Java does not provide mechanisms to operate with the address of variables, this approach cannot be applied. Unfortunately, source code instrumentation approaches fail in combination with targeted analysis: with these approaches it is possible that a given variable cannot be fully analyzed without instrumenting all used code. This is of particular importance for Java where each program relies on at least the standard library classes who’s source is generally not available.

5.2

Byte Code Instrumentation, or Compiler

Another approach is to instrument the Java class files. The Java compiler will usually1 convert the Java source code, into the class file format as specified in [13]. Rather than instrumenting the source code the resulting Java byte code can be modified. The Java byte code is effectively a stack based programming language. If we introduce statements into the byte code that have no effect on the stack, we can modify the byte code without changing the semantics of the program. Actions which perform loading of values onto the operand stack, such as iload and getField, are followed by operations which indicate the reference of the variable. Actions which store values into variables, such as istore and putField, are followed by operations which indicate the definition action. In both of these cases, the inserted code must not affect the operand stack. The generation of the modified byte code can be performed either by a customized compiler, or by using an existing compiler and modifying its resulting byte code. Similarly, byte code transformation approaches can only guarantee to fully track the access and/or modification of variables by violating targeted analysis. It is possible to instrument all Java class files with instrumented probes, and so libraries and third party components can be instrumented. This, however, will result in anomalies being reported for variables that the developer has no control over. This will negatively impact on the ability to analysis the results of this approach. Targeted analysis is gained in the compiler approach, but as it relies on the program’s source code, it provides no better coverage than the source code instrumentation approaches.

5.3

Java Debugger

The Java Platform Debugger Architecture [16] provides three interfaces related to performing debugging tasks using the Java Virtual Machine. By using the Java Debug Interface API [15] it is possible to add watch points for access and modification of instance and class variables. Unfortunately, no watch can be added to variables local to methods, or to the access or modification of elements in an array. By performing some source code transformation it is possible for the Java Debug Interface [15] to analyze local variables. Wrapper classes can be introduced to provide an object wrapper for local variables. By converting all local variables into objects appropriate watch points can be added to these instance variables. All use of the variable within the method are then converted to use of the instance variable. The code in Figure 1 shows the original source on the left, with the resulting instrumented source on the right. Unfortunately, the Java Debug Interface [15] does not allow watch points to be added to individual array elements. Although source code transformation can be provided to track array 1 The Java specifications indicate that a Java compiler need not compile to Java source code to Java class files [10, 13].

5

int aMethod() { int x;

int aMethod() { DfaIntWrapper x = new DfaIntWrapper ();

x = 100;

x.value = 100;

return x; }

return x.value; } class DfaIntWrapper { public int value; }

Figure 1: Transforming local variables into objects element usage within the instrumented code, this will again not extend to uninstrumented code. For example the java.utils.Arrays class [14] provides methods which take arrays as parameters, e.g. methods for searching, sorting, and filling arrays. Any use of array elements within these methods cannot be tracked by the debugger. Nevertheless, this approach comes the closest to meeting all of our requirements. It is capable of both complete analysis of public instance variables and targeted analysis of a program in the absence of arrays. It is unable to perform complete analysis of arrays that are passed to uninstrumented code.

6

Conclusions and Future Work

In this paper, we have presented several different approaches for gaining data flow information from Java programs at runtime needed to perform dynamic data flow analysis. More precisely, the required information can be extracted from a program either by using program instrumentation or the debugging services offered by the Java 2 platform. Program instrumentation of either the source or binary representation of the program is used to insert probes that report actions on the data items. Alternatively debugging services can be used to watch the access and manipulation of variable values. These approaches are examined for Java, and evaluated in relation to how they meet our requirements to perform a comprehensive runtime data analysis. Neither debugging services nor program transformation are capable of meeting all of our requirements for performing complete dynamic data flow analysis for Java programs. Supporting the individual analysis of any kind of variable, independent of its scope, type, or visibility is possible, so is targeted analysis. However, when these two aspects are combined, none of the existing tools prove to offer a solution. All forms of program transformation are either unable to analyze all variables and/or cannot perform targeted analysis. While the debugger approach meets these requirements, it is unable to perform complete analysis for individual array elements; one of the key advantages for dynamic analysis over static analysis. To allow complete dynamic data flow analysis for Java programs, either an extension to the Java Platform Debugger Architecture needs to be created, or some of our requirements on the analysis must be relaxed. Relaxing any of the requirements will result in an approach that can perform only a partial analysis. These issues will form the basis for future work. Further work is also required to examine if data flow analysis can be performed with the .NET framework [9]. .NET does not provide as rich debugging services as the Java 2 platform, and protects binary representations of library code from being manipulated.

6

References [1] Frances E. Allen and John Cocke. A program data flow analysis procedure. Communications of the ACM, 19(3):137–147, 1976. [2] Abdulaziz S. Boujarwah, Kassem Saleh, and Jehad Al-Dallal. Dynamic data flow analysis for Java programs. Information and Software Technology, 42(11):765–775, August 2000. [3] Andrew Cain, Tsong Yueh Chen, Doug Grant, and Jean-Guy Schneider. Dynamic data flow analysis for object oriented programs. Submitted for Publication, 2003. [4] Fun Ting Chan and Tsong Yueh Chen. AIDA – A Dynamic Data Flow Anomaly Detection System for Pascal Programs. Software Practice and Experience, 17(3):227–239, March 1987. [5] Hou Yan Chen, T. H. Tse, and Tsong Yueh Chen. TACCLE: a methodology for object-oriented software Testing At the Class and Cluster LEvels. ACM Transactions on Software Engineering and Methodology, 10(1):56–109, 2001. [6] Tsong Yueh Chen, H. Kao, M. S. Luk, and W. C. Ying. COD – A Dynamic Data Flow Analysis System for COBOL. Information and Management, 12(2):65–72, 1987. [7] Tsong Yueh Chen and C. K. Low. Error Detection in C++ through Dynamic Data Flow Analysis. Software – Concepts and Tools, 18:1–13, 1997. [8] European Computer Machinery Association. Standard ECMA-334: C# Language Specification, second edition, December 2002. [9] European Computer Machinery Association. Standard ECMA-335: Common Language Infrastructure, second edition, December 2002. [10] James Gosling, Bill Joy, Guy Steele, and Gilad Bracha. The Java Language Specification. AddisonWesley, second edition, 2000. [11] J. C. Huang. Detection of Data Flow Anomaly Through Program Instrumentation. IEEE Transactions on Software Engineering, SE-5(3):226–236, May 1979. [12] Gr´egor Kiczales, Jim des Rivi`eres, and Daniel G. Bobrow. The Art of the Metaobject Protocol. MIT Press, 1991. [13] Tim Lindholm and Frank Yellin. The Java Virtual Machine Specification. Addison-Wesley, second edition, 1999. [14] Sun Microsystems. Java 2 Platform http://java.sun.com/j2se/1.4.1/docs/api/.

SE

API

[15] Sun Microsystems. Java Debug Interface http://java.sun.com/j2se/1.4.1/docs/guide/jpda/jdi/. [16] Sun Microsystems. Java Platform Debugger http://java.sun.com/j2se/1.4.1/docs/guide/jpda/.

Specification. API. Architecture.

Available Available Available

at at at

[17] Leon J. Osterweil and Lloyd D. Fosdick. DAVE–A Validation Error Detection and Documentation System for Fortran Programs. Software Practice and Experience, 6:473–486, 1976. [18] David Andrew Price. Program Instrumentation for the Detection of Software Anomalies. Master’s thesis, Department of Computer Science, University of Melbourne, Australia, 1985.

7

View publication stats

Lihat lebih banyak...

Runtime data analysis for Java programs

Descrição do Produto

Comentários