How Do Mobile Phones Fail? A Failure Data Analysis of Symbian OS Smart Phones

June 16, 2017 | Autor: Zbigniew Kalbarczyk | Categoria: Data Analysis, User eXperience, Random access memory, Mobile phone, Data Logger, Smart Phone

Share Embed

Denunciar este link

Descrição do Produto

How do Mobile Phones Fail? A Failure Data Analysis of Symbian OS Smart Phones Marcello Cinque, Domenico Cotroneo

Zbigniew Kalbarczyk, Ravishankar K. Iyer

Dipartimento di Informatica e Sistemistica

Center for Reliable and High Performance Computing

Universit`a degli Studi di Napoli Federico II

University of Illinois at Urbana-Champaign

Via Claudio 21, 80125 - Naples, Italy

1308 W. Main St., Urbana, IL 61801

{macinque, cotroneo}@unina.it

{kalbar, iyer}@crhc.uiuc.edu

Abstract While the new generation of hand-held devices, e.g., smart phones, support a rich set of applications, growing complexity of the hardware and runtime environment makes the devices susceptible to accidental errors and malicious attacks. Despite these concerns, very few studies have looked into the dependability of mobile phones. This paper presents measurement-based failure characterization of mobile phones. The analysis starts with a high level failure characterization of mobile phones based on data from publicly available web forums, where users post information on their experiences in using hand-held devices. This initial analysis is then used to guide the development of a failure data logger for collecting failure-related information on SymbianOS-based smart phones. Failure data is collected from 25 phones (in Italy and USA) over the period of 14 months. Key findings indicate that: (i) the majority of kernel exceptions are due to memory access violation errors (56%) and heap management problems (18%), and (ii) on average users experience a failure (freeze or self shutdown) every 11 days. While the study provide valuable insight into the failure sensitivity of smart-phones, more data and further analysis are needed before generalizing the results.

1

Introduction

New generation of mobile and embedded devices, such as smart phones and PDAs (personal digital assistants) support a rich set of applications, e.g., web browsing and entertainment software. What’s more, the time-to-market pressure forces manufacturers to deliver products with new features within very short time windows (e.g., six months) often sacrificing the testing efforts. As a result, we witness an increasing susceptibility of hand-held devices to acciden-

tal errors and malicious attacks. An example is the recently reported first mobile phone virus, Cabir, affecting SymbianOS-based smart phones. Reliability becomes even more critical as new critical applications emerge for mobile phones, e.g., robot control [15, 10], traffic control [2] and telemedicine [4]. In such scenarios, a phone failure affecting the application could result in a significant loss or hazard, e.g., a robot performing uncontrolled actions. Despite these concerns, very few studies have looked into the dependability of mobile phones. As a result, there is little understanding of how and why mobile phones fail. This paper presents measurement-based failure analysis of mobile phones. The analysis starts with a high level failure characterization of mobile phones based on everyday user’s experiences. Data for this study spans the four year period (between 2003 and 2006) and is obtained from publicly available web forums, where users post information on their experiences in using hand-held devices. The information collected in these forums is not well structured, and a relatively small number of entries can be considered as failure reports. However, collected data enables: (i) identification of the high level failure manifestation, (ii) categorization of the user-initiated recovery from the device failure, and (iii) characterization of the failure severity. This initial analysis is then used to guide the development of a failure data logger for smart phones, initially introduced in [1]. The logger employs heartbeat mechanism to detect system/application failures. Upon failure detection, the logger records information about the phone activities, the list of running applications, and error conditions signaled by the system/application modules. The logger has been deployed on 25 Symbian-based smart phones in Italy and in the US since September 2005. The Symbian OS was chosen because of: (i) its open programmability features supporting C++ and Java programming languages and (ii)

its relatively wide-spread use at the time of this analysis. The analysis of the collected failure data shows: (i) Majority of kernel exceptions (56%) are due to memory access violation errors and heap management problems (18%). This is despite the micro- kernel design model and advanced memory management facilities provided by the Symbian OS. (ii) System panics often occur in bursts - two or more panic events in a short succession, which indicates error propagation between applications (especially between real-time tasks and interactive applications). (iii) Users experience a failure (the phone freeze or self shutdown) every 11 days on average.

2

Background

Evolution of Mobile Phones. Mobile phone evolution can be described according to three waves, each one characterized by a specific class of mobile terminal [8]: • Voice-centric mobile phone (first wave): a hand-held mobile radiotelephone for use in an area divided into small sections (cells) and supporting SMS (Short Message Service). • Rich-experience mobile phone (second wave): a mobile phone with numerous advanced features, typically including the ability to handle data (web-browsing, email, personal information management, images, music) through high-resolution color screens. • Smart phone (third wave): a general-purpose, programmable mobile phone with enhanced processing and storing capabilities. It can be viewed as a combination of a mobile phone and a PDA, and it may have a PDA-like screen and input devices. Recent mobile phone models on the market feature more computing and storing capabilities, new operating systems, new embedded devices (e.g., cameras, radio), and communication technologies (Bluetooth, IrDA, WAP, GPRS, UMTS). The number of units sold during the third quarter of 2005 (205 millions) doubled with respect to the third quarter of 2001 (97 millions units)1 . In the same period, the percentage of smart phones sold has sextupled. According to industry, the time from conception to the market deployment of a new phone model is between 4 to 6 months. Clearly, the pressure to deliver a product on-time, frequently results in compromising the device reliability. The hope is that any potential reliability problems can be fixed quickly by deploying new releases of phone firmware, which can be installed on the phone by service phone centers. 1 sources:

http://www.itfacts.biz, http://www.theregister.co.uk

Symbian OS. Symbian [8] is a light-weight operating system developed for mobile phones and carried out by several leading mobile phone’s manufacturers. The design of Symbian is based on a hard real-time, multithreaded microkernel. All system services are provided by server applications. Clients access servers using kernel supported message passing mechanisms. Since mobile phones resources are highly constrained, special care is taken for memory management. Specific programming rules are defined to ensure freeing unused memory and avoid memory leaks even in the case of failures. In particular, the following mechanisms are provided: (i) clean-up stack, which is an OS resource for storing references to objects allocated on the heap memory. (ii) trapleave technique, which is similar to the try-catch paradigm defined for C++ and Java languages, where upon an exception raised during the execution of a trap block, the current method “leaves” and the control returns to the caller, which handles the problem. In meantime, the operating system frees memory allocated for all objects stored on the cleanup stack during the execution of the trap block, thus avoiding potential memory leaks. (iii) two-phase construction paradigm, which is defined to construct objects with dynamic extensions. The mechanism assures that, when errors occur during the construction of an object, the dynamic extension is freed using the clean-up stack mechanism. The Symbian OS defines two levels of multitasking: (i) threads, which execute at the lower level and are scheduled by a time-sharing, preemptive, priority-based OS thread scheduler, (ii) Active Objects (AOs), which execute at the upper level and are scheduled by a non-preemptive, eventdriven active scheduler. Multiple AOs can run within a thread. Use of active objects enables the light-weight OS design since the AOs eliminate need for synchronization primitives and hence, incur a lower context-switch overhead than threads. A crucial Symbian aspect, which is of interest to us is panic events. A panic event represents a non-recoverable error condition signaled to the kernel by either user or system applications. Information associated with a panic (panic category and panic type) is delivered to the kernel, which decides on the recovery action, e.g., application termination or system reboot.

3

Related Research

The goal of measurement-based analysis of failure data of computer systems is to classify errors/failures, to characterize their temporal behavior, and to guide development of detection and recovery mechanisms. [17] identifies trends (shifting error sources, explosive complexity, and global volume) in computer industry that impact computer system dependability and security. The evolution of three research

threads in experimental dependable systems (error monitoring and failure data analysis, fault injection, and design methodology) are traced to illustrate how research responds to or anticipates the direction of the computer industry. The authors indicate a need for more research, especially on issues of complexity, security, and reliability of current and new generation computing systems and applications. Towards this, our study proposes a method for experimental measurement-based analysis of failure mechanisms of emerging smart handheld devices. Number of studies focus on measurement-based dependability analysis of operating systems, e.g., Windows NT [9, 20], Windows 2000 [19], and Linux [7, 18]. Other studies characterize failures of networked systems and more recently, large-scale heterogeneous server environments [11] [14]. In the field of mobile distributed systems, an architecture for gathering and analyzing failure data for the Bluetooth distributed systems is proposed in [6], whereas [12] reports on an experimental study of the drop impact on mobile devices hardware failures. [13] discusses failure data collected from the base stations of the cellular system. All these studies exploit failure information collected in system event logs, or failure reports provided by specialized maintenance staff. In the case of smart phones devices (analyzed in this paper), logging facilities are limited and not fully exploited. In particular, the Symbian OS provides a server application (flogger), which allows logging the application specific information. However, in order to access the data logged by a generic system/application module, it is necessary to create (on the device) a directory with a welldefined, system specific name (e.g., Xdir). The problem is that the names of such directories are not made publicly available to developers. These directories are used by manufacturers during the development/testing. Recently, a tool called D EXC2 has been introduced to enable collecting panic events generated on a phone. However, the tool does not relate panic events to failure manifestations, running applications, and phone activities as we do in our study.

4

Smart Phones’ High-level Failures Characterization

In order to conduct a high level failure characterization of mobile phones, we use publicly available data found on several web forums 3 , where mobile phone users post information on their experience in using hand-held devices. The posted data has a free format and a relatively small number of entries report on device errors/failures. Here are 2 D EXC is a Symbian project avilable at www.symbian.com/developer/downloads/tools.html 3 www.howardforums.com cellphoneforums.net, www.phonescoop.com, and www.mobiledia.com

two examples of user reports: “the phone freezes whenever I try to write a text message, and stays frozen until I take the battery out” and “the phone exhibits random wallpaper disappearing and power cycling, due to UI memory leaks”. Note that the latter report gives details on a potential failure cause. The posted information is filtered (to extract entries related to device failures), classified, and analyzed along several dimension as discussed further in this section. Failure Types. Following failure categories are identified based on the extracted data. 4 . • Freeze (lock-up or a halting failure [3]): The device’s output becomes constant, and the device does not respond to the user’s input. • Self-shutdown (silent failure [3]): The device shuts down itself, and no service is delivered at the user interface. • Unstable behavior (erratic failure [3]): The device exhibits erratic behavior without any input inserted by the user, e.g. backlight flashing, and self-activation of applications. • Output failure (value failure [5]): The device, in response to an input sequence, delivers an output sequence that deviates from the expected one. Examples include inaccuracy in charge indicator, ring or music volume different from the confgured one, and event reminders going off at wrong times. • Input failure (omission value failure [5]): User inputs have no effect on device behavior, e.g. soft keys do not work. User-Initiated Recovery. User-initiated actions to recover from a device failure can be classified according to the following categories: • Repeat the action: Repeating the action is sometime sufficient to get the phone working properly, i.e., the problem was transient. • Wait an amount of time: Often it is enough to wait for a certain amount of time (the exact amount is not reported by users) to let the device deliver the expected service. • Reboot (power cycle or reset): The user turns off the device and then turns it on to restore the correct operation (a temporary corrupted state is cleaned up by the reboot). 4 It is possible that other failure categories, not present in the analyzed logs, exist

• Remove battery: Battery removal is mainly performed when the phone freezes. In this case, the phone often does not respond to the power on/off button. Battery removal can clean up a permanent corrupted state (e.g., due to a user’s customized settings). • Service the phone: The user has to bring the phone to a service center for assistance. Often, when the failure is firmware-related, the recovery consists of either a master reset (all the settings are reset to the factory settings and the user’s content is removed from the memory) or a firmware update, i.e., uploading a new version of the firmware. Hardware problems are fixed by substituting malfunctioning components (e.g., the screen or the keypad) or replacing the entire device with a new one.

Table 1. Failure frequency distribution with respect to failure types and recovery actions; the numbers are percentages of the total number of failures

Recovery action Failure Type

battery service reboot removal phone

wait

repeat

unrep.

freeze

3.65

2.36

9.01

4.29

0

6.01

input failure

0.64

0.64

0.21

0

0.64

0.86

output failure

6.87

8.80

0.43

0.64

5.79

13.73

selfshutdown

6.65

0

2.15

0.43

0

7.73

unstable behavior

6.87

1.72

0.21

0.21

0.64

8.80

If a failure report does not contain any information about the recovery, we classify the recovery as unreported. Failure Severity In introducing failure severity, this study takes the user perspective and defines severity levels corresponding to the difficulty of the recovery action(s). • High: A failure is considered to be highly severe when recovery requires the assistance of service personnel. • Medium: A failure is considered to be of medium severity when the recovery requires reboot or battery removal. • Low: A failure is considered to be of low severity if the device operation can be reestablished by repeating the action or waiting for an amount of time.

4.1

Reports analysis

The results discussed in this section are obtained from the analysis of failure reports posted between January 2003 and March 2006. A total of 533 reports are used in this study. Phone models from all major vendors are present: Motorola, Nokia, Samsung, Sony-Ericcson, LG, besides Kyocera, Audiovox, HP, Blackbarry, Handspring, and Danger. The 22.3% of failure reports are from smart phones, although smart phones represented only 6.3% of the market share in 2005. We attribute this to the fact that smart phones: (i) have more complex architecture than voicecentric or rich-experience mobile phones and (ii) are open for users to download and install third party applications and/or develop their own applications. Note that not all considered phones are Symbian-based smart phones. Consequently, while the discussion in this section provides high level characterization of phones failures, the reported figures may differ from the results given in section 6, which discusses failure data collected by the logger software run on the Symbian-based smart phones.

Nevertheless, these considerations do not change our conclusions, since the purpose of this preliminary study is to gain an initial understanding of the observed phenomena, rather than conducting a detailed failure analysis. The most frequent failure type is output failure (36.3%), followed by freeze (25.3%), unstable behavior (18.5%), self-shutdown (16.9%), and input failure (3%). Despite their high occurrence, output failures are often of lowseverity since repeating the action is often sufficient to restore a correct device operation (5.8%, see Table 1). On the other hand, self-shutdown and unstable behavior can be considered as high-severity failures, because they are effectively recovered by serviceing the phone, or removing the battery. Phone freezes are usually of medium severity, since reboot (2.4% of the total number of failures; see Table 1) or the battery removal (9.0%; see Table 1) usually do the job and reestablish the proper operation. Only in about 3.7% (see Table 1) of cases must the user seek assistance. To gain an understanding of the relationship between failure types and recovery actions, Table 1 reports failure distribution with respect to failure types and corresponding recovery actions. From the recovery action perspective, it should be noted that reboots are an effective way to recover from output failures (8.8% of the total number of failures). This indicates that output failures are often due to a temporary software corrupted state, which is cleaned up by the reboot. This is also confirmed by the fact that repeating the action is often sufficient to restore a correct device operation. Freezes are usually recovered by pulling out the battery (9.01%), even if a significant number of them (4.29%) are recovered by simply waiting an amount of time for the phone to respond. This may indicate that a certain fraction of battery removals and reboots in response to freezes are due to impatient users. In general, this lead us to observe how freezes are more annoying than output failures, where the user does not often need to pull out the battery.

Analyzed data also allows correlating failure occurrences with the user activity at the time of the failure. In particular, 13% of failures occur during the voice calls, 5.4% while creating/sending/receiving text messages, 3.6% while using Bluetooth and 2.4% when manipulating images. Finally, several reports (we guess from more sophisticated users) provide insight into the failure causes, e.g., there are indications of memory leaks, incorrect use of the device resources, bad handling (by the software) of indexes/pointers to objects, and incorrect management of buffer sizes.

5

Data Collection

In order to gain in-depth understanding of the failure behavior of handheld devices we developed a failure data logger for Symbian based smart phones. The logger enables: i) recording the occurrences of user-perceived application/system failures and ii) associating high-level failure events with the low-level error conditions signaled by applications and system modules in the form of panics. The collected data provide basis for analyzing the low-level causes of failures observed by users. Towards this, it is important to record the phone status at the time of failure. For example, when a phone freezes while a text message is being received, the stored failure data should enable answering the following questions: 1. Was the text message received despite of the failure? 2. Did any user/system module fail? 3. What other applications were running on the device at the time of the failure? In order to address these questions, it is necessary to relate the failure (the freeze event in our example) with the phone activity/status at the time of the failure and with a panic event, which can be signaled by application or system modules. In this study, we focus on freeze and self-shutdown failures, since they can be relatively easily detected without human intervention. The automated detection of value and erratic failures (output failures, input failures, and unstable behavior identified in the previous section) requires the implementation of a perfect observer, which has a complete knowledge of the system specification [5]. An alternative could be to involve the user in the detection process, by asking him/her to report the occurrence of a value or erratic failures. However, as our experience with analysis of Bluetooth failures shows [6], users are quite unreliable and often neglect or forget to post the required information, thus biasing the results. While this approach can be considered acceptable for an initial evaluation, as discussed in section 4,

Log Engine activity Running Applications Detector

Panic Detector

Power Manager power

runapp

Log file

HeartBeat beats

AO File

Figure 1. Overall architecture it becomes too unreliable for a more detailed analysis. Regardless of its limited scope, the study of freezes and selfshutdowns enables us to infer valuable insight into failure behavior of Symbian-based smart phones.

5.1

Failure Logger Architecture

The high-level architecture of the failure data logger is shown in figure 1. The logger is implemented as a daemon application that starts at the phone start-up time and executes in the background. It consists of a set of Active Objects (AOs) responsible for the following tasks: • Heartbeat: which is in charge of detecting both freezes and self-shutdowns (the next subsection provides more details on the heartbeat active object). • Running Applications Detector: which periodically stores (in the runapp file) the list of IDs of the applications running on the phone. The list is obtained from the Application Architecture Server. • Log Engine: which collects the smart phone activity (e.g., calls, messages, and web browsing). The information is gathered from the Database Log Server and stored into the activity file. • Power Manager: which provides information about the battery status and enables differentiating selfshutdowns due to failures and those due to low battery. The battery status is gathered from the System Agent Server and stored into the power file. • Panic Detector: which collects panic events as soon as they are notified. In order to gather panic related information (panic category and type), the Panic Detector exploits services provided by the RDebug object in the Symbian OS Kernel Server. The Panic Detector is also responsible collecting data produced by the other active objects into a single Log File. This operation is performed either when a panic is detected or when the logger application starts (i.e., when the phone starts).

Detection mechanisms

Freezes and self-shutdowns detection is accomplished by means of the heartbeat technique. This is a well known approach for crash detection. The Heartbeat AO periodically writes a heartbeat events to the beats file. During normal execution, the Heartbeat writes an ALIVE event. Once a shutdown is performed, the Heartbeat writes a REBOOT event. Note that before the phone reboots, the Symbian OS allows applications to complete their tasks. This is sufficient for the Heartbeat to record the REBOOT event. When the user deliberately turns off the logger application, a MAOFF (Manual OFF) event is written to the log file. Finally, if a shutdown is due to low battery (the battery status is requested to the Power Manager), a LOWBT (LOW BaTtery) event is written. When the phone is turned on and the logger starts, the Panic Detector checks the last event logged by the Heartbeat. An ALIVE event indicates the phone has been shut down by pulling out the battery. In all other cases (i.e., a shutdown due to the low battery, the user, or the kernel) the Heartbeat would log REBOOT or LOWBT events. This means that the phone was frozen, which is consistent with the fact that pulling out the battery is the only reasonable user-initiated recovery action for a freeze. Therefore, a freeze is recorded by the Panic Detector, along with the information gathered by the Log Engine and the Running Applications Detector. On the other hand, a REBOOT event can be logged because either the phone rebooted itself or it was rebooted by the user. Hence, it becomes important to distinguish the two cases. More details on the logger including the tuning of the heartbeat frequency and the description of the software infrastructure for automated transfer of Log Files from the phones used in this study, can be found in [1].

6

duration < 500 s

% shutdown events

5.2

Reboot duration (s)

Figure 2. Distribution of reboot durations; the inner histogram zooms the external one for duration < 500 s

Heartbeat AO) is the same in both cases. We discriminate between these two events by examining the phone off-time (or the reboot duration) recorded by the Panic Detector. Figure 2 shows the distribution of reboot durations. The histogram includes all recorded shutdown events (1778 events). Two local maximums can be noticed in the figure: a first one for reboot durations shorter than 500s, which corresponds to self-shutdowns, and a second one around 30000 seconds (about eight hours and 20 minutes), corresponding to the phone off time during the night when users usually turn off their phones. The inner histogram zooms in on the data around the first local maximum (for the reboot durations less than 500 seconds) and shows a peak around 80 seconds, which corresponds to the median self-shutdown duration. Note that the number of events approaches zero seconds for durations longer than 360 seconds. We filtered-out all shutdown events with durations longer than 360 seconds. The remaining events are assumed to be self-shutdown events (471 events or 24.2% of the overall data set).

Experimental Results

This section reports results from the analysis of failure data collected over the period of 14 months from 25 phones, which run Symbian OS versions 6.1 to 8.0 or version 9.0. The majority of phones use the Symbian version 8.0, the most popular on the market at the time the analysis started. The targeted phones belong to students, researchers, and professors from both Italy and USA. The phones have the logger installed and have been under normal use during the period of the experiment. Self-shutdowns Identification. As a first step in the failure data analysis, we isolate the self-shutdowns from the user triggered shutdowns. Unfortunately, it is not possible to automatically distinguish the two types of shutdowns because the generated event (i.e., the one captured by the

Freezes and Self-shutdowns. A total of 360 freezes and 471 self-shutdowns are reported by the logger. Based on this data we estimate the Mean Time Between Freezes (MTBFr) and the Mean Time Between Self-shutdowns (MTBS), in terms of wall-clock hours, averaged per single phone. The results show: MTBFr of 313 hours and MTBS of 250 hours. Hence, on average, a user experiences his/her phone freeze about every 13 days and the phone self-shutdown about every 10 days. These figures give an overall idea of today’s mobile phones user-perceived dependability. While these values are acceptable for everyday dependability requirements [16], they indicate potential limitations in using smart phones for critical applications. Captured Panic Events. Table 2 reports on the panic events recorded during the experiment. The panics are

Table 2. Collected panic events Panic KERN-EXEC

E32USERCBase

USER

Type

%

0

6.31

3

56.31 This panic is raised when an unhandled exception occurs. Exceptions have many causes, but the most common are access violations caused, for example, by dreferencing NULL. Among other possible causes are: general protection faults, executing an invalid instruction, alignment checks, etc.

15

0.51

This panic is raised when a timer event is requested from an asynchronous timer service, an RTimer, and a timer event is already outstanding. It is caused by calling either the At(), After() or Lock() member functions after a previous call to any of these functions but before the timer event requested by those functions has completed.

33

5.56

Raised by the destructor of a CObject. It is caused, if an attempt is made to delete the CObject when the reference count is not zero.

46

0.76

This panic is raised by an active scheduler, a CActiveScheduler. It is caused by a stray signal.

47

0.25

This panic is raised by the Error() virtual member function of an active scheduler, a CActiveScheduler. This function is called when an active object’s RunL() function leaves. Applications always replace the Error() function in a class derived from CActiveScheduler; the default behaviour provided by CActiveScheduler raises this panic.

69

10.10 This panic is raised if no trap handler has been installed. In practice, this occurs if CTrapCleanup::New() has not been called before using the cleanup stack.

91

0.51

Not documented

92

0.76

Not documented

10

1.52

This panic is raised when the position value passed to a 16-bit variant descriptor member function is out of bounds. It may be raised by the Left(), Right(), Mid(), Insert(), Delete() and Replace() member functions of TDes16.

11

5.81

This panic is raised when any operation that moves or copies data to a 16-bit variant descriptor, causes the length of that descriptor to exceed its maximum length. It may be caused by any of the copying, appending or formatting member functions and, specifically, by the Insert(), Replace(), Fill(), Fillz() and ZeroTerminate() descriptor member functions. It can also be caused by the SetLength() function.

Meaning This panic is raised when the Kernel Executive cannot find an object in the object index for the current process or current thread using the specified object index number (the raw handle number).

70

0.76

This panic is raised when attempting to complete a client/server request and the RMessagePtr is null.

KERN-SVR

0

0.25

This panic is raised by the Kernel Server when it attempts to close a Kernel object in response to an RHandleBase::Close() request. The panic occurs when the object represented by the handle cannot be found. The panic is also raised by the Kernel Server when it cannot find an object in the object index for the current process or current thread using the specified object index number (the raw handle number). The most likely cause is a corrupt handle.

ViewSrv

11

2.53

occurs when one active object’s event handler monopolizes the thread’s active scheduler loop and the application’s ViewSrv active object cannot respond in time (the View Server monitors applications for activity/inactivity, if it thinks the application is in some kind of infinite loop state it will close it. Clever use of Active Objects should help overcome this).

EIKONLISTBOX

3

0.25

occurs when using a listbox object from the eikon framework and no view is defined to display the object.

5

0.76

occurs when using a listbox object from the eikon framework and an invalid Current Item Index is specified.

Phone.app

2

0.25

Not documented

EIKCOCTL

70

0.25

Corrupt edwin state for inlining editing

MSGS Client

3

6.31

Failed to write data into asynchronous call descriptor to be passed back to client

MMFAudioClient

4

0.25

it appears when the TInt value passed to SetVolume(TInt) gets 10 or more

classified according to their categories and types. The table also gives a relative frequency (with respect to the total number of panics) of occurrences of different panic types. In addition, a brief description (extracted from the Symbian OS documentation) of each panic category is given. The data on panic events provides an overall insight into the software defects, which lead to application/system failures. The most frequent panics are due to access violations caused by dereferencing null pointers. In this case the Symbian kernel executive terminates the offending application and signals a KERN-EXEC type 3 panic. Other frequent panic causes include: invalid object indexes (KERN-EXEC type 0 panic), runtime errors related to the heap management (causing E32User-CBase panics),

and copy operations causing a descriptor to exceed its maximum length (USER type 11 panic). These findings are consistent with our observations from the analysis of failure data reported in the public web forums and discussed earlier in this paper. Further analysis of panic events reveals that in many cases (25%), a cascade of more than one panic event is recorded in the logs (see figure 3). Since a panic generation is the last operation performed by an application or a system module (just after, the application is terminated by the kernel), multiple panic events in a short succession indicate error propagation within the operating system. The observable consequence of this phenomenon is the termination of multiple applications.

% Panics no. of subsequent Panics

Figure 3. Distribution of subsequent panics

panic freeze

window

(isolated) panic

window

(a)

(isolated) self-shutdown

time

Figure 4. Panics and HL events coalescence scheme

Panics and High Level Events. From the collected data we can infer the relationship between panics and the highlevel (HL) events, e.g., freezes and self-shutdowns. Towards this, we correlate panic events with freeze and selfshutdown events as depicted in Figure 4. When a panic is found in the Log File, we search for freeze and selfshutdown events, within a predefined temporal window. As indicated in Figure 4 there can be panic events which do not relate to HL events as well as isolated HL events. The temporal window for grouping the events must be carefully selected to avoid misinterpretation of the results. Analysis of the collected data shows that the number of coalesced events increases for window’s sizes up to five minutes. A further increase in the number of the coalesced events is observed for much larger temporal windows (of the order of hours), which indicates that the coalesced events are most likely uncorrelated. For these reasons, we fix the temporal window size to be five minutes. Figure 5 shows the results of this coalescence procedure (including the distribution of isolated panics, i.e., those panics which cannot be related to any HL event5 ). The results show that more than a half of the recorded panics (51%) are related to HL events. If we consider a relatively small number of HL events (one every 11 days), these relationships cannot be just a coincidence. Furthermore, if we include all shutdown events recorded in the logs (hence about 300% increase in the number of events, from 471 to 1778 shutdown events), the percentage of panics related to HL events increases to 55%, i.e., only by 4%. This also confirms our previous observation that the shutdown events, which we filtered out from the data 5 These panic events, most likely, relate to output failures, which our failure logger (in its current implementation) is not able to collect

(b) (b)

Figure 5. Panics and HL events: a) across all events, b) details with respect to freeze and self-shutdown events

analysis, are user-triggered shutdowns. Figure 5a, also shows panic categories (EIKON-LISTBOX, EIKCOCTL, MMFAudioClient, and KERN-SVR) which do not manifest as HL events. The first three panics are typical application panics, concerning the view or the audio streaming. This indicates a good OS resilience with respect to application panics. More frequent system panics, such as KERN-EXEC, E32USER-Cbase, USER and ViewSrv, usually lead to an HL event. Depending on the component that caused the panic: (i) the phone can crash if the panic is raised by a critical system server or (ii) the phone keeps working properly once the offending application is terminated by the kernel. As a further observation, there are panics, e.g., Phone.app and MSGS Client, which always cause the self-shutdown. The two panic events correspond to the core applications provided by the phone and hence, the OS kernel always reboots the phone if any of these applications fails. Figure 5b details the relationship between specific panic events and HL events (freezes and self-shutdowns). The data enables identifying panic categories which are symptomatic of freezes, e.g., the heap management (E32USERCbase), USER, and ViewSrv, and KERN-EXEC (type 0 panics). On the other hand, access violation-related panics

Table 3. Panic-activity relationship categ. act.

type

E32USERCBase

KERNEXEC

MSGS Phone. View USER Client app Srv

33

47

0

3

3

2

message

1.10

.

.

4.41

.

1.10

.

Voice call

6.62 1.10

.

17.3

.

.

9.56

9.19

.

.

.

0.37 40.4

All

11 .

6.62

4.04 38.6 .

54.8

Percentage

unspecified 4.78

11

rows correspond to HL events and panic categories. The columns indicate applications which execute at the time of a panic. Numbers reported in every cell of the table represent percentages of the total number of panics, e.g., the Clock application is present in 3.2% of all recorded KERNEXEC panics which lead to freeze. Consistently with our findings from the web forums, the Message application is one of the main panic causes. Other potential dependability bottlenecks are the camera, the Bluetoth browsing tool, and the log of incoming/outgoing calls. The table also gives an insight into the applications which, even panicking, do not cause HL events.

7

Conclusions and Lessons Learned

no. of apps at panic time

Figure 6. Distribution of the number of running applications at panic time

(KERN-EXEC type 3) can trigger both phone freeze and self-shutdowns. Phone Activity at Panic Time. Table 3 reports the user activity at the time of the panic, in terms of voice calls and text messages (the only ones registered on the Symbian’s Database Log Server). Only panics which lead to an HL event are considered in this analysis. Interestingly, about 45% of panics are recorded when the user performs realtime activities, e.g., a voice call, or sending/receiving a short message. This confirms our earlier observation (based on failure data from the web forums), which indicates presence of interferences between various applications/system modules. In other terms, this is also a symptom of the lack of sufficient (to protect error propagation) isolation between real-time and time-sharing modules. Thus, more effort should be directed to enhance the isolation between the two types of system modules. Also, there are panics, such as USER and ViewSrv, which are triggered only while a voice call is performed. Similarly, there are panics, e.g., Phone.app, which manifest only when a short message is sent/received. The Running Application Detector allowed us to collect the set of running application at the time of the panic. It is interesting to notice that often only one user application is found to be running at the panic time, as can be observed in Figure 6. This indicates, somewhat counter intuitive, that a concurrent execution of multiple applications does not necessary lead to more frequent panics. Table 4 summarizes panic-running applications relationship. Only cases with significant percentage are taken into account, covering 53% of the total number of panics. The

This work presented a measurement-based failure analysis of mobile phones. A dedicated logger has been implemented to gather failure-related information on SymbianOS-based smart phones. Failure data has been collected from 25 phones over the period of 14 months. Key findings indicate that: (i) Majority of kernel exceptions are due to memory access violation errors and heap management problems (despite adopting the micro-kernel model in the Symbian design and providing advanced memory management facilities). This is consistent with our initial analysis of failure data on hand-held devices obtained from publicly available web forums, which pinpoints the memory leaks as one of the main causes of failures. (ii) Similarly, analysis of data collected by the logger and data from the web forums shows that the majority of failures occur when the user performs real-time tasks, e.g., a voice call or sending/receiving of a text message. This indicates the need to strength the isolation between interactive and real-time tasks. (iii)Users experience a failure (freeze or self shutdown) every 11 days, on average. Since these figures are obtained from a single study, more data and further analysis are needed before generalizing the results. Future effort will focus on: (i) conducting experiments on a larger set of phones, including other platforms, e.g., MS Windows, (ii) enhancing the logging mechanism to enable capturing output failures (this may require involvement of users).

8

Acknowledgments

This work has been supported in part by the University of Naples Federico II - Ufficio Programmi Internazionali, by the Italian Ministry for Education,University, and Research (MIUR) in the framework of the PRIN Project “COMMUTA: Mutant hardware/software components for dynamically reconfigurable distributed systems”, and by the Motorola Corporation as part of Motorola Center in the University of Illinois at Urbana-Champaign, USA. We also thank

Table 4. Panic-running applications relationship

No HL event

.

0.28 0.90 1.02

.

.

.

.

.

.

0.18

.

.

.

.

6.39

.

.

3.20

.

.

.

.

.

.

.

.

.

E32USER-CBase 0.38 6.39

.

.

.

.

.

0.26

.

.

.

0.26

.

.

.

EIKCOCTL

.

.

.

.

.

.

.

0.13

.

.

.

.

.

.

.

EIKON-LISTBOX

.

.

0.26

.

.

.

.

.

.

.

.

.

.

.

.

.

.

KERN-EXEC

6.78 0.26

.

1.66 1.28

TomTom

Telephone

.

.

Clock Log

Messages Contacts

.

.

1.02 1.28

FExplorer

battery

.

.

3.20 3.20

Contacts

.

.

Log Contacts

.

.

BT_Browser Log Teleph.

.

.

KERN-EXEC

KERN-EXEC SelfShutdown MSGS Client

Log Telephone

Camera Log Telephone

Freeze

Panic category

Clock

Messages Log

HL event

Log

Messages 0.51

Application

1.02 1.15 2.56 1.53 1.28 0.89 0.38 0.26

USER

.

.

.

.

.

.

3.07

.

0.38

.

.

.

.

.

ViewSrv

.

.

0.13

.

.

0.13

.

.

.

.

.

.

.

.

Total 8.18 6.91 6.78 5.50 4.48 3.32 3.07 3.07 2.94 2.56 1.53 1.53 1.35 1.28 1.28

Paolo Ascione for an excellent work on the implementation of the logger and Daniel Chen for help in the collection of the failure data.

References [1] P. Ascione, M. Cinque, and D. Cotroneo. Automated Logging of Mobile Phones Failure Data. Proc. of the 9th IEEE International Symposium on Object-oriented Real-time Distributed Computing (ISORC 2006), April 2006. [2] V. Astarita and M. Florian. The use of Mobile Phones in Traffic Management and Control. Proc. of the 2001 IEEE Intelligent Transportation Systems Conference, August 2001. [3] A. Avizienis, J. Laprie, B. Randell, and C. Landwehr. Basic Concepts and Taxonomy of Dependable and Secure Computing. IEEE Transactions on Dependable and Secure Computing, 1(1):11–33, 2004. [4] A. A. Aziz and R. Besar. Application of Mobile Phone in Medical Image Transmission. Proc. of the 4th National Conference on Telecommunication Technology, January 2003. [5] A. Bondavalli and L. Simoncini. Failures Classification with Respect to Detection. Proc. of the 2nd IEEE Workshop on Future Trends in Distributed Computing Systems, 1990. [6] M. Cinque, D. Cotroneo, and S. Russo. Collecting and Analyzing Failure Data of Bluetooth Personal Area Networks. proc. of the 2006 International Conference on Dependable Systems and Networks (DSN’06), June 2006. [7] W. Gu, Z. Kalbarczyk, R. K. Iyer, and Z. Yang. Characterization of Linux Kernel Behavior under Errors. Proc. of the 2003 International Conference on Dependable Systems and Networks (DSN’03), June 2003. [8] R. Harrison. Symbian OS C++ for Mobile Phones Volume 2. Symbian Press, 2004. [9] R. K. Iyer, Z. Kalbarczyk, and M. Kalyanakrishnam. Measurement-Based Analysis of Networked System Availability. Performance Evaluation Origins and Directions, Ed. G. Haring, Ch. Lindemann, M. Reiser, Lecture Notes in Computer Science 1769, Springer Verlag, 2000.

[10] T. Kubik and M. Sugisaka. Use of a Cellular Phone in mobile robot voice control. Proc. of the 40th SICE Annual Conference, July 2001. [11] Y. Liang, Y. Zhang, A. Sivasubramaniam, R. K. Sahoo, and M. Jette. BlueGene/L Failure Analysis and Prediction Models. proc. of the 2006 International Conference on Dependable Systems and Networks (DSN’06), June 2006. [12] C. Lim. Drop Impact Study of Handheld Electronic Products. Proc. of the 5th International Symposium on Impact Engineering, July 2004. [13] S. M. Matz, L. G. Votta, and M. Malkawi. Analysis of Failure Recovery Rates in a Wireless Telecommunication System. Proc. of the 2002 International Conference on Dependable Systems and Networks (DSN’02), June 2002. [14] B. Schroeder and G. Gibson. A Large-Scale Study of Failures in High-Performance Computing Systems. Proc. of the IEEE International Conference on Dependable Systems and Networks (DSN 2006), June 2006. [15] A. Sekman, A. B. Koku, and S. Z. Sabatto. Human Robot Interaction via Cellular Phones. Proc. of the 2003 IEEE Int. Conf. on Systems, Man and Cybernetics, October 2003. [16] M. Shaw. Everyday Dependability for Everyday Needs. Proc. of the 13th IEEE International Symposium on Software Reliability Engineering, November 2002. [17] D. P. Siewiorek, R. Chillarege, and Z. Kalbarczyk. Reflections on industry trends and experimental research in dependability. IEEE Transactions on Dependable and Secure Computing, 1(2), 2004. [18] C. Simache and M. Kaˆaniche. Measurement-Based Availability Analysis of Unix Systems in a Distributed Environment. Proc. of the 12th International Symposium on Software Reliability Engineering (ISSRE’01), November 2001. [19] C. Simache, M. Kaˆaniche, and A. Saidane. Event Log based Dependability Analysis of Windows NT and 2K Systems. Proc. of the 2002 Pacific Rim International Symposium on Dependable Computing (PRDC’02), December 2002. [20] J. Xu, Z. Kalbarczyc, and R. K. Iyer. Networked Windows NT System Field Data Analysis. Proc. of the 1999 Pacific Rim International Symposium on Dependable Computing (PRDC’99), December 1999.

Lihat lebih banyak...

How Do Mobile Phones Fail? A Failure Data Analysis of Symbian OS Smart Phones

Descrição do Produto

Comentários