A Sequential Data Preprocessing Tool for Data Mining

June 14, 2017 | Autor: Chiroma Haruna | Categoria: Data Mining

Descrição do Produto

A Sequential Data Preprocessing Tool for Data Mining Zailani Abdullah1, Tutut Herawan2,3, Haruna Chiroma2, and Mustafa Mat Deris4 1

School of Informatics & Applied Mathematics Universiti Malaysia Terengganu Gong Badak, Kuala Terengganu, Malaysia 2 Faculty of Computer Science & Information Technology University of Malaya 50603 Pantai Valley, Kuala Lumpur, Malaysia 3 AMCS Research Center, Yogyakarta, Indonesia 4 Faculty of Science Computer and Information Technology Universiti Tun Hussein Onn Malaysia 86400 Parit Raja, Batu Pahat, Johor, Malaysia [email protected], [email protected], [email protected], [email protected]

Abstract. Sequential dataset is a collection of records written and read in sequential order. Information from the sequential dataset is very useful in understanding the sequential patterns and finally making an appropriate decision. However, generating of sequential dataset from log file is quite complicated and difficult. Therefore, in this study we proposed a sequential preprocessing model (SPM) and sequential preprocessing tool (SPT) as an attempt to generate the sequential dataset. The result shows that SPT can be used in generating the sequential dataset. We evaluated the performance of the developed model against the log activities captured from UMT’s e-Learning System called myLearn. With the minimum modification of the dataset, it can be used by other data mining tool for further sequential patterns analysis. Keywords: Sequential dataset; Data preprocessing; Data mining; Tool.

1

Introduction

Pattern or association rules mining is one of the important topics in data mining. Until this recently, it has been actively and widely studied [1-8] in various domain applications. Besides that, sequential pattern mining is a bit different since it involves with discrete or order of events. Sequential pattern can be defined as a consecutive or non-consecutive ordered sub-set of an events sequence [9]. A sequential pattern is an important task with broad number of applications [10]. In the last 10 years, there have been many studies on sequential pattern and its applications. Although the mining of the complete set of sequential patterns has been improved substantially, in many cases, sequential pattern still faces tough challenges in both effectiveness and efficiency. On the one hand, there could be a large number of sequential patterns in a large database. A user is often interested in only a small subset of such patterns. Presenting the complete set of sequential patterns may make the mining result hard to understand and hard to use [11]. Examples of the domain applications of sequential B. Murgante et al. (Eds.): ICCSA 2014, Part III, LNCS 8581, pp. 734–746, 2014. © Springer International Publishing Switzerland 2014

A Sequential Data Preprocessing Tool for Data Mining

735

pattern are medicine, telecommunications, World Wide Web [12], etc. In the educational contexts, the events sequence are commonly refer to individual or group of students’ actions logged by the e-Learning System. Dataset is a collection of related sets of information that is composed of separate elements but it can be manipulated as a unit [13]. In term of sequential dataset, it correspond to a collection of records that written and read in sequential order from the beginning until the end. Until this moment, educational data mining has been specifically focused on extracting the patterns in analyzing and understanding the student behavior [14]. Indeed, data mining has been considered as an integrated and functional part of eLearning System and thus requires some adequate techniques to effectively retrieve and analyze the data. Dataset can be derived from several data sources including log file. Log file is a file that contains a list of events that goes in and out of a particular server. Its main point at keeping track of what is happening at the server. Most of the log file is saved in flat file format but for certain application systems (eg. e-Learning System) it can be also configured to channel into database system. In educational context, the information from log file is very important because it can detect regularity and deviations in groups of students. Moreover, it also can provide more information to educators about the learners’ behavior, and give recommendations on how to go about with some deviation cases [15]. In order to produce a valid dataset, data processing becomes a necessary component. Data mining considered the data preprocessing as the most important technique to transform the raw data into an understandable format for further processing. There are several steps taken place in data preprocessing such as data cleaning, data integration, data transformation, data reduction and data discretization. This technique is very important because it helps in solving the problems of incomplete, noisy, inconsistent, etc. of data. In fact, data preprocessing of web log file plays an important role in web usage mining before producing the complete set of sequential patterns [16]. Educational Data Mining (EDM) refers to techniques, tools, and research designed for automatically extracting the meaning from data repositories based on the learning activities in educational settings. E-Learning is among the popular sources of data repositories in EDM. Generally, e-Learning System is referred to as Learning Management System (LMS), Course Management System (CMS), Learning Content Management System (LCMS), Managed Learning Environment (MLE), Learning Support System (LSS), Web Based Training System (WBT-System) [17]. These systems collect a lot of information that can be further processed in analyzing the students and educators’ behavior. E-Learning systems offer the facilitation of communication between students and educators, sharing resources, producing content material, preparing assignments, conducting online tests [18]. From the literature, generation of sequential dataset from log file is a very interesting topic because it can be used as an input for further analysis by data mining tools. However, when dealing with the format of discrete events, the data preprocessing task becomes more complicated and difficult. Therefore in order to mitigate these problems, we proposed a Sequential Preprocessing Model (SPM) and Sequential Preprocessing Tool (SPT) with the application of e-Learning System. The contributions of this paper are as follows. First, we do comprehensively studies about the preprocessing techniques for sequential patterns. Second, we proposed Preprocessing Model (SPM) and developed Sequential Preprocessing Tool

736

Z. Abdullah et al.

(SPT). Third, we evaluated the performance of the developed model against the log activities captured from UMT’s e-Learning System called myLearn. The reminder of this paper is organized as follows. Section 2 describes the related work. Section 3 describes some difinition and the proposed model. This is followed by the results and discussion in Section 4. Finally, conclusions of this work are reported in section 5.

2

Related Works

Data preprocessing is one of the important step in the data mining process. As a result, many works have been devoted to preprocess data in log file. However, only few researches have been developed in term of techniques, tools, algorithms etc. focuses on preprocessing of log file from the student log activities. Wahab et al. [19] elaborated the pre-processing techniques involved in extracting IIS Web Server Logs before it can be applied into data mining algorithms. Data preprocessing acts as a filter and only appropriate information from the log file will be extracted. In their experiment, the raw log files were collected from Portal Pendidikan Utusan Malaysia or popularly known as Tutor.com. Castellano et al. [20] proposed LODAP tool (log data preprocessor) to perform preprocessing of log file. LODAP takes input log file related to website and generate output based on a database that containing pages visited by user and identified user sessions. LODAP tool can reduce the size of web log file and group all the web requests into a number of user sessions. Salama et al. [21] introduced a new approach in preprocessing of web log files for web intrusion detection. The steps in this process are not similar and have several differences as compared to the web usage mining. The main differences between them are in the context of log files combination, user identification, session identification and after preprocessing. In this approach, two algorithms are employed to combine the log files in W3C format and NCSA format into a single file in XML format. XML file will become an input to the mining algorithms rather than the relational database. Yan Li [22] suggested path completion algorithm and an implementation of data preprocessing for web usage mining. Referer-based method is employed to append the missing pages in user access paths. The reference length of pages is modified according to the estimated average reference length of auxiliary pages. The algorithm appends the lost information and thus improves the reliability of access data for further calculations in web usage mining. Patil et al. [23] focused on the earlier two parts of the preprocessing which are field extraction and data cleaning. Two algorithms are specifically designed in order to clean the raw web log files and finally to insert them into a relational database. The field extraction algorithm extracts the web logs that collect the data from the web server. The data cleaning algorithm plays the function to clean the web logs and removes the redundant information. Zhang et al. [24] proposed a new hybrid algorithm to perform data preprocessing in web log mining based on Hadoop cluster framework. Hadoop is a distributed system infrastructure developed by Apache Foundation. It is a software platform which can run and analyze large-scale data. The experimental results show that the improved data pretreatment algorithm can improve the efficiency of web data mining. Valsamidis et al. [25] introduced a methodology for analyzing LMS courses and students’ activity. It consists of three main steps which are logging step,

A Sequential Data Preprocessing Tool for Data Mining

737

preprocessing and clustering step. The first step focuses on logging of specific data from e-learning platform by considering only the fields of courseID, sessionID and Uniform Resource Locater (URL) using Apache module. The second step filters the recorded data provided by detecting the outliers and removing extreme values. The third step applies a clustering method namely Markove Clustering algorithm (MCL) to separate users into different groups according to the usage patterns. Blagojevic [26] proposed Data Mining Extensions (DMX) queries for mining student data from eLearning System. It main aims at understanding the students’ behavior more closely and plan their classes to maximize the students efficiency. There are three phases involved namely data selection, preprocessing and OnLine Analytical Processing (OLAP). The first phase is to retrieved data from the Moodle server. The second phase focuses on removing the entries that contain errors. The third phase emphasizes on generating cube and dimensions. After completing these phases, DMX query is employed to perform the analysis. Romero et al. [27] attempts to find that, in most eLearning systems, all the pages accessed by students are saved in log files (either one log file for each student or just one big log file for everyone) which contain all the information about the interaction of the students with the system. Therefore, after preprocessing this information, it is possible to discover sequential patterns from these log files by using data mining algorithm. Sequential mining was first proposed by Agrawal and Srikant [17]. They have designed Apriori-based algorithm called Generalized Sequential Patterns GSP to mine all sequential patterns based on minimum support threshold. Many methods which are based on the Aprriori have been proposed to mine sequential patterns. Han [28] introduced Frequent patternprojected Sequential pattern mining (FreeSpan) method by integrating the mining of frequent sequences and using projected sequence databases. Besides mining the complete set of patterns, FreeSpan also reduced the candidate sebsequence generation and thus outperform Apriori-based GSP algorithm by Srikant and Agrawal [29]. Pei et al. [30] suggested Web access pattern tree (WAP-tree) structure and WAP-Mine algorithm to efficiently mine of access patterns from Web logs. Zaki [31] proposed Sequential PAttern Discovery using Equivalence classes (SPADE) algorithm for fast discovery of Sequential Patterns. SPADE utilizes combinatorial properties to decompose the original problem into smaller sub-problems using lattice search techniques and simple join operations. Pei et al. [32] proposed Prefix-projected Sequential pattern mining (PrefixSpan) method to efficiently mining sequential pattern by introducing ordered growth and in the same time reducing the projected database. In most cases, PrefixSpan outperforms the apriori-based algorithm GSP, FreeSpan, and SPADE due the less memory space consumption. Shie et al. [33] suggested UMSP to mine high utility mobile sequential patterns. It searches for all patterns from MTS-Tree structure. Ahmed et al. [34] proposed UWAS-tree (utilitybased web access sequence tree), and IUWAS-tree (incremental UWAS tree), for mining web access sequences in static and dynamic databases respectively. Extensive performance analyses show that our approach is very efficient for both static and incremental mining of high utility web access sequences. Yin [35] introduced USpan with lexicographic quantitative sequence to extract high utility sequence and designed concatenation mechanisms with two effective pruning strategies. USpan can defiantly identify high utility sequences in large-scale data with low minimum utility.

738

3

Z. Abdullah et al.

Proposed Method

There are four major components involved in generating the sequential dataset from mySQL database. All components are interrelated and the process flow is moving in one-way direction. The dataset produced in this model is in a format of flat file. A complete overview model of sequential processing model (SPM) is shown in Figure 1.

Features Extraction

Log File (MySQL)

Features Mapping

Threshold Duration

User Patterns Determination

Outlier Elimination

Mapped File

Sequential Patterns Generation

Sequential Dataset File

Fig. 1. The Overview of Sequential Preprocessing Model (SPM)

Log File All the students’ activities in the myLearn (e-Learning System at Universiti Malaysia Terengganu) are automatically stored in MySQL database rather than the typical text server log files. The log activities contain a great amount of information that reflects the student’s learning process and academic performance. In the context of myLearn, among the crucial source of information are based on the login timestamp and requested pages. Timestamp is a sequence of characters that identifying the current time of an event occurred. Features Extraction SQL statement has been executed to extract only the relevant information from the MySQL database. There are about 204 tables in myLearn and the main twelve tables are mdl_config, mdl_course, mdl_course_categories, mdl_course_modules, mdl_log,

A Sequential Data Preprocessing Tool for Data Mining

739

mdl_course_sections, mdl_log_display, mdl_modules, mdl_user, mdl_user_admins, mdl_user_students, mdl_user_teachers. These tables are called as the meta-tables. The most important table is mdl_log which contains the attributes of id, time, userid, course, module, action, url and info. The desired attributes are obtained by merging and joining the different tables using SQL syntax. The most important attributes employed are id, userId, time and action. Features Mapping The unique data from attribute Action is extracted and assigned with the unique number. The attribute id, userId, time and action is transferred into a temporary table and the attribute action is mapped with the previous assigned unique number. All series of actions (replaced by the numbers) from the same userId are sorted according to the time (timestamp). The actions with the same date are grouped together and assigned. This process will continue until the last userId. Table 1 presents the mapping number of all actions. User Patterns Formation The sequences of actions (patterns) for each userId are generated based on the factors of date, minimum and maximum durations. For each userId, a set of sequences actions are formed and written in a single line. It means that, the single line of transaction represents the set of sequences actions that have been performed by the student in one semester for one subject. Each pattern will comply with certain threshold durations. After generating of the complete set of patterns per userId, the single transaction will be appended beta version of sequential datasets. Outlier Elimination Typically, the log file may contain many redundant activities (actions) within very short period of times. Unfortunately, these activities may occur many times in the same pattern. Another issue is, in one pattern it may contains only an activity. These are among the outliers that will be removed from the beta version of sequential datasets. The outlier determination and elimination process are very important because they can change the result of data analysis in the future. Sequential Patterns Generation The final sequential dataset (start with #data in the first line) and mapped actions file (start with #map in the first line) are produced at this phase. This dataset is generated once all the outliers are completely removed from the previous beta version of dataset. The mapped actions file contains the mapped actions in term of the associate number and the correspond action. This file can be used to interpret the actual action value during data analysis. Sequential Dataset The final sequential dataset and mapped actions file are ready to be used. Since not many data mining tools are available to process these inputs, the flexible tool will be developed to perform the sequential analysis in the near future based on this data format. The complete SPM algorithm is shown in Figure 2.

740

1: 2: 3: 4: 5: 6: 7: 8: 9: 10: 11: 12: 13: 14: 15: 16: 17: 18: 19:

Z. Abdullah et al.

Input: Logfile, DurMin, DurMax, CourseId Output: SequentialPatterns, MappingActions for all Logfile do uAction

Lihat lebih banyak...

A Sequential Data Preprocessing Tool for Data Mining

Descrição do Produto

Comentários