NoSQL for Census Data Analysis

May 28, 2017 | Autor: J. Ijcsis | Categoria: Computer Science, Computer Engineering

Descrição do Produto

Vol. 14 ICETCSE 2016 Special Issue International Journal of Computer Science and Information Security (IJCSIS) ISSN 1947-5500 [https://sites.google.com/site/ijcsis/]

11

NoSQL for Census Data Analysis M.Akka Lakshmi [email protected]

G.Victor Daniel [email protected]

ABSTRACT: Data is growing exponentially with Terra bytes of data being generated daily by social networks, millions of mails. Voluminous data is being generated by enterprises in the form of documents. Meaningful insights can be derived from this huge amount of data that helps organizations improve their business. ‘NoSQL’ technologies provide solution with high performance and scalable approach to analyze large and non-structured datasets. In this paper, we analyze Census data ,using various ‘NoSQL’ technologies to gain insights into workforce available in various states and age groups of India as per Census-2011.

Keywords- Big data, Non-structured data, NoSQL I INTRODUCTION The amount of data produced by mankind from beginning of time till 2003 is only 5 billion gigabytes but the same amount was created in every 2 days in 2011 and every 10 min in 2013. It is estimated that by 2020 the total data will be about 35ZB with China accounting for more than 1/5th of it. It is also estimated that about 1/3rd of all data will exists or pass through cloud by 2020. Knowingly or unknowingly every one of us are contributing to this voluminous data. There are more than billion internet users generating zetabytes of internet traffic every day in the form of millions of e-mails, millions of blogs and hundreds of websites being created every minute. Users upload 48 hours of new video every minute. Social networks like Face book, Twitter, LinkedIn are also contributing terra bytes of data daily. In addition, business organizations generate and accumulate lot of documents and other transactional data. Traditionally, organizations store and preserve only structured data and is used for analysis to take business decisions. But this constitute small portion of total data , about 10 to 15%. Enterprise data represent large percentage of text and

D.Srinivasa Rao [email protected]

multimedia data which is in the form of semi-structured or un-structured. And also of the total data, about 85% to 90% is non-structured and cannot be processed by traditional database systems. Analyzing small portion of data may not provide precise insights to observe trends in the data sets. Large portions of data in unstructured or semi-structured form is unused. Usually organizations retain this data for specific period of time and then disposed off . If this data is also considered for analysis, it leads to more accurate results. Analysis of large data sets gives more meaningful long term co-relations among data sets and help organizations in their business. Such accurate analyses help organizations in improving their business which will in turn generate more revenue. In this paper , section II discuss the need of non-relational databases and popular NoSQL databases, in section III we analyze the census data using Hive and Neo4J and section IV is conclusion.

II Non-Relational databases A. Need of Non-Relational databases increased the use of Internet. Developments in web technologies, proliferation of social networks are contributing to exponential growth of data. Most of this data is in non-structured form. Relational databases cannot handle data beyond certain size and can only process data that is structured. The following challenges of relational model have driven the emergence of new data models. (i) impedance mismatch- the way data is represented in relational databases is different from the way it is represented in memory (ii) scalability- as the data size grows, we have to scale-out rather than scale-up (iii) single point failures. This demands for a technology that can store and process large datasets in structured, non-structured form and at the same time giving good performance, scalability and avoiding single point failures. This has given rise to development of NoSQL databases.

Proceedings of 3rd International Conference on Emerging Technologies in Computer Science & Engineering (ICETCSE 2016) V. R. Siddhartha Engineering College, Vijayawada, India, October 17-18, 2016

Vol. 14 ICETCSE 2016 Special Issue International Journal of Computer Science and Information Security (IJCSIS) ISSN 1947-5500 [https://sites.google.com/site/ijcsis/]

B. Number of technological advancements and user needs have contributed to the development of NoSQL technologies for analyzing big data sets (i) Dramatic decrease in storage costs- storage capacity is roughly doubled every 40 months since 1980s[2]. Organizations started storing and retaining the data for long periods (ii) Increase in processing power. (iii) emergence of data centers and cloud computing that provide flexibility for storage and computing (iv) current workloads demand scale-out and not scale-up(v) todays data is large and unstructured. Many proprietary and open source NoSQL databases have emerged. The popular being ,Googles Big table, Facebook’s Cassandra, Amazon’s DynamoDB, MongoDB, HBase. C.Big data analytics NoSQL databases are suited for Big Data analytics. Big data is characterized by (i) large Volume (ii) Variety in terms of data sources and data types like structured, semistructured and unstructured data, data generated by machines, networks (iii) Velocity- the pace at which data is being generated. Atomic reactors generate 40TB per sec, 640TB of data is generated for one flight. (iv) Veracity & Validity -Data to be analyzed should be clean and valid for the given application to make appropriate decisions. Organizations should keep away dirty data from being accumulated. Big data analytics find its applications in many areas like predicting customer behavior, health care, developing security systems for crime and fraud detection, remote sensing data for weather prediction, agriculture yield estimation to achieve food security. Popular use cases of big data technologies include (i) areas of security for intrusion and fraud detection , developing spam filters (ii) resource optimizations for internet and other organizational resources(iii) medical data analysis to predict and prevent spread of deceases, predicting patient readmissions (iv) manufacturing sector to determine optimal time for repair or replacement of machines (v) retail sector for building recommendation systems , posting relevant Ads to customers (v) Agriculture to estimate and predict yield, Land Cover and usage patterns to improve farming practices to achieve food security.[3] In the big data analysis, social networks like Face book, Twitter, LinkedIn plays a key role for utilizing information and messages posted by users. It is used by (i) Customer product companies to know product feedback, Product defects, user preferences, Customer behavior. (ii)

12

Advertising and Marketing agencies to understand responses to their campaigns and promotions (iii) Sports teams to track ticket sales, know team strategies (iv) to predict Election results Processing [7] and also expressing graph computations.[5] III Data analysis Census- 2011 data pertaining to workers in India has been analyzed, using Hadoop Hive and Neo4J, with the following results TABLE I.

workers in India

Workers Category

Male

Female

Main workers

75.35%

24.65%

Marginal workers

49.22%

50.78%

Nonworkers

39.97%

60.04%

Rural

Urban

67.81%

32.19%

86.22% 13.78%

66.53%

33.47%

Pre-processing: The following Duplicate data is deleted .Age like 60+ is deleted since it is already included in 60-69 and 70-79 age groups.30-59 age groups have been deleted as it is included in in more specific age groups. From age group 40 onwards range of 10 was given but for age groups 5 to 29 , range of 5 was given. Since it is workers data, analysis was done from age 20 onwards. Hence age group 20-24 and 25-29 is merged to form a group of 10 range and also 30-34 and 3539 are also merged. Analysis is as follows. Overall population distribution is as given below TABLE II. Analysis of data given in various Age-groups starting from 5 years to 80+

Male population

Female Population

Rural Population

Urban Population

51.47%

48.53%

68.86%

31.14%

Proceedings of 3rd International Conference on Emerging Technologies in Computer Science & Engineering (ICETCSE 2016) V. R. Siddhartha Engineering College, Vijayawada, India, October 17-18, 2016

Vol. 14 ICETCSE 2016 Special Issue International Journal of Computer Science and Information Security (IJCSIS) ISSN 1947-5500 [https://sites.google.com/site/ijcsis/]

NoSQL databases – key-value stores, Document databases, and Column family databases are based on aggregate model. Data set often contains relationships or connectedness among the data. Relational databases can model data relationships but not efficient when the relationships are many. Graph data structure is the natural way of representing relationships using edges or links. We find number of applications of graphs in the areas of social networks, computer networks, transport networks, proteinprotein interactions. Neo4J is the popular graph databases. Census data is analyzed and represented as graph depicts the states having maximum and minimum percent of Main, Marginal and Non-workers which is drawn using Neo4J. The nodes are state names and category of significant domination of main workers compared to Non-workers till age of 60. Good number of main workers can also be found in the age group 60-69

13

population is more than double the urban population and male are 3% more than female population. Indian workers are grouped into Main workers, Marginal workers and Non-Workers. Marginal workers are further grouped into those working less than 3 months a year and 3 to 6 months a year. Within Marginal and Non-workers, people are available for work. Main workers contribute to about 1/3rd in the entire population.

Figure II.

Distribution of workers in India

Main workers data is analyzed to know the distribution across male, female, rural and urban areas. In all the age groups, male contribute to 3/4 th of total main workers and female are only 1/4th. Main workers in urban areas are almost close to main workers in the rural India up to the retirement age of 60 years. Late rural main workers are than urban main workers. Figure. I state wise maximum and minimum percent of Main, Marginal and Non-workers We can see more Non-workers than Main workers in the age group of 20-29. Later, we can find significant domination of main workers compared to Non-workers till age of 60. Good number of main workers can also be found in the age group 60-69. Workers data is with further divided into male, female, rural and urban of main, marginal and non-workers .Rural

Figure III.

Distribution of workers-Age group wise

Proceedings of 3rd International Conference on Emerging Technologies in Computer Science & Engineering (ICETCSE 2016) V. R. Siddhartha Engineering College, Vijayawada, India, October 17-18, 2016

Vol. 14 ICETCSE 2016 Special Issue International Journal of Computer Science and Information Security (IJCSIS) ISSN 1947-5500 [https://sites.google.com/site/ijcsis/]

14

Female Non-workers are about 90% from 30 to 60 years. Male non-workers are significant after the age of retirement. We can also find significant male non-workers in the age group of 20-29.

Figure V. Comparisonof different category of workers-Age group wise

From the above Figure , We can see more Non-workers than Main workers in the age group of 20-29. Later, we can find significant domination of main workers compared to Nonworkers till age of 60. Good number of main workers can also be found in the age group 60-69 IV CONCLUSION Big data technology requires predefined formats for data representation and can Process structured, semi-structured and un-structured data. With reduction in storage costs, organizations can store data for long periods and analyze it to have insights into data. These large data sets can contain valuable information which can be helpful for other agencies, for their planning and policy making. This paper gives the distribution of different category of workers across malefemale population and also between Rural-Urban areas References [1] https://www.meridium.com/challenges/big-data0 [2] Alvaro A. Cardenas, Pratyusa K. Mandate, Sreeranga P. Rajan, “Big Data Analytics for Security”, IEEE computer and- Reliability Societies, 2013, pp. 74-76. [3] Jeff Markey, “ How to Manage Big Data’s Big Security Challenges” Figure IV.

Distribution of non workers age group wise

[4] “Cloud Security Alliance Top Ten Big Data Security and Privacy Challenges”, Cloud Security Alliance, 2012.

Proceedings of 3rd International Conference on Emerging Technologies in Computer Science & Engineering (ICETCSE 2016) V. R. Siddhartha Engineering College, Vijayawada, India, October 17-18, 2016

Vol. 14 ICETCSE 2016 Special Issue International Journal of Computer Science and Information Security (IJCSIS) ISSN 1947-5500 [https://sites.google.com/site/ijcsis/]

[5] “Cloud Security Alliance Big Data Analytics for Security Intelligence”, Cloud Security Alliance, 2013. [6] Jon Oltsik, “ The Big Data Security Analytics Era Is Here”, Enterprise [7] Strategy group, 2013 [8] http://www.brickmarketing.com/define-log-file.htm [9] https://downloads.cloudsecurityalliance.org/ initiatives/bdwg/Big_Data_Analytics_for_ Security_Intelligance.pdf [10] http://www.censusindia.gov.in/2011census/B-series/B-Series-01.html

[11] Ian Robison, Jim Webber and Emil Eifrem. Graph Database. CA : O’Reilly Publishers

Proceedings of 3rd International Conference on Emerging Technologies in Computer Science & Engineering (ICETCSE 2016) V. R. Siddhartha Engineering College, Vijayawada, India, October 17-18, 2016

15

Lihat lebih banyak...

NoSQL for Census Data Analysis

Descrição do Produto

Comentários