A New Perspective to Data Processing: Big Data

Share Embed

Descrição do Produto

A New Perspective to Data Processing: Big Data Anhad Mathur

Akash Sihag

Anu Sharma

Kirti Sharma

Dept. Of Computer Science Vivekananda Institute of Technology Jaipur, Rajasthan [email protected]

Dept. Of Computer Science Vivekananda Institute of Technology Jaipur, Rajasthan [email protected]

Dept. Of Computer Science Vivekananda Institute of Technology Jaipur, Rajasthan [email protected]

Dept. Of Computer Science Vivekananda Institute of Technology Jaipur, Rajasthan [email protected]

Abstract -- Big data is data that exceeds the processing capacity of traditional database systems. The data is too voluminous, moves too fast, or is impossible to be managed by the structures of existing database architectures. Hence there must be an alternative way to process these data. This paper outlines the fundamental aspects of Big Data along with its opportunities and challenges. The paper builds on some of the most recent findings in the field of data science. It does not aim to cover the entire spectrum of challenges nor to offer definitive answers to those it addresses, but to provide as a reference for further reflection and discussion. Keywords -- Big Data, Ameliorate, Semantics, Data Elucidation, Heterogeneity, Data Accession, Metadata, Scalability, Warehouse

I. INTRODUCTION “The growth of data is never ending”. A recent survey states that, we consume more bytes on the internet in 30 minutes than grains of rice in a year i.e. 40 petabytes and the number will be going to increase exponentially day by day and a study say the amount of data we require in 2015 will be thrice of what we are consuming today and so to handle that big amount of data a new kind of technology and architecture is required known as Big Data. The term ‘Big Data’ was coined in 1970s but got pace in 2008. Till now, the Big Data phenomenon was restricted to only research field but now Big Data analysis and phenomenon is required in many commercial fields like retail, mobile services, manufacturing, financial services as all these services reflect our modern society. Basically, Big Data is not just a technology but is a phenomenon that how the retrieval and storage of data can be made more effective, and we need to create such algorithms which can manage and convert unstructured, imperfect, complex and machine-generated data (data records, web-log files, sensor data) into actionable information so that an effective automated conclusion can be made on the real-time based calculation as the data velocity is very high at any instance of time. The solution to all this complexity is Big Data. Walmart [1] which has more than 16 thousand customer transactions every minute, which is imported into databases estimated to contain more than 42 terabytes of data – the equivalent of 2.5 times the information contained in all the books in the US Library of Congress. Similarly, Amazon.com [2] an e-commerce website, handles millions of back-end operations every day, as well as queries from more than half a million third-party sellers. The core technology that keeps Amazon running is Linuxbased and as of 2005 they had the world’s three largest Linux databases, with capacities of 7.8 TB, 18.5 TB, and 24.7 TB.

According to McKinsey, Big Data refers to datasets whose size are beyond the ability of typical database software tools to record, store, manage and analyze. Big Data has no exact definition that how big it should be in order to be considered in this category. Hence there is a requirement of new technology to manage Big Data. Big Data technologies can be defined as advanced data extracting technology and its architectures are designed such that the values from the data can be managed economically and efficiently by regulating the different characteristics of datasets. What is Big Data? Big Data is a term used for managing large amount of datasets which is difficult to be managed by on-hand database management tools or traditional data processing applications. Basically, Big Data is considered as a technology, but rather it is a phenomenon which represents a challenge in utilizing this volume of data, and also an opportunity for organizations who seek to ameliorate their effectiveness. II. CHARACTERISTICS There are three important properties of Big Data. It is not just only about the vastness of data but it also includes variety and velocity of data. All these attributes forms the 3 V’s of Big Data.

Here, the generalized term for the word volume is “big” which represents the size of data. It is a relative term as some small organizations are likely to have some gigabytes or terabytes of data in comparison to big global

organization that have several petabytes and Exabyte of data to handle. The only prediction that can be made about the data is that it is certain that is only going to increase day by day and it’s volume is independent of the size of organization. Till now, there is a tendency in companies to store all kind of datasets like medical data, financial data, environmental data, and statistic and analytics data and today many of the organizations have their datasets in the range of terabytes but soon words like petabytes and Exabyte are not so far. Data can be obtained from a variety of sources which can be of different types i.e. structured, semi-structured and unstructured. With the rise in the technology, escalation of sensors, social media and networking, smart devices the data has become more complex because now it includes unstructured and semi-structured data.


Data Accession and Recording:-

It is mandatory to have a source to record Big Data as it does not emerge out of a vacuum. The data is recorded from a source. For example, we can consider the case of our real world where we sense and observe the things around us i.e. the presence of air and smell if present in it, heart rate of a person, to the big experimental analysis and simulations performed by the scientists, all these activities produces up to millions of terabytes of data per day. For Example: Hubble telescope which is one of the biggest telescopes in the world generates a very huge amount of raw data. B. Structured data: In this type, the data is clustered into a relational scheme i.e. a prevailing database having rows and columns. The data consistency and configuration allows it to retrieve usable information by responding to basic queries, based on the parameter and operational needs of the organization. Semi-structured data: In this type, the data is in the form of structured datasets that does not have a fixed or particular schema. It has a variable schematic designed that purely depends on the generating source. This category of data sets includes data obtained or inherited from hierarchies of records and fields of data such as social media and weblogs. Unstructured data: This category contains the most scattered data and is most complex to be processed. In this type, one of the general problem arises is to assign proper index to data tables for its analysis and querying. Examples include images, video and many other multimedia files.

III. PHASES OF BIG DATA The Big Data includes the following phases in its processing pipelining model:

Information Pulling and Filtering:-

"It's not information overload. It's filter failure." The raw data we collect from the data accession and recording phase is not in a format ready for analysis. Therefore, we cannot leave it in this form and we have to still analyze it further. For example, collection of health records in a hospital, structured data from sensors, transcribed dictations from physicians and other measurements that possibly have some uncertainty. Hence, we require an information extraction process that can generate the required information from the source and can express it in a structured form. Doing this task completely and correctly is one of the biggest continuing technical challenges. We can note that the data may be in different forms as in today it includes only images but in future may include video and such data is highly application dependent i.e. what we want to extract from a picture analyzer is totally different from what we want to pull out of an MRI.

C. Data Integration, Representation :-



Due to this non uniformity and flood of data, it’s not sufficient only to record it and send it into a warehouse. A set of scientific experiments is suitable example. If we

have a bundle of datasets in the warehouse, then it is impossible to find, reuse, and utilize any of such data. But if we have appropriate metadata, there can be a chance, but even then, challenges will persist due to the differences in experimental results and data record structures.

since there are many crucial assumptions made behind the data recorded. Even analytical pipelines often involve many steps, again by considering many possibilities and assumptions built in. IV. CHALLENGES IN ANALYSIS OF BIG DATA

Data analysis is lot more challenging than only being locating, understanding, recognizing and referring data. For a very large scale analysis, all of these processes has to be done in an automated approach. Hence, this will require a different set of syntax and semantics which are in such forms that a machine readable and machine resolvable. There is a powerful art of work in integration of datasets. Although some additional work is needed to achieve error-free automated problem solutions. D. Query Analysis:-





Techniques for mining and retrieval of BIG DATA are basically different from our conventional statistical analysis of small scale samples. Big data is generally rowdy, inter-related, signified and untrustworthy. In spite of that, rowdy big data could be more valuable than those small samples because some basic statistical data is obtained from recurrent system.

Mining Big Data requires filtered, integrated, trustworthy, effectively accessible data, scalable algorithms and environments which are suitable for performing Big Data Computations. On the other hand mining Big Data can also be used to improve the standards and trustworthiness of the data. At the other hand, data mining can also be used to improve the standard and truthfulness of the data, interpret its semantics, and provide tools for intelligent querying.


Big Data is providing new opportunity to modern society and scientists as it promises to handle large amount of heterogeneous data but due to high dimensionality and massive sample size, it also have to face some unique statistical and computational challenges which includes storage bottleneck scalability noise accumulation incidental endogeneity spurious correlation and other measurement errors. Some of the challenges are discussed here:

Data Elucidation and Interpretation :-

One of the main task in the processing pipelining of Big Data is to make analysis in such a form that it can easily be understandable in terms of the user and if an effective Interpretation about the data is cannot be made by the user than it is of limited values. Ultimately, decision making devices and algorithms had to interpret the result which includes various stages like examining all the possibilities and assumptions made and to retract the analysis. Furthermore, there may be many possible sources of error like all models almost have some kind of assumptions, computer systems and programs can have bugs, and results can be based on wrong and erroneous data. Therefore, for all these reasons, no user will give authority to the computer system and instead he/she will try to understand and verify the results through some other processes. The computer system must take care of that and should make it easy for the user to do so. This is one of the biggest challenges with Big Data due to its complexity


Heterogeneity, Diversity and Incompleteness:-

The data and information consume by humans possess a lot of heterogeneity [4] but is comfortably tolerable. Actually, the nuance and richness of natural language provides a great valuable in-depth. But this is not the case with machine algorithms as they expect cannot understand nuance and expect homogeneous data. Therefore, for the data to be effectively processed by these algorithms it should be carefully structured. For example, we can consider the case of patient in a hospital or consumer purchasing goods from a shop. We can create different records for different aspects of the user as in case of a patient we can create one record for laboratory test, one record for hospital stay, or one more record for lifetime for all time interaction of this customer with the hospital. Now, here we can observe that leaning the first design, all other medical procedures and tests per records would be distinct for each patient. All the three designs discussed above are less structured and conversely have successively greater variety. The basic requirement of a (traditional) data analysis system is to have data with a well-defined structure. Computer systems work most efficiently if they to store multiple items of same size and structure and hence this field requires further work. B.

Scalability :-

The first important thing about the Big Data is its size and that’s why it is called so. To manage large and voluminous amount of data is a challenging problem and issue for many years. In the past, this problem was eradicated by following the Moore’s law which states that the frequency of processor will get doubled in every two years but now this is shifted toward a whole new scenario of cloud computing in which the whole system is in the form of distributed cluster. This technique of resource sharing now requires new ways of deducing how to execute and run the data processing jobs.

One more dramatic shift that is going underway is the change in the form of traditional input-output subsystem. From last few decades, HDDs (hard disk drives) were used to store data. HDDs had some disadvantages like slower random input=output performances but now these devices are increasingly replaced by solid state drives today, or other technologies like PCM ( Phase Change Memory) are around corner. All these newer technologies requires a new thinking of how to design storage subsystems for data processing as these technologies do not have the same large spread in performance between the random and sequential I/O performance in comparison with the older HDDs techniques. Implying all this changing storage subsystem potentially require to touch the every aspect of processing data including database design, query scheduling processing algorithms, recovery methods and concurrency control methods

information of user's location can be determined through many stationary connection points. After sometime, user just leave "a trail of packet crumbs" which may be related to a certain residence or location of office and thereby used to find the identity of user. Several different types of private information like religious preference or health problems can also be unveiled by just observing unknown user's movement usage pattern and movement over time. Generally, Barabási et al [5], demonstrated that there is a strong and close correlation between people’s movement pattern and their identities. Note that concealing a location of user is much more challenging than concealing his/her identity. This is due to services based on location, the location of the user is required for a successful access to data or collection of data, while the user's identity is not essential. E.


When the larger data sets are to be processed by a system, it requires more time to analyze them as the larger data sets increases complexity which in turn increases time complexity because the flip side of size is speed. The design of a system which is likely to have a greater speed of handling larger data sets is more suitable for these technologies. In many situations result of analysis is required immediately. For example, consider a case of fraudulent credit transaction which should be flagged before the actual transaction is completed to prevent the transition from taking place. Now, as we know a full analysis of a user’s purchase history will not be feasible at real-time. Instead, the system needs to develop a partial result about the user and the card so that an effective conclusion can be made to arrive at quick determination. Hence, the system should be made in such a way that it possesses flexibility in computation. D.

Human collaboration :-

Timeliness :-

Privacy :-

The data privacy is another big concern and one which continuously hikes in background of Big data .In the electronic records of health ,strict laws are there deciding what can be and what cannot be done. Particularly in US all regulations are less forceful for other data. However great public fear is there regarding use of personal data inappropriately and peculiarly data linking from multiple users [3]. Privacy management is both a sociological and a proficient problem, which must be addressed collectively from both perspectives realizing the big data promise. Consider an example, data collected from services based on location. These brand new architectures will need a user who will share his/her location with provider of the service, resulting in obvious privacy pertain. Hiding only the identity without hiding the location will not accurately address these privacy pertains. A location based server or an attacker can deduct the identity of the source of query from its respective location information. For example the

In spite of enormous advances made in computational analysis, there are many patterns remaining which are not detected by computer algorithms but easily detected by human beings Indeed. CAPTCHAs ruins incisively this fact to inform human web users isolated from computer programs. Ideally Big data analysis will not be all computational instead it is explicitly created to have human in the loop. The new visual analytics sub-field is seeking to do this and at least with respect to the analysis and modelling phase in the pipeline. At all stages of the analysis pipeline there is similar value to input of human. In today's complicated world, it usually takes multiple known experts from various different domains to actually understand what is going on. The analysis system for big data must support shared exploration of results and input from multiple human experts. All these multiple experts can be separated in time and space when it is too costly to collect and assemble a whole team in one room. Also the data system has to support their collaboration and accept this scattered expert input. V. CASE STUDY McKinsey Global Institute[6] (MGI) studied big data in five domains—healthcare in the United States, the public sector in Europe, retail in the United States, and manufacturing and personal-location data globally. Big data can generate value in each. For example, a retailer using big data to the full could increase its operating margin by more than 60 percent. Harnessing big data in the public sector has enormous potential, too. If US healthcare were to use big data creatively and effectively to drive efficiency and quality, the sector could create more than $300 billion in value every year. Two-thirds of that would be in the form of reducing US healthcare expenditure by about 8 percent. In the developed economies of Europe, government administrators could save more than €100 billion ($149 billion) in operational efficiency improvements alone by using big data, not including using big data to reduce fraud and errors and boost the collection of tax revenues. And users of services enabled by personal-

location data could capture $600 billion in consumer surplus. The research offers seven key insights. 1. Data have swept into every industry and business function and are now an important factor of production, alongside labor and capital. We estimate that, by 2009, nearly all sectors in the US economy had at least an average of 200 terabytes of stored data (twice the size of US retailer Wal-Mart's data warehouse in 1999) per company with more than 1,000 employees. 2. There are five broad ways in which using big data can create value. First, big data can unlock significant value by making information transparent and usable at much higher frequency. Second, as organizations create and store more transactional data in digital form, they can collect more accurate and detailed performance information on everything from product inventories to sick days, and therefore expose variability and boost performance. Leading companies are using data collection and analysis to conduct controlled experiments to make better management decisions; others are using data for basic low-frequency forecasting to high-frequency now casting to adjust their business levers just in time. Third, big data allows ever-narrower segmentation of customers and therefore much more precisely tailored products or services. Fourth, sophisticated analytics can substantially improve decision-making. Finally, big data can be used to improve the development of the next generation of products and services. For instance, manufacturers are using data obtained from sensors embedded in products to create innovative after-sales service offerings such as proactive maintenance (preventive measures that take place before a failure occurs or is even noticed). 3. The use of big data will become a key basis of competition and growth for individual firms. From the standpoint of competitiveness and the potential capture of value, all companies need to take big data seriously. In most industries, established competitors and new entrants alike will leverage data-driven strategies to innovate, compete, and capture value from deep and up-to-real-time information. Indeed, we found early examples of such use of data in every sector we examined. 4. The use of big data will underpin new waves of productivity growth and consumer surplus. For example, we estimate that a retailer using big data to the full has the potential to increase its operating margin by more than 60 percent. Big data offers considerable benefits to consumers as well as to companies and organizations. For instance, services enabled by personal-location data can allow consumers to capture $600 billion in economic surplus. 5. While the use of big data will matter across sectors, some sectors are set for greater gains. We compared the historical productivity of sectors in the United States with the potential of these sectors to capture value from big data (using an index that combines several quantitative metrics), and found that the opportunities and challenges vary from sector to sector. The computer and electronic products and information sectors, as well as finance and insurance, and government are poised to gain substantially from the use of big data.

6. There will be a shortage of talent necessary for organizations to take advantage of big data. By 2018, the United States alone could face a shortage of 140,000 to 190,000 people with deep analytical skills as well as 1.5 million managers and analysts with the know-how to use the analysis of big data to make effective decisions. 7. Several issues will have to be addressed to capture the full potential of big data. Policies related to privacy, security, intellectual property, and even liability will need to be addressed in a big data world. Organizations need not only to put the right talent and technology in place but also structure workflows and incentives to optimize the use of big data. Access to data is critical—companies will increasingly need to integrate information from multiple data sources, often from third parties, and the incentives have to be in place to enable this. VI. CONCLUSION We are entering in an era of Big Data. With the help of large scale data that are available nowadays, there are great opportunities in making faster advances in different scientific fields to enhance and enrich many organizations. Nevertheless, there are many challenges which are already discussed in this paper must be considered before realizing these opportunities. The challenges must include not just only the salient issues, but also Heterogeneity, Diversity, Incompleteness, Scalability, Timeliness, and Human collaboration, at all levels of analysis flow from data accession to result explanation. Listed challenges are very common in large application domain. Hence, these are non-cost-effective to understand in reference of single domain. Also, these challenges are not able to be addressed naturally by the products of the new era of industrialization. We are in great favor and support for further research in realizing these technical challenges, in order to get all the benefits of BIG DATA. VII. REFERENCES [1] "Data, data everywhere". The Economist. 25 February 2010. [2] Layton, Julia. "Amazon Technology". money.howstuffworks.com. [3] “Top ten Big Data Security and Privacy Challenges”. Cloud Security Alliance. November 2012. [4] Challenges and Opportunities with Big Data 2011-1. Cyber Centre Technical Report, Purdue University. [5] Understanding individual human mobility patterns. Marta C. González, César A. Hidalgo, and Albert-László Barabási. Nature 453, 779-782 (5 June 2008) [6] Report-McKinsey Global Institute. Big data: The next frontier for innovation, competition, and productivity, May 2011

Lihat lebih banyak...


Copyright © 2017 DADOSPDF Inc.