Big Data A 360 Overview attempt

July 25, 2017 | Autor: Juvenal Chokogoue | Categoria: Big Data, Big Data / Analytics / Data Mining, Machine Learning Big Data, Big Data Technologies

Share Embed

Denunciar este link

Descrição do Produto

Data

Information

Knowledge

Actionable plans

Performance

The New Infrastructure for Data Management : Hadoop
Taken alone, Big data is technology-driven. If Businesses want to capitalize on their Big Data paradigm, they have to find a way to combine their traditional business analysis techniques they used in the past to query and dive through the data.
But with extremely wide variety of data comes new challenges. Most of traditional business analysis techniques are not suitable for the new kind of data sources we have today and that is where Analytics comes into play!
Analytics design the means by which businesses gain insight from data whatever its source, its size and even its format.
All this said, you can now understand that Big Data Analytics is the concept that design the new means by which we extract insights from data that are extremely large, extremely varied and extremely swift.
However, Be aware that the efficiency of Analytics depends fundamentally on the question you want to answer, and on the Quality of data. Data quality issues must be consider prior to analytics concern. As it is said in the field: "Garbage in, Garbage out".
Analytic techniques must be handle with cautious and require a formal training in the field. you may consider acquiring an analytics professional
Thirdly, analytics is not a "silver bullet" that will always give you insights.
Fourthly, Just Because You Have Insights Does not Guarantee You Have The Power To Act on Them, that is Analytics can provide insights, but turning insights from numbers into competitive advantage may require changes that your business can't afford, or simply doesn't want to make. The Harvard Business Review explores a case study where through big data it was learned that "he could increase profits substantially by extending the time that items were on the floor before and after discounting. Implementing that change, however, would have required a complete redesign of the supply chain, which the retailer was reluctant to undertake." (source :https://hbr.org/2013/12/you-may-not-need-big-data-after-all/ar/1)
Analytics does not replace your business intuition. It just make you feel more confident about your choice. you may at the end consider your experience and your intuition as a manager to take the decision.
Analytical Techniques for Mining Big Data
in this part, I am going to talk only about some techniques I am certified in. These techniques are used in most business scenarios and have showed their proof long ago.
These techniques are : Regression( Linear and Logistic), Decision Trees, K-Means, Times Series, Neural Network, Association Rules, Naive Bayes, Support Vector Machine (SVM) and Survival Analysis. In addition, i am going to present Text Analytics fundamentals, since in Big Data age, we are generating more and more text data (tweets, facebook comments..).

- Regression
regression focuses on the relationship between an outcome and its input variables. Here, we are predicting how changes in individual drivers affect the outcome. the outcome can be continuous or discrete. When it is discrete, we are predicting the probability that the outcome will occur. When it is continuous, we are predicting the value of the dependent variable given the independent variable.
a survey from TDWI
Big Data and Analytics:
How these two married together?
According to Gartner : "Big data is high-volume, high-velocity and high-variety information assets that demand cost-effective, innovative forms of information processing for enhanced insight and decision making." (http://www.gartner.com/it-glossary/big-data/)
From all definitions provided for Big Data, the definition of Gartner is the most widely adopted for describing Big Data. And from that definition, one thing is clear : when one uses the term "Big Data", it is to designate data that is large in volume , has a high velocity and is available in wide variety . This is often refer to as the "3-V" or the 3 Dimension of Big Data.

Anyway! What is Big Data ?
BIG DATA: A 360° Overview
Juvénal CHOKOGOUE M
Consultant Business Analytics – Big Data

BD-DE-0005

11/23/2014
The Business Challenge
What this module Stands for ?
Who is this module for ?
Before the battle begins
Anyway! What is Big Data ?
Big Data and Analytics: How these two married together?
Analytical Techniques for Mining Big Data
The New Infrastructure for Data Management : Hadoop
Big Data adoption : Now or Later ?
The Next Steps
What Should i remember ?
Some Big Data Providers
Bibliography & Resources
About me
Module Overview
The Business Challenge
Scaling operations up and down as conditions change and ability to Decrease "time to market" for decision-making are become a critical competitive differentiator in today's economy.
Companies are gathering more and more data to stay competitive.
If they want to decrease their "time to market", they must make sense of the intersection of all these different kind of data they have gathered.
Technically, when you are dealing with so much data in so many different forms, it is impossible to think about data management in traditional ways.
The challenges and opportunities associated with this new kind of data management problem is known today as "Big Data"
What this module Stands for ?
Like in any other technological concept that pops up, Software Companies are framing definitions from an IT perspective in order to sell their products, leaving businesses confused on the concept and on where that concept fit in the issues they have to face. Big Data, like any other concepts such as Cloud Computing, Virtualization, Data mining and so on, is just one of these concepts.
When writing this paper, my main objective was to provide really a 360 ° conceptual overview of Big Data, that is a clear understanding of where the term "Big Data" comes from, why is that term so popular now, what does it really mean and what can be its implication for businesses. Because Analytics is another term that is associated to Big Data, I provided a description of a widely recognized and used analytical techniques to help you figure out how used in conjunction with Big Data, analytics can boost Business Performance.

This paper does not intent to be a "how-to" neither for a big data project management, nor for big data application development, nor for Statistical Model Building. Those will be the subjects of other papers. Rather, I expected that by the end of this paper :
you will smile the next time you read or hear at the terms big data, Hadoop, or analytics :)
you will understand what are behind the scene when one talks about "Big Data"
you will know how one can "make sense" of Big Data using Analytics
you will get a basic idea of data mining techniques used in Business and in Big Data
you will be able to get every update about Big Data

So, Keep Reading…

Before the battle begins
Information provided here is for informational purposes only and represents my current point of view as of the date of this presentation. Due to changing conditions of market, information provided here can be modify or obsolete, it should not be interpreted to be a commitment and I cannot guarantee its accuracy after the date of this presentation.

Contents of websites provided here can be modify or change, or the website itself can be unavailable after the publication of this presentation. So I can not MAKES warranties, express, implied or statutory, as to the information in this presentation.

In this presentation, I choose to call the "Analyst" the person who is responsible for data management, analytics, and programming Job. It is just a simplification that I adopted to avoid you of being worried by the new jobs/terms created by Big Data and help you focus on the content of the paper.

Microsoft, SQL Server, Teradata, Oracle, Google, Hadoop, Cloudera, HortonWorks, SAS, EMC and other names and products cited here are or may be registered Trademarks in the U.S. and/or in other countries.

Feel free to share this module with anyone you know, from your colleagues to your friends, but in this case, don't forget to mention the name of the author.

You can use and change the content of this module at your own but I will not be responsible of it content in this case.

This module is not for sale, If you intend to use it to your own, please, don't commercialize it !
- Decision Trees
Decision Trees are a flexible method very commonly deployed in classification and regression problems. Decision trees partition large amount of data into smaller segments by applying a series of rules in the form "if condition THEN expression" (eg: if age less than 30 and revenue greater than 36000 then class = 'Rich'). Decision trees are visually represented as upside-down trees with the root at the top and branches emanating from the root. There are two types of trees: Classification Trees and Regression trees.

- K-Means
K-means is a clustering method, it enter in the category of Exploratory Data Analysis Methods called "Unsupervised Classification". The goal is to group data based on similarities in input variables with no target or specific outcome. It is the preferred method for segmentation & Profiling.

a survey from TDWI
-Times Series
Time Series Analysis provides a scientific methodology for forecasting. Time Series Analysis is the analysis of a phenomenon that has a temporal evolution. The main objectives in Time Series Analysis are:
To understand the underlying structure of the time series by breaking it into trend, seasonality, and noise.
Fit a mathematical model to forecast the future.

- Neural Network
Artificial Neural Network are class of flexible non-linear models used for prediction problems. The power of the neural network comes from the fact that they can approximate virtually any continuous association between the inputs and the target, whatever the kind of relationship that associate them. There are many kind of Neural Network, but the most widely used is the Multi Layer Perceptron (MLP).

- Association Rules
Also known as association rules discovery or Market Basket Analysis or affinity analysis, association rule is a popular data mining method for exploring associations between items (data). It is an unsupervised method for in-database mining over transactions in databases.

The centralized process for data processing is no more efficient nowadays !

To deal with Big Data, the idea is to distribute the storage of data and parallelize the processing of that data across several cluster of computers: the Cluster computing infrastructure.

In cluster computing :
data Files are stored redundantly.
Computation are divided into tasks and parallelized

The redundancy of the data on multiple hard disk is supported via a new kind of file system called the "Distributed File System" (DFS) and the parallelism of the processing is performed via a new kind of programming model called "MapReduce".

The Most popular (and yet mature) implementation of MapReduce is called "Hadoop". Hadoop comes along with the HDFS (Hadoop Distributed File System)

Yes, you got it! You can use an implementation of MapReduce to manage many large-scale data computations in a way that is tolerant of hardware fault.
6.1 The New data management strategy
A cluster computing environment
Map Reduce Job Description
BIG DATA ADOPTION :
NOW OR LATER ?
- Sqoop
Sqoop (SQL-to-Hadoop) efficiently transfers data from Hadoop HDFS to the datawarehouse and vice-verca. Look at Sqoop as the ETL (Extract - Transform - Load) for an Hadoop environment.

- Zookeeper
Zookeeper provides a distributed configuration service, a synchronization service and a naming registry for distributed applications. Zookeeper is Hadoop's way of coordinating all the elements of these distributed applications.

-Mahout
Mahout is a scalable machine learning and data mining library for Hadoop. Look at Mahout as the analytic software for an Hadoop environment. Mahout provides data mining and machine learning algorithms packaged in Java libraries to perform 4 types of analysis in an Hadoop environment: Recommendation mining, classification, clustering and association rules.
- Pig
Pig is an interactive data flow (or script-based) language and execution environment for Hadoop. Pig provides a data flow language called Pig Latin that allows to express a series of operations to apply to an input data to produce output.

- Hive
Hive is an interactive and batch query language based on SQL for building MapReduce jobs. It provides users who know SQL with a simple SQL-like implementation called HiveQL.

-HBase
HBase is a distributed, column-oriented database that utilizes HDFS as its persistence store and supports MapReduce and point queries. It is capable of hosting very large tables (billions of columns/rows) because it is layered on Hadoop clusters of commodity hardware.

1 CREATE TABLE records (year string, temperature INT, quality INT) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' ;

2 LOAD DATA LOCAL 'data/sample.txt' OVERWRITE INTO TABLE records ;

3 SELECT year, MAX(temperature) FROM records WHERE temperature !=9999 AND (quality == 0 OR quality == 1) GROUP BY year ;
eg of a Pig script : finding the Maximum temperature by year
1 records = LOAD 'data/samples.txt AS (year: chararray, temperature : int, quality: int);

2 filtered_records = FILTER records BY temperature !=9999 AND (quality ==0 OR quality == 4);

3 grouped_records = GROUP filtered_records BY year ;

4 Max_temp = FOREACH grouped_records GENERATE group, MAX (filtered_records.temperature)

5 DUMP max_temp ;
The same previous example written in HiveQL
Hadoop is a platform that implements MapReduce and provide a redundant, reliable and distributed file system optimized for large files.

In reality, Hadoop is just a set of Java classes (theses classes can also be written into other programming languages such as Python, C#, C++,...) for HDFS types and MapReduce job management.

Theses classes allow the analyst to write functions that will get insight from data without having to worry about how his code is distributed and parallelized in the cluster environment.

To get out the most of a Hadoop cluster , a set of technologies and tools have been developed. These set of tools forms today what is convenient to call : the Hadoop Ecosystem.

The most foundational tools of the Hadoop Ecosystem are the following: Pig, Hive, HBase, Sqoop, Zookeeper & Mahout.
6.2 The Hadoop Ecosystem
The answer to this question must lie in the integration and the operationalization of analytics as a whole part of the organization's business process. This suppose organization is data-driven. the big data approach is mostly suited to addressing or solving business problems that are subject to one or more of the following criteria:
Data throttling:
Computation-restricted throttling
Large data volumes
Significant data variety
Benefits from data parallelization

Even if we have always had a lot of data, the difference today is that significantly more of it exists, and it varies in type and timeliness. To cope with this problem , you have to think about managing data differently. That is where comes the "Big Data".

Big Data is the name given to the data management challenges and opportunities that emerge when dealing with data that is extremely large in volume, has extremely high velocity and is extremely wide in variety.

Big Data without Analytics is just data

Just Because You Have Insights doesn't Guarantee You Have The Power To Act on Them.

Every problem is not suitable for Big Data

MapReduce is a programming model that allow to manage large-scale data computations in a way that is tolerant of hardware fault.

Hadoop is a platform that implements MapReduce and provide a redundant, reliable and distributed file system optimized for large files.
What Should I remember ?
Some Big Data Providers
Cloudera, with its commercial distribution of Hadoop
HortonWorks, with its commercial distribution of Hadoop
SAS Institute with its SAS High Performance Suite and SAS Visual Analytics
HP with its platform called HP Vertica
EMC with its platform called GreenPlum Pivotal
Here are some Big Data providers I personally know. There are some others.

- Naive Bayes
Naive bayes is a "Classifier", that is it is used to classify or assign labels to objects based on applying Bayes theorem with strong naïve independence assumptions. Naive Bayes is specifically suited for problems where you have a categorical inputs with lot of levels, such as Textual data.

- Survival Analysis
Survival analysis is a class of statistical methods for studying the occurrence and timing of events. It is suitable for problems where you want to know WHEN a specific event will happen. The most common approaches to build a survival model are the following : Life Tables, Kaplan-Meier estimators, exponential regression, proportional hazards regression, competing risk models and discrete-time methods.

- Text analytics fundamentals
Text analytics is the process of analyzing unstructured text, extracting relevant information, and transforming it into structured information that can then be leveraged in various ways. The analysis and extraction processes take advantage of techniques that originated from computational linguistics (Natural Semantic Language), statistics, and other computer science disciplines.
Thank you for attending, I sincerely hope this module will be helpful for you !

The Full version will be available soon !!!!
Bibliography & Resources
http://www.cisjournal.org/archive/vol2no4/vol2no4_1.pdf
Hybrid Recommender System Using Naive Bayes Classifier and Collaborative Filtering
http://eprints.ecs.soton.ac.uk/18483/
Online applications : http://www.convo.co.uk/x02/
http://mahout.apache.org/
EMC Data Science & Big Data Analytics Training Module
https://education.emc.com/guest/campaign/data_science.aspx
SAS Official Predictive Modeling Training Course
https://support.sas.com/edu/schedules.html?id=1366&ctry=us
https://support.sas.com/edu/schedules.html?id=1220&ctry=US
Big Data for Dummies by Judith Hurwitz, Alan NUGENT, Dr. Fern Halper, Marcia Kaufman
ISBN : 978-1-118-50422-2 www.wiley.com
Gartner : http://www.gartner.com/it-glossary/big-data/
The Harvard Business Review :
https://hbr.org/2013/12/you-may-not-need-big-data-after-all/ar/1
MapReduce: Simplified Data Processing on Large Clusters (from Google)
http://static.googleusercontent.com/media/research.google.com/fr//archive/mapreduce-osdi04.pdf
Hadoop Apache Foundation
http://hadoop.apache.org/
TDWI : http://tdwi.org/

About Me
I am a freelance/Consultant who help organisations leverage their data to improve their performance through the right tool, the right methodology and the right technology. I have over 3 years of experience and 5 Certifications. I am a highly certified SAS Professional and also a certified EMC² Data Scientist.

Contact
Mail : [email protected]
Twitter : @Juvenal_JVC
Linkedin : http://fr.linkedin.com/pub/juv%C3%A9nal-chokogoue/52/965/a8

Modifiez le style du titre
Modifiez les styles du texte du masque
Deuxième niveau
Troisième niveau
Quatrième niveau
Cinquième niveau
12/1/2014
N°

27
12/1/2014

N°
Modifiez le style du titre
Modifiez les styles du texte du masque
Deuxième niveau
Troisième niveau
Quatrième niveau
Cinquième niveau
Modifiez les styles du texte du masque
12/1/2014

N°
Modifiez le style du titre
Cliquez sur l'icône pour ajouter une image
Modifiez les styles du texte du masque
12/1/2014

N°
Modifiez le style du titre
Modifiez les styles du texte du masque
Deuxième niveau
Troisième niveau
Quatrième niveau
Cinquième niveau
12/1/2014

N°
Modifiez le style du titre
12/1/2014

N°
Modifiez les styles du texte du masque
Modifiez les styles du texte du masque
12/1/2014

N°
Modifiez le style du titre
Modifiez les styles du texte du masque
Deuxième niveau
Troisième niveau
Quatrième niveau
Cinquième niveau
Modifiez les styles du texte du masque
Deuxième niveau
Troisième niveau
Quatrième niveau
Cinquième niveau
12/1/2014

N°
Modifiez le style du titre
Modifiez les styles du texte du masque
Deuxième niveau
Troisième niveau
Quatrième niveau
Cinquième niveau
Modifiez les styles du texte du masque
Deuxième niveau
Troisième niveau
Quatrième niveau
Cinquième niveau
Modifiez le style du titre
Modifiez les styles du texte du masque
12/1/2014

N°
Modifiez le style du titre
Modifiez les styles du texte du masque
Deuxième niveau
Troisième niveau
Quatrième niveau
Cinquième niveau
12/1/2014

N°
Modifiez le style du titre
Modifiez le style des sous-titres du masque
12/1/2014
N°

Modifiez le style du titre
Modifiez les styles du texte du masque
Deuxième niveau
Troisième niveau
Quatrième niveau
Cinquième niveau
12/1/2014

N°
Click to edit Master title style
Click to edit Master text styles
Second level
Third level
Fourth level
Fifth level

12/1/2014

Modifiez les styles du texte du masque
Deuxième niveau
Troisième niveau
Quatrième niveau
Cinquième niveau

N°
Data
Information
Knowledge
Actionable plans
Performance

Lihat lebih banyak...

Big Data A 360 Overview attempt

Descrição do Produto

Comentários