Enterprise Use Cases in Big Data Analytics

Share Embed


Descrição do Produto

Volume


Variety


Velocity









Import sample dataset into HDFS or accessible file system


Access sample dataset using analytical tool


Run queries or computations on sample dataset


Output desired or simulated results demonstrating the use case in point


Draw conclusions from analysis of sample dataset in a big data environment


Setup big data environment




















Hadoop Distributed File System (HDFS)


Designed for storing very large files


Supports streaming data access


MapReduce


Programming model for data processing


Supports inbuilt parallelization


Hadoop Ecosystem


Tools for data extraction, transformation and loading


Tools for data processing and querying


Tools for managing distributed computation tasks on big data


YARN


Manages the cluster of servers running Hadoop


Yet Another Resource Negotiator


Comprises the distributed computation layer of Hadoop


Sits between HDFS and top level data processing tools like Apache Hive, Spark


Supports Java, Python, Ruby


Designed to run on commodity hardware



















Apache Ambari cluster


Node 1


Node 2


Node 3


Centos 6.3 64-bit


Centos 6.3 64-bit


Centos 6.3 64-bit


2 GB RAM


25GB HDD


2 GB RAM


25GB HDD


2 GB RAM


25GB HDD


2 cores


2 cores


2 cores






















































http://blogs.gartner.com/doug-laney/files/2012/01/ad949-3D-Data-Management-Controlling-Data-Volume-Velocity-and-Variety.pdf
http://barnraisersllc.com/2012/12/38-big-facts-big-data-companies/
Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung, "The Google File System, "Proceedings of the Nineteenth ACM Symposium on Operating Systems Principles -SOSP '03 (2003): 29-43.
Jeffrey Dean and Sanjay Ghemawat, "MapReduce: Simplified Data Processing on Large Clusters," Proceedings of the 6th Conference on Symposium on Operating Systems Design and Implementation (2004)
https://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-hdfs/HdfsUserGuide.html
https://hadoop.apache.org/docs/stable/hadoop-mapreduce-client/hadoop-mapreduce-client-core/MapReduceTutorial.html,
https://www.cs.rutgers.edu/~pxk/417/notes/content/mapreduce.html,
https://spark.apache.org/, https://spark.apache.org/docs/latest/,
https://spark.apache.org/docs/latest/mllib-guide.html
http://cloudera.github.io/hue/docs-3.7.0/user-guide/introducing.html
http://ambari.apache.org/
www.pig.apache.org
www.sqoop.apache.org
www.hbase.apache.org
www.zookeeper.apache.org
www.oozie.apache.org
www.flume.apache.org
http://www.sas.com/en_us/insights/analytics/big-data-analytics.html
https://hbr.org/2012/10/data-scientist-the-sexiest-job-of-the-21st-century/
http://www.datameer.com/pdf/eBook-Top-Five-High-Impact-UseCases-for-Big-Data-Analytics.pdf
http://www.informationweek.com/big-data/big-data-analytics/5-big-data-use-cases-to-watch/d/d-id/1251031
http://storm.apache.org/
http://research.google.com/pubs/pub36632.html
http://drill.apache.org/
http://www.informationweek.com/big-data/big-data-analytics/16-top-big-data-analytics-platforms/d/d-id/1113609
http://www.datameer.com/solutions/industries/financial-services.html
Big Data in Financial Services and Banking (Oracle Enterprise Architecture White Paper, Feb 2015)
Big Data in Financial Services and Banking (Oracle Enterprise Architecture White Paper, Feb 2015)
http://www.ibmbigdatahub.com/blog/big-noise-telecommunications-companies
Improving Communications Service Provider Performance with Big Data (Oracle Enterprise Architecture White Paper, April 2015)
http://www.sqlstream.com/blog/2014/06/the-big-data-mindset-customer-experience-driving-streaming-big-data-adoption-in-telecoms/
http://cloudera.github.io/hue/docs-3.7.0/user-guide/introducing.html
https://cwiki.apache.org/confluence/display/Hive/HiveODBC
Big Data in Financial Services and Banking (Oracle Enterprise Architecture White Paper, Feb 2015)

https://developer.yahoo.com/yql
http://www.bollingerbands.com/
http://www.fxkeys.com/how-to-use-bollinger-bands-in-forex-and-stock-trading/
http://www.investinganswers.com/financial-dictionary/investing/compound-annual-growth-rate-cagr-1096
http://www.saasoptics.com/SaaS-opedia/saas-opedia/Churn/churn_definition.html

Setup big data environment

Import sample dataset into HDFS or accessible file system

Access sample dataset using analytical tool

Run queries or computations on sample dataset

Output desired or simulated results demonstrating the use case in point
Draw conclusions from analysis of sample dataset in a big data environment
Ozioma Ikenna IhekwoabaBCS MEMBER ID " 990335580Enterprise Use Cases in Big Data AnalyticsA professional IT project submitted to the British Computer SocietyIn partial fulfillment of the requirements for the award ofPOSTGRADUATE DIPLOMAIN INFORMATION TECHNOLOGYAUGUST 2015Ozioma Ikenna IhekwoabaBCS MEMBER ID " 990335580Enterprise Use Cases in Big Data AnalyticsA professional IT project submitted to the British Computer SocietyIn partial fulfillment of the requirements for the award ofPOSTGRADUATE DIPLOMAIN INFORMATION TECHNOLOGYAUGUST 2015
Ozioma Ikenna Ihekwoaba
BCS MEMBER ID " 990335580
Enterprise Use Cases in Big Data Analytics


A professional IT project submitted to the British Computer Society

In partial fulfillment of the requirements for the award of

POSTGRADUATE DIPLOMA
IN INFORMATION TECHNOLOGY

AUGUST 2015

Ozioma Ikenna Ihekwoaba
BCS MEMBER ID " 990335580
Enterprise Use Cases in Big Data Analytics


A professional IT project submitted to the British Computer Society

In partial fulfillment of the requirements for the award of

POSTGRADUATE DIPLOMA
IN INFORMATION TECHNOLOGY

AUGUST 2015



Table of Contents
Chapter 1: ABSTRACT 5
Big Data definition 5
Characteristics of Big Data 5
Big Data in the business world 6
Sources of Big Data 7
Big Data Technologies 7
Introduction to Apache Hadoop Big Data Ecosystem 8
Introduction to Hadoop Distributed File System 9
Introduction to MapReduce 10
Introduction to Apache Hive 10
Introduction to Apache Spark 10
Introduction to Cloudera Hue 11
Introduction to Apache Ambari 11
Other tools in the Hadoop Ecosystem 12
Chapter 2: Literature review 13
Big Data Analytics Definition 13
Techniques in Big Data Analytics 13
Data Science as a profession 13
Potential Use Cases for Big Data Analytics 13
Tools for Big Data Analytics 14
Commercial Big Data Analytics Vendors 15
Big Data Analytics in Financial Services 15
Big Data Analytics in Telecoms 16
Chapter 3: Methodology & REQUIREMENTS SPECIFICATION 17
Methodology workflow 17
Overview of use cases 18
Sample Data Sets 18
Big Data Environment 18
Hadoop Server Environment 19
Client Environment 19
Analytical Tools Used 19
Expected results from data analysis 20
Chapter 4: Big Data Environment Setup 21
Hadoop Ecosystem Setup using Apache Ambari 21
Components Available for installation using Apache Ambari 21
Ambari cluster setup details 23
Hadoop Node Settings 23
Hadoop Ecosystem Access using Cloudera Hue 26
Apache Hive ODBC Access using Microsoft Hive ODBC Driver 28
Setting up Hive databases for the use cases 28
Setting up Hive ODBC driver data sources 29
Chapter 5: Financial Markets Use Case 31
Use case overview 31
Big Data Analytics challenge 31
Enterprise use cases for Acme Trading LLC 31
Use case workflow 32
Data sourcing 32
Downloading Historical Data from Yahoo Finance 32
Workflow for data loading 35
Step 1: Create HDFS Directory 35
Step 2: Load data files into HDFS directories 36
Step 3: Create Hive tables from uploaded data 37
Step 4: Create Hive views and summary tables from Hive tables 39
Workflow for creating use cases 41
Step 1: Add ODBC datasource 42
Step 2: Transform the imported data 44
Use Case 1: Bollinger bands moving averages dashboard 45
Dashboard usage & Testing 49
Use Case 2: Investment Analysis Dashboard 50
Dashboard usage & Testing 53
Use Case Source Code & Project Data 54
Chapter 6: Telecommunications Use Case 55
Introduction 55
Use case overview 55
Big Data Analytics challenge 56
Enterprise use cases for Acme Telecom 56
Use case workflow 57
Data sourcing 57
Workflow for data loading 58
Step 1: Create HDFS Directory 58
Step 2: Load data files into HDFS directories 59
Step 3: Create Hive tables from uploaded data 59
Workflow for creating use cases 61
Step 1: Add ODBC datasource 61
Step 2: Transform the imported data 63
Use Case 1: Services Revenue Tracking Dashboard 68
Dashboard usage & Testing 69
Use Case 2: Call Plan Revenue Tracking Dashboard 70
Dashboard usage & Testing 71
Use Case 3: Data Plan Revenue Tracking Dashboard 72
Dashboard usage & Testing 73
Use Case Source Code & Project Data 74
Chapter 7: Data Integration Use Case 75
Use Case Overview 75
REST Services overview 75
REST Service Development tools 76
REST Service creation workflow 76
Step 1: Define resources to be exposed 76
Step 2 & 3: Define resource endpoints and HTTP request methods 76
Step 4: Create Spring web application structure 77
REST Service Deployment 83
REST API usage 84
Using the REST API to build a dashboard in Excel 85
Use Case Source Code 89
Chapter 8: Conclusions and Recommendations 90
Conclusion 90
Recommendations 90
Works Cited 91





Chapter 1: ABSTRACT

Big Data definition

Big data is defined as data that exceeds the processing capacity of conventional database systems. In this case the data is too big, moves too fast or does not fit the normal client-server data processing architectures used in many organizations. A new approach to data storage and data processing is needed therefore to gain value from this data. (O'reilly Media Inc, 2012)
In other words, Big Data is a term used to describe data sets whose size is beyond the capability of current data processing software like relational databases systems to process.
A key factor in the definition of Big Data is the classification of the type of data that goes into making a "Big Data system". Typically big data volumes comprise of the following:
Structured data: This includes data in various formats like JSON,XML,CSV etc. that can easily be interpreted and/or processed by common data processing software like Microsoft Excel, relational database systems etc.
Unstructured data: This includes data that does not adhere to any specific format or schema, such data types are typically derived from social media posts, log/sensor data etc. Unstructured data is not easily processed and/or interpreted by common data processing software.

Characteristics of Big Data

Three parameters are typically used to further define Big Data.



Figure 1
Volume
Refers to the massive increase from gigabytes to terabytes to petabytes and so on.
Variety
Refers to the different structures/formats of big data volumes. Typically big data formats come in structured and unstructured formats.
Velocity
Refers to the rate at which data is generated, streamed and processed. Current batch processing systems are giving way to real-time data streaming platforms powered by big data technologies.


Big Data in the business world

There has been a significant increase in the number of companies employing Big Data technologies as a means by which they can economically manage their ever growing streams of data and analyze it for strategic decision making. The business environment of today demands that companies respond in real-time to changes in prices, competitor operations, public sentiments etc. These are areas not previously catered to by the typical IT operations of most companies. The volume and velocity of the involved data necessitates a Big Data system.
The following statistics culled from across the Internet highlight the occurrence of Big Data in today's highly competitive business environments.
Description
Source
What happens in an Internet minute :
80,000 Amazon sales (2014)
50,200 apps downloaded on Apple Store (2014)
306 hours of YouTube content uploaded (2014)
67,00 Instagram photos uploaded

http://www.techspartan.co.uk/features/internet-minute-2013-vs-2014-infographic/
4.1 million Google searches
438,801 Wikipedia page views
23,148 hours of Netflix videos watched
4.347,222 tweets on Twitter
http://www.intel.com/content/www/us/en/communications/internet-minute-infographic.html

90% of the world's data has been created in the last two years
http://www.baselinemag.com/analytics-big-data/slideshows/surprising-statistics-about-big-data.html
$600 billion – potential annual surplus from using personal location data globally
http://www.mckinsey.com/Insights/MGI/Research/Technology_and_Innovation/Big_data_The_next_frontier_for_innovation
60% - potential increase in retailers' operating margins possible with Big Data

Amazon draws data from 152 million customers' purchases to help users decide on items to purchase
http://cloudtweaks.com/2015/03/surprising-facts-and-stats-about-the-big-data-industry/
10 billion mobile devices will be in use by 2020
http://www.ibmbigdatahub.com/gallery/quick-facts-and-stats-big-data



The table below adapted from (Soumendra, Madhu, & Harsha, 2013), outline the potential of big data to add value across different industries.
Industry
Data volume
Data velocity
Data variety
Under-utilized Data
Big Data Potential Value
Banking & Securities
High
High
Low
Medium
High
Communications & Media Services
High
High
High
Medium
High
Education
Very low
Very Low
Very Low
High
Medium
Government
High
Medium
High
High
High
Healthcare providers
Medium
High
Medium
Medium
High
Insurance
Medium
Medium
Medium
Medium
Medium
Manufacturing
High
High
High
High
High
Chemicals and Natural Resources
High
High
High
High
Medium
Retail
High
High
High
Low
High
Transportation
Medium
Medium
Medium
High
Medium
Utilities
Medium
Medium
Medium
Medium
Medium


Sources of Big Data

Big data is generated primarily by users and then by machines ranging from mobile devices, personal computers to onsite and cloud servers. The list below enumerates the possible sources of big data (Society, 2014), (Shawant & Shah, 2014) :

Social Media e.g. photos, videos, tweets, articles etc.
Click stream data e.g. web browsing patterns, server logs etc.
Operational and transactional data from ERP systems
Data from data aggregators e.g. Bloomberg, Axiom, Reuters etc.
Data from mobile and web adverts
Demographic and economic data
Data from weather forecasting and monitoring systems

Big Data Technologies

Due to the fact that traditional data management and processing systems i.e. relational database systems, do not easily handle the massive volumes of data now handled by Big Data systems, new models for data storage and processing have been developed to handle these large data volumes in ways more economical and easier than traditional systems. These new models encompass one or more of the following features (Frampton, 2014):
Distributed and scalable storage of Big Data
Distributed processing and querying capabilities
Distributed computation and streaming capabilities
Collectively, systems having these features are commonly called "NoSQL" or NewSQL databases or data stores, where NoSQL is generally understood to mean "NotOnlySQL". NoSQL databases are typically classified into the following groups (Shawant & Shah, 2014):
NoSQL Data Store Type
Description
Examples
Key-Value Data Stores
Here data is stored using keys which are identifiers for looking up values which are the data associated with the keys. Suitable for shopping carts and session data.
Memcached, Riak, Redis
Column-oriented Data stores
Here data is stored as a collection of columns. Suitable for random access to huge data sets with multiple hierarchies.
Apache Cassandra, Apache HBase, Amazon SimpleDB
Document Data stores
An extension of a key-value store where keys are used to look up values, but the stored values are typically collections of data items stored in a flexible structure. Suitable for real-time logging and analytics.
MongoDB, Apache CouchDB, RavenDB
Graph Data stores
These are used to model objects and the relationships between objects. Suitable for network modelling and recommendation systems.
Neo4j, HypergraphDB, AllegroGraph

In addition to the above, the dominant software typically referred to when using the term Big Data is the Apache Hadoop ecosystem of Big Data management and processing tools. These open source software from the Apache Software Foundation encompass all the aforementioned types of NoSQL data stores and also the accompanying software used manage storage, data loading, processing, querying etc. The Apache Hadoop ecosystem of Big Data software are recognized as the most deployed software in the industry for Big Data management. The focus of this write-up will be on these ecosystem.These will be discussed in the next sections.

Introduction to Apache Hadoop Big Data Ecosystem

In 2002, Doug Cutting and Mike Cafarella began working on a project called Nutch, the goal was to crawl the Internet and produce a search engine from the indexed content. This project graduated into a platform called Hadoop incorporating some of the ideas in two papers released by Google in 2002 and 2003. These papers on "The Google File System" and "MapReduce: Simplified Data Processing on Large Clusters" outlined a new storage model and a new programming model for handling big data volumes. (White, 2014)
Apache Hadoop is a platform comprising multiple components performing different tasks in the storage, processing, querying and management of big data volumes. The primary components are outlined below:


Introduction to Hadoop Distributed File System

HDFS is the primary distributed storage used by Hadoop applications. HDFS in a Hadoop setup typically consists of a cluster of servers. A HDFS cluster consists of one or more NameNodes that manages the file system metadata and the DataNodes that store the actual data.
The use cases sited in this project will be storing data on HDFS to be accessed and processed by other tools.



Introduction to MapReduce

MapReduce is a programming model for developing applications on top of Hadoop. On the Hadoop platform MapReduce applications are written in Java. MapReduce is a software framework for easily writing applications which process vast amounts of data in parallel on Hadoop clusters in a reliable, fault-tolerant manner.
MapReduce jobs consists of two functions, the MAP and the REDUCE function which work on the input data and feed the processed output to the next function. The MAP function feeds data to the REDUCE function.
The MAP function reads a stream of data and parses it into intermediate (key, value) pairs. When the MAP function call is done, the REDUCE function is called once for each unique key that was generated by the MAP function and is give the key and a list of all values that were generated for that key as a parameter.
All data processing tasks run on Hadoop are broken down into MapReduce tasks and then executed on the nodes in the cluster.

Introduction to Apache Hive

Apache Hive is a framework for data warehousing on top of Hadoop. Hive was developed at Facebook and targeted at developers with significant SQL skills. MapReduce involved a lot of boilerplate code and could be tedious to write, Apache Hive made it possible for data analysts to run SQL queries on the huge data volumes stored in HDFS. (White, 2014, p. Chapter 17)
Hive defines a SQL-like query language called Hive Query Language (HQL), that enables users with SQL knowledge to query data stored in HDFS. Queries written in HQL are converted into MapReduce code by the Hive engine and then executed on the Hadoop cluster. (Sitto & Presser, 2015)
Apache Hive supports the creation of schemas/databases, tables and views on data stored in HDFS. Apache Hive will serve as the data warehousing tool for the use cases cited in this project.

Introduction to Apache Spark

Apache Spark is a fast and general purpose engine for large-scale data processing. Spark is positioned as a general purpose cluster computing system. Apache Spark provides high-level APIs in Java, Scala, Python and R programming languages. Spark's main programming abstraction is called Resilient Distributed Datasets (RDDs) which are a collection of data items that can be partitioned in memory across compute nodes and operated upon in parallel.
Apache Spark consists of the following components (Zaharia, 2015):
Spark Core: This contains the basic functionality of Spark, including components for task scheduling, memory management, fault recovery, IO etc. Spark Core provides APIs for building and manipulating RDDs.
Spark SQL & Data frames: This is Spark's module for working with structured data. Spark SQL allows querying data via SQL or Hive SQL. Apache Spark SQL can query data from a variety of sources, including Apache Hive tables, Apache Parquet, CSV, text files and JSON. The beauty of Spark SQL is that it allows the mixing of SQL queries with manipulation of RDDs in the same code. This makes for complex analytics integration in code with simple programming abstractions easy for developers to use.
Spark Streaming: This is a Spark component that enables the processing of live streams of data. Spark Streaming leverages on Spark Core and enables developers easily manipulate data stored in memory, on disk, or arriving in real time.
Spark GraphX: GraphX is Spark's component for graphs and graph-parallel computation.
Spark MLlib: MLlib is Spark's scalable machine learning library consisting of common learning algorithms and utilities, including classification, regression, clustering, collaborative filtering, dimensionality reduction, as well as underlying optimization primitives.

Introduction to Cloudera Hue

Hue is a set of web applications that enable you to interact with a Hadoop cluster. Hue applications let you browse HDFS and jobs, manage a Hive metastore, run Hive, Cloudera Impala queries and Pig scripts, browse HBase, export data with Sqoop, submit MapReduce programs, build custom search engines with Solr, and schedule repetitive workflows with Oozie.

Introduction to Apache Ambari

A detailed introduction to Apache Ambari is given the on the project site, quoted below.
The Apache Ambari project is aimed at making Hadoop management simpler by developing software for provisioning, managing, and monitoring Apache Hadoop clusters. Ambari provides an intuitive, easy-to-use Hadoop management web UI backed by its RESTful APIs.

Ambari enables System Administrators to:
Provision a Hadoop Cluster
Ambari provides a step-by-step wizard for installing Hadoop services across any number of hosts.
Ambari handles configuration of Hadoop services for the cluster.
Manage a Hadoop Cluster
Ambari provides central management for starting, stopping, and reconfiguring Hadoop services across the entire cluster.
Monitor a Hadoop Cluster
Ambari provides a dashboard for monitoring health and status of the Hadoop cluster.
Ambari leverages Ambari Metrics System for metrics collection.
Ambari leverages Ambari Alert Framework for system alerting and will notify you when your attention is needed (e.g., a node goes down, remaining disk space is low, etc).
Ambari enables Application Developers and System Integrators to:
Easily integrate Hadoop provisioning, management, and monitoring capabilities to their own applications with the Ambari REST APIs.

Other tools in the Hadoop Ecosystem

Tool
Description
Pig
High-level data flow language for processing data
Sqoop
Utility for transfer data between HDFS and relational databases
HBase
Column-oriented database built on top of HDFS allowing random access to data
Zookeeper
Tool for sharing state and configuration data across nodes in a Hadoop cluster
Oozie
A workflow scheduler used for starting, suspending, stopping and coordinating tasks across jobs in a Hadoop cluster.
Flume
A data collection and aggregation tool for feeding data into HDFS
Avro, Parquet, ORC
File formats optimizing for storing, compressing and serialization of data.




Chapter 2: Literature review

Big Data Analytics Definition

Big data analytics can be defined as the application of advanced analytic techniques in extracting information or insight from huge volumes of both structured and unstructured data.
The end goal is to uncover hidden patterns, unknown correlations and other useful information that can be used to make better decisions.
Techniques in Big Data Analytics

Different techniques can be employed in big data analytics, including but not limited to the following (Shawant & Shah, 2014), (Frampton, 2014):
Text analytics
Predictive analytics
Machine learning
Data mining
Natural language processing
Data Science as a profession

The explosion in the volume and speed of data generation has engendered a new role in most companies grappling with huge volumes of data- that of the data scientist. A data scientist is loosely defined as someone with the training and curiosity to make discoveries in the world of big data.
This includes advanced mastery of statistics and probability theorems, ability to develop and analyze software algorithms and most importantly expertise with the current crop of big data technologies.
Potential Use Cases for Big Data Analytics

The following list outlines recognized use cases for big data usage in enterprises . This recognition is driven by business needs and changing competitor landscape:
360 degree customer view: Ability to track customers across all available engagement platforms and systems. This dovetails into revenue assurance and market growth initiatives.
Internet of things (IOT): A phrase coined for phenomenon of diverse connected devices encountered at home, in the workplace and public settings. IOT-connected devices provide and consume an increasingly growing volume of data.
Data warehouse consolidation and optimization
Efficient consolidation of big data storage and processing
Business capacity planning and optimization
Implementation of fraud and regulation compliance initiatives
Development of data driven products and services

Tools for Big Data Analytics

Popular tools for big data analytics include but are not limited to:
Apache Hadoop
Apache Hadoop has revolutionized data storage and processing with the introduction of tools that enable the usage of commodity servers to both store, process and analyze huge volumes of data
MapReduce
MapReduce is the primary programming framework for developing applications on Apache Hadoop
Cloudera Hue
Cloudera Hue provides a web based UI for accessing data in various big data repositories like Apache Hive, HDFS, Cloudera Impala, Apache Solr and Apache Spark. In addition, Hue offers a graphical and intuitive environment for building data tables, charts and graphs from queries run on big data stores. Results of queries can be exported also.
Apache Spark
Apache Spark is a complete replacement for MapReduce. Due to the inefficiencies reported in production usage of MapReduce, Apache Spark has gained widespread adoption due to its simpler and faster programming model.
Apache Storm
Apache Storm is a real time distributed computation system. Storm was designed to reliably process unbounded streams of data in real time. This contrasts with the batch processing Hadoop framework.
Google Dremel
Dremel is a scalable, interactive ad-hoc query system for analysis of read-only nested data. By combining multi-level execution trees and columnar data layout, it is capable of running aggregation queries over trillion-row tables in seconds.
Apache Drill
Apache Drill Schema-free SQL Query Engine for Hadoop, NoSQL and Cloud Storage. Drill supports a variety of NoSQL databases and file systems, including HBase, MongoDB, MapR-DB, HDFS, MapR-FS, Amazon S3, Azure Blob Storage, Google Cloud Storage, Swift, NAS and local files. A single query can join data from multiple datastores. 
Microsoft Excel
Excel needs no introduction as it is the premium data analysis tool for the majority of users all over the world.
Microsoft Power BI
Power BI Desktop is a new self service tool for data analysis and business intelligence offered by Microsoft. Power BI is offered both as an online service and a desktop based tool.

Commercial Big Data Analytics Vendors

The following vendors have been identified as leaders and innovators in big data technologies. They all have competing and complementary products in the big data analytics space.
Vendor
Products
Cloudera
CDH opensource Hadoop distribution
Hortonworks
Hortonworks Data Platform
Amazon Web Services
Elastic MapReduce, Kinesis, Redshift
Actian
Actian Matrix
HP
Vertica
IBM
DB2, Netezza
Pivotal
Greenplum
MapR
MapR M3,M5 and M7 Hadoop distributions
Microsoft
SQL Server 2012 Parallel Data warehouse, Azure HDInsight Hadoop Distribution

Big Data Analytics in Financial Services

Identified areas where financial institutions are making use of Big Data Analytics include compliance and regulation, customer segmentation and fraud detection.
Other areas include risk management, broker and trade monitoring and social media sentiment monitoring for brand protection.
The benefits and opportunities that can be derived from implementing Big Data Analytics include
Utilizing Big Data technologies like Hadoop and Spark for massive parallel computations
Improve execution and cost-effectiveness of compute intensive computations like risk simulation and credit scoring
Use machine learning to increase the effectiveness of fraud and money laundering detection systems
Correlate customer transactional data with social media insights, sentiments and trends to better offer products matching customers' needs
Identify customers with long term profitability potential

Big Data Analytics in Telecoms

Telecom companies generate massive volumes of data from the following sources:
Billions of call data records. Call data records (CDR) are the telecom industry standard for capturing service usage information about calls, SMS etc.
Social media broadband usage
Terabytes of location data generated every day
Big Data challenges for telecom companies amongst others are outlined below:
Respond to changing customer demand and offer superior customer service, thus preventing customer churn
Maximize product and services revenue and increase market share
Align product offerings and plans to customer usage and behavior
Understand the nature of poor customer experience such as slow internet access, poor VOIP voice quality, dropped calls and network congestion
Where Big Data comes into the mix is in the area of storage, processing and analytics. In the area of analytics, the end goal is to achieve the ability to drive real time business decisions with operational data in the following areas:
Customer experience: Real-time 360 degree view of customers, combining call data, location data, broadband usage, network, device and service data
Real time fraud monitoring of live data streams across different service offerings



Chapter 3: Methodology & REQUIREMENTS SPECIFICATION

This project will employs use cases in demonstrating the ability to use Big Data technologies in analyzing business data.
The methodology adopted in this project is informed by the need to present practical solutions to data analysis challenges.
The primary methodology adopted is sample data analysis. The data samples used are obtained from different sources. Typical scenarios mirroring what is obtained in real world companies is used to model the different use cases in this project.
The emphasis of this project is on the practical usage of big data technologies in analyzing business data and thus achieving results that enhance business value for organizations. Said results could be a factor of business insight gained or competitive advantage gained from using newer and better technologies.
The end results are as follows:
An appreciation of how big data technologies could be efficiently used in storing and analyzing data for the purpose of enhancing business value for organizations,
A demonstration of possible use cases showing how big data technologies could be employed in analyzing business data,
Enabling data scientists in organizations to take advantage of the latest technologies in analyzing data efficiently.

Methodology workflow

The diagram below illustrates the steps involved in carrying out the sample data analysis. The following section describes each step.

Overview of use cases

A use case in this context is defined an identified pattern of interaction between an enterprise software system and end users.
There are a total of 3 use cases in this project. The details are given below.
#
Use case
Description
1
Financial Markets Use Case
Historical stock price analysis for selected securities on the NYSE
2
Telecommunications Use Case
Churn analysis to track call targets for a hypothetical telecom company
4
Data Integration Use Case
A Java REST service exposing data stored in Apache Hive tables to end users in JSON and XML formats

Sample Data Sets

#
Data set
Use case
Description
Source
1
NYSE-listed security historical Stock Prices
Financial Markets use case
Intraday stock prices of 100 top securities on the NYSE
Yahoo Finance API
http://ichart.finance.yahoo.com
2
Generated call data records
Telecommunications use case
Dataset containing service usage statistics of a telecom provider's customers.
Generated data based on real life schemas. All data is entirely fictitious and the code for generating the data is given.
3
NYSE-listed security historical Stock Prices
Data Integration use case








Big Data Environment

The architecture employed for this project is a client-server environment, where multiple clients connect to one or more servers hosted on the same or external networks. In the scenario for this project, a 3 node Centos Linux cluster running in VirtualBox is used to run Apache Hadoop in a distributed mode. The 3 node cluster is setup using Apache Ambari.
The Hadoop setup comprises the server environment, while the client environment consists of Microsoft Excel 2013, Microsoft Power BI Desktop and a Java web service servlet container. The operating system and hardware details are given below.


Hadoop Server Environment

Server configuration
2GB RAM, 25GB, 2 CPU virtualized server running in VirtualBox 4.3
Operating system
Centos 6.3 x64
Apache Hadoop
version 2.6, using default options with no optimizations
Apache Spark
versions 1.3.1, 1.4, using default options with no optimizations
Apache Hive
version 1.2, using default options with no optimizations. MySQL metastore
MySQL metadata store for Apache Hive
version 5.5
Cloudera Hue
Version 3.9
Java
Oracle JDK 1.7.0.51 64-bit
No of instances
3

Client Environment

Client hardware specs
16GB RAM, 1TB, 4 CPU Dell Inspiron laptop
Operating system
Microsoft Windows 7 64 bit
Apache Spark
versions 1.3.1, 1.4, using default options with no optimizations
Microsoft Power BI Desktop
version 2.25.4095.554 (formerly Power BI Designer Preview)
Microsoft Excel
Version 2013
Servlet container for Java web service
Oracle Glassfish 4.0
Java
Oracle JDK 1.7.0.51 64-bit

Analytical Tools Used

Microsoft Excel 2013
Microsoft Power BI Desktop
Apache Spark 1.3.1+
Cloudera Hue 3.9










Expected results from data analysis

#
Use case
Expected Results
1
Financial Markets use case
A dashboard showing technical analyses of stock prices over different periods of interest. The analysis will be restricted to observing trends and calculating variances. In this instance, the employed statistical techniques include calculating Bollinger Bands for the stock closing prices using Hive SQL.
A dashboard showing the trends for the moving averages, the upper and lower Bollinger band values for selected securities is displayed in a Power BI dashboard.
A dashboards demonstrating the ability to run investment analysis on data stored in Big Data repositories.
2
Telecommunications use case
A dashboard analyzing subscriber usage targets for call services, data services and SMS services across different locations. Analysis of targets missed, met or exceeded for a particular time period is expected to help telecom marketers and planners analyze customer churn.
The dashboard demonstrates the ability to use big data technologies to store and query the massive call data records (CDRs) generated on a daily basis.
3
Data Integration use case
A Java REST service publishing historical stock dividend payments and prices from data stored in HDFS. XML and JSON endpoints are made available for content negotiation. Finally a Power View dashboard incorporating data from the REST service is built to demonstrate the ability to publish big data stored in Hadoop to external or internal consumers.





Chapter 4: Big Data Environment Setup

The Big Data environment for the use cases in this project is setup using Apache Ambari. Ambari makes it easy to setup the diverse components that make up an operational Big Data environment.
Hadoop Ecosystem Setup using Apache Ambari

Apache Ambari 2.10 will be installed for this project from public repositories. A detailed installation guide is available at https://cwiki.apache.org/confluence/display/AMBARI/Install+Ambari+2.1.0+from+Public+Repositories
Apache Ambari's greatest future is the ability to manage the installation, configuration, integration, monitoring and startup/shutdown of the diverse components that make up the Hadoop ecosystem.

Components Available for installation using Apache Ambari

The following components amongst others will be setup using Ambari:
Component
Version
HDFS
2.7.1.3
Yarn+MapReduce2
2.7.1.3
Hive
1.2.1.2.3
Oozie
4.2.0.2.3
Zookeeper
3.4.6.2.3
Tez
0.7.0.2.3



The screenshot below shows the components to be installed in the Ambari web interface.

















Ambari cluster setup details

A 3 node cluster will be setup for the Hadoop big data environment. The node details are given below.




Hadoop Node Settings

Node
Setting

Node 1
Fully Qualified Domain Name
linuxserver1.ozzycloud.com

IP Address
169.254.160.201



Node 2
Fully Qualified Domain Name
linuxserver2.ozzycloud.com

IP Address
169.254.160.202



Node 3
Fully Qualified Domain Name
linuxserver3.ozzycloud.com

IP Address
169.254.160.203






The screenshots below shows the nodes setup in Ambari and the Hadoop components installed on them.













Hadoop Ecosystem Access using Cloudera Hue

Hue is a set of web applications that enable you to interact with a Hadoop cluster. Hue applications let you browse HDFS and jobs, manage a Hive metastore, run Hive, Cloudera Impala queries and Pig scripts, browse HBase, export data with Sqoop, submit MapReduce programs, build custom search engines with Solr, and schedule repetitive workflows with Oozie.
The screenshot below shows the Hue interface with installed Hadoop component client applications.

The main applications used in the Hue interface are the Hive query editor and the HDFS File browser. The Hive query editor is a web based client used for accessing Hive, creating and reading from Hive tables etc. The HDFS file browser is used for uploading and download documents to/from HDFS.
The screenshots below show the two interfaces.




Apache Hive ODBC Access using Microsoft Hive ODBC Driver

Microsoft Excel and Microsoft Power BI Desktop will be used in connecting to Apache Hive data warehouse, to enable this connection, there is a need for Excel to have an appropriate interface for accessing Hive tables. This is provided by a Hive ODBC driver.
The Hive ODBC Driver is a software library that implements the Open Database Connectivity (ODBC) API standard for the Hive database management system, enabling ODBC compliant applications to interact seamlessly (ideally) with Hive through a standard interface.
The ODBC driver used in this project is the Microsoft Hive ODBC driver, available at http://go.microsoft.com/fwlink/?linkid=286698&clcid=0x409.

Setting up Hive databases for the use cases

The Hive databases for the use cases are outlined below:
Database name
Use case
stockdb
Financial markets
churndb
Telecommunications

Setting up Hive ODBC driver data sources

After installation of the Microsoft Hive ODBC driver, a Data source Name ("DSN") needs to be created using the ODBC Data Source Administrator to enable applications access the database configured using the DSN. The screenshot below shows the Hive ODBC configuration for the churndb database.





Chapter 5: Financial Markets Use Case

Use case overview

Acme Trading LLC is a fictional hedge fund that specializes in securities trading and analysis. Acme Trading buys and sells securities on various stock exchanges on behalf of private and institutional clients. These securities range from treasury bonds, stock market shares to commodities.
Big Data Analytics challenge

The volume of data handled by financial institutions has been increasing due to the number of financial products available in the international financial markets, and also the broad reach of the markets to many traders in different countries. (DURENARD, 2013)
With the introduction of high frequency trading, algorithmic trading and the explosion in derivatives products, massive data volumes are crunched by quantitative analysts and financial engineers to gain the slightest insight into the state of the markets.
It is literally a given that speed rules in the world of today's electronic trading, and to achieve and maintain the high levels of accurate speedy processing of incoming data streams broadcasted in the financial markets, a big data approach is needed.
Enterprise use cases for Acme Trading LLC

In this use case, 2 dashboards will be built to demonstrate the ability to use Big Data technologies to analyze Acme Trading's financial and investment data. The focus of the dashboards will be on the following:
Monitoring of historical asset prices
View at a glance historical trends in asset prices
Analyze historical data to drive investment decisions and portfolio construction
The dashboards that will be created to address the 2 cases above are listed below:
Bollinger Bands dashboard
Investment Analysis dashboard
Data Plan Revenue Tracking dashboard
Note that the focus of the analysis here is not on asset prices prediction, but on the analysis of historical trading data, with the goal to gaining insights that drive investment decisions
Use case workflow

Load financial market data from sources i.e. Yahoo Finance, clean up and format data into CSV, JSON or XML format
Load market data into HDFS and create Hive tables pointing to the HDFS folders
Create views, aggregations and summaries of Hive tables for reporting purposes
Load data into Microsoft Power BI Desktop using Hive ODBC driver
Transform loaded data by adding and removing columns etc.
Create Bollinger Bands dashboard
Create Investment Analysis dashboard
Data sourcing

The data to be used for the financial markets use case is a set of market data from the New York Stock Exchange (NYSE), the data includes daily prices of traded securities and quarterly dividend payments.
A selection of 100 stocks on the New York Stock Exchange is used for this chapter. The CSV file containing the selected stocks is included with the Java source code on GitHub given in the next section.

Downloading Historical Data from Yahoo Finance

The YQL (Yahoo! Query Language) platform enables you to query, filter, and combine data across the web through a single interface. It exposes a SQL-like syntax that is both familiar to developers and expressive enough for getting the right data.
The screenshot below shows a preview of the data download from the YQL Console.


YQL exposes several tables for querying. The tables of interest here are the following:
yahoo.finance.historicaldata: This table contains historical prices of securities.
yahoo.finance.dividendhistory: This table contains historical dividend payments of securities.
These tables are also exposed through REST endpoints. I have developed a Java application called yahooimport to download data from the two tables above. The Java project is hosted on GitHub at https://github.com/ozzioma/YahooImport.
The screenshots below in Excel shows a preview of the downloaded data in both tables.

Downloaded data from yahoo.finance.historicaldata showing IBM's intraday prices on the NYSE for 2014. Total rows is 252.

Downloaded data from yahoo.finance.dividendhistory showing Exxon Mobil's dividend payments from 1970 till date.
Workflow for data loading

Create directory in HDFS to store imported data files
Import data files into HDFS directory
Create Hive tables pointing to HDFS directory in step 2
Create views and summary tables

Step 1: Create HDFS Directory

HDFS directories are where physical storage blocks are stored across the Hadoop cluster. HDFS directories can be created from the command line using the HDFS client or through graphical tools such as Hue.
The required directories are given below:
Parent directory /marketdata
Directory /marketdata/prices for storing downloaded stock prices CSV files
Directory /marketdata/dividends for storing downloaded dividend payments CSV files
The HDFS client command line scripts for creating the above directories are shown in the screenshot below:
hadoop fs -mkdir /marketdata
hadoop fs -mkdir /marketdata/prices
hadoop fs -mkdir /marketdata/dividends




Step 2: Load data files into HDFS directories

Data can be loaded into HDFS through the command line HDFS client or through user interfaces such as Hue.





The data for the stock prices is best loaded into HDFS from the command line. The script below describes this.
hadoop fs -copyFromLocal /media/sf_Datasets/yahoofinance/prices/csv/*.* /marketdata/prices
The above command loads all the files in the /media/sf_Datasets/yahoofinance/prices/csv folder into the /marketdata/prices HDFS folder. A total of 2,589 CSV files are copied into HDFS.

Step 3: Create Hive tables from uploaded data

The scripts below show the Hive query language commands to create the database and associated tables for loading the NYSE stock market data.
--create database
create database stockdb location '/user/hadoop/stockdb';

--create stock prices table
create external table stockdb.prices (adjClose float,close float,tradedate date,high float,low float,open float,symbol string,volume int)
row format delimited fields terminated by ',' stored as textfile location '/marketdata/prices'
tblproperties ("skip.header.line.count"="1");

--create dividend payments table
create external table stockdb.dividends(paydate date, dividends float,symbol string)
row format delimited fields terminated by ',' stored as textfile location '/marketdata/dividends'
tblproperties ("skip.header.line.count"="1");


The screenshot below shows the scripts being executed using the Hive Query Editor in Hue.


The next screenshot shows the intraday stock prices for IBM for the year 2014.




Step 4: Create Hive views and summary tables from Hive tables

The following views and summary tables are needed to summarize the data contained in the Hive tables. The details of the views are given below:
--summary view for prices table
create view stockdb.pricesmasterview as
select row_number() over (order by symbol, tradedate) as rowid,
concat(symbol,year(tradedate),month(tradedate),day(tradedate)) as quoteid,
symbol,open,high,low,close,adjclose,volume,tradedate,
year(tradedate) as tradeyear,
month(tradedate) as trademonth,
weekofyear(tradedate) as tradeweek,
day(tradedate) as tradeday,
date_add(tradedate,1 - day(tradedate)) firstdate,
day(date_add(tradedate,1 - day(tradedate))) firstday,
last_day(tradedate) lastdate,
day(last_day(tradedate)) lastday

from stockdb.prices

--summary table for adding details to prices table
create table stockdb.pricesmaster
(rowid int,quoteid string,
symbol string,open float,high float,low float,close float,adjclose float,volume int,
tradedate date, tradeyear int,trademonth int, tradeweek int,
tradeday int,firstdate string,firstday int, lastdate string,lastday int)
row format delimited fields terminated by ',' stored as textfile location '/marketdata/summaries_year'

--populate summary prices table
insert overwrite table stockdb.pricesmaster
select * from stockdb.pricesmasterview

--summary table for yearly growth rate dashboard
create table stockdb.yearlyprices
(symbol string,open float,high float,low float,close float,adjclose float,volume int,tradedate date)
row format delimited fields terminated by ',' stored as textfile location '/marketdata/prices_year';

--populate summary table
insert overwrite table stockdb.yearlyprices
select d.symbol,d.open,d.high,d.low,d.close,d.adjclose,d.volume,d.tradedate from
(
SELECT ROW_NUMBER() OVER(partition by symbol,tradeyear order by tradedate asc) as yearid,
symbol,open,high,low,close,adjclose,volume,tradedate

FROM pricesmaster
) as d where d.yearid = 1

--summary table for growth rate dashboard
create table stockdb.yearlyreturns
(symbol string,tradedate string,adjclose float,prevadjclose float,returnpct float)
row format delimited fields terminated by ',' stored as textfile location '/marketdata/returns_year';

--populate summary table
insert overwrite table stockdb.yearlyreturns
SELECT a.symbol,a.tradedate,a.adjclose, b.adjclose,
cast(((a.adjclose- b.adjclose)/b.adjclose) as decimal(18,4))

from stockdb.yearlyprices a, stockdb.yearlyprices b
where (a.symbol=b.symbol) and (year(a.tradedate)=year(b.tradedate)+1)

--summary table for bollinger bands dashboard
create table stockdb.bollinger20(
rowid int,quoteid string,symbol string,tradedate string,tradeyear int,trademonth int,
open float, high float,low float,close float,adjclose float,
ma20 float,upperbollinger float,lowerbollinger float)
row format delimited fields terminated by ',' stored as textfile location '/marketdata/bollinger_daily';

--populate summary table
insert overwrite table stockdb.bollinger20
SELECT
rowid,quoteid,symbol
,tradedate,tradeyear,trademonth,open,high,low,close,adjclose
--20 days moving average
,AVG(close) OVER (PARTITION BY symbol ORDER BY rowid ROWS 19 PRECEDING)
--upper bollinger band
,AVG(close) OVER (PARTITION BY symbol ORDER BY rowid ROWS 19 PRECEDING)
+ (stddev_pop(close) OVER (PARTITION BY symbol ORDER BY rowid ROWS 19 PRECEDING) * 2)
--lower bollinger band
,AVG(close) OVER (PARTITION BY symbol ORDER BY rowid ROWS 19 PRECEDING)
- (stddev_pop(close) OVER (PARTITION BY symbol ORDER BY rowid ROWS 19 PRECEDING) * 2)

FROM stockdb.pricesmaster













The screenshot below shows the output from the Bollinger bands summary table for Exxon Mobil (XOM).


Workflow for creating use cases

The use cases in this chapter will be created using Microsoft Power BI Desktop. The workflow is outlined below
Add an ODBC datasource to a new report in Power BI, this connects to the previously configured Hive ODBC DSN and one or more Hive tables is selected. The required tables in this use case are from the stockdb database and are 1) yearlyreturns and 2) bollinger20 summary tables
Transform the imported tables by adding new columns, changing column data types etc.
Create the dashboards by dragging visual chart elements from the toolbox and adding fields to them as needed







Step 1: Add ODBC datasource

The screenshot below shows the blank report that the ODBC DSN will be added to. Next are the steps involved in adding the Hive tables from the ODBC DSN.


The DSN to be added to the connection string box is specified by typing in : dsn=stockdb.



The bollinger20 and yearlyreturns tables are selected from the left panel. Click on edit to import the summary tables and transform the imported data.


Step 2: Transform the imported data

The screenshot below shows the Power BI Query Editor after importing the two tables.


The following changes need to the made to the yearlyreturns imported tables.
Change the data type of the tradedate column from text to Date
Add a new column called Year extracted from the tradedate column
When done, the transformed table looks like the one in the below screenshot.

4 new measures and a calculated column need to be added to the yearlyreturns table. The calculated column is called "YearlyReturnFlag" These new measures will be used in calculating the Annual Growth Rate of each stock in the table. The code for adding the measures in given below:
YearlyReturnFlag = IF([returnpct] >= 0,"Upward trend","Downward trend")

BeginningAdjClose = CALCULATE(SUM(yearlyreturns[adjclose]),FIRSTNONBLANK(yearlyreturns[tradedate],CALCULATE(SUM(yearlyreturns[adjclose]))) )

EndingAdjClose = CALCULATE(SUM(yearlyreturns[adjclose]),LASTNONBLANK(yearlyreturns[tradedate],CALCULATE(SUM(yearlyreturns[adjclose]))) )

TotalYears = COUNTROWS(yearlyreturns)

CAGR = POWER([EndingAdjClose]/[BeginningAdjClose],(1/[TotalYears]))-1

Use Case 1: Bollinger bands moving averages dashboard

In this use case, a dashboard analyzing the 20 day moving averages of selected stocks is built. The moving averages is summarized using a technical trading tool called Bollinger Bands. A definition of Bollinger Bands is given below:
Bollinger Bands are a technical trading tool created by John Bollinger in the early 1980s. They arose from the need for adaptive trading bands and the observation that volatility was dynamic, not static as was widely believed at the time.

The purpose of Bollinger Bands is to provide a relative definition of high and low. By definition prices are high at the upper band and low at the lower band. This definition can aid in rigorous pattern recognition and is useful in comparing price action to the action of indicators to arrive at systematic trading decisions.
Bollinger Bands consist of a set of three curves drawn in relation to securities prices. The middle band is a measure of the intermediate-term trend, usually a simple moving average, that serves as the base for the upper band and lower band. The interval between the upper and lower bands and the middle band is determined by volatility, typically the standard deviation of the same data that were used for the average. The default parameters, 20 periods and two standard deviations, may be adjusted to suit your purposes.
Typical uses of Bollinger Bands include:
Trend trading for stock and forex
Reversal trading
The screenshot below shows a preview of the Bollinger bands dashboard created in this use case.


The screenshots that follow show the steps taken to create the Bollinger Bands Dashboard
Drag the Slicer visualization to the workspace and add the symbol column from the bollinger20 table to its property box, as show in the first screen shot below
Drag the Line Chart visualization to the workspace and add the tradeyear column to the Shared Axis property box, and then add the ma20, close, upperbollinger and lowerbollinger columns to the Values property box. The dashboard now looks like the second screenshot below.











The final Bollinger Bands dashboard is shown below:
















Dashboard usage & Testing

The screenshot below shows the Bollinger bands chart for ticker symbol AIG (American International Group) intraday prices from 1980 till date. Typical usage of this chart is in day trading the listed securities of AIG on a stock exchange.


The next screenshot shows Bollinger bands chart for ticker symbol KO (Coca Cola) intraday prices from 1980 till date.



Use Case 2: Investment Analysis Dashboard

The compound annual growth rate (CAGR) is a useful measure of growth over multiple time periods. It can be thought of as the growth rate that gets you from the initial investment value to the ending investment value if you assume that the investment has been compounding over the time period.
The formula for CAGR is:
CAGR = (EV / BV) 1 / n - 1
Where:
EV = Investment's ending value
BV = Investment's beginning value
n = Number of periods (months, years, etc.)
The above formula is calculated in Power BI Desktop from the yearlyreturns table using measures. The code earlier given is reproduced below:

BeginningAdjClose = CALCULATE(SUM(yearlyreturns[adjclose]),FIRSTNONBLANK(yearlyreturns[tradedate],CALCULATE(SUM(yearlyreturns[adjclose]))) )

EndingAdjClose = CALCULATE(SUM(yearlyreturns[adjclose]),LASTNONBLANK(yearlyreturns[tradedate],CALCULATE(SUM(yearlyreturns[adjclose]))) )

TotalYears = COUNTROWS(yearlyreturns)

CAGR = POWER([EndingAdjClose]/[BeginningAdjClose],(1/[TotalYears]))-1

Two other measures are added to the yearlyreturns, these new measures will calculate for us the expected value of a $1,000,000 dollar investment in each selected stock in our dashboard compared to say a fixed-rate savings account investment offering 2.85%. The code to calculate these measures is given below:
SavingsRate = 0.0285

SavingsReturns = 1000000*(POWER(1+[SavingsRate],[TotalYears]))

StockReturns = 1000000*(POWER(1+[CAGR],[TotalYears]))


The SavingsReturns measure calculates the hypothetical returns on an investment in a fixed-rate savings account offering 2.85% while the StockReturns measures uses the selected stocks annual growth rate (CAGR) to calculate the projected returns on a $1,000,000 investment over the timeline of the stock listed on the NYSE.
The screenshots that follow show the steps taken to create the CAGR Dashboard
Add a new page to the report as shown below
Drag the Slicer visualization to the workspace and add the symbol column from the yearlyreturns table to its property box, as show in the first screen shot
Next add the Table visualization to the workspace and add the symbol, TotalYears and EndAdjClose columns to the Values property box.
Drag the Column Chart visualization to the workspace and add the Year column to the Shared Axis property box, add the YearlyReturnFlag column to the Legend property box, finally add the returnpct column to the Values property box.
Drag the Area Chart visualization to the workspace and add the Year column to the Shared Axis property box, add then add the returnpct column to the Values property box. The dashboard now looks like the second screenshot below.
Next add 3 Card visualizations to the workspace and add the CAGR, SavingsReturns and StockReturns measures to them respectively. The final dashboard is shown in the third screenshot.





Dashboard usage & Testing

The screenshot below tells me that if I had invested $1,000,000, I might have received a return of $54.15 million on GE (General Electric) stock on an annual growth rate of 12.08% as compared to a mere $2.67 million on the hypothetical fixed-rate of 2.85%.



The next screenshot tells me that I might have received a return on a $1,000,000 investment of $732.08 million on ORCL (Oracle Corp) stock on an annual growth rate of 25.54% as compared to a mere $2.26 million on the hypothetical fixed-rate savings of 2.85% rate.



Use Case Source Code & Project Data

The source code for the Yahoo Finance API data import project is available on GitHub at https://github.com/ozzioma/YahooImport. A copy is also included with the accompanying disk.
The Power BI Desktop file for this use case (named Financial Dashboard.pbix) is included in the accompanying disk, located in the "Financial Markets use case" folder.
The SQL file for creating the Hive schema (financial schema.sql) is also included in the folder
The downloaded data from Yahoo Finance is also included in the use case folder.


Chapter 6: Telecommunications Use Case

Introduction

This use case is modelled on the operations of telecom companies in Nigeria and all the data used is entirely fictitious. This is mandated by the need of the telecom companies to protect the privacy of their customers. An attempt has been made to generate data similar to what may be obtainable in the actual operations of the telecom companies.
Use case overview

Acme Telecom is a fictional mobile services provider specializing in VOIP and mobile switching. Acme Telecom currently handles millions of voice calls a day for over 20 million customers. Acme Telecom operates in states in the USA and Canada.
Acme Telecom provides 4 different types of services to end customers. These services are:
Voice (Call) services, which are grouped into
Outgoing calls (Mobile originated calls or MOC)
Incoming calls (Mobile terminated calls or MTC)
Text (SMS) services, which are grouped into
Incoming SMS (Mobile terminated SMS or SMS-MT)
Outgoing SMS (Mobile originated SMS or SMS-MO)
Data (Broadband) services
3G data services
LTE data services
For each type and group of service, Acme Telecom has a rate. The rates are listed below:
Service
Service type
Description
Rate (to end user)
MOC
Voice
Outgoing call
15 cents/second
MTC
Voice
Incoming call
0 cents/second
SMS-MO
Text
Outgoing SMS
30 cents/160 characters
SMS-MT
Text
Incoming SMS
0 cents
3G
Data
Broadband data session
12 cents/kilobytes
LTE
Data
Broadband data session
16 cents/kilobytes



Acme Telecom also has a tiered service delivery approach for customers, namely for each type of service offered, Acme Telecom has different service plans available for different classes of customers. The service plans are listed below:
Call Service
Call plans
Data Service
Data plans
Voice
Starter
3G and LTE
Eazy Surf

Silver

Quick Surf

Gold

Biz Surf

Platinum



Elite







Big Data Analytics challenge

Acme Telecom collects service usage data on the fly for billing, reconciliation and reporting purposes. The volume of data has grown exponentially over the years with the introduction of data services and the subscription of new customers. A key focus for telecom companies is the reduction of customer churn, where churn can be defined as the rate of loss in customers over a certain period. To enable telecom companies monitor and reduce churn rate, the ability to store, process and analyze the huge volumes of data generated on a daily basis from telecom operations is an implicit in today's telecom industry.
Enterprise use cases for Acme Telecom

In this use case, 3 dashboards will be built to demonstrate the ability to use Big Data technologies to analyze Acme Telecom's operational service data. The focus of the dashboards will be on the following:
Monitoring of trends in service usage and adoption
View at a glance increase and decrease in service usage by customers across locations, years and months
View at a glance increase and decrease in service and product revenue across locations, years and months
The dashboards that will be created to address the 3 cases above are listed below:
Services Revenue Tracking dashboard
Call Plan Revenue Tracking dashboard
Data Plan Revenue Tracking dashboard
Note that the focus of the analysis here is not on churn prediction, but on the analysis of historical service usage data, with the goal to gaining a better understanding of historical churn across locations, periods and service plans.

Use case workflow

Load telecom operational data from sources, sources in this case will be generated CDR data
Load market data into HDFS and create Hive tables pointing to the HDFS folders
Create views, aggregations and summaries of Hive tables for reporting purposes
Load data into Microsoft Power BI Desktop using PowerQuery datasources plugin
Create Services Revenue Tracking dashboard
Create Call Plan Revenue Tracking dashboard
Create Data Plan Revenue Tracking dashboard

Data sourcing

I have developed a Java application called cdrgenerator to generate sample CDR data. The Java project is hosted on GitHub at https://github.com/ozzioma/cdrgenerator.
A sample set of 30,000 fictitious customers is used for the data generation. A total of The CSV file containing the selected stocks is included with the Java source code on GitHub given in the next section.
The screenshot in Excel below shows a preview of the generated customer data.


The screenshot below shows a preview of the generated CDR data. A total of 1,840,218 CDRs is generated for this use case.


Workflow for data loading

Create directory in HDFS to store imported data files
Import data files into HDFS directory
Create Hive tables pointing to HDFS directory in step 2

Step 1: Create HDFS Directory

HDFS directories are where physical storage blocks are stored across the Hadoop cluster. HDFS directories can be created from the command line using the HDFS client or through graphical tools such as Hue.
The required directories are given below:
Parent directory /churn
Directory /churn/customers for storing downloaded stock prices CSV files
Directory /churn/log for storing downloaded dividend payments CSV files
The HDFS client command line scripts for creating the above directories are given below
hadoop fs -mkdir /churn
hadoop fs -mkdir /churn/customers
hadoop fs -mkdir /customers/log

Step 2: Load data files into HDFS directories

Data can be loaded into HDFS through the command line HDFS client or through user interfaces such as Hue.
The CDR data is best loaded into HDFS from the command line. The script below describes this.
hadoop fs -copyFromLocal /media/sf_Datasets/telecom/bcs/batch/*.* /churn/log
The above command loads all the files in the /media/sf_Datasets/telecom/bcs/batch folder into the /churn/log HDFS folder. A total of 1,840,218 rows of call data records is contained within the CSV files.
The screenshot below shows the uploaded CDR files in HDFS using the Hue web interface.


Step 3: Create Hive tables from uploaded data

The scripts below show the Hive query language commands to create the database and associated tables for loading the NYSE stock market data.
--create database
create database churn location '/user/hadoop/churndb';

--create service log table
create external table churndb.servicelog (
callBegin timestamp,callEnd timestamp,callPlan string,callType string,
customerId string,customerPhoneNumber string,
dataKiloBytes float,dataPlan string,dataRate float,
inPhoneNumber string,outPhoneNumber string,ratePerSec float,
serviceDescription string,serviceType string,smsRate float,smsText string)
row format delimited fields terminated by '"' stored as textfile location '/churn/log'
tblproperties ("skip.header.line.count"="1");



The screenshot below shows the scripts being executed using the Hive Query Editor in Hue.


The next screenshot shows the CDR data for the "Elite" call plan, for the year 2014, month January.


Workflow for creating use cases

The use cases in this chapter will be created using Microsoft Power BI Desktop. The workflow is outlined below
Add an ODBC datasource to a new report in Power BI, this connects to the previously configured Hive ODBC DSN and one or more Hive tables is selected. The required tables in this use case are from the churndb database and are 1) customers and 2) servicelog tables
Transform the imported tables by adding new columns, changing column data types etc.
Create the dashboards by dragging visual chart elements from the toolbox and adding fields to them as needed

Step 1: Add ODBC datasource

The screenshot below shows the blank report that the ODBC DSN will be added to. Next are the steps involved in adding the Hive tables from the ODBC DSN.


The DSN to be added to the connection string box is specified by typing in : dsn=churndb.



The customers and servicelog tables are selected from the left panel. Click on edit to import the tables and transform the imported data.
Step 2: Transform the imported data

The screenshot below shows the Power BI Query Editor after importing the two tables.

Also, Power BI manages the relationship between the two imported tables which can be detected automatically or manually specified as shown in the screenshot below:



The following changes need to the made to the servicelog table.
Change the data types of some columns
Add new columns from existing ones
Because of the intricacies involved, the code to transform the servicelog table is reproduced below, the code can be pasted into the Advanced Editor as shown below:
let
Source = Odbc.DataSource("dsn=churndb"),
servicelog_churndb_HIVE = Source{[Item="servicelog",Schema="churndb",Catalog="HIVE"]}[Data],
#"Inserted Year" = Table.AddColumn(servicelog_churndb_HIVE, "Year", each Date.Year([callbegin]), type number),
#"Renamed Columns" = Table.RenameColumns(#"Inserted Year",{{"Year", "year"}}),
#"Inserted Month" = Table.AddColumn(#"Renamed Columns", "Month", each Date.Month([callbegin]), type number),
#"Renamed Columns1" = Table.RenameColumns(#"Inserted Month",{{"Month", "month"}}),
#"Added Custom" = Table.AddColumn(#"Renamed Columns1", "datacost", each ([datakilobytes]*0.001)*([datarate]/100)),
#"Inserted Hour" = Table.AddColumn(#"Added Custom", "Hour", each Time.Hour([callbegin]), type number),
#"Renamed Columns2" = Table.RenameColumns(#"Inserted Hour",{{"Hour", "callbegin_hour"}}),
#"Inserted Time" = Table.AddColumn(#"Renamed Columns2", "Time", each DateTime.Time([callbegin]), type time),
#"Removed Columns" = Table.RemoveColumns(#"Inserted Time",{"callbegin_hour"}),
#"Renamed Columns3" = Table.RenameColumns(#"Removed Columns",{{"Time", "callbegin_time"}}),
#"Inserted Time Subtraction" = Table.AddColumn(#"Renamed Columns3", "TimeDifference", each [callend] - [callbegin], type duration),
#"Inserted Total Days" = Table.AddColumn(#"Inserted Time Subtraction", "DurationTotalDays", each Duration.TotalDays([TimeDifference]), type number),
#"Inserted Total Hours" = Table.AddColumn(#"Inserted Total Days", "DurationTotalHours", each Duration.TotalHours([TimeDifference]), type number),
#"Inserted Total Minutes" = Table.AddColumn(#"Inserted Total Hours", "DurationTotalMinutes", each Duration.TotalMinutes([TimeDifference]), type number),
#"Inserted Total Seconds" = Table.AddColumn(#"Inserted Total Minutes", "DurationTotalSeconds", each Duration.TotalSeconds([TimeDifference]), type number),
#"Added Custom1" = Table.AddColumn(#"Inserted Total Seconds", "callcost", each [DurationTotalSeconds]*([ratepersec]/100)),
#"Inserted Text Length" = Table.AddColumn(#"Added Custom1", "Length", each Text.Length([smstext]), type number),
#"Renamed Columns4" = Table.RenameColumns(#"Inserted Text Length",{{"Length", "smslenght"}}),
#"Added Custom2" = Table.AddColumn(#"Renamed Columns4", "smscount", each [smslenght]/160),
#"Inserted Round Up" = Table.AddColumn(#"Added Custom2", "RoundUp", each Number.RoundUp([smscount]), Int64.Type),
#"Renamed Columns5" = Table.RenameColumns(#"Inserted Round Up",{{"RoundUp", "totalsms"}}),
#"Added Custom3" = Table.AddColumn(#"Renamed Columns5", "smscost", each [totalsms]*([smsrate]/100)),
#"Changed Type" = Table.TransformColumnTypes(#"Added Custom3",{{"smscost", type number}}),
#"Filtered Rows" = Table.SelectRows(#"Changed Type", each true)
in
#"Filtered Rows"


When done, the transformed table looks like the one in the below screenshot.

After transforming the table, number of new columns are added to the servicelog table. Some are extracted directly from existing columns while others are calculated and then used for calculating other columns. The code for the key calculations is given below:
Column
Data type
Calculation
year
Whole Number
Extracted from callbegin column
month
Whole Number
Extracted from callbegin column
callcost
Decimal number
callrate*DurationTotalSeconds
datacost
Decimal number
([datakilobytes]*0.001)*([datarate]/100)
smscost
Decimal number
totalsms*(smsrate/100)
DurationTotalDays
Decimal number

DurationTotalHours
Decimal number

DurationTotalMinutes
Decimal number

DurationTotalSeconds
Decimal number






















Use Case 1: Services Revenue Tracking Dashboard

The screenshot below shows the Services Revenue Tracking Dashboard, with the visualizations marked.
EEFFDDCCBBAA
E
E
F
F
D
D
C
C
B
B
A
A

Section
Description
Columns/Fields
A
Slicer visualization for filtering by customer location/state
state column from customers table
B
Slicer visualization for filtering by year
year from servicelog table
C
Line and clustered column chart displaying callcost,datacost and smscost by year
year from servicelog table on Shared Axis, callcost, datacost and smscost in Column Values
D
Clustered Column chart displaying callcost by year and callplan
year column on Axis, callplan on Legend and callcost on Value
E
Clustered Bar chart displaying datacost by year and dataplan
year column on Axis, dataplan on Legend and datacost on Value
F
Bar Chart displaying smscost by year and calltype
year column on Axis, callplan on Legend and smscost on Value








Dashboard usage & Testing

The screenshot below shows the revenue for location California, year 2014 across Voice, Text and Data services.

The screenshot below shows the revenue for locations California, Arizona and Arkansas for the years 2010, 2011, 2012 across Voice, Text and Data services.

Use Case 2: Call Plan Revenue Tracking Dashboard

The screenshot below shows the Call Plan Revenue Tracking Dashboard, with the visualizations marked.
DDGGEEFFBBCCAA
D
D
G
G
E
E
F
F
B
B
C
C
A
A

Section
Description
Columns/Fields
A
Slicer visualization for filtering by year
year column from servicelog table
B
Slicer visualization for filtering by month
month from servicelog table
C
Slicer visualization for filtering by customer location/state
state column from customers table
D
Matrix table displaying callcost by year, calltype across callplans
calltype,servicetype and year on Rows
servicedescription and callplan on Columns
callcost on Values
E
Bar chart displaying callcost by year and callplan
year column on Axis, month on Legend and callcost on Value
F
Funnel Chart displaying customers by callplan
callplan on Group, customerid count on Values
G
Bar chart showing DurationTotalMinutes by year and callplan
year column on Axis, callplan on Legend and DurationTotalMinutes on Value



Dashboard usage & Testing

The screenshot below shows the call revenue for all locations, year 2013, for the first 4 months.


The screenshot below shows the call revenue for location Alabama, years 2012, 2013 and 2014, for all 12 months.

Use Case 3: Data Plan Revenue Tracking Dashboard

The screenshot below shows the Data Plan Revenue Tracking Dashboard, with the visualizations marked.
EEGGBBCCDDFFAA
E
E
G
G
B
B
C
C
D
D
F
F
A
A

Section
Description
Columns/Fields
A
Slicer visualization for filtering by customer location/state
state column from customers table
B
Slicer visualization for filtering by year
year from servicelog table
C
Slicer visualization for filtering by month
month from servicelog table
D
Funnel Chart displaying datacost by servicetype
servicetype on Group, datacost on Values
E
Line and Stacked Column chart displaying datacost by dataplan and servicetype
dataplan column on Shared Axis, servicetype on Column Series and datacost on Value
F
Bar Chart displaying datacost by year and month
year on Axis, month on Legend, data cost on Values
G
Cards showing total customer base, total kilobytes used by customers, DurationTotalHours and DurationTotalMinutes





Dashboard usage & Testing

The screenshot below shows the data revenue for location Kansas, year 2014, for all 12 months.














The screenshot below shows the call revenue for locations Illinois, Indiana, Iowa and Kansa, years 2012 and 2013, for the first 3 months.


Use Case Source Code & Project Data

The source code for the CDR generator project is available on GitHub at https://github.com/ozzioma/cdrgenerator. A copy is also included with the accompanying disk.
The Power BI Desktop file for this use case (named Telecom Dashboard.pbix) is included in the accompanying disk, located in the "Telecom use case" folder.
The SQL file for creating the Hive schema (telecom schema.sql) is also included in the folder
The generated CDR data is also included in the use case folder.


Chapter 7: Data Integration Use Case

Use Case Overview

Acme Trading LLC manages assets for several institutional clients. A key factor to Acme Trading's success in asset management is accurate and well tested algorithms for trading securities. Acme Fund's collaborates with different partners to develop highly complex algorithms for trading securities in different markets. Among these is ABC Insite LLC, a software company specializing in algorithm development for high frequency trading.
In this use case, a scenario is demonstrated where analysts at ABC Insite use Microsoft Excel to access a REST service exposed by Acme Trading LLC. ABC Insite wants to be able to create its own Annual Growth Dashboard for monitoring yearly returns of securities. This is a variation of the Investment Analysis dashboard created in Chapter 5.
REST Services overview

Representational State Transfer (REST) is an architectural style for designing distributed network applications. Applications that adhere to the principles underlying REST services are said to be RESTful. These principles or characteristics of RESTful services include: (Varanasi & Belida, 2015)
Separation of concerns between client and server
Communication between client and server should be stateless
Multiple hierarchical layers can exist between the client and server
Responses from the server can be declared as cacheable or non-cacheable
All interactions between the client and server are based on uniform interfaces
Note that no security features have been implemented in this use case, this is because the objective is firstly to demonstrate the ability to expose data stored in big data stores such as HDFS to end users. However, several authentication and authorization mechanisms exist for securing REST services. These mechanisms should be thoroughly tested and implemented before deploying web services internally or externally. This is to guard against breach of privacy and unauthorized access to sensitive information.
These authentication mechanisms include:
OAuth
Digital certificates




REST Service Development tools

The requirements for building the REST service are as follows:
Java 7
Apache Hive JDBC data connectors
Spring framework 4.1.6

REST Service creation workflow

The following workflow is followed in creating the REST service:
Define the data entities/resources to be exposed
Define the endpoints and URLs for the exposed resources
Define the request methods for accessing the exposed resources
Create the Spring web application structure
Step 1: Define resources to be exposed

The resources to be exposed are the yearlyreturns, prices and dividends tables from the Financial Markets use case. The data exposed is for read-only purposes and cannot be modified.
Step 2 & 3: Define resource endpoints and HTTP request methods

Resource
URI
Service class
HTTP Request Method
Description
prices
/prices/{symbol}
PriceService
GET
Return all historical prices for ticker symbol

/prices/{symbol}/{year}
PriceService
GET
Return all historical prices for ticker symbol in the specified year

/prices/{symbol}/{year}/{month}
PriceService
GET
Return all historical prices for ticker symbol in the specified year and month
yearlyreturns
/prices/yearly/{symbol}
PriceService
GET
Return all summarized yearly opening prices for ticker symbol
dividends
/payments/{symbol}
DividendService
GET
Return all historical dividend figures for the ticker symbol

/payments/{symbol}/{year}
DividendService
GET
Return all historical dividend figures for the ticker symbol by year

Step 4: Create Spring web application structure

The below listing configures the exposed web services to be use Spring's ContentNegotiationManager for content negotiation. This auto enables the serialization to HTML, XML and JSON of web service responses based on either the request headers or request URL path extension.
LISTING 7.1: Spring MVC Web config file










































LISTING 7.2: Connection string properties file

hive.driver=org.apache.hive.jdbc.HiveDriver
hive.url=jdbc:hive2://169.254.160.201:10000/stockdb
hive.username=hadoop
hive.password=hadoop

LISTING 7.3: Price ticker service

package com.ozzy.bcsproject.tickerservice;

//import com.wordnik.swagger.annotations.Api;
import org.springframework.beans.factory.annotation.Autowired;
import org.springframework.http.HttpStatus;
import org.springframework.http.ResponseEntity;
import org.springframework.web.bind.annotation.PathVariable;
import org.springframework.web.bind.annotation.RequestMapping;
import org.springframework.web.bind.annotation.ResponseBody;
import org.springframework.web.bind.annotation.RestController;

import java.sql.SQLException;
import java.util.List;

/**
* Created by Ozzy on 007, Jul 07, 2015.
*/
@RestController
@RequestMapping("/prices")
public class PriceService
{

private final TickerRepository dataService;

@Autowired
public PriceService(TickerRepository dataRepo)
{
this.dataService = dataRepo;
}

@RequestMapping("/{symbol}")
@ResponseBody
public ResponseEntity getAllPricesForSymbol(@PathVariable("symbol") String symbol)
{
List prices = null;
TickerPrices response = null;

try
{
prices = dataService.getAllPrices(symbol);
response = new TickerPrices();
response.setPrices(prices);

} catch (SQLException e)
{
return new ResponseEntity(HttpStatus.INTERNAL_SERVER_ERROR);
}

if (!prices.isEmpty())
{
return new ResponseEntity(response, HttpStatus.OK);
}
return new ResponseEntity(HttpStatus.NOT_FOUND);

}


@RequestMapping("/{symbol}/{year}")
@ResponseBody
public ResponseEntity getPricesByYear(
@PathVariable("symbol") String symbol, @PathVariable("year") int year)
{
if (year < 1980 "" year > 2015)
{
ErrorDetail error = new ErrorDetail();
error.setTitle("year " + year);
error.setErrorMessage("Validation failed");
error.setDetail("year must be between 1980 and 2015");

return new ResponseEntity(error, HttpStatus.INTERNAL_SERVER_ERROR);
}

List prices = null;
TickerPrices response = null;

try
{
prices = dataService.getPricesByYear(symbol, year);
response = new TickerPrices();
response.setPrices(prices);

} catch (SQLException e)
{
return new ResponseEntity(HttpStatus.INTERNAL_SERVER_ERROR);
}

if (!prices.isEmpty())
{
return new ResponseEntity(response, HttpStatus.OK);
}
return new ResponseEntity(HttpStatus.NOT_FOUND);
}

@RequestMapping("/{symbol}/{year}/{month}")
@ResponseBody
public ResponseEntity getPricesByMonth(
@PathVariable("symbol") String symbol, @PathVariable("year") int year, @PathVariable("month") int month)
{

if (year < 1980 "" year > 2015)
{
ErrorDetail error = new ErrorDetail();
error.setTitle("year " + year);
error.setErrorMessage("Validation failed");
error.setDetail("year must be between 1980 and 2015");

return new ResponseEntity(error, HttpStatus.INTERNAL_SERVER_ERROR);
}

if (month < 1 "" month > 12)
{
ErrorDetail error = new ErrorDetail();
error.setTitle("month " + month);
error.setErrorMessage("Validation failed");
error.setDetail("month must be between 1 and 12");

return new ResponseEntity(error, HttpStatus.INTERNAL_SERVER_ERROR);
}

List prices = null;
TickerPrices response = null;

try
{
prices = dataService.getPricesByMonth(symbol,year,month);
response = new TickerPrices();
response.setPrices(prices);

} catch (SQLException e)
{
return new ResponseEntity(HttpStatus.INTERNAL_SERVER_ERROR);
}

if (!prices.isEmpty())
{
return new ResponseEntity(response, HttpStatus.OK);
}
return new ResponseEntity(HttpStatus.NOT_FOUND);
}

@RequestMapping("/yearly/{symbol}")
@ResponseBody
public ResponseEntity getYearlyPricesForSymbol(@PathVariable("symbol") String symbol)
{
List prices = null;
YearlyPrices response = null;

try
{
prices = dataService.getYearlyPrices(symbol);
response = new YearlyPrices();
response.setPrices(prices);

} catch (SQLException e)
{
return new ResponseEntity(HttpStatus.INTERNAL_SERVER_ERROR);
}

if (!prices.isEmpty())
{
return new ResponseEntity(response, HttpStatus.OK);
}
return new ResponseEntity(HttpStatus.NOT_FOUND);

}

}


LISTING 7.2: Dividends ticker service

package com.ozzy.bcsproject.tickerservice;

import org.springframework.beans.factory.annotation.Autowired;
import org.springframework.http.HttpStatus;
import org.springframework.http.ResponseEntity;
import org.springframework.web.bind.annotation.PathVariable;
import org.springframework.web.bind.annotation.RequestMapping;
import org.springframework.web.bind.annotation.ResponseBody;
import org.springframework.web.bind.annotation.RestController;

import java.sql.SQLException;
import java.util.List;

/**
* Created by Ozzy on 007, Jul 07, 2015.
*/
@RestController
@RequestMapping("/payments")
public class DividendService
{

private final TickerRepository dataService;

@Autowired
public DividendService(TickerRepository dataRepo)
{
this.dataService = dataRepo;
}

@RequestMapping("/{symbol}")
@ResponseBody
public ResponseEntity getAllPaymentsForSymbol(
@PathVariable("symbol") String symbol)
{
List payments = null;
DividendPayments response = null;

try
{
payments = dataService.getAllDividends(symbol);
response = new DividendPayments();
response.setPayments(payments);

} catch (SQLException e)
{
return new ResponseEntity(HttpStatus.INTERNAL_SERVER_ERROR);
}

if (!payments.isEmpty())
{
return new ResponseEntity(response, HttpStatus.OK);
}
return new ResponseEntity(HttpStatus.NOT_FOUND);

}


@RequestMapping("/{symbol}/{year}")
@ResponseBody
public ResponseEntity getPaymentsByYear(
@PathVariable("symbol") String symbol, @PathVariable("year") int year)
{
if (year < 1980 "" year > 2015)
{
ErrorDetail error = new ErrorDetail();
error.setTitle("year " + year);
error.setErrorMessage("Validation failed");
error.setDetail("year must be between 1980 and 2015");

return new ResponseEntity(error, HttpStatus.INTERNAL_SERVER_ERROR);
}

List payments = null;
DividendPayments response = null;

try
{
payments = dataService.getDividendsByYear(symbol,year);
response = new DividendPayments();
response.setPayments(payments);

} catch (SQLException e)
{
return new ResponseEntity(HttpStatus.INTERNAL_SERVER_ERROR);
}

if (!payments.isEmpty())
{
return new ResponseEntity(response, HttpStatus.OK);
}
return new ResponseEntity(HttpStatus.NOT_FOUND);

}

}


REST Service Deployment

The REST service can be deployed in any JEE 7 compliant servlet container. The servlet container used in this use case is Oracle Glassfish 4. Note that the JDBC driver used in connecting to Hive is from the Hive stand-alone jar file found in the Hive lib folder. A necessary addition is the Hadoop-commons jar file, which contains Hadoop core libraries required for the JDBC driver connection.
The following screenshots show the REST service being deployed in the Glassfish admin console.




REST API usage

The REST service can be invoked from any appropriate client capable of issuing HTTP requests.
The screenshot below shows the response from issuing the request for Coca Cola's yearly opening price quotes in JSON format.


The next screenshot below show the response from issuing the request for IBM's yearly opening price quotes in XML format.


Using the REST API to build a dashboard in Excel

To conclude this use case, Microsoft Excel will be used to connect to the yearly prices web service and download data from the yearlyprices table. Specifically, the Annual Growth Rate Analysis dashboard will be created for IBM stock.
The steps are outlined below:
From the Power Query tab, add a datasource from the web as shown in the screenshot below, the web url this case is http://ozzypc:8083/tickerservice/prices/yearly/IBM.xml


The yearlyprice item is selected from the left pane, next you click on Load as shown below:



Next, two new columns YearlyReturnFlag and Year are added as in the previous example as shown below:




Next, add a Power View sheet from the Home tab as shown below:


The next shows the completed Power View dashboard in Excel:



Use Case Source Code

The REST service source code for this chapter is available on GitHub at https://github.com/ozzioma/TickerService. A copy is also included in the disk.
The Microsoft Excel sheet containing the Power View dashboard (name cagr dashboard.xlsx ) for this use is included in the accompanying disk, located in the "Integration use case" folder.


Chapter 8: Conclusions and Recommendations

Conclusion

Big data technologies can function as data repositories for various business needs. These needs include but are not limited to:
Storage of historical/operational data for internal usage
Storage of transactional data for reporting and analysis
Integration with third party systems used either internally or external to the organization
Integration with partners/clients external to the organization

Recommendations

Enterprises grappling with rising data volumes from different sources can today take advantage of the latest Big Data technologies to achieve the following
Store operational and transactional data cheaply,
Process in real time the huge streams of data,
Utilize big data distributed computing tools to process and summarize data stored in Big Data stores
Achieve cost savings due to the cost effectiveness of open source licensing



Works Cited
DURENARD, E. A. (2013). Professional Automated Trading. Wiley.
Frampton, M. (2014). Big Data Made Easy. In M. Frampton, Big Data Made Easy (p. 3). Apress.
O'reilly Media Inc. (2012). Big Data Now. O'reilly Media.
Shawant, N., & Shah, H. (2014). Big Data Application Architecture Q&A. In N. Shawant, & H. Shah, Big Data Application Architecture Q&A (pp. 11-12). Apress.
Sitto, K., & Presser, M. (2015). Field Guide to Hadoop. In K. Sitto, & M. Presser, Field Guide to Hadoop (p. 34). O'reilly Media.
Society, B. C. (2014). Big Data Opportunities and Challenges. In B. C. Society, Big Data Opportunities and Challenges (p. 1). BCS, The Chartered Institute for IT.
Soumendra, M., Madhu, J., & Harsha, S. (2013). Big Data Imperatives. In S. Mohanty, M. Jagadeesh, & H. Srivatsa, Big Data Imperatives (p. 10). Apress.
Varanasi, B., & Belida, S. (2015). Spring REST. Apress.
White, T. (2014). Hadoop the Definitive Guide 4Ed. In T. White, Hadoop the Definitive Guide 4Ed (p. 6). O'reilly Media.
Zaharia, M. e. (2015). Learning Spark. O'reilly Media.







Apache Ambari cluster
Node 1
Centos 6.3 64-bit
2 GB RAM
25GB HDD
2 cores
Node 2
Centos 6.3 64-bit
2 GB RAM
25GB HDD
2 cores
Node 3
Centos 6.3 64-bit
2 GB RAM
25GB HDD
2 cores
Hadoop Distributed File System (HDFS)
Designed for storing very large files
Supports streaming data access
Designed to run on commodity hardware
MapReduce
Programming model for data processing
Supports Java, Python, Ruby
Supports inbuilt parallelization
YARN
Yet Another Resource Negotiator
Manages the cluster of servers running Hadoop
Comprises the distributed computation layer of Hadoop
Sits between HDFS and top level data processing tools like Apache Hive, Spark
Hadoop Ecosystem
Tools for data extraction, transformation and loading
Tools for data processing and querying
Tools for managing distributed computation tasks on big data
Volume
Variety
Velocity

Lihat lebih banyak...

Comentários

Copyright © 2017 DADOSPDF Inc.