🔙 Quay lại trang tải sách pdf ebook Big Data Analytics With Java Ebooks Nhóm Zalo Big Data Analytics with Java Table of Contents Big Data Analytics with Java Credits About the Author About the Reviewers www.PacktPub.com eBooks, discount offers, and more Why subscribe? Customer Feedback Preface What this book covers What you need for this book Who this book is for Conventions Reader feedback Customer support Downloading the example code Downloading the color images of this book Errata Piracy Questions 1. Big Data Analytics with Java Why data analytics on big data? Big data for analytics Big data – a bigger pay package for Java developers Basics of Hadoop – a Java sub-project Distributed computing on Hadoop HDFS concepts Design and architecture of HDFS Main components of HDFS HDFS simple commands Apache Spark Concepts Transformations Actions Spark Java API Spark samples using Java 8 Loading data Data operations – cleansing and munging Analyzing data – count, projection, grouping, aggregation, and max/min Actions on RDDs Paired RDDs Transformations on paired RDDs Saving data Collecting and printing results Executing Spark programs on Hadoop Apache Spark sub-projects Spark machine learning modules MLlib Java API Other machine learning libraries Mahout – a popular Java ML library Deeplearning4j – a deep learning library Compressing data Avro and Parquet Summary 2. First Steps in Data Analysis Datasets Data cleaning and munging Basic analysis of data with Spark SQL Building SparkConf and context Dataframe and datasets Load and parse data Analyzing data – the Spark-SQL way Spark SQL for data exploration and analytics Market basket analysis – Apriori algorithm Full Apriori algorithm Implementation of the Apriori algorithm in Apache Spark Efficient market basket analysis using FP-Growth algorithm Running FP-Growth on Apache Spark Summary 3. Data Visualization Data visualization with Java JFreeChart Using charts in big data analytics Time Series chart All India seasonal and annual average temperature series dataset Simple single Time Series chart Multiple Time Series on a single chart window Bar charts Histograms When would you use a histogram? How to make histograms using JFreeChart? Line charts Scatter plots Box plots Advanced visualization technique Prefuse IVTK Graph toolkit Other libraries Summary 4. Basics of Machine Learning What is machine learning? Real-life examples of machine learning Type of machine learning A small sample case study of supervised and unsupervised learning Steps for machine learning problems Choosing the machine learning model What are the feature types that can be extracted from the datasets? How do you select the best features to train your models? How do you run machine learning analytics on big data? Getting and preparing data in Hadoop Preparing the data Formatting the data Storing the data Training and storing models on big data Apache Spark machine learning API The new Spark ML API Summary 5. Regression on Big Data Linear regression What is simple linear regression? Where is linear regression used? Predicting house prices using linear regression Dataset Data cleaning and munging Exploring the dataset Running and testing the linear regression model Logistic regression Which mathematical functions does logistic regression use? Where is logistic regression used? Predicting heart disease using logistic regression Dataset Data cleaning and munging Data exploration Running and testing the logistic regression model Summary 6. Naive Bayes and Sentiment Analysis Conditional probability Bayes theorem Naive Bayes algorithm Advantages of Naive Bayes Disadvantages of Naive Bayes Sentimental analysis Concepts for sentimental analysis Tokenization Stop words removal Stemming N-grams Term presence and Term Frequency TF-IDF Bag of words Dataset Data exploration of text data Sentimental analysis on this dataset SVM or Support Vector Machine Summary 7. Decision Trees What is a decision tree? Building a decision tree Choosing the best features for splitting the datasets Advantages of using decision trees Disadvantages of using decision trees Dataset Data exploration Cleaning and munging the data Training and testing the model Summary 8. Ensembling on Big Data Ensembling Types of ensembling Bagging Boosting Advantages and disadvantages of ensembling Random forests Gradient boosted trees (GBTs) Classification problem and dataset used Data exploration Training and testing our random forest model Training and testing our gradient boosted tree model Summary 9. Recommendation Systems Recommendation systems and their types Content-based recommendation systems Dataset Content-based recommender on MovieLens dataset Collaborative recommendation systems Advantages Disadvantages Alternating least square – collaborative filtering Summary 10. Clustering and Customer Segmentation on Big Data Clustering Types of clustering Hierarchical clustering K-means clustering Bisecting k-means clustering Customer segmentation Dataset Data exploration Clustering for customer segmentation Changing the clustering algorithm Summary 11. Massive Graphs on Big Data Refresher on graphs Representing graphs Common terminology on graphs Common algorithms on graphs Plotting graphs Massive graphs on big data Graph analytics GraphFrames Building a graph using GraphFrames Graph analytics on airports and their flights Datasets Graph analytics on flights data Summary 12. Real-Time Analytics on Big Data Real-time analytics Big data stack for real-time analytics Real-time SQL queries on big data Real-time data ingestion and storage Real-time data processing Real-time SQL queries using Impala Flight delay analysis using Impala Apache Kafka Spark Streaming Typical uses of Spark Streaming Base project setup Trending videos Sentiment analysis in real time Summary 13. Deep Learning Using Big Data Introduction to neural networks Perceptron Problems with perceptrons Sigmoid neuron Multi-layer perceptrons Accuracy of multi-layer perceptrons Deep learning Advantages and use cases of deep learning Flower species classification using multi-Layer perceptrons Deeplearning4j Hand written digit recognizition using CNN Diving into the code: More information on deep learning Summary Index Big Data Analytics with Java Big Data Analytics with Java Copyright © 2017 Packt Publishing All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews. Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the author, nor Packt Publishing, and its dealers and distributors will be held liable for any damages caused or alleged to be caused directly or indirectly by this book. Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information. First published: July 2017 Production reference: 1270717 Published by Packt Publishing Ltd. Livery Place 35 Livery Street Birmingham B3 2PB, UK. ISBN 978-1-78728-898-0 www.packtpub.com Credits Author Rajat Mehta Reviewers Dave Wentzel Roberto Casati Commissioning Editor Veena Pagare Acquisition Editor Chandan Kumar Content Development Editor Deepti Thore Technical Editors Jovita Alva Sneha Hanchate Copy Editors Safis Editing Laxmi Subramanian Project Coordinator Shweta H Birwatkar Proofreader Safis Editing Indexer Pratik Shirodkar Graphics Tania Dutta Production Coordinator Shantanu N. Zagade Cover Work Shantanu N. Zagade About the Author Rajat Mehta is a VP (technical architect) in technology at JP Morgan Chase in New York. He is a Sun certified Java developer and has worked on Java-related technologies for more than 16 years. His current role for the past few years heavily involves the use of a big data stack and running analytics on it. He is also a contributor to various open source projects that are available on his GitHub repository, and is also a frequent writer for dev magazines. About the Reviewers Dave Wentzel is the CTO of Capax Global, a data consultancy specializing in SQL Server, cloud, IoT, data science, and Hadoop technologies. Dave helps customers with data modernization projects. For years, Dave worked at big independent software vendors, dealing with the scalability limitations of traditional relational databases. With the advent of Hadoop and big data technologies everything changed. Things that were impossible to do with data were suddenly within reach. Before joining Capax, Dave worked at Microsoft, assisting customers with big data solutions on Azure. Success for Dave is solving challenging problems at companies he respects, with talented people who he admires. Roberto Casati is a certified enterprise architect working in the financial services market. Roberto lives in Milan, Italy, with his wife, their daughter, and a dog. In a former life, after graduating in engineering, he worked as a Java developer, Java architect, and presales architect for the most important telecommunications, travel, and financial services companies. His interests and passions include data science, artificial intelligence, technology, and food. www.PacktPub.com eBooks, discount offers, and more Did you know that Packt offers eBook versions of every book published, with PDF and ePub files available? You can upgrade to the eBook version at www.PacktPub.com and as a print book customer, you are entitled to a discount on the eBook copy. Get in touch with us at for more details. At www.PacktPub.com, you can also read a collection of free technical articles, sign up for a range of free newsletters and receive exclusive discounts and offers on Packt books and eBooks. https://www.packtpub.com/mapt Get the most in-demand software skills with Mapt. Mapt gives you full access to all Packt books and video courses, as well as industry-leading tools to help you plan your personal development and advance your career. Why subscribe? Fully searchable across every book published by Packt Copy and paste, print, and bookmark content On demand and accessible via a web browser Customer Feedback Thanks for purchasing this Packt book. At Packt, quality is at the heart of our editorial process. To help us improve, please leave us an honest review on this book’s Amazon page at https://www.amazon.com/dp/1787288986. If you’d like to join our team of regular reviewers, you can e-mail us at [email protected]. We award our regular reviewers with free eBooks and videos in exchange for their valuable feedback. Help us be relentless in improving our products! This book is dedicated to my mother Kanchan, my wife Harpreet, my daughter Meher, my father Ashwini and my son Vivaan. Preface Even as you read this content, there is a revolution happening behind the scenes in the field of big data. From every coffee that you pick up from a coffee store to everything you click or purchase online, almost every transaction, click, or choice of yours is getting analyzed. From this analysis, a lot of deductions are now being made to offer you new stuff and better choices according to your likes. These techniques and associated technologies are picking up so fast that as developers we all should be a part of this new wave in the field of software. This would allow us better prospects in our careers, as well as enhance our skill set to directly impact the business we work for. Earlier technologies such as machine learning and artificial intelligence used to sit in the labs of many PhD students. But with the rise of big data, these technologies have gone mainstream now. So, using these technologies, you can now predict which advertisement the user is going to click on next, or which product they would like to buy, or it can also show whether the image of a tumor is cancerous or not. The opportunities here are vast. Big data in itself consists of a whole lot of technologies whether cluster computing frameworks such as Apache Spark or Tez or distributed filesystems such as HDFS and Amazon S3 or real-time SQL on underlying data using Impala or Spark SQL. This book provides a lot of information on big data technologies, including machine learning, graph analytics, real-time analytics and an introductory chapter on deep learning as well. I have tried to cover both technical and conceptual aspects of these technologies. In doing so, I have used many real world case studies to depict how these technologies can be used in real life. So this book will teach you how to run a fast algorithm on the transactional data available on an e-commerce site to figure out which items sell together, or how to run a page rank algorithm on a flight dataset to figure out the most important airports in a country based on air traffic. There are many content gems like these in the book for readers. What this book covers Chapter 1, Big Data Analytics with Java, starts with providing an introduction to the core concepts of Hadoop and provides information on its key components. In easy-to-understand explanations, it shows how the components fit together and gives simple examples on the usage of the core components HDFS and Apache Spark. This chapter also talks about the different sources of data that can put their data inside Hadoop, their compression formats, and the systems that are used to analyze that data. Chapter 2, First Steps in Data Analysis, takes the first steps towards the field of analytics on big data. We start with a simple example covering basic statistical analytic steps, followed by two popular algorithms for building association rules using the Apriori Algorithm and the FP-Growth Algorithm. For all case studies, we have used realistic examples of an online e-commerce store to give insights to users as to how these algorithms can be used in the real world. Chapter 3, Data Visualization, helps you to understand what different types of charts there are for data analysis, how to use them, and why. With this understanding, we can make better decisions when exploring our data. This chapter also contains lots of code samples to show the different types of charts built using Apache Spark and the JFreeChart library. Chapter 4, Basics of Machine Learning, helps you to understand the basic theoretical concepts behind machine learning, such as what exactly is machine learning, how it is used, examples of its use in real life, and the different forms of machine learning. If you are new to the field of machine learning, or want to brush up your existing knowledge on it, this chapter is for you. Here I will also show how, as a developer, you should approach a machine learning problem, including topics on feature extraction, feature selection, model testing, model selection, and more. Chapter 5, Regression on Big Data, explains how you can use linear regression to predict continuous values and how you can do binary classification using logistic regression. A real-world case study of house price evaluation based on the different features of the house is used to explain the concepts of linear regression. To explain the key concepts of logistic regression, a real-life case study of detecting heart disease in a patient based on different features is used. Chapter 6, Naive Bayes and Sentimental Analysis, explains a probabilistic machine learning model called Naive Bayes and also briefly explains another popular model called the support vector machine. The chapter starts with basic concepts such as Bayes Theorem and then explains how these concepts are used in Naive Bayes. I then use the model to predict the sentiment whether positive or negative in a set of tweets from Twitter. The same case study is then re-run using the support vector machine model. Chapter 7, Decision Trees, explains that decision trees are like flowcharts and can be programmatically built using concepts such as Entropy or Gini Impurity. The golden egg in this chapter is a case study that shows how we can predict whether a person's loan application will be approved or not using decision trees. Chapter 8, Ensembling on Big Data, explains how ensembling plays a major role in improving the performance of the predictive results. I cover different concepts related to ensembling in this chapter, including techniques such as how multiple models can be joined together using bagging or boosting thereby enhancing the predictive outputs. We also cover the highly popular and accurate ensemble of models, random forests and gradient-boosted trees. Finally, we predict loan default by users in a dataset of a real-world Lending Club (a real online lending company) using these models. Chapter 9, Recommendation Systems, covers the particular concept that has made machine learning so popular and it directly impacts business as well. In this chapter, we show what recommendation systems are, what they can do, and how they are built using machine learning. We cover both types of recommendation systems: content-based and collaborative, and also cover their good and bad points. Finally, we cover two case studies using the MovieLens dataset to show recommendations to users for movies that they might like to see. Chapter 10, Clustering and Customer Segmentation on Big Data, speaks about clustering and how it can be used by a real-world e-commerce store to segment their customers based on how valuable they are. I have covered both k-Means clustering and bisecting k-Means clustering, and used both of them in the corresponding case study on customer segmentation. Chapter 11, Massive Graphs on Big Data, covers an interesting topic, graph analytics. We start with a refresher on graphs, with basic concepts, and later go on to explore the different forms of analytics that can be run on the graphs, whether path-based analytics involving algorithms such as breadth-first search, or connectivity analytics involving degrees of connection. A real-world flight dataset is then used to explore the different forms of graph analytics, showing analytical concepts such as finding top airports using the page rank algorithm. Chapter 12, Real-Time Analytics on Big Data, speaks about real-time analytics by first seeing a few examples of real-time analytics in the real world. We also learn about the products that are used to build real-time analytics system on top of big data. We particularly cover the concepts of Impala, Spark Streaming, and Apache Kafka. Finally, we cover two real-life case studies on how we can build trending videos from data that is generated in real-time, and also do sentiment analysis on tweets by depicting a Twitter-like scenario using Apache Kafka and Spark Streaming. Chapter 13, Deep Learning Using Big Data, speaks about the wide range of applications that deep learning has in real life whether it's self-driving cars, disease detection, or speech recognition software. We start with the very basics of what a biological neural network is and how it is mimicked in an artificial neural network. We also cover a lot of the theory behind artificial neurons and finally cover a simple case study of flower species detection using a multi-layer perceptron. We conclude the chapter with a brief introduction to the Deeplearning4j library and also cover a case study on handwritten digit classification using convolution neural networks. What you need for this book There are a few things you will require to follow the examples in this book: a text editor (I use Sublime Text), internet access, admin rights to your machine to install applications and download sample code, and an IDE (I use Eclipse and IntelliJ). You will also need other software such as Java, Maven, Apache Spark, Spark modules, the GraphFrames library, and the JFreeChart library. We mention the required software in the respective chapters. You also need a good computer with a good RAM size, or you can also run the samples on Amazon AWS. Who this book is for If you already know some Java and understand the principles of big data, this book is for you. This book can be used by a developer who has mostly worked on web programming or any other field to switch into the world of analytics using machine learning on big data. A good understanding of Java and SQL is required. Some understanding of technologies such as Apache Spark, basic graphs, and messaging will also be beneficial. Conventions In this book, you will find a number of styles of text that distinguish between different kinds of information. Here are some examples of these styles, and an explanation of their meaning. A block of code is set as follows: Dataset rowDS = spark.read().csv("data/loan_train.csv"); rowDS.createOrReplaceTempView("loans"); Dataset loanAmtDS = spark.sql("select _c6 from loans"); When we wish to draw your attention to a particular part of a code block, the relevant lines or items are set in bold: Datasetdata = spark.read().csv("data/heart_disease_data.csv"); System.out.println("Number of Rows -->" + data.count()); Note Warnings or important notes appear in a box like this. Tip Tips and tricks appear like this. Reader feedback Feedback from our readers is always welcome. Let us know what you think about this book—what you liked or may have disliked. Reader feedback is important for us to develop titles that you really get the most out of. To send us general feedback, simply send an e-mail to , and mention the book title via the subject of your message. If there is a topic that you have expertise in and you are interested in either writing or contributing to a book, see our author guide on www.packtpub.com/authors. If you have any questions, don't hesitate to look me up on LinkedIn via my profile https://www.linkedin.com/in/rajatm/, I will be more than glad to help a fellow software professional. Customer support Now that you are the proud owner of a Packt book, we have a number of things to help you to get the most from your purchase. Downloading the example code You can download the example code files for all Packt books you have purchased from your account at http://www.packtpub.com. If you purchased this book elsewhere, you can visit http://www.packtpub.com/support and register to have the files e-mailed directly to you. You can download the code files by following these steps: 1. Log in or register to our website using your e-mail address and password. 2. Hover the mouse pointer on the SUPPORT tab at the top. 3. Click on Code Downloads & Errata. 4. Enter the name of the book in the Search box. 5. Select the book for which you're looking to download the code files. 6. Choose from the drop-down menu where you purchased this book from. 7. Click on Code Download. You can also download the code files by clicking on the Code Files button on the book's webpage at the Packt Publishing website. This page can be accessed by entering the book's name in the Search box. Please note that you need to be logged in to your Packt account. Once the file is downloaded, please make sure that you unzip or extract the folder using the latest version of: WinRAR / 7-Zip for Windows Zipeg / iZip / UnRarX for Mac 7-Zip / PeaZip for Linux The code bundle for the book is also hosted on GitHub at https://github.com/PacktPublishing/Big-Data-Analytics-with-Java. We also have other code bundles from our rich catalog of books and videos available at https://github.com/PacktPublishing/. Check them out! Downloading the color images of this book We also provide you with a PDF file that has color images of the screenshots/diagrams used in this book. The color images will help you better understand the changes in the output. You can download this file from www.packtpub.com/sites/default/files/downloads/BigDataAnalyticswithJava_ColorImages.p Errata Although we have taken every care to ensure the accuracy of our content, mistakes do happen. If you find a mistake in one of our books—maybe a mistake in the text or the code—we would be grateful if you would report this to us. By doing so, you can save other readers from frustration and help us improve subsequent versions of this book. If you find any errata, please report them by visiting http://www.packtpub.com/submit-errata, selecting your book, clicking on the errata submission form link, and entering the details of your errata. Once your errata are verified, your submission will be accepted and the errata will be uploaded on our website, or added to any list of existing errata, under the Errata section of that title. Any existing errata can be viewed by selecting your title from http://www.packtpub.com/support. Piracy Piracy of copyright material on the Internet is an ongoing problem across all media. At Packt, we take the protection of our copyright and licenses very seriously. If you come across any illegal copies of our works, in any form, on the Internet, please provide us with the location address or website name immediately so that we can pursue a remedy. Please contact us at with a link to the suspected pirated material. We appreciate your help in protecting our authors, and our ability to bring you valuable content. Questions You can contact us at if you are having a problem with any aspect of the book, and we will do our best to address it. Chapter 1. Big Data Analytics with Java Big data is no more just a buzz word. In almost all the industries, whether it is healthcare, finance, insurance, and so on, it is heavily used these days. There was a time when all the data that was used in an organization was what was present in their relational databases. All the other kinds of data, for example, data present in the log files were all usually discarded. This discarded data could be extremely useful though, as it can contain information that can help to do different forms of analysis, for example, log files data can tell about patterns of user interaction with a particular website. Big data helps store all these kinds of data, whether structured or unstructured. Thus, all the log files, videos, and so on can be stored in big data storage. Since almost everything can be dumped into big data whether they are log files or data collected via sensors or mobile phones, the amount of data usage has exploded within the last few years. Three Vs define big data and they are volume, variety and velocity. As the name suggests, big data is a huge amount of data that can run into terabytes if not peta bytes of volume of storage. In fact, the size is so humongous that ordinary relational databases are not capable of handling such large volumes of data. Apart from data size, big data can be of any type of data be it the pictures that you took in the 20 years or the spatial data that a satellite sends, which can be of any type, be it text or in the form of images. Any type of data can be dumped into the big data storage and analyzed. Since the data is so huge it cannot fit on a single machine and hence it is stored on a group of machines. Many programs can be run in parallel on these machines and hence the speed or velocity of computation on big data. As the quantity of this data is very high, very insightful deductions can now be made from the data. Some of the use cases where big data is used are: In the case of an e-commerce store, based on a user's purchase history and likes, new set of products can be recommended to the users, thereby increasing the sales of the site Customers can be segmented into different groups for an e-commerce site and can then be presented with different marketing strategies On any site, customers can be presented with ads they might be most likely to click on Any regular ETL-like work (for example, as in finance or healthcare, and so on.) can be easily loaded into the big data stack and computed in parallel on several machines Trending videos, products, music, and so on that you see on various sites are all built using analytics on big data Up until few years back, big data was mostly batch. Therefore, any analytics job that was run on big data was run in a batch mode usually using MapReduce programs, and the job would run for hours if not for days and would then compute the output. With the creation of the cluster computing framework, Apache Spark, a lot of these batch computations that took lot of time earlier have tremendously improved now. Big data is not just Apache Spark. It is an ecosystem of various products such as Hive, Apache Spark, HDFS, and so on. We will cover these in the upcoming sections. This book is dedicated to analytics on big data using Java. In this book, we will be covering various techniques and algorithms that can be used to analyze our big data. In this chapter, we will cover: General details about what big data is all about An overview of the big data stack—Hadoop, HDFS, Apache Spark We will cover some simple HDFS commands and their usage We will provide an introduction to the core Spark API of RDDs using a few examples of its actions and transformations using Java We will also cover a general introduction on Spark packages such as MLlib, and compare them with other libraries such as Apache Mahout Finally, we will give a general description of data compression formats such as Avro and Parquet that are used in the big data world Why data analytics on big data? Relational databases are suitable for real-time crud operations such as order capture in e-commerce stores but they are not suitable for certain use cases for which big data is used. The data that is stored in relational databases is structured only but in big data stack (read Hadoop) both structured and unstructured data can be stored. Apart from this, the quantity of data that can be stored and parallelly processed in big data is massive. Facebook stores close to a tera byte of data in its big data stack on a daily basis. Thus, mostly in places where we need real-time crud operations on data, we can still continue to use relational databases, but in other places where we need to store and analyze almost any kind of data (whether log files, video files, web access logs, images, and so on.), we should use Hadoop (that is, big data). Since analytics run on Hadoop, it runs on top of massive amounts of data; it is thereby a no brainer that deductions made from this are way more different than can be made from small amounts of data. As we all know, analytic results from large data amounts beat any fancy algorithm results. Also you can run all kinds of analytics on this data whether it be stream processing, predictive analytics, or real-time analytics. The data on top of Hadoop is parallelly processed on multiple nodes. Hence the processing is very fast and the results are parallelly computed and combined. Big data for analytics Let's take a look at the following diagram to see what kinds of data can be stored in big data: As you can see, the data from varied sources and of varied kinds can be dumped into Hadoop and later analyzed. As seen in the preceding image there could be many existing applications that could serve as sources of data whether providing CRM data, log data, or any other kind of data (for example, orders generated online or audit history of purchase orders from existing web order entry applications). Also as seen in the image, data can also be collected from social media or web logs of HTTP servers like Apache or any internal source like sensors deployed in a house or in the office, or external source like customers' mobile devices, messaging applications such as messengers and so on. Big data – a bigger pay package for Java developers Java is a natural fit for big data. All the big data tools support Java. In fact, some of the core modules are written in Java only, for example, Hadoop is written in Java. Learning some of the big data tools is no different than learning a new API for Java developers. So, putting big data skills in their skillset is a healthy addition for all the Java developers. Mostly, Python and R language are hot in the field of data science mainly because of the ease of use and the availability of great libraries such as scikit learn. But, Java, on the other hand has picked up greatly due to big data. On the big data side, there is availability of good software on the Java stack that can be readily used for applying regular analytics or predictive analytics using machine learning libraries. Learning a combination of big data and analytics on big data would get you closer to apps that make a real impact on business and hence they command a good pay too. Basics of Hadoop – a Java sub-project Hadoop is a free, Java-based programming framework that supports the processing of these large datasets in a distributed computing environment. It is part of the Apache Software Foundation and was donated by Yahoo! It can be easily installed on a cluster of standard machines. Different computing jobs can then be parallelly run on these machines for faster performance. Hadoop has become very successful in companies to store all of their massive data in one system and perform analysis on this data. Hadoop runs in a master/slave architecture. The master controls the running of the entire distributed computing stack. Some of the main features of Hadoop are: Feature nameFeature description Failover support Horizontal scalability If one or more slave machines go down, the task is transferred to another workable machine by the master Just by adding a new machine, it comes within the network of the Hadoop framework and becomes part of the Hadoop ecosystem Lower cost Data locality Hadoop runs on cheap commodity hardware and is much cheaper than the costly large data solutions of other companies. For example some bigger firms have large data warehouse implementations such as Oracle Exadata or Teradata. These also let you store and analyze huge amounts of data but their hardware and software both are expensive and require more maintenance. Hadoop on the other hand installs on commodity hardware and its software is open sourced. This is one of the most important features of Hadoop and is the reason why Hadoop is so fast. Any processing of large data is done on the same machine on which the data resides. This way, there is no time and bandwidth lost in the transferring of data. There is an entire ecosystem of software that is built around Hadoop. Take a look at the following diagram to visualize the Hadoop ecosystem: As you can see in the preceding diagram, for different criteria we have a different set of products. The main categories of the products that big data has are shown as follows: Analytical products: The whole purpose of this big data usage is an ability to analyze and make use of this extensive data. For example, if you have click stream data lying in the HDFS storage of big data and you want to find out the users with maximum hits or users who made the most number of purchases, or based on the transaction history of users you want to figure out the best recommendations for your users, there are some popular products that help us to analyze this data to figure out these details. Some of these popular products are Apache Spark and Impala. These products are sophisticated enough to extract data from the distributed machines of big data storage and to transform and manipulate it to make it useful. Batch products: in the initial stages when it came into picture, the word "big data" was synonymous with batch processing. So you had jobs that ran on this massive data for hours and hours cleaning and extracting the data to probably build useful reports for the users. As such, the initial set of products that shipped with Hadoop itself included "MapReduce", which is a parallel computing batch framework. Over time, more sophisticated products appeared such as Apache Spark, which also a cluster computing framework but is comparatively faster than MapReduce, but still in actuality they are batch only. Streamlining: This category helps to fill the void of pulling and manipulating real time data in the Hadoop space. So we have a set of products that can connect to sources of streaming data and act on it in real time. So using these kinds of products you can make things like trending videos on YouTube or trending hashtags on Twitter at this point in time. Some popular products in this space are Apache Spark (using the Spark Streaming module) and Apache Storm. We will be covering the Apache Spark streaming module in our chapter on real time analytics. Machine learning libraries: In the last few years there has been tremendous work in the predictive analytics space. Predictive analytics involves usage of advanced machine learning libraries and it's no wonder that some of these libraries are now included with the clustering computing frameworks as well. So a popular machine learning library such as Spark ML ships along with Apache Spark and older libraries such as Apache Mahout are also supported on big data. This is a growing space with new libraries frequently entering the market every few days. NoSQL: There are times when we need frequent reads and updates of data even though big data is involved. Under these situations there are a lot of non-SQL products that can be readily used while analyzing your data and some of the popular ones that can be used are Cassandra and HBase both of which are open source. Search: Quite often big data is in the form of plain text. There are many use cases where you would like to index certain words in the text to make them easily searchable. For example, if you are putting all the newspapers of a particular branch published for the last few years in HDFS in pdf format, you might want a proper index to be made over these documents so that they are readily searchable. There are products in the market that were previously used extensively for building search engines and they are now integratable with big data as well. One of the popular and open source options is SOLR and it can be easily established on top of big data to make the content easily searchable. The categories of products we have just depicted previously is not extensive. We have not covered messaging solutions and there are many other products too apart from this. For checking on extensive lists refer to a book that specifically covers Hadoop in detail: for example, the Hadoop Definitive Guide. We have covered the main categories of products, but let's now cover some of the important products themselves that are built on top of the big data stack: Product Description HDFSHDFS is a distributed filesystem that provides high-performance access to data across Hadoop clusters SparkThe Spark cluster computing framework is used for various purposes such as analytics, stream processing, machine learning analytics, and so on, as shown in the preceding diagram. ImpalaReal-time data analytics is where you can fire queries in real time using this on big data; this is used by data scientists and business analysts. MapReduceMapReduce is a programming model and an associated implementation for processing and generating large datasets with a parallel, distributed algorithm on a cluster. SqoopThis helps to pull data from structured databases such as Oracle and push the data into Hadoop or HDFS Oozie This is a job scheduler for scheduling Hadoop jobs Flume This is a tool to pull large amount of streaming data into Hadoop/HSFS KafkaKafka is a real-time stream processing engine which provides very high throughput and low latency. Yarn This is the resource manager in Hadoop 2 Distributed computing on Hadoop Suppose you put plenty of data on a disk and read it. Reading this entire data takes, for example, 24 hours. Now, suppose you distribute this data on 24 different machines of the same type and run the read program at the same time on all the machines. You might be able to parallelly read the entire data in an hour (an assumption just for the purpose of this example). This is what parallel computing is all about though. It helps in processing large volumes of data parallelly on multiple machines called nodes and then combining the results to build a cumulated output. Disk input/output is so slow that we cannot rely on a single program running on one machine to do all this for us. There is an added advantage of data storage across multiple machines, which is failover and replication support of data. The bare bones of Hadoop are the base modules that are shipped with its default download option. Hadoop consists of three main modules: Hadoop core: This is the main layer that has the code for the failover, data replication, data storage, and so on. HDFS: The Hadoop Distributed File System (HDFS) is the primary storage system used by Hadoop applications. HDFS is a distributed filesystem that provides high-performance access to data across Hadoop clusters. MapReduce: This is the data analysis framework that runs parallely on top of data stored in HDFS. As you saw in the options above if you install the base Hadoop package you will get the core Hadoop library, the HDFS file system, and the MapReduce framework by default, but this is not extensive and the current use cases demand much more then the bare minimum products provided by the Hadoop default installation. It is due to this reason that a whole set of products have originated on top of this big data stack be, it the streaming products such as Storm or messaging products such as Kafka or search products such as SOLR. HDFS concepts HDFS is Hadoop's implementation of a distributed filesystem. The way it is built, it can handle large amount of data. It can scale to the extent where the other types of distributed filesystems, for example, NFS cannot scale to. It runs on plain commodity servers and any number of servers can be used. HDFS is a write once, read several times type of filesystem. Also, you can append to a file, but you cannot update a file. So if you need to make an update, you need to create a new file with a different version. If you need frequent updates and the amount of data is small, then you should use other software such as RDBMS or HBASE. Design and architecture of HDFS These are some of the features of HDFS: Open source: HDFS is a completely open source distributed filesystem and is a very active open source project. Immense scalability for the amount of data: You can store petabytes of data in it without any problem. Failover support: Any file that is put in HDFS is broken into chunks (called blocks) and these blocks are distributed across different machines of the cluster. Apart from the distribution of this file data, the data is also replicated across the different machines depending upon the replication level. Thereby, in case any machine goes down; the data is not lost and is served from the other machine. Fault tolerance: This refers to the capability of a system to work in unfavorable conditions. HDFS handles faults by keeping replicated copies of data. So due to a fault, if one set of data in a machine gets corrupted then the data can always be pulled from some other replicated copy. The replica of the data is created on different machines, so even if the entire machine goes down, still is no problem as replicated data can always be pulled from some other machine that has the copy of it. Data locality: The way HDFS is designed, it allows the main data processing programs to run closer to the data where it resides and hence they are faster as less network transfer is involved. Main components of HDFS There are two main daemons that make up HDFS. They are depicted in the following diagram: As you can see in the preceding diagram, the main components are: NameNode: This is the main program (master) of HDFS. A file in HDFS is broken in to chunks or blocks and is distributed and replicated across the different machines in the Hadoop cluster. It is the responsibility of the NameNode to figure out which blocks go where and where the replicated blocks land up. It is also responsible for clubbing the data of the file when the full file is asked for by the client. It maintains the full metadata for the file. DataNodes: These are the slave processes running on the other machines (other than the NameNode machine). They store the data and provide the data when the NameNode asks for it. The most important advantage of this master/slave architecture of HDFS is failover support. Thereby, if any DataNode or slave machine is down, the NameNode figures this out using a heartbeat signal and it would then refer to another DataNode that has the replicated copy of that data. Before Hadoop 2, the NameNode was the single point of failure but after Hadoop 2, NameNodes have a better failover support. So you can run two NameNodes alongside one another so that if one NameNode fails, the other NameNode can quickly take over the control. HDFS simple commands Most of the commands on HDFS are for storing, retrieving, or discarding data on it. If you are used to working on Linux, then using HDFS shell commands is simple, as almost all the commands are a replica of the Linux commands with similar functions. Though the HDFS commands can be executed by the browser as well as using Java programs, for the purpose of this book, we will be only discussing the shell commands of HDFS, as shown in the following table: Command What it does This helps you to make a directory in HDFS: hdfs dfs -mkdir /usr/etl mkdir put You always start the command with hdfs dfs and then the actual command, which is exactly similar to the Linux command. In this case, this command makes a directory etl inside the /usr directory in hdfs. This helps you to copy a file from a local filesystem to hdfs: hdfs dfs -put dataload1.txt /usr/etl This copies a file dataload1.txt to /usr/etl directory inside hdfs lsThis helps you to list out all files inside a directory: hdfs dfs -ls /usr/etl (lists out files inside /usr/etl) This helps you to remove a file: rm hdfs dfs -rm /usr/etl/dataload.txt (deletes dataload.txt inside /usr/etl) du -hThis helps you to check the file size: hdfs dfs -du -h /usr/etc/dataload.txt chmod This helps you to change the permissions on all: hdfs dfs -chmod 700 /usr/etl/dataload.txt This only gives the owner of the file complete permissions; rest of the users won't have any permissions on the file. catThis helps you to read the contents of a file: hdfs dfs -cat /usr/etl/dataload.txt This helps you to read the top content (few lines from top) of a file: head hdfs dfs -head /usr/etl/dataload.txt Similarly, we have the tail command to read a few lines from the bottom of a file. mvThis helps you to move a file across different directories: hdfs dfs -mv /usr/etl/dataload.txt /usr/input/newdataload.txt Apache Spark Apache Spark is the younger brother to the MapReduce framework. It's a cluster computing framework that is getting much more attraction now in comparison to MapReduce. It can run on a cluster of thousands of machines and distribute computations on the massive datasets across these machines and combine the results. There are few main reasons why Spark has become more popular than MapReduce: It is way faster than MapReduce because of its approach of handling a lot of stuff in memory. So on the individual nodes of machines, it is able to do a lot of work in memory, but MapReduce on the other hand has to touch the hard disk many times to get a computation done and the hard disk read/write is slow, so MapReduce is much slower. Spark has an extremely simple API and hence it can be learned very fast. The best documentation is the Apache page itself, which can be accessed at spark.apache.org. Running algorithms such as machine learning algorithms on MapReduce can be complex but the same can be very simple to implement in Apache Spark. It has a plethora of sub-projects that can be used for various other operations. Concepts The main concept to understand Spark is the concept of RDDs or Resilient Distributed Dataset. So what exactly is an RDD? A resilient distributed dataset (RDD) is an immutable collection of objects. These objects are distributed across the different machines available in a cluster. To a Java developer, an RDD is nothing but just like another variable that they can use in their program, similar to an ArrayList. They can directly use it or call some actions on it, for example, count() to figure out the number of elements in it. Behind the job, it sparks tasks that get propagated to the different machines in the cluster and bring back the computed results in a single object as shown in the following example: JavaRDD rows = sc.textFile("univ_rankings.csv"); System.out.println("Total no. of rows --->"+ rowRdd.count()); The preceding code is simple yet it depicts the two powerful concepts of Apache Spark. The first statement shows a Spark RDD object and the second statement shows a simple action. Both of them are explained as follows: JavaRDD: This is a simple RDD with the name rows. As shown in the generics parameter, it is of type string. So it shows that this immutable collection is filled with string objects. So, if Spark, in this case, is sitting on 10 machines, then this list of strings or RDD will be distributed across the 10 machines. But to the Java developer, this object is just available as another variable and if they need to find the number of elements or rows in it, they just need to invoke an action on it. rows.count(): This is the action that is performed on the RDD and it computes the total elements in the RDD. Behind the scene, this method would run on the different machines of the cluster parallelly and would club the computed result on each parallel node and bring back the result to the end user. Note RDD can be filled with any kind of object, for example, Java or Scala objects. Next we will cover the types of operations that can be run on RDDs. RDDs support two type of operations and they are transformations and actions. We will be covering both in the next sections. Transformations These are used to transform an RDD into just another RDD. This new RDD can later be used in different operations. Let's try to understand this using an example as shown here: JavaRDD lines = sc.textFile("error.log"); As shown in the preceding code, we are pulling all the lines from a log file called error.log into a JavaRDD of strings. Now, suppose we need to only filter out and use the data rows with the word error in it. To do that, we would use a transformation and filter out the content from the lines RDD, as shown next: JavaRDD filtered = rowRdd.filter(s -> s.contains("error")); System.out.println("Total no. of rows --->"+ filtered.count()); As you can see in the preceding code, we filtered the RDD based on whether the word error is present in its element or not and the new RDD filtered only contains the elements or objects that have the word error in it. So, transformation on one RDD produces another RDD only. Actions The user can take some actions on the RDD. For example, if they want to know the total number of elements in the RDD, they can invoke an action count() on it. It's very important to understand that until transformation, everything that happens on an RDD is in lazy mode only; that is, to say that the underlying data remains untouched until that point. It's only when we invoke an action on an RDD that the underlying data gets touched and an operation is performed on it. This is a design-specific approach followed in Spark and this is what makes it so efficient. We actually need the data only when we execute some action on it. What if the user filtered the error log for errors but never uses it? Then storing this data in memory is a waste, so thereby only when some action such as count() is invoked will the actual data underneath be touched. Here are few common questions: When RDD is created, can it be reused again and again? An RDD on which no action has been performed but only transformations are performed can be directly reused again and again. As until that point no underlying data is touched in actuality. However, if an action has been performed on an RDD, then this RDD object is utilized and discarded as soon as it is used. As soon as an action is invoked on an RDD the underlying transformations are then executed or in other words the actual computation then starts and a result is returned. So an action basically helps in the return of a value. What if I want to re-use the same RDD even after running some action on it? If you want to reuse the RDD across actions, then you need to persist it or, in other words, cache it and re-use it across operations. Caching an RDD is simple. Just invoke an API call persist and specify the type of persistence. For example, in memory or on disk, and so on. Thereby, the RDD, if small, can be stored in the memory of the individual parallel machines or it could be written to a disk if it is too big to fit into memory. An RDD that is stored or cached in this way, as mentioned earlier, is reusable only within that session of Spark Context. That is, to say if your program ends the usage ends and all the temp disk files of the storage of RDD are deleted. So what would you do if you need an RDD again and again in multiple programs going forward in different SparkContext sessions? In this case, you need to persist and store the RDD in an external storage (such as a file or database) and reuse it. In the case of big data applications, we can store the RDD in HDFS filesystem or we can store it in a database such as HBase and reuse it later when it is needed again. In real-world applications, you would almost always persist an RDD in memory and reuse it again and again to expedite the different computations you are working on. What does a general Spark program look like? Spark is used in massive ETL (extract, transform, and load), predictive analytics, or reporting applications. Usually the program would do the following: 1. Load some data into the RDD. 2. Do some transformation on it to make the data compatible to handle your operations. 3. Cache the reusable data across sessions (by using persist). 4. Do some actions on the data; the action can be ready-made or can be custom operations that you wrote in your programs. Spark Java API Since Spark is written in Scala, which inherently is written in Java, Java is the big brother on the Apache Spark stack and is fully supported on all its products. It has an extensive API on the Apache Spark, stack. On Apache Spark Scala is a popular language of choice but most enterprise projects within big corporations still heavily rely on Java. Thus, for existing java developers on these projects, using Apache Spark and its modules by their java APIs is relatively easy to pick up. Here are some of the Spark APIs that java developers can easily use while doing their big data work: Accessing the core RDD frameworks and its functions Accessing Spark SQL code Accessing Spark Streaming code Accessing the Spark GraphX library Accessing Spark MLlib algorithms Apart from this, Java is very strong on the other big data products as well. To show how strong Java is on the overall big data scene, let's see some examples of big data products that readily support Java: Working on HBase using Java: HBase has a very strong java API and data can easily be manipulated on it using Java Working on Hive using Java: Hive is a batch storage product and working on it using Java is easy as it has a good Java API. Even HDFS supports a Java API for regular file handling operations on HDFS. Spark samples using Java 8 All our samples in the book are written using Java 8 on Apache Spark 2.1. Java 8 is aptly suited for big data work mainly because of its support for lambda's, due to which the code is very concise. In the older versions of Java, the Apache Spark Java code was not concise but Java 8 has changed completely. We will encourage the readers of this book to actively use the Java 8 API on Apache Spark as it not only produces concise code, but overall improves the readability and maintainability of the code. One of the main reasons why scala is heavily used on Apache Spark was mainly due to the concise and easy to use API. But with the usage of Java 8 on Apache Spark, this advantage of Scala is no longer applicable. Loading data Before we use Spark for data analysis, there is some boilerplate code that we always have to write for creating the SparkConfig and creating the SparkContext. Once these objects are created, we can load data from a directory in HDFS. Note For all real-world applications, your data would either reside in HDFS or in databases such as Hive/HBase for big data. Spark lets you load a file in various formats. Let's see an example to load a simple CSV file and count the number of rows in it. We will first initialize a few parameters, namely, application name, master (whether Spark is locally running this or on a cluster), and the data filename as shown next: private static String appName =LOAD_DATA_APPNAME"; private static String master =local"; private static String FILE_NAME =univ_rankings.txt";\ Next, we will create the SparkContext and Spark config object: SparkConf conf =new SparkConf().setAppName(appName).setMaster(master); JavaSparkContext sc =new JavaSparkContext(conf); Using the SparkContext, we will now load the data file: JavaRDD rowRdd = sc.textFile(FILE_NAME); Data operations – cleansing and munging This is the task on which the data analyst would be spending the maximum amount of time on. Most of the time, the data that you would be using for analytics will come from log files or will be generated from other data sources. The data won't be clean and some data entries might be missing or incorrect completely. Before any data analytic tasks can be run on the data, it has to be cleaned and prepared in good shape for the analytic algorithms to run on. We will be covering cleaning and munging in detail in the next chapter. Analyzing data – count, projection, grouping, aggregation, and max/min I assume that you already have Spark installed. If not, refer to the Spark documentation on the web for installing Spark on your machine. Let's now use some popular transformation and actions on Spark. For the purpose of the following samples, we have used a small dataset of university rankings from Kaggle.com. It can be download from this link: https://www.kaggle.com/mylesoneill/world-university-rankings. It is a comma separated dataset of university names followed by the country the university is located at. Some sample data rows are shown next: Harvard University, United States of America California Institute of Technology, United States of America Massachusetts Institute of Technology, United States of America … Common transformations on Spark RDDs We will now cover a few common transformation operations that we frequently run on the RDDs of Apache Spark: 1. Filter: This applies a function to each entry of the RDD, for example: JavaRDD rowRdd = sc.textFile(FILE_NAME); System.out.println(rowRdd.count()); As shown in the preceding code, we loaded the data file using Spark context. Now, using the filter function we will filter out the rows that contain the word Santa Barbara as shown next: JavaRDD filteredRows = rowRdd.filter(s -> s.contains("Santa Barbara")); System.out.println(filteredRows.count()); 2. Map: This transformation applies a function to each entry of an RDD. 3. In the RDD we read earlier we will find the length of each row of data using the map function as shown next: JavaRDD rowlengths = rowRdd.map(s -> s.length()); After reading the length of each row in the RDD, we can now collect the data of the RDD and print its content: List rows = rowlengths.collect(); for(Integer row : rows){ System.out.println(row); } 4. FlatMap: This is similar to map, except, in this case, the function applied to each row of RDDs will return a list or sequence of values instead of just one, as in case of the preceding map. As an example, let's create a sample RDD of strings using the parallelize function (this is a handy function for quick testing by creating dummy RDDs): JavaRDD rddX = sc.parallelize( Arrays.asList("big data","analytics","using java")); On this RDD, let's split the strings by the spaces between them: JavaRDD rddY = rddX.map(e -> e.split(" ")); Finally, flatMap will connect all these words together into a Single List of object as follows: {"big","data","analytics","using","java"} JavaRDD rddY2 = rddX.flatMap(e -> Arrays.asList(e.split(" ")).iterator()); We can now collect and print this rddY2 in a similar way as shown here for other RDDs. 5. Other common transformations on RDDs are as follows: Other transformationDescription UnionThis is a union of two RDDs to create a single one. The new RDD is a union set of both the other RDDs that are combined. Distinct This creates an RDD of only distinct elements. Map paritionsThis is similar to a map as shown earlier, but runs separately on each partition block of the RDD. Actions on RDDs As mentioned earlier, the actual work on the data starts when an action is invoked. Until that time, all the transformations are tracked on the driver program and sent to the data nodes as a set of tasks. We will now cover a few common actions that we frequently run on the RDDs of Apache Spark: count: This is used to count the number of elements of an RDD. For example, the rowRdd.count()method would count the rows in row RDD. collect: This brings back all the data from different nodes into an array on the driver program (It can cause memory leaks on the driver if the driver is low on memory.). This is good for quick testing on small RDDs: JavaRDD rddX = sc.parallelize( Arrays.asList("big data","analytics","using java")); List strs = rddX.collect(); This would print the following three strings: 'Big data Analytics Using java' reduce: This action takes in two parameters and returns one. It is used in aggregating the data elements of an RDD. As an example, let's create a sample RDD using the parallelize function: JavaRDD rddX2 = sc.parallelize(Arrays.asList("1","2","3")); After creating the RDD rddX2, we can sum up all its integer elements by invoking the reduce function on this RDD: String sumResult = rddX2.reduce((String x, String y)-> { return»»+(Integer.parseInt(x)+ Integer.parseInt(y)); }); Finally, we can print the sum of RDD elements: System.out.println("sumResult ==>"+sumResult); foreach: Just as the foreach loop of Java works in a collection, similarly this action causes each element of the RDD to be accessed: JavaRDD rddX3 = sc.parallelize( Arrays.asList("element-1","element-2","element 3")); rddX3.foreach(f -> System.out.println(f)); This will print the output as follows: element-1 element-2 element-3 Paired RDDs As HashMap is a key-value pair collection, similarly, paired RDDs are key-value pair collections except that the collection is a distributed collection. Spark treats these paired RDDs specially and provides special operations on them as shown next. An example of a paired RDD: Let's create a sample key-value paired RDD using the parallelize function: JavaRDD rddX = sc.parallelize( Arrays.asList("videoName1,5","videoName2,6", "videoName3,2","videoName1,6")); Now, using the mapToPair function, extract the keys and values from the data rows and return them as an object of a key-value pair or simple a Tuple2: JavaPairRDD videoCountPairRdd = rddX.mapToPair((String s)->{ String[] arr = s.split(","); return new Tuple2(arr[0], Integer.parseInt(arr[1])); }); Now, collect and print these rules: List> testResults = videoCountPairRdd.collect(); for(Tuple2 tuple2 : testResults){ System.out.println(tuple2._1); } This will print the output as follows: videoName2 videoName3 videoName1 Transformations on paired RDDs Just as we can run transformations on plain RDDs we can also run transformations on top of paired RDDs too. Some of the transformations that we can run on paired RDDs are explained as follows: reduceByKey: This is a transformation operation on a key-value paired RDD. This operation involves shuffling across different data partitions and creates a new RDD. The parameter to this operation is a cumulative function, which is applied on the elements and an aggregation is done on those elements to produce a cumulative result. In the preceding RDD, we have data elements for video name and hit counts of the videos as shown in the following table: Video name Hit counts. videoName1 5 videoName2 6 videoName3 2 videoName1 6 We will now try to run reduceByKey on the paired RDD to find the net hit counts of all the videos as shown earlier. We will be loading the data into an RDD in the same way as shown earlier. Once the data is loaded, we can do a reduceByKey to sum up the hit counts on the different videos: JavaPairRDD sumPairRdd = videoCountPairRdd.reduceByKey((x,y)-> x + y); After the transformation, we can collect the results and print them as shown next: List> testResults = sumPairRdd.collect(); for(Tuple2 tuple2 : testResults){ System.out.println("Title : "+ tuple2._1 + ", Hit Count : "+ tuple2._2); } The results should be printed as follows: Title : videoName2, Hit Count : 6 Title : videoName3, Hit Count : 2 Title : videoName1, Hit Count : 11 groupByKey: This is another important transformation on a paired RDD. Sometimes, you want to club all the data for a particular key into one iterable unit so that you can later go through it for some specific work. groupByKey does this for you, as shown next: JavaPairRDD> grpPairRdd = videoCountPairRdd.groupByKey(); After invoking groupByKey on videoCountPairRdd, we can collect and print the result of this RDD: List>> testResults = grpPairRdd.collect(); for(Tuple2> tuple2 : testResults){ System.out.println("Title : "+ tuple2._1 ); Iterator it = tuple2._2.iterator(); int i =1; while(it.hasNext()){ System.out.println("value "+ i +" : "+ it.next()); i++; } } And the results should be printed as follows: Title : videoName2 value 1 : 6 Title : videoName3 value 1 : 2 Title : videoName1 value 1 : 5 value 2 : 6 As you can see, the contents of the videoName1 key were grouped together and both the counts 5 and 6 were printed together. Saving data The contents of an RDD can be stored in external storage. The RDD can later be rebuilt from this external storage too. There are a few handy methods for pushing the contents of an RDD into external storage, which are: saveAsTextFile(path): This writes the elements of the dataset as a text file to an external directory in HDFS saveAsSequenceFile(path): This writes the elements of the dataset as a Hadoop SequenceFile in a given path in the local filesystem—HDFS or any other Hadoop-supported filesystem Collecting and printing results We have already seen in multiple examples earlier that by invoking collect() on an RDD, we can cause the RDD to collect data from different machines on the cluster and bring the data to the driver. Later on, we can print this data too. When you fire a collect on an RDD at that instant the data from the distributed nodes is pulled and brought into the main node or driver nodes memory. Once the data is available, you can iterate over it and print it on the screen. As the entire data is brought in memory this method is not suitable for pulling a heavy amount of data as that data might not fit in the driver memory and an out of memory error might be thrown. If the amount of data is large and you want to peek into the elements of that data then you can save your RDD in external storage in Parquet or text format and later analyze it using analytic tools like Impala or Spark SQL. There is also another method called take that you can invoke on the Spark RDD. This method allows you to pull a subset of elements from the first element of the arrays. Thereby take method can be used when you need to view just a few lines from the RDD to check if your computations are good or not. Executing Spark programs on Hadoop Apache Spark comes with a script spark-submit in its bin directory. Using this script, you can submit your program as a job to the cluster manager (such as Yarn) of Spark and it would run this program. These are the typical steps in running a Spark program: 1. Create a jar file of your Spark Java code. 2. Next, run the spark-submit job by giving the location of the jar file and the main class in it. An example of the command is shown next: ./bin/spark-submit --class --master Some of the commonly used options of spark-submit are shown in the following table: spark-submit options What it does --class Your Java class that is the main entry point for the spark code execution. --master Master URL for the cluster application-jar Jar file containing your Apache spark code Note For additional spark-submit options, please refer to the Spark programming guide on the web. It has extensive information on it. Apache Spark sub-projects Apache Spark has now become a complete ecosystem of many sub-projects. For different operations on it, we have different products as shown next: Spark sub module What it does Core SparkThis is the foundation framework for all the other modules. It has the implementation for Spark computing engine, that is, RDD, executors, storage, and so on. Spark SQLSpark SQL is a Spark module for structured data processing. Using this you can fire SQL queries on your distributed datasets. It's very easy to use. Spark Streaming This module helps in processing live data streams, whether they are coming from products such as Kafka, Flume, or Twitter. GraphX Helps in building components for Spark parallel graph computations. MLlibThis is a machine learning library that is built on the top of the Spark core and hence the algorithms are parallelly distributable across the massive datasets. Spark machine learning modules Spark MLlib is Spark's implementation of the machine learning algorithms based on the RDD format. It consists of the algorithms that can be easily run across a cluster of computer machines parallelly. Hence, it is much faster and scalable than single node machine learning libraries such as scikit-learn. This module allows you to run machine learning algorithms on top of RDDs. The API is not very user friendly and sometimes it is difficult to use. Recently Spark has come up with the new Spark ML package, which essentially builds on top of the Spark dataset API. As such, it inherits all the good features of the datasets that are massive scalability and extreme ease of usage. If anybody has used the very popular Python scikit library for machine learning, they would realize that the API of the new Spark ML is quite similar to Python scikit. From the Spark documentation, Spark ML is the recommended way for doing machine learning tasks now and the old Spark MLlib RDD based API would get deprecated in some time. Spark ML being based on datasets allows us to use Spark SQL along with it. Feature extraction and feature manipulation tasks become very easy as a lot can now be handled using Spark SQL only, especially the data manipulation work using plain SQL queries. Apart from this, Spark ML ships with an advanced feature called Pipeline. Plain data is usually in an extremely raw format and this data usually goes through a cycle or workflow where it gets cleaned, mutated, and transformed before it is used for consumption and training of machine learning models. This entire workflow of data and its stages is very well encapsulated in the new feature called as Pipeline in the Spark ML library. So you can work on the different workflows whether for feature extraction, feature transformation or converting features to mathematical vector format and gel together all this code using the pipeline API of Spark ML. This helps us in maintaining large code bases of machine learning stacks, so if later on you want to switch some piece of code (for example, for feature extraction), you can separately change it and then hook it into the pipeline and this would work cleanly without changing or impacting any other area of code. MLlib Java API The MLlib module is completely supported in Java and it is quite easy to use. Other machine learning libraries There are many machine learning libraries currently out there. Some of the popular ones are scikit-learn, pybrain, and so on. But as I mentioned earlier, these are single node libraries that are built to run on one machine but the algorithms are not optimized to run parallelly across a stack of machines and then club the results. Note How do you use these libraries on big data in case there is a particular algorithm implementation that you want to use from these libraries? On all the parallel nodes that are running your Spark tasks, make sure the particular installation of the specific library is present. Also any jars or executables that are required to run the algorithm must be available in the path of the spark-submit job to run this. Mahout – a popular Java ML library Apache Mahout is also a popular library and is open source from the Apache stack. It contains scalable machine learning algorithms. Some of the algorithms can be used for tasks such as: Recommendations Classfications Clustering Some important features of Mahout are as follows: Its algorithms run on Hadoop so they work well in a distributed environment It has MapReduce implementations of several algorithms Deeplearning4j – a deep learning library This library is fully built on Java and is a deep learning library. We will cover this library in our chapter on deep learning. Compressing data Big data is distributed data that is spread across many different machines. For various operations running on data, data transfer across machines is a given. These are the formats supported on Hadoop for input compression: gzip, bzip, snappy, and so on. While we won't go into detail for the compression piece, it must be understood that when you actually work on big data analytics tasks, compressing your data will be always beneficial, providing few main advantages as follows: If the data is compressed, the data transfer bandwidth needed is less and as such the data would transfer fast. Also, the amount of storage needed for compressed data is much less. Hadoop ships with a set of compression formats that support easy distributability across a cluster of machines. So even if the compressed files are chuncked and distributed across a cluster of machines, you would be able to run your programs on them without loosing any information or important data points. Avro and Parquet Spark helps in writing the data to Hadoop and in Hadoop input/output formats. Avro and Parquet are two popular Hadoop file formats that have specific advantages. For the purpose of our examples, other than the usual file formats of data, such as log and text format, the files can also be present in Avro or Parquet format. So what is Avro and Parquet and what is special about them? Avro is a row-based format and is also schema based. The schema for the structure of the data row is stored within the file; due to this, the schema can independently change and there won't be any impact on reading old files. Also, since it is in row-based format, the files can easily be split, based on rows and put on multiple machines and processed parallely. It has good failover support too. Parquet is a columnar file format. Parquet is specifically suited for applications where for analytics you only need a subset of your columnar data and not all the columns. So for things such as summing up/aggregating specific column Parquet is best suited for such operations. Since Parquet helps in choosing only the columns that are needed, it reduces disk I/O tremendously and hence it reduces the time for running analytics on the data. Summary In this chapter, we covered what big data is all about and how we can analyze it. We showed the 3 Vs that constitute big data: volume, variety, and velocity. We also covered some ground on the big data stack, including Hadoop, HDFS, and Apache Spark. While learning Spark, we went through some examples of the Spark RDD API and also learned a few useful transformations and actions. In the next chapter, we will get the first taste of running analytics on big data. For this, we will initially use Spark SQL, a very useful Spark module, to do simple yet powerful analysis of your data and later we will go on to build complex analytic tasks while learning market basket analysis. Chapter 2. First Steps in Data Analysis Let's take the first steps towards data analysis now. Spark has a very useful module, Spark. Apache Spark has a prebuilt module called as Spark SQL and this module is used for structured data processing. Using this module, we can execute SQL queries on our underlying data. Spark lets you read data from various datasources whether text, CSV, or Parquet files on HDFS or also from hive tables or HBase tables. For simple data analysis tasks, whether you are exploring your datasets initially or trying to analyze and cut a report for your end users with simple stats this module is tremendously useful. In this chapter, we will work on two datasets. The first dataset that we will analyze is a simple dataset and the next one is a more complex real-world dataset from an e-commerce store. In this chapter, we will cover the following topics: Basic statistical analytic approaches using Spark SQL Building association rules using the Apriori algorithm Advantages and disadvantages of using the Apriori algorithm Building association rules using a faster and more efficient FP-Growth algorithm Datasets Before we get our hands wet in the world of complex analytics, we will take small baby steps and learn some basic statistical analysis first. This would help us get familiar with the approach that we will be using on big data for other solutions as well. For our analysis initially we will take a simple cars JSON dataset that has details about a few cars from different countries. We will analyze it using Spark SQL and see how easy it is to query and analyze datasets using Spark SQL. Spark SQL is handy to use for basic analytics purposes and is nicely suited on big data. It can be run on massive datasets and data can reside in HDFS. To start with a simple case study we are using a cars dataset. This dataset can be obtained from http://www.carqueryapi.com/. It can be obtained from link http://www.carqueryapi.com/api/0.3/?callback=?&cmd=getMakes. This datasets contains data about cars in different countries. It is in JSON format. It is not a very big dataset from the perspective of big data but for our learning purposes to start with a simple analytics case study it suits our requirements well. This dataset has four important attributes shown as follows: Attribute name Attribute description make_id The type of car, for example Acura, Mercedes make_display Name of the mode of the car make_is_common Check if the makel is a common model (marked as 1 if it is a common model else 0) make_country Country where the car is made Note We are using this data only for learning purposes. The cars dataset can be replaced by any other dataset too for our learning purposes here. Hence we are not bothered about the accuracy of this data and we are not using it other than for our simple learning case study here. Also here is a sample of some of the data in the dataset: Sample row of dataset Description { "make-id":"acura", "make_display": "Acura", "make_is_common":"1", "make_country" : "USA" } { "make-id":"alfa romeo", "make_display": "AlfaRomeo", "make_is_common":"1", "make_country" : "Italy" } Here, the make_id or type of car is acura and it is made in the country USA. Here, the car is of type AlfaRomeo and it is made in Italy. Data cleaning and munging The major amount of time spent by a developer while performing a data analysis task is spent in data cleaning or producing data in a particular format. Most of the time, while performing analysis of some log file data or getting files from some other system, there will definitely be some data cleaning involved. Data cleaning can be in many forms whether it involves discarding a certain kind of data or converting some bad data into a different format. Also note that most of the machine learning algorithms involve running algorithms on a mathematical dataset, but most of the practical datasets won't always have mathematical data. Converting text data to mathematical form is another important task that many developers need to do themselves before they can apply the data analysis tasks on the data. If there are problems in the data that we need to resolve before we use it, then this approach of fixing the data is called as data munging. One of the common data munging tasks is to fix up null values in data and these null values might represent either bad data or missing data. Bad or missing data is not good for our analysis as it can result in bad analytical results. These data issues need to be fixed before we can use our data in actual analysis. To learn the concepts of how we can fix our data before we use it in our analysis let's pick up the dataset that we are using in this chapter and fix the data before analyzing these datasets. Most of your time as a developer performing the task of data analysis on big data will be spent on making the data good for training the models. The general tasks might include: Filtering the unwanted data: There are times when some of the data in your dataset might be corrupted or might be bad. If you can fix this data somehow, then you should, else you will have to discard it. Sometimes the data might be good but it might contain attributes that you don't need. In this case, you can discard these extra attributes. You can also use the Apache Spark's filter method to filter out the unwanted data. Handling incomplete or missing data: Not all data points might be present in the data. In such a situation, the developer needs to figure out which data point or default data point is needed when the data point is not available. Filling missing values is a very important task especially if you are using this data to analyze your dataset. We will look at some of the common strategies for handling missing data. Discarding data: If a lot of attributes in the data are missing, one easy approach is to discard this row of data. This is not a very fruitful approach especially if there are some attributes within this row that are meaningful, which we are using. Fill some constant value: You can fill in some constant generic value for missing attributes; for example, in your car, if you have entries as shown in the following table with empty make_id and empty make_display: Dataset one sample row { "make_id ";"", "make_display "; "", "make_country" ;" JAPAN" } If we discard these entries, it won't be a good approach. If we are asked to find the total number of cars from JAPAN in this dataset, then we will use the following code: make_country = 'JAPAN'. To counter this and use this data, we can fill in some constant value such as Unknown in this field. So the field will look like this: { "make_id ";"Unknown", "make_display "; "UnKnown", "make_country" ;" JAPAN" } As shown earlier, we have filled the UnKnown keyword wherever we saw empty data as in the case of make_id and make_display. Populate with average value: This might work in some cases. So if you have a missing value in some column, you can take all the values with good data in that column and find an average and later use this average value as a value on that item. Nearest Neighbor approach: This is one of my favorite approaches, and once we cover the KNN algorithm in this book we will cover this topic again. Basically, you find data points that are similar to the one with missing attributes in your dataset. You then replace the missing attributes with the attributes of the nearest data point that you found. So suppose you have your data from the dataset plotted on a scatter plot, as shown in the following screenshot: The preceding screenshot shows some data points of a dataset plotted on the x and y axis on a scatter plot. Look at the datapoint as shown by the arrow with Point A as label. lf this datapoint has some missing attributes, then we find the nearest data point to it which in this case is datapoint B as shown by the other arrow (which has Point B as a label). From this datapoint, we now pull the missing attributes. For this approach, we use the KNN algorithm or the K Nearest Neighbor algorithm to figure out the distance of one data point from another based on some attributes: Converting data to a proper format: Sometimes you might have to convert data from one format to another for your analytics task. For example, converting non-numeric numbers to numeric numbers or converting the date field to a proper format. Basic analysis of data with Spark SQL Spark SQL is a spark module for structured data processing. Almost all the developers know SQL. Spark SQL provides an SQL interface to your Spark data (RDDs). Using Spark SQL you can fire SQL queries or SQL-like queries on your big data set and fetch data in objects called dataframes. A dataframe is like a relational database table. It has columns in it and we can apply functions to these columns such as groupBy, and so on. It is very easy to learn and use. In the next section, we will cover a few examples on how we can use the dataframe and run regular analysis tasks. Building SparkConf and context This is just boilerplate code and is the entry point for the usage of our Spark SQL code. Every spark program will start with this boiler plate code for initialization. In this code we build the Spark configuration and then apply the configuration parameters (like application name and master location) and also build the SparkSession object. This SparkSession object is the main object using which you can fire SQL queries on your dataset. SparkConf sconf = new sparkConf().setAppName(APP_NAME) .setMaster(APP_MASTER); SparkSession spark = SparkSession.builder() .config(sconf) .getOrCreate(); Dataframe and datasets Dataframe is a collection of distributed objects organized into named columns. It is similar to a table in a relational database and you can use Spark SQL to query it in a similar way. You can build dataframes from various datasources such as JSON files, CSV files, parquet files or directly from Hive tables, and so on. A dataset is also a collection of distributed objects, but is essentially a hybrid of a Resilient Distributed Dataset (RDD) and a dataframe. An RDD or resilient distributed dataset is a distributed collection of objects, is similar to an array list in Java except that it is filled with objects that are distributed across multiple machines. Spark provides low level API to interact with this distributed object. Dataframe on the other hand is a higher level abstraction on top of RDDs and they are similar to relational database tables which store data in that format. SQL queries can be fired on top of dataframes. As we mentioned before a dataset object is a hybrid of dataframe and RDD and it supports firing SQL queries similar to dataframes and also applying RDD functions such as map, filter, and flatMap, similar to RDDs. Load and parse data Spark API is very extensive. We can load data out of the box in different formats and can clean/munge the data as we require and use it in our analysis tasks. The following code shows us ways of loading different datasets. Here we are loading data from a JSON file. This builds Dataset which is similar to a table in a relational database, it has a set of columns: Dataset carsBaseDF = spark.read() .json("src/resources/data/cars.json"); carsBaseDF.show(); Now we will register this dataframe as a temporary view. Just registering it as a temp table in SparkContext means we can fire queries on it just as you execute queries on an RDBMS table. That's as simple as it gets. To use this dataset row as a relational database table and fire queries on it, just use the createOrReplaceTempView method shown as follows: carsBaseDF.createOrReplaceTempView("cars"); Now this data is available as a table cars just like a relational database table and you can fire any SQL queries on it such as select * from cars to pull all the rows. Analyzing data – the Spark-SQL way Let's now dive into a few examples. You can find more examples in the accompanying code in the GitHub repository too. For brevity, I am not showing the boilerplate code for SparkContext again and again. I will be just referring to SparkSession object as spark: Simply select and print data: Here we will just execute a query on the cars table and would print a sample result from the entire dataset of results. It's exactly similar to firing a select query on a relational database table: Dataset netDF = spark.sql("select * from cars"); netDF.show (); The result will be printed as follows: Filtering on data: Here I will show two simple ways for filtering the data. First we will select a single column and print results from the top few rows. For this we will use the spark session and fire a SQL query on the cars table. We will be selecting only the two columns make_country and make_display from the cars table shown next. Also, for printing the first few rows, we will use a handy spark method show(), which will print the first few rows of the result set: Dataset singleColDF = spark.sql("select make_country,make_display fromcars") ; singleColDF.show(); The output is as follows: Total count: Here we will find the total rows in our dataset. For this we will use the count method on the dataset. The count method when executed on the dataset returns the total number of rows in the dataset. System.out.println("Total Rows in Data --) " + netDF.count(); The output is as follows: Total Rows in dataset :155 Selective data: Let's fetch some data based on some criteria: Fetch the cars made in Italy only: We will fire a query on our car view with a where clause specifying the make_country as 'Italy': Dataset italyCarsDF = spark.sql("select * from cars where make_country 'Italy'"}; italyCarsDF.show(}; //show the full content The result will be printed as follows: Fetch the count of cars from Italy: We will just use the count method on the dataset we received in the previous call where we fetched the rows that belonged only to country 'Italy': System.out.println("Data on Italy Cars"); System .out. println ("Number of cars from Italy in this data set --> " + italyCarsDF. count (); This will print the following: Collect all data and print it: Now discard the show() function as it is just a handy function for testing and instead of that let's use a function that we will use to get the data after firing the queries. List italyRows = italyCarsDF.collectAsList(); for (Row italyRow : italyRows) { System.out.println("Car type -> " + italyRow.getString(1); } This will print out all the types of cars that are made in Italy as shown (we are only showing the first few cars here) Total count of cars from Japan in the dataset: We selected records that belong to Italy. Let's find the total count of cars from Japan in the dataset. This time we will just pull the count and not the total data for Japanese cars: Dataset jpnCarsDF = spark.sql("select count(*) from cars where make_country = 'Japan'"); List jpnRows = jpnCarsDF.collectAsList(); System.out.println("Japan car dataset -----~> " + jpnRows.get(0).getLong(0); As shown, we build a dataframe by searching only for Japanese cars, and next we print the count of these rows. The result is as follows: Distinct countries and their count: Just like we use the distinct clause in SQL we can use the distinct clause in this big data Spark SQL query. If the result is small, as in this case, we can do a collect() and bring the data result in the memory of the driver program and print it there. Using the following code, we will print the distinct countries in this dataset of cars: Dataset distinctCntryDF = spark.sql("select distinct make_country from Cars"); List distinctCtry = distinctCntryDF.collectAsList(); System.out.println("Printing Distinct Countries below"); for (Row drow : distinctCtry) { System.out.println(drow.get(0).toString(); System. out. println( "Total Distinct Countries ; " + distinctCtry.length); } And the result is printed as follows: Group by country and find count: Now, let's try to find the number of cars from each country and sort it in descending order. As most Java developers have used SQL before, this is a simple group by clause along with an order by for ordering by count in descending order as shown: Dataset grpByCntryDF = spark.sql("select make_country,count(*) cnt from Cars order by cnt desc"); As seen we fired a simple group by query and counted the number of countries in the dataset and finally sorted by the count in descending order. We will now print the first few rows of this dataset: grpByCntryDF.show() The result should be printed as follows: