Advanced Analytics with Spark Patterns for Learning from Data at Scale (Early Release) {Zer07}.pdf

DownLoad Book

Published: 2018-03-17 Author:Bieber 28 Browses

Tags: spark data


Extracting password :c68ebb989504c0e0

Sandy Ryza, Uri Laserson, Sean Owen, and Josh Wills
Advanced Analytics with Spark
Advanced Analytics with Spark
by Sandy Ryza, Uri Laserson, Sean Owen, and Josh Wills
Copyright © 2010 Sandy Ryza, Uri Laserson, Sean Owen and Josh Wills. All rights reserved.
Printed in the United States of America.
Published by O’Reilly Media, Inc., 1005 Gravenstein Highway North, Sebastopol, CA 95472.
O’Reilly books may be purchased for educational, business, or sales promotional use. Online editions are
also available for most titles ( For more information, contact our corporate/
institutional sales department: 800-998-9938 or
Editor: Ann Spencer
Production Editor: FIX ME!
Copyeditor: FIX ME!
Proofreader: FIX ME!
Indexer: FIX ME!
Cover Designer: Karen Montgomery
Interior Designer: David Futato
Illustrator: Rebecca Demarest
January -4712:
First Edition
Revision History for the First Edition:
2014-11-12: Early release revision 1
2015-01-05: Early release revision 2
2015-01-21: Early release revision 3
See for release details.
Nutshell Handbook, the Nutshell Handbook logo, and the O’Reilly logo are registered trademarks of O’Reilly
Media, Inc. !!FILL THIS IN!! and related trade dress are trademarks of O’Reilly Media, Inc.
Many of the designations used by manufacturers and sellers to distinguish their products are claimed as
trademarks. Where those designations appear in this book, and O’Reilly Media, Inc. was aware of a trademark
claim, the designations have been printed in caps or initial caps.
While every precaution has been taken in the preparation of this book, the publisher and authors assume
no responsibility for errors or omissions, or for damages resulting from the use of the information contained
ISBN: 978-1-491-91269-0
Table of Contents
Preface. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix
Analyzing Big Data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
The Challenges of Data Science 3
Introducing Apache Spark 4
About This Book 6
Introduction to Data Analysis with Scala and Spark. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
Scala for Data Scientists 10
The Spark Programming Model 11
Record Linkage 11
Getting Started: The Spark Shell and SparkContext 12
Bringing Data from the Cluster to the Client 18
Shipping Code from the Client to the Cluster 21
Structuring Data with Tuples and Case Classes 22
Aggregations 27
Creating Histograms 27
Summary Statistics For Continuous Variables 29
Creating Reusable Code For Computing Summary Statistics 30
Simple Variable Selection and Scoring 34
Where To Go From Here 36
Recommending Music and the Audioscrobbler data set. . . . . . . . . . . . . . . . . . . . . . . . . . 37
Data Set 38
The Alternating Least Squares Recommender Algorithm 39
Preparing the Data 42
Building a First Model 44
Spot Checking Recommendations 46
Evaluating Recommendation Quality 48
Computing AUC 49
Hyperparameter Selection 51
Making Recommendations 53
Where To Go From Here 54
4. Predicting Forest Cover with Decision Trees. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
Fast Forward to Regression 57
Vectors and Features 58
Training Examples 59
Decision Trees and Forests 60
Covtype Data Set 63
Preparing the Data 64
A First Decision Tree 65
Decision Tree Hyperparameters 69
Tuning Decision Trees 71
Categorical Features Revisited 73
Random Decision Forests 75
Making Predictions 77
Where To Go From Here 77
Anomaly Detection in Network Traffic with K-means clustering. . . . . . . . . . . . . . . . . . . 79
Anomaly Detection 80
K-means clustering 80
Network Intrusion 81
KDD Cup 1999 Data Set 82
A First Take on Clustering 83
Choosing k 85
Visualization in R 87
Feature Normalization 90
Categorical Variables 92
Using Labels with Entropy 93
Clustering in Action 94
Where To Go From Here 95
Understanding Wikipedia with Latent Semantic Analysis. . . . . . . . . . . . . . . . . . . . . . . . . 97
The Term-Document Matrix 98
Getting The Data 100
Parsing and Preparing the Data 100
Lemmatization 101
Computing the TF-IDFs 102
Singular Value Decomposition 105
Finding Important Concepts 106
iv | Table of Contents
Querying and Scoring with the Low-Dimensional Representation 109
Term-Term Relevance 109
Document-Document Relevance 111
Term-Document Relevance 112
Multiple-Term Queries 113
Where To Go From Here 114
Analyzing Co-occurrence Networks with GraphX. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115
The MEDLINE Citation Index: A Network Analysis 116
Getting The Data 117
Parsing XML Documents with Scalas XML Library 119
Analyzing the MeSH Major Topics and their Co-occurrences 121
Constructing a Co-occurrence Network with GraphX 123
Understanding the Structure of Networks 126
Connected Components 126
Degree Distribution 129
Filtering Out Noisy Edges 132
Processing EdgeTriplets 133
Analyzing the Filtered Graph 134
Small World Networks 136
Cliques and Clustering Coefficients 137
Computing Average Path Length with Pregel 138
Where To Go From Here 143
Geospatial and Temporal Data Analysis on the New York City Taxicab Data. . . . . . . . . 145
Getting the Data 146
Working With Temporal And Geospatial Data in Spark 147
Temporal Data with JodaTime and NScalaTime 147
Geospatial Data with the Esri Geometry API and Spray 149
Exploring the Esri Geometry API 149
Intro to GeoJSON 151
Preparing the New York City Taxicab Data 153
Handling Invalid Records at Scale 154
Geospatial Analysis 158
Sessionization in Spark 161
Building Sessions: Secondary Sorts in Spark 162
Where To Go From Here 165
Financial Risk through Monte Carlo Simulation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 167
Terminology 168
Methods for Calculating VaR 168
Variance-Covariance 168
Table of Contents | v
Historical Simulation 169
Monte Carlo Simulation 169
Our Model 169
Getting the Data 171
Preprocessing 171
Determining the Factor Weights 174
Sampling 176
The Multivariate Normal Distribution 178
Running the Trials 179
Visualizing the Distribution of Returns 182
Evaluating Our Results 183
Where To Go From Here 185
10. Analyzing Genomics Data and the BDG Project. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 187
Decoupling Storage from Modeling 188
Ingesting Genomics Data with the ADAM CLI 190
Parquet format and columnar storage 195
Example: Predicting Transcription Factor Binding Sites from ENCODE Data 197
Example: Querying Genotypes from the 1000 Genomes Project 204
Where to go from here 205
Analyzing Neuroimaging Data with PySpark and Thunder. . . . . . . . . . . . . . . . . . . . . . . 207
Overview of PySpark 207
PySpark Internals 209
Overview and Installation of the Thunder library 210
Loading data with Thunder 212
Thunder Core Datatypes 218
Example: Categorizing Neuron Types with Thunder 220
Where To Go From Here 225
Appendix: Deeper Into Spark. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 227
The Spark Execution Model 227
Serialization 229
Accumulators 229
Spark and the Data Scientist’s Workflow 230
File Formats 232
Spark Subprojects 233
MLlib 233
Spark Streaming 234
Spark SQL 235
GraphX 235
vi | Table of Contents
13. Appendix: Upcoming MLlib Pipelines API. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 237
Beyond Mere Modeling 237
The Pipelines API 238
Text Classification Example Walkthrough 239
Table of Contents | vii
I dont like to think I have many regrets, but it’s hard to believe anything good came out
of a particular lazy moment in 2011 when I was looking into how to best distribute tough
discrete optimization problems over clusters of computers. My advisor explained this
newfangled Spark thing he had heard of, and I basically wrote off the concept as too
good to be true and promptly got back to writing my undergrad thesis in MapReduce.
Since then, Spark and I have both matured a bit, but one of us has seen a meteoric rise
thats nearly impossible to avoid making “ignite” puns about. Cut to two years later, and
it has become crystal clear that Spark is something worth paying attention to.
Sparks long lineage of predecessors, running from MPI to MapReduce, make it possible
to write programs that take advantage of massive resources while abstracting away the
nitty-gritty details of distributed systems. As much as data processing needs have mo‐
tivated the development of these frameworks, in a way the field of big data has become
so related to these frameworks that its scope is defined by what these frameworks can
handle. Sparks promise is to take this a little further - to make writing distributed pro‐
grams feel like writing regular programs.
Spark will be great at giving ETL pipelines huge boosts in performance and easing some
of the pain that feeds the MapReduce programmer’s daily chant of despair “why?
whyyyyy?” to the Hadoop gods. But the exciting thing for me about it has always been
what it opens up for complex analytics. With a paradigm that supports iterative algo‐
rithms and interactive exploration, Spark is finally an open source framework that al‐
lows a data scientist to be productive with large datasets.
I think the best way to teach data science is by example. To that end, my colleagues and
I have put together a book of applications, trying to touch on the interactions between
the most common algorithms, datasets, and design patterns in large scale analytics. This
book isnt meant to be read cover to cover. Page to a chapter that looks like something
youre trying to accomplish. Or simply ignites your interest.
What’s in this Book
The first chapter will place Spark within the wider context of data science and big data
analytics. After that, each chapter will comprise a self-contained analysis using Spark.
The second chapter will introduce the basics of data-processing in Spark and Scala
through a use case in data cleansing. The next few chapters will delve into the meat and
potatoes of machine learning with Spark, applying some of the most common algo‐
rithms in canonical applications. The remaining of the chapters are a bit more of a grab
bag and apply Spark in slightly more exotic applications. For example, querying Wiki‐
pedia through latent semantic relationships in the text or analyzing genomics data.
It goes without saying that you wouldnt be reading this book if it were not for the
existence of Apache Spark and MLlib. We all owe thanks to the team that has built and
open-sourced it, and the hundreds of contributors that have added to it.
We would like to thank everyone that helped review and improve the text and content
of the book: Michael Bernico, Chris Fregly, Debashish Ghosh, Juliet Hougland, Nick
Pentreath. Thanks all!
Thanks to Marie Beaugureau and O’Reilly, for the experience and great support in get‐
ting this book published and into your hands.
TODO: finish Acknowledgements
Using Code Examples
Supplemental material (code examples, exercises, etc.) is available for download at
This book is here to help you get your job done. In general, if example code is offered
with this book, you may use it in your programs and documentation. You do not need
to contact us for permission unless youre reproducing a significant portion of the code.
For example, writing a program that uses several chunks of code from this book does
not require permission. Selling or distributing a CD-ROM of examples from O’Reilly
books does require permission. Answering a question by citing this book and quoting
example code does not require permission. Incorporating a significant amount of ex‐
ample code from this book into your product’s documentation does require permission.
We appreciate, but do not require, attribution. An attribution usually includes the title,
author, publisher, and ISBN. For example: "Advanced Analytics with Spark by Ryza,
Laserson, Owen, and Wills (O’Reilly). Copyright 2014 Some Copyright Holder,
x | Preface
If you feel your use of code examples falls outside fair use or the permission given above,
feel free to contact us at
Safari® Books Online
Safari Books Online is an on-demand digital library that
delivers expert content in both book and video form from
the world’s leading authors in technology and business.
Technology professionals, software developers, web designers, and business and crea‐
tive professionals use Safari Books Online as their primary resource for research, prob‐
lem solving, learning, and certification training.
Safari Books Online offers a range of plans and pricing for enterprise, government,
education, and individuals.
Members have access to thousands of books, training videos, and prepublication manu‐
scripts in one fully searchable database from publishers like O’Reilly Media, Prentice
Hall Professional, Addison-Wesley Professional, Microsoft Press, Sams, Que, Peachpit
Press, Focal Press, Cisco Press, John Wiley & Sons, Syngress, Morgan Kaufmann, IBM
Redbooks, Packt, Adobe Press, FT Press, Apress, Manning, New Riders, McGraw-Hill,
Jones & Bartlett, Course Technology, and hundreds more. For more information about
Safari Books Online, please visit us online.
How to Contact Us
Please address comments and questions concerning this book to the publisher:
O’Reilly Media, Inc.
1005 Gravenstein Highway North
Sebastopol, CA 95472
800-998-9938 (in the United States or Canada)
707-829-0515 (international or local)
707-829-0104 (fax)
We have a web page for this book, where we list errata, examples, and any additional
information. You can access this page at<catalog page>.
To comment or ask technical questions about this book, send email to bookques
For more information about our books, courses, conferences, and news, see our website
Find us on Facebook:
Preface | xi
Follow us on Twitter:
Watch us on YouTube:
xii | Preface
Analyzing Big Data
Sandy Ryza
Build a model to detect credit card fraud using thousands of features and billions
of transactions.
Intelligently recommend millions of products to millions of users.
Estimate financial risk through simulations of portfolios including millions of in‐
Easily manipulate data from thousands of human genomes to detect genetic asso‐
ciations with disease.
These are tasks that simply could not be accomplished five or ten years ago. When people
say that we live in an age of “big data, they mean that we have tools for collecting, storing,
and processing information at a scale previously unheard of. Sitting behind these ca‐
pabilities is an ecosystem of open source software that can leverage clusters of com‐
modity computers to chug through massive amounts of data. Distributed systems like
Apache Hadoop have found their way into the mainstream and see widespread deploy‐
ment at organizations in nearly every field.
But just as chisel and a block of stone do not make a statue, there is a gap between having
access to these tools and all this data, and doing something useful with it. This is where
data science” comes in. As sculpture is the practice of turning tools and raw material
into something relevant to non-sculptors, data science is the practice of turning tools
and raw data into something that non-data scientists might care about.
Often “doing something useful” means placing a schema over it and using SQL to answer
questions like “of the gazillion users who made it to the third page in our registration
process, how many are over 25?”. The field of how to structure a data warehouse and
organize information to make answering these kinds of questions easy is a rich one, but
we will mostly avoid its intricacies in this book.
Sometimes, “doing something useful” takes a little extra. SQL still may be core to the
approach, but to work around idiosyncrasies in the data or perform complex analysis,
one needs a programming paradigm thats a little bit more flexible, a little closer to the
ground, and with richer functionality in areas like machine learning and statistics. These
are the kinds of analyses we are going to talk about in this book.
For a long time, open source frameworks like R, the PyData stack, and Octave have
made rapid analysis and model building viable over small datasets. With less than 10
lines of code, one can throw together a machine learning model on half a dataset and
use it to predict labels on the other half. With a little more effort, one can impute missing
data, experiment with a few models to find the best one, or use the results of a model
as inputs to fit another. What should an equivalent look like that can leverage clusters
of computers to achieve the same outcomes on huge datasets?
The right approach might be to simply extend these frameworks to run on multiple
machines, to retain their programming models and rewrite their guts to play well in
distributed settings. However, the challenges of distributed computing require us to
rethink many of the basic assumptions that we rely on in single node systems. For
example, as data must be partitioned across many nodes on a cluster, algorithms that
have wide data dependencies will suffer from the fact that network transfer rates are
orders of magnitude slower than memory accesses. As the number of machines working
on a problem increases, the probability of a failure increases. These facts require a pro‐
gramming paradigm that is sensitive to the characteristics of the underlying system:
that discourages poor choices and makes it easy to write code that will execute in a highly
parallel manner.
Of course, single-machine tools like PyData and R that have come to recent prominence
in the software community are not the only tools used for data analysis. Scientific fields
like genomics that deal with large datasets have been leveraging parallel computing
frameworks for decades. Most people processing data in these fields today are familiar
with a cluster-computing environment called HPC (high performance computing).
Where the difficulties with PyData and R lie in their inability to scale, the difficulties
with HPC lie in its relatively low level of abstraction and difficulty of use. For example,
to process a large file full of DNA sequencing reads in parallel, one must manually split
it up into smaller files and submit a job for each of those files to the cluster scheduler.
If some of these fail, the user must detect the failure and take care of manually resub‐
mitting them. If the analysis requires all-to-all operations like sorting the entire dataset,
the large data set must be streamed through a single node, or the scientist must resort
to lower-level distributed frameworks like MPI, which are difficult to program without
extensive knowledge of C and distributed/networked systems. Tools written for HPC
environments often fail to decouple the in-memory data models from the lower-level
storage models. For example, many tools only know how to read data from a POSIX
file system in a single stream, making it difficult to make tools naturally parallelize, or
to use other storage backends, like databases. Recent systems in the Hadoop ecosystem
2 | Chapter 1: Analyzing Big Data
provide abstractions that allow users to treat a cluster of computers more like a single
computer - to automatically split up files and distribute storage over many machines,
to automatically divide work into smaller tasks and execute them in a distributed man‐
ner, and to recover from failures automatically. The Hadoop ecosystem can automate a
lot of the hassle of working with large data sets, and is far cheaper than HPC to boot.
The Challenges of Data Science
A few hard truths come up so often in the practice of data science that evangelizing
these truths has become a large role of the data science team at Cloudera. For a system
that seeks to enable complex analytics on huge data to be successful, it needs to be
informed by, or at least not conflict with, these truths.
First, the vast majority of work that goes into conducting successful analyses lies in
preprocessing data. Data is messy, and cleaning, munging, fusing, mushing, and many
other verbs are prerequisites to doing anything useful with it. Large datasets in partic‐
ular, because they are not amenable to direct examination by humans, can require com‐
putational methods to even discover what preprocessing steps are required. Even when
it comes time to optimize model performance, a typical data pipeline requires spending
far more time in feature engineering and selection than in choosing and writing algo‐
For example, when building a model attempting to detect fraudulent purchases on a
website, the data scientist must choose from a wide variety of potential features: any
fields that users are required to fill out, IP location info, login times, click logs as users
navigate the site. Each of these comes with its own challenges in converting to vectors
fit for machine learning algorithms. A system needs to support more flexible transfor‐
mations than turning a 2D array of doubles into a mathematical model.
Second, iteration is a fundamental part of the data science. Modeling and analysis typ‐
ically require multiple passes over the same data. One aspect of this lies within machine
learning algorithms and statistical procedures. Popular optimization procedures like
stochastic gradient descent and expectation maximization involve repeated scans over
their inputs to reach convergence. Iteration also matters within the data scientist’s own
workflow. When initially investigating and trying to get a feel for a dataset, usually the
results of a query inform the next query that should run. When building models, data
scientists do not try to get it right in one try. Choosing the right features, picking the
right algorithms, running the right significance tests, and finding the right hyperpara‐
meters all require experimentation. A framework that requires reading the same dataset
from disk each time it is accessed adds delay that can slow down the process of explo‐
ration and limit the number of things one gets to try.
Third, the task isnt over when a well-performing model has been built. If the point of
data science is making data useful to non-data scientists, then a model stored as a list
The Challenges of Data Science | 3
of regression weights in a text file on the data scientists computer has not really ac‐
complished this goal. Uses of data recommendation engines and real-time fraud detec‐
tion systems, culminate in data applications. In these, models become part of a pro‐
duction service and may need to be rebuilt periodically or even in real time.
For these situations, it is helpful to make a distinction between analytics in the lab and
analytics in the factory. In the lab data scientists engage in exploratory analytics. They
try to understand the nature of the data they are working with. They visualize it and
test wild theories. They experiment with different classes of features and auxiliary sour‐
ces they can use to augment it. They cast a wide net of algorithms in the hopes that one
or two will work. In the factory, in building a data application, data scientists engage in
operational analytics. They package their models into services that can inform real-
world decisions. They track their models’ performance over time and obsess about how
they can make small tweaks to squeeze out another percentage point of accuracy. They
care about SLAs and uptime. Historically, exploratory analytics typically occurs in lan‐
guages like R, and when it comes time to build production applications, the data pipe‐
lines are rewritten entirely in Java or C+\+.
Of course, everybody could save time if the original modeling code could be actually
used in the app its written for, but languages like R are slow and lack integration with
most planes of production infrastructure stack, and languages like Java and C++ are
just poor tools for exploratory analytics. They lack REPL (Read-Evaluate-Print-Loop)
environments for playing with data interactively and require large amounts of code to
express simple transformations. A framework that makes modeling easy but is also a
good fit for production systems is a huge win.
Introducing Apache Spark
Enter Apache Spark, an open source framework that combines an engine for distrib‐
uting programs across clusters of machines with an elegant model for writing programs
atop it. Spark, which originated at the UC Berkeley AMPLab and has since been con‐
tributed to the Apache Software Foundation, is arguably the first open source software
that makes distributed programming truly accessible to data scientists.
One illuminating way to understand Spark is in terms of its advances over its prede‐
cessor, MapReduce. MapReduce revolutionized computation over huge datasets by of‐
fering a simple model for writing programs that could execute in parallel across hun‐
dreds to thousands of machines. The MapReduce engine achieves near linear scalability
- as the data size increases, one can throw more computers at it and see jobs complete
in the same amount of time - and is resilient to the fact that failures that occur rarely
on a single machine occur all the time on clusters of thousands. It breaks up work into
small tasks and can gracefully accommodate task failures without compromising the
job to which they belong.
4 | Chapter 1: Analyzing Big Data
Spark maintains MapReduces linear scalability and fault tolerance, but extends it in a
few important ways. First, rather than relying on a rigid map-then-reduce format, its
engine can execute a more general directed acyclic graph (DAG) of operators. This
means that, in situations where MapReduce must write out intermediate results to the
distributed filesystem, Spark can pass them directly to the next step in the pipeline. In
this way, it is similar to Dryad, a descendant of MapReduce that originated at Microsoft
Research. It complements this capability with a rich set of transformations that allow
users to express computation more naturally. It has a strong developer focus and
streamlined API that can represent complex pipelines in a few lines of code.
Finally, Spark extends its predecessors with in-memory processing. Its Resilient Dis‐
tributed Dataset (RDD) abstraction enables developers to materialize any point in a
processing pipeline into memory across the cluster, meaning that future steps that want
to deal with the same dataset need not recompute it or reload it from disk. This capability
opens up use cases that distributed processing engines could not previously approach.
Spark is well suited for highly iterative algorithms that require multiple passes over a
dataset, as well as reactive applications that quickly respond to user queries by scanning
large in-memory datasets.
Perhaps most importantly, Spark fits well with the aforementioned hard truths of data
science, acknowledging that the biggest bottleneck in building data applications is not
CPU, disk, or network, but analyst productivity. It perhaps cannot be overstated how
much collapsing the full pipeline, from preprocessing to model evaluation, into a single
programming environment can speed up development. By packaging an expressive
programming model with a set of analytic libraries under a REPL, it avoids the round
trips to IDEs required by frameworks like MapReduce and the challenges of subsam‐
pling and moving data back and forth from HDFS required by frameworks like R. The
more quickly analysts can experiment with their data, the higher likelihood they have
of doing something useful with it.
With respect to the pertinence of munging and ETL, Spark strives to be something closer
to the Python of big data than the Matlab of big data. As a general purpose computation
engine, its core APIs provide a strong foundation for data transformation independent
of any functionality in statistics, machine learning, or matrix algebra. Its Scala and
Python APIs allow programming in expressive general purpose languages, as well as
access to existing libraries.
Sparks in-memory caching makes it ideal for iteration both at the micro and macro
level. Machine learning algorithms that make multiple passes over their training set can
cache it in memory. When exploring and getting a feel for a dataset, a data scientist can
keep it in memory while they run queries, and easily cache transformed versions of it
as well without suffering a trip to disk.
Last, Spark spans the gap between systems designed for exploratory analytics and sys‐
tems designed for operational analytics. It is often quoted that a data scientist is someone
Introducing Apache Spark | 5
who is better at engineering than most statisticians and better at statistics than most
engineers. At the very least, Spark is better at being an operational system than most
exploratory systems and better for data exploration than the technologies commonly
used in operational systems. It is built for performance and reliability from the ground
up. Sitting atop the JVM, it can take advantage of many of the operational and debugging
tools built for the Java stack.
It boasts strong integration with the variety of tools in the Hadoop ecosystem. It can
read and write data in all of the data formats supported by MapReduce, allowing it to
interact with the formats commonly used to store data on Hadoop like Avro and Parquet
(and good old CSV). It can read from and write to NoSQL databases like HBase and
Cassandra. Its stream processing library, Spark Streaming can ingest data continuously
from systems like Flume and Kafka. Its SQL library, SparkSQL, can interact with the
Hive Metastore, and a project that is in-progress at the time of this writing seeks to
enable Spark to be used as an underlying execution engine for Hive, as an alternative
to MapReduce. It can run inside YARN, Hadoops scheduler and resource manager,
allowing it to share cluster resources dynamically and managed with the same policies
as other processing engines like MapReduce and Impala.
Of course, Spark isnt all roses and petunias. While its core engine has progressed in
maturity even during the span of this book being written, it is still young compared to
MapReduce and hasn’t yet surpassed it as the workhorse of batch processing. Its speci‐
alized subcomponents for stream processing, SQL, machine learning, and graph pro‐
cessing lie at different stages of maturity and are undergoing large API upgrades. For
example, MLlibs pipelines and transformer API model is in progress while this book is
being written. Its statistics and modeling functionality comes nowhere near that of single
machine languages like R. Its SQL functionality is rich, but still lags far behind that of
About This Book
The rest of this book is not going to be about Sparks merits and disadvantages. There
are a few other things that it will not be either. It will introduce the Spark programming
model and Scala basics, but it will not attempt to be a Spark reference or provide a
comprehensive guide to all its nooks and crannies. It will not try to be a machine learn‐
ing, statistics, or linear algebra reference, although many of the chapters will provide
some background on these before using them.
Instead, it will try to help the reader get a feel for what it’s like to use Spark for complex
analytics on large datasets. It will cover the entire pipeline: not just building and eval‐
uating models, but cleaning, preprocessing and exploring data, with attention paid to
turning results into production applications. We believe that the best way to teach this
is by example, so, after a quick chapter describing Spark and its ecosystem, the rest of
6 | Chapter 1: Analyzing Big Data
the chapters will be self-contained illustrations of what it looks like to use Spark for
analyzing data from different domains.
When possible, we will attempt not to just provide a “solution, but to demonstrate the
full data science workflow, with all of its iteration, dead ends, and restarts. This book
will be useful for getting more comfortable with Scala, more comfortable with Spark,
and more comfortable with machine learning and data analysis. However, these are in
service of a larger goal, and we hope that most of all, this book will teach how to approach
tasks like those described in the first words of this chapter. Each chapter, in about twenty
measly pages, will try to get as close as possible to demonstrating how to build one of
these pieces of data applications.
About This Book | 7
Introduction to Data Analysis with Scala
and Spark
Josh Wills
If you are immune to boredom, there is literally nothing you cannot accomplish.
— David Foster Wallace
Data cleansing is the first step is any data science project, and often the most important.
Many clever analyses have been undone because the data analyzed had fundamental
quality problems or underlying artifacts that biased the analysis or led the data scientist
to see things that werent really there.
Despite its importance, most textbooks and classes on data science either dont cover
data cleansing or only give it a passing mention. The explanation for this is simple:
cleaning data is really boring. It is the tedious, dull work that you have to do before you
can get to the really cool machine learning algorithm that you’ve been dying to apply
to a new problem. Many new data scientists tend to rush past it to get their data into a
minimally acceptable state, and only discover that the data has major quality issues after
they apply their (potentially computationally-intensive) algorithm and get a nonsense
answer as output.
Everyone has heard the saying “garbage in, garbage out.” But there is even something
more pernicious: getting reasonable looking answers from a reasonable looking data
set that has major (but not obvious at first glance) quality issues. Drawing significant
conclusions based on this kind of mistake is the sort of thing that gets data scientists
One of the most important talents that you can develop as a data scientist is the ability
to discover interesting and worthwhile problems in every phase of the data analytics
lifecycle. The more skill and brainpower that you can apply early on in an analysis
project, the stronger your confidence will be in your final product.
Of course, its easy to say all that; its the data science equivalent of telling a child to eat
their vegetables. It’s much more fun to play with a new tool like Spark that lets us build
fancy machine learning algorithms, develop streaming data processing engines, and
analyze web-scale graphs. So what better way to introduce you to working with data
using Spark and Scala than a data cleansing exercise?
Scala for Data Scientists
Most data scientists have a favorite tool, like R or Python, for performing interactive
data munging and analysis. Although they’re willing to work in other environments
when they have to, data scientists tend to get very attached to their favorite tool, and are
always looking to find a way to carry out whatever work they can using it. Introducing
them to a new tool that has a new syntax and a new set of patterns to learn can be
challenging under the best of circumstances.
There are libraries and wrappers for Spark that allow you to use it from R or Python.
The Python wrapper, which is called PySpark, is actually quite good, and well cover
some recipes that involve using it in one of the later chapters in the book. But the vast
majority of our recipes will be written in Scala, because we think that learning how to
work with Spark in the same language that the underlying framework is written in has
a number of advantages for you as a data scientist:
No impedance mismatch. Whenever we’re running an algorithm in R or Python
on top of a JVM-based language like Scala, we have to do some work to pass code
and data across the different environments, and often times, things can get lost in
translation. When youre writing your data analysis algorithms in Spark with the
Scala API, you can be far more confident that your program will run as intended.
2. Get access to the latest and greatest. All of Spark’s machine learning, stream pro‐
cessing, and graph analytics libraries are written in Scala, and the Python and R
bindings can get support for this new functionality much later. If you want to take
advantage of all of the features that Spark has to offer (without waiting for a port
to other language bindings), youre going to need to learn at least a little bit of Scala,
and if you want to be able to extend those functions to solve new problems you
encounter, youll need to learn a little bit more.
It will help you understand the Spark philosophy. Even when using Spark from
Python or R, the APIs reflects the underlying philosophy of computation that Spark
inherited from the language it was developed in- Scala. If you know how to use
Spark in Scala, even if you primarily use it from other languages, you’ll have a better
understanding of the system and will be in a better position to “think in Spark.
There is another advantage to learning how to use Spark from Scala, but its a bit more
difficult to explain because of how different it is from any other data analysis tool. If
youve ever analyzed data that you pulled from a database in R or Python, youre used
10 | Chapter 2: Introduction to Data Analysis with Scala and Spark
to working with languages like SQL to retrieve the information you want, and then
switching into R or Python in order to manipulate and visualize the data you’ve re‐
trieved. You’re used to using one language (SQL) for retrieving and manipulating lots
of data stored in a remote cluster and another language (Python/R) for manipulating
and visualizing information stored on your own machine. If you’ve been doing it for
long enough, you probably dont even think about it anymore.
With Spark and Scala, the experience is different, because youre using the same language
for everything. You’re writing Scala to retrieve data from the cluster via Spark. Youre
writing Scala to manipulate that data locally on your own machine. And then and
this is the really neat part you can send Scala code into the cluster so that you can
perform the exact same transformations that you performed locally on data that is still
stored in the cluster. It’s difficult to express how transformative it is to do all of your
data munging and analysis in a single environment, regardless of where the data itself
is stored and processed. It’s the sort of thing that you have to experience for yourself to
understand, and we wanted to be sure that our recipes captured some of that same magic
feeling that we felt when we first started using Spark.
The Spark Programming Model
Spark programming starts with a dataset or few, usually residing in some form of dis‐
tributed, persistent storage like the Hadoop Distributed File System (HDFS). Writing a
Spark program typically consists of a few related things:
Defining a set of transformations on input datasets.
Invoking actions that output the transformed datasets to persistent storage or return
results to the driver’s local memory.
Running local computations that operate on the results computed in a distributed
fashion. These can help decide what transformations and actions to undertake next.
Understanding Spark means understanding the intersection between the two sets of
abstractions the framework offers: storage and execution. Spark pairs these abstractions
in an elegant way that essentially allows any intermediate step in a data processing
pipeline to be cached in memory for later use.
Record Linkage
The problem that were going to study in this chapter goes by a lot of different names
in the literature and in practice: entity resolution, record deduplication, merge-and-
purge, and list washing. Ironically, this makes it difficult to find all of the research papers
on this topic across the literature in order to get a good overview of solution techniques;
we need a data scientist to de-duplicate the references to this data cleansing problem!
The Spark Programming Model | 11
For our purposes in the rest of this chapter, we’re going to refer to this problem as record
The general structure of the problem is something like this: we have a large collection
of records from one or more source systems, and it is likely that some of the records
refer to the same underlying entity, such as a customer, a patient, or the location of a
business or an event. Each of the entities has a number of attributes, such as a name, an
address, or a birthday, and we will need to use these attributes to find the records that
refer to the same entity. Unfortunately, the values of these attributes aren’t perfect: values
might have different formatting, or typos, or missing information that means that a
simple equality test on the values of the attributes will cause us to miss a significant
number of duplicate records. For example, lets compare the following business listings:
Table 2-1. The challenge of record linkage
Name Address City State Phone
Josh’s Coffee Shop 1234 Sunset Boulevard West Holly
CA (213)-555-1212
Josh Cofee 1234 Sunset Blvd West Hollywood CA 555-1212
Coffee Chain #1234 1400 Sunset Blvd #2 Hollywood CA 206-555-1212
Coffee Chain Region
al Office
1400 Sunset Blvd
Suite 2
Hollywood Califor
The first two entries in this table refer to the same small coffee shop, even though a data
entry error makes it look as if they are in two different cities (West Hollywood vs.
Hollywood.) The second two entries, on the other hand, are actually referring to dif‐
ferent business locations of the same chain of coffee shops that happen to share a com‐
mon address: one of the entries refers to an actual coffee shop, and the other one refers
to a local corporate office location. Both of the entries give the official phone number
of corporate headquarters in Seattle.
This example illustrates everything that makes record linkage so difficult: even though
both pairs of entries look similar to each other, the criteria that we use to make the
duplicate/not-duplicate decision is different for each pair. This is the kind of distinction
that is easy for a human to understand and identify at a glance, but is difficult for a
computer to learn.
Getting Started: The Spark Shell and SparkContext
We’re going to use a sample data set from the UC Irvine Machine Learning Repository,
which is a fantastic source for a variety of interesting (and free) data sets for research
and education. The data set we’ll be analyzing was curated from a record linkage study
that was performed at a German hospital in 2010, and it contains several million pairs
of patient records that were matched according to several different criteria, such as the
12 | Chapter 2: Introduction to Data Analysis with Scala and Spark
patient’s name (first and last), their address, and their birthday. Each matching field was
assigned a numerical score from 0.0 to 1.0 based on how similar the strings were, and
the data was then hand labeled to identify which pairs represented the same person and
which did not. The underlying values of the fields themselves that were used to create
the data set were removed to protect the privacy of the patients, and numerical identi‐
fiers, the match scores for the fields, and the label for each pair (match vs. non-match)
were published for use in record linkage research.
From the shell, lets pull the data from the repository:
$ mkdir linkage
$ cd linkage/
$ curl -o
$ unzip
$ unzip 'block*.zip'
If you have a Hadoop cluster handy, you can create a directory for the block data in
HDFS and copy the files from the data set there:
$ hadoop fs -mkdir linkage
$ hadoop fs -put block*csv linkage
Now were ready to launch the spark-shell, which is a REPL (read-eval-print loop) for
the Scala language that also has some Spark-specific extensions. If you’ve never seen the
term REPL before, you can think of it as something similar to the R environment: it’s a
place where you can define functions and manipulate data in the Scala programming
If you have a Hadoop cluster that runs a version of Hadoop that supports YARN, you
can launch the Spark jobs on the cluster by using the value of yarn-client for the Spark
$ spark-shell --master yarn-client
However, if youre just running these examples on your personal computer, you can
launch a local Spark cluster by specifying local[N], where N is the number of threads
to run, or * to match the number of cores available on your machine. For example, to
launch a local cluster that uses 8 threads on an 8-core machine:
$ spark-shell --master local[*]
The examples will work the same way locally. You will simply pass paths to local files,
rather than paths on HDFS beginning with hdfs://.
The rest of the examples in this book will not show a --master argument to spark-
shell, but you will typically need to specify this argument as appropriate for your en‐
You may need to specify additional arguments to make the Spark shell fully utilize your
resources. For example, when running with a local master, you can use --driver-
Getting Started: The Spark Shell and SparkContext | 13
memory 2g to let the single local process use 2 gigabytes of memory. YARN memory
configuration is more complex, and relevant options like --executor-memory are ex‐
plained in the Spark on YARN documentation.
After running one of these commands, you will see a lot of log messages from Spark as
it initializes itself, but you should also see a bit of ASCII art, followed by some additional
log messages and a prompt:
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/___/ .__/\_,_/_/ /_/\_\ version 1.2.0
Using Scala version 2.10.4
(Java HotSpot(TM) 64-Bit Server VM, Java 1.7.0_67)
Type in expressions to have them evaluated.
Type :help for more information.
Spark context available as sc.
If this is your first time using the Spark shell (or any Scala REPL, for that matter), you
should run the :help command to list available commands in the shell. :history
and :h? can be helpful for finding the names that you gave to variables or functions that
you wrote during a session but cant seem to find at the moment. :paste can help you
correctly insert code from the clipboard something you may well want to do while
following along with the book and its accompanying source code.
In addition to note about :help, the Spark log messages indicated that “Spark context
available as sc.” This is a reference to the SparkContext, which coordinates the execution
of Spark jobs on the cluster. Go ahead and type sc at the command line:
res0: org.apache.spark.SparkContext =
The REPL will print the string form of the object, and for the SparkContext object, this
is simply its name plus the hexadecimal address of the object in memory (DEADBEEF
is a placeholder; the exact value you see here will vary from run to run.)
It’s good that the sc variable exists, but what exactly do we do with it? SparkContext is
an object, and as an object, it has methods associated with it. We can see what those
methods are in the Scala REPL by typing the name of a variable, followed by a period,
followed by tab:
accumulable accumulableCollection
accumulator addFile
14 | Chapter 2: Introduction to Data Analysis with Scala and Spark
addJar addSparkListener
appName asInstanceOf
broadcast cancelAllJobs
cancelJobGroup clearCallSite
clearFiles clearJars
clearJobGroup defaultMinPartitions
defaultMinSplits defaultParallelism
emptyRDD files
getAllPools getCheckpointDir
getConf getExecutorMemoryStatus
getExecutorStorageStatus getLocalProperty
getPersistentRDDs getPoolForName
getRDDStorageInfo getSchedulingMode
hadoopConfiguration hadoopFile
hadoopRDD initLocalProperties
isInstanceOf isLocal
jars makeRDD
master newAPIHadoopFile
newAPIHadoopRDD objectFile
parallelize runApproximateJob
runJob sequenceFile
setCallSite setCheckpointDir
setJobDescription setJobGroup
startTime stop
submitJob tachyonFolderName
textFile toString
union version
The SparkContext has a long list of methods, but the ones that were going to use most
often allow us to create Resilient Distributed Datasets, or RDDs. An RDD is Sparks
fundamental abstraction for representing a collection of objects that can be distributed
across multiple machines in a cluster. There are two ways to create an RDD in Spark:
Using the SparkContext to create an RDD from an external data source, like a file
in HDFS, a database table via JDBC, or from a local collection of objects that we
create in the Spark shell.
Performing a transformation on one or more existing RDDs, like filtering records,
aggregating records by a common key, or joining multiple RDDs together.
RDDs are a convenient way to describe the computations that we want to perform on
our data as a sequence of small, independent steps.
Resilient Distributed Datasets
An RDD is laid out across the cluster of machines as a collection of partitions, each
including a subset of the data. Partitions define the unit of parallelism in Spark. The
framework processes the objects within a partition in sequence, and processes multiple
Getting Started: The Spark Shell and SparkContext | 15
partitions in parallel. One of the simplest ways to create an RDD is to use the parallel
ize method on SparkContext with a local collection of objects:
val rdd = sc.parallelize(Array(1, 2, 2, 4), 4)
rdd: org.apache.spark.rdd.RDD[Int] = ...
The first argument is the collection of objects to parallelize. The second is the number
of partitions. When the time comes to compute the objects within a partition, Spark
fetches a subset of the collection from the driver process.
To create an RDD from a text file or directory of text files residing in a distributed file
system like HDFS, we can pass the name of the file or directory to the textFile method:
val rdd2 = sc.textFile("hdfs:///some/path.txt")
rdd2: org.apache.spark.rdd.RDD[String] = ...
When running Spark in local mode, the textFile can access paths that reside on the local
filesystem. If Spark is given a directory instead of an individual file, it will consider all
of the files in that directory as part of the given RDD. Finally, note that no actual data
has been read by Spark or loaded into memory yet, either on our client machine or the
cluster. When the time comes to compute the objects within a partition, Spark reads a
section (also known as a split) of the input file, and then applies any subsequent trans‐
formations (filtering, aggregation, etc.) that we defined via other RDDs.
Our record linkage data is stored in a text file, with one observation on each line. We
will use the textFile method on SparkContext to get a reference to this data as an
val rawblocks = sc.textFile("linkage")
rawblocks: org.apache.spark.rdd.RDD[String] = ...
There are a few things happening on this line that are worth going over. First, we’re
declaring a new variable called rawblocks. As we can see from the shell, the raw
blocks variable has a type of RDD[String], even though we never specified that type
information in our variable declaration. This is a feature of the Scala programming
language called type inference, and it saves us a lot of typing when were working with
the language. Whenever possible, Scala figures out what type a variable has based on its
context. In this case, Scala looks up the return type from the textFile function on the
SparkContext object, sees that it returns an RDD[String], and assigns that type to the
rawblocks variable.
Whenever we create a new variable in Scala, we must preface the name of the variable
with either val or var. Variables that are prefaced with val are immutable, and may not
be changed to refer to another value once they are assigned, while variables that are
16 | Chapter 2: Introduction to Data Analysis with Scala and Spark
prefaced with var may be changed to refer to different objects of the same type. Watch
what happens when we execute the following code:
rawblocks = sc.textFile("linkage")
<console>: error: reassignment to val
var varblocks = sc.textFile("linkage")
varblocks = sc.textFile("linkage")
Attempting to re-assign the linkage data to the rawblocks val threw an error, but re-
assigning the varblocks var is fine. Within the Scala REPL, there is an exception to the
re-assignment of vals, because we are allowed to re-declare the same immutable variable,
like the following:
val rawblocks = sc.textFile("linakge")
val rawblocks = sc.textFile("linkage")
In this case, no error is thrown on the second declaration of rawblocks. This isnt typ‐
ically allowed in normal Scala code, but its fine to do in the shell, and we will make
extensive use of this feature throughout the recipes in the book.
The REPL and Compilation
In addition to its interactive shell, Spark also supports and compiled applications. We
typically recommend using Maven for compiling and managing dependencies. The code
samples included with this book hold a self-contained Maven project setup under the
simplesparkproject/ directory to help with getting started.
With both the shell and compilation as options, which should one use when testing out
and building a data pipeline? It is often useful to start working entirely in the REPL. This
enables quick prototyping, faster iteration, and less lag time between ideas and results.
However, as the program builds in size, maintaining a monolithic file of code become
more onerous, and Scala interpretation eats up more time. This can be exacerbated by
the fact that, when dealing with massive data, it is not uncommon for an attempted
operation to cause a Spark application to crash or otherwise render a SparkContext
unusable. Meaning that any work and code typed in so far becomes lost. At this point,
it is often useful to take a hybrid approach. Keep the frontier of development in the
REPL, and, as pieces of code harden, move them over into a compiled library. The
compiled jar can be made available to spark-shell by passing it to the --jars property.
When done right, the compiled jar only needs to be rebuilt infrequently, and the REPL
allows for fast iteration on code and approaches that still need ironing out.
Getting Started: The Spark Shell and SparkContext | 17