CHAPTER 1 • INTRODUCTION: WHY LOOK BEYOND HADOOP MAP-REDUCE? 11
in financial domain and found it to be scalable, though not without
issues. (I had to tweak the source significantly.) One observation
about Mahout is that it implements only a smaller subset of ML algo-
rithms over Hadoop—only 25 algorithms are of production quality,
with only 8 or 9 usable over Hadoop, meaning scalable over large data
sets. These include the linear regression, linear SVM, the K-means
clustering, and so forth. It does provide a fast sequential implementa-
tion of the logistic regression, with parallelized training. However, as
several others have also noted (see Quora.com, for instance), it does
not have implementations of nonlinear SVMs or multivariate logistic
regression (discrete choice model, as it is otherwise known).
Overall, this book is not intended for Mahout bashing. However,
my point is that it is quite hard to implement certain ML algorithms
including the kernel SVM and CGD (note that Mahout has an imple-
mentation of stochastic gradient descent) over Hadoop. This has been
pointed out by several others as well—for instance, see the paper by
Professor Srirama (Srirama et al. 2012). This paper makes detailed
comparisons between Hadoop and Twister MR (Ekanayake et al.
2010) with regard to iterative algorithms such as CGD and shows
that the overheads can be significant for Hadoop. What do I mean by
iterative? A set of entities that perform a certain computation, wait for
results from neighbors or other entities, and start the next iteration.
The CGD is a perfect example of iterative ML algorithm—each CGD
can be broken down into daxpy , ddot , and matmul as the primitives.
I will explain these three primitives: daxpy is an operation that takes
a vector x , multiplies it by a constant k , and adds another vector y
to it; ddot computes the dot product of two vectors x and y ; matmul
multiplies a matrix by a vector and produces a vector output. This
means 1 MR per primitive, leading to 6 MRs per iteration and even-
tually 100s of MRs per CG computation, as well as a few gigabytes
(GB)s of communication even for small matrices. In essence, the
setup cost per iteration (which includes reading from HDFS into