Lucky Number 0.13.0

Apache Mahout has just released the long awaited 0.13.0 which introduces modular native solvers (e.g. GPU support!).

TensorFlow has done a great job driving the conversation around bringing GPU accelerated linear algebra to the masses for implementing custom algorithms, however it has a major draw back that it prefers to manage its own cluster or can work on top of Apache Spark. Mahout on the other hand was designed to work on top of Spark or with any user defined distributed back-end (it just so happens Spark is the generic one we recommend for people trying out Mahout- but if you happen to use Spark great, you’re all set!)

Mahout for Hybrid Spark-GPU clusters

In Mahout, we strive to make everything modular, use something off the shelf or write your own that is optimized to you environment. In 0.13.0 Mahout has introduced a concept of modular native solvers. The native solver is a what gets called when you do any BLAS Like operations (e.g. matrix-matrix multiplication, matrix-vector multiplication, etc.). Many machine learning packages such as R, Python’s sklearn, and Spark’s MLLib utilize Fortran based solvers for speeding up these types of operations. The two native solvers currently included in Mahout are based on ViennaCL, which allow the user to utilize GPUs (via the org.apache.mahout:mahout-native-viennacl_2.10:0.13.0 artifact) or CPUs (via the org.apache.mahout:mahout-native-viennacl-omp_2.10:0.13.0 artifact).

A quick review on how Mahout works under the hood

To understand how this gets used in practice, let’s review how Mahout does math on a Distributed Row Matrix (DRM). A DRM is really a wrapper for an underlying data structure, for example in Spark a RDD[org.apache.mahout.math.Vector]. When we do some math in Mahout on a DRM, for example

drmA %*% incoreV

We are taking a DRM (which is wrapping an RDD) and taking the dot product of each row. In Spark for instance, each executor is multiplying every row of each partition it has locally by a vector (taking the dot product), this happens in the JVM.

When a native solver such as ViennaCL is in use, each executor attempts to use the ViennaCL library to dump the data out of the JVM (where the exectutor is running), utilize a GPU if one exists, load the data into the GPU and execute the operation there. (GPUs are super charged for BLAS operations like this).

This works in a similar way using ViennaCL-OMP, the difference is that all available CPUs are used.

On a Spark-GPU hybrid cluster

Supposing then you have a Spark cluster, and on each node there are GPUs available- then what you now have is statistics/machine learning on Spark which is GPU accelerated. If you have some other distributed engine that executed in the JVM (such as Apache Flink) then you also get GPU acceleration there!

If you don’t have GPUs on your Spark Cluster, you can also use the ViennaCL-OMP package to see significant performance gains. It should be noted though, the optimal ‘tuning’ of one’s Spark jobs changes when using OMP. Since ViennaCL-OMP will utilize all available CPUs, it is advised to set the partitioning of the DRM to be equal to the number of nodes you have. This is because they way Spark normally paralellizes jobs is by using one core per partition to do work- however, since OMP will invoke all cores, you don’t gain anything by ‘over parallelizing’ your job. (This assumes that setting partitions = number of executors will still allow your full data set to fit in memory).

Other cool stuff

In addition to the super-awesome GPU stuff, there are also a couple of other fun little additions.

Algorithms Framework

Currently this is very sparse, but represents Mahout moving toward offering a large collection of contributed algorithms via a consistent API similar to Python’s sklearn.

Consider classic OLS.

val drmData = drmParallelize(dense(
  (2, 2, 10.5, 10, 29.509541),  // Apple Cinnamon Cheerios
  (1, 2, 12,   12, 18.042851),  // Cap'n'Crunch
  (1, 1, 12,   13, 22.736446),  // Cocoa Puffs
  (2, 1, 11,   13, 32.207582),  // Froot Loops
  (1, 2, 12,   11, 21.871292),  // Honey Graham Ohs
  (2, 1, 16,   8,  36.187559),  // Wheaties Honey Gold
  (6, 2, 17,   1,  50.764999),  // Cheerios
  (3, 2, 13,   7,  40.400208),  // Clusters
  (3, 3, 13,   4,  45.811716)), numPartitions = 2)


val drmX = drmData(::, 0 until 4)
val drmY = drmData(::, 4 until 5)

import org.apache.mahout.math.algorithms.regression.OrdinaryLeastSquares

var model = new OrdinaryLeastSquares[Int]()
model.fit(drmX, drmY)
model.summary

Which returns:

res3: String = 
Coef.       Estimate        Std. Error      t-score         Pr(Beta=0)
X0  -1.336265388326865  2.6878127323908942  -0.49715717625097144    1.354836239139669
X1  -13.157701320678825 5.393984138816236   -2.4393288860442244 1.9287400286811958
X2  -4.152654199019935  1.7849055635870432  -2.326540005105108  1.9194430753543341
X3  -5.67990809423236   1.886871957793384   -3.0102244462177334 1.960458377021527
(Intercept) 163.17932687840948  51.91529676169986   3.143183937239744   0.03474107366050938

Feels just like R, with the one caveat that X0 is not the intercept- a slight cosmetic issue for the time being.

The following algorithms are included as of 0.13.0:

Preprocessing
– AsFactor: Given a column of integers, returns a sparse matrix (also sometimes called “OneHot” encoding).
– MeanCenter: Centers a column on its mean
– StandardScaler: Scales a column to mean= 0 and unit variance.
Regression
– OrdinaryLeastSquares: Closed form ordinary least squares regression
Remedial Measures
– CochraneOrcutt: When serial correlation in the error terms is present the test statistics are biased- this procedure attempts to ‘fix’ the estimates
Tests
– Coefficent of Determination: Also known as R-Squared
– MeanSquareError
– DurbinWatson: Tests for presence of serial correlation

MahoutCollections

This is a small package that adds Scala-like methods to org.apache.mahout.math.Vector. Specifically it adds the toArray and toMap method.

While these may seem trivial, I for one have found them extremely convenient (which is why I contributed them 🙂 )

Spark DataFrame / MLLib convenience wrappers.

Spark can be a really useful tool for getting your data ready for Mahout- however there was a notable lack of methods for easily working with DataFrames or RDDs of org.apache.spark.mllib.linalg.Vector (the kinds of RDD that MLLib likes).

For convenience, the following were added:
–drmWrapMLLibLabledPoint
–drmWrapDataFrame
–drmWrapMLLibVector

drmWrapMLLibLabledPoint

val myRDD: RDD[org.apache.spark.mllib.regression.LabeledPoint] = ...
val myDRM = drmWrapMLLibLabeledPoint(myRDD)
// myDRM is a DRM where the 'label' is the last column

drmWrapDataFrame

val myRDD: RDD[org.apache.spark.mllib.linalg.Vector] = ...
val myDRM = drmWrapMLLibVector(myRDD)

drmWrapMLLibVector

val myDF = ... // a dataframe where all of the values are Doubles
val myDRM = drmWrapDataFrame(myDF)

Conclusions

This was a really hard fought release, and an extra special thanks to Andrew Palumbo and Andrew Musselman. If you see them at a bar, buy them a beer; if you see them at a taco stand, buy them some extra guac.