Apache Mahout has just released the long awaited 0.13.0 which introduces modular native solvers (e.g. GPU support!).
TensorFlow has done a great job driving the conversation around bringing GPU accelerated linear algebra to the masses for implementing custom algorithms, however it has a major draw back that it prefers to manage its own cluster or can work on top of Apache Spark. Mahout on the other hand was designed to work on top of Spark or with any user defined distributed back-end (it just so happens Spark is the generic one we recommend for people trying out Mahout- but if you happen to use Spark great, you’re all set!)
Mahout for Hybrid Spark-GPU clusters
In Mahout, we strive to make everything modular, use something off the shelf or write your own that is optimized to you environment. In 0.13.0 Mahout has introduced a concept of modular native solvers. The native solver is a what gets called when you do any BLAS Like operations (e.g. matrix-matrix multiplication, matrix-vector multiplication, etc.). Many machine learning packages such as R, Python’s sklearn, and Spark’s MLLib utilize Fortran based solvers for speeding up these types of operations. The two native solvers currently included in Mahout are based on ViennaCL, which allow the user to utilize GPUs (via the
org.apache.mahout:mahout-native-viennacl_2.10:0.13.0 artifact) or CPUs (via the
A quick review on how Mahout works under the hood
To understand how this gets used in practice, let’s review how Mahout does math on a Distributed Row Matrix (DRM). A DRM is really a wrapper for an underlying data structure, for example in Spark a
RDD[org.apache.mahout.math.Vector]. When we do some math in Mahout on a DRM, for example
drmA %*% incoreV
We are taking a DRM (which is wrapping an RDD) and taking the dot product of each row. In Spark for instance, each executor is multiplying every row of each partition it has locally by a vector (taking the dot product), this happens in the JVM.
When a native solver such as ViennaCL is in use, each executor attempts to use the ViennaCL library to dump the data out of the JVM (where the exectutor is running), utilize a GPU if one exists, load the data into the GPU and execute the operation there. (GPUs are super charged for BLAS operations like this).
This works in a similar way using ViennaCL-OMP, the difference is that all available CPUs are used.
On a Spark-GPU hybrid cluster
Supposing then you have a Spark cluster, and on each node there are GPUs available- then what you now have is statistics/machine learning on Spark which is GPU accelerated. If you have some other distributed engine that executed in the JVM (such as Apache Flink) then you also get GPU acceleration there!
If you don’t have GPUs on your Spark Cluster, you can also use the ViennaCL-OMP package to see significant performance gains. It should be noted though, the optimal ‘tuning’ of one’s Spark jobs changes when using OMP. Since ViennaCL-OMP will utilize all available CPUs, it is advised to set the partitioning of the DRM to be equal to the number of nodes you have. This is because they way Spark normally paralellizes jobs is by using one core per partition to do work- however, since OMP will invoke all cores, you don’t gain anything by ‘over parallelizing’ your job. (This assumes that setting partitions = number of executors will still allow your full data set to fit in memory).
Other cool stuff
In addition to the super-awesome GPU stuff, there are also a couple of other fun little additions.
Currently this is very sparse, but represents Mahout moving toward offering a large collection of contributed algorithms via a consistent API similar to Python’s sklearn.
Consider classic OLS.
val drmData = drmParallelize(dense( (2, 2, 10.5, 10, 29.509541), // Apple Cinnamon Cheerios (1, 2, 12, 12, 18.042851), // Cap'n'Crunch (1, 1, 12, 13, 22.736446), // Cocoa Puffs (2, 1, 11, 13, 32.207582), // Froot Loops (1, 2, 12, 11, 21.871292), // Honey Graham Ohs (2, 1, 16, 8, 36.187559), // Wheaties Honey Gold (6, 2, 17, 1, 50.764999), // Cheerios (3, 2, 13, 7, 40.400208), // Clusters (3, 3, 13, 4, 45.811716)), numPartitions = 2) val drmX = drmData(::, 0 until 4) val drmY = drmData(::, 4 until 5) import org.apache.mahout.math.algorithms.regression.OrdinaryLeastSquares var model = new OrdinaryLeastSquares[Int]() model.fit(drmX, drmY) model.summary
res3: String = Coef. Estimate Std. Error t-score Pr(Beta=0) X0 -1.336265388326865 2.6878127323908942 -0.49715717625097144 1.354836239139669 X1 -13.157701320678825 5.393984138816236 -2.4393288860442244 1.9287400286811958 X2 -4.152654199019935 1.7849055635870432 -2.326540005105108 1.9194430753543341 X3 -5.67990809423236 1.886871957793384 -3.0102244462177334 1.960458377021527 (Intercept) 163.17932687840948 51.91529676169986 3.143183937239744 0.03474107366050938
Feels just like R, with the one caveat that X0 is not the intercept- a slight cosmetic issue for the time being.
The following algorithms are included as of 0.13.0:
– AsFactor: Given a column of integers, returns a sparse matrix (also sometimes called “OneHot” encoding).
– MeanCenter: Centers a column on its mean
– StandardScaler: Scales a column to mean= 0 and unit variance.
– OrdinaryLeastSquares: Closed form ordinary least squares regression
– CochraneOrcutt: When serial correlation in the error terms is present the test statistics are biased- this procedure attempts to ‘fix’ the estimates
– Coefficent of Determination: Also known as R-Squared
– DurbinWatson: Tests for presence of serial correlation
This is a small package that adds Scala-like methods to
org.apache.mahout.math.Vector. Specifically it adds the
While these may seem trivial, I for one have found them extremely convenient (which is why I contributed them 🙂 )
Spark DataFrame / MLLib convenience wrappers.
Spark can be a really useful tool for getting your data ready for Mahout- however there was a notable lack of methods for easily working with DataFrames or RDDs of org.apache.spark.mllib.linalg.Vector (the kinds of RDD that MLLib likes).
For convenience, the following were added:
val myRDD: RDD[org.apache.spark.mllib.regression.LabeledPoint] = ... val myDRM = drmWrapMLLibLabeledPoint(myRDD) // myDRM is a DRM where the 'label' is the last column
val myRDD: RDD[org.apache.spark.mllib.linalg.Vector] = ... val myDRM = drmWrapMLLibVector(myRDD)
val myDF = ... // a dataframe where all of the values are Doubles val myDRM = drmWrapDataFrame(myDF)
This was a really hard fought release, and an extra special thanks to Andrew Palumbo and Andrew Musselman. If you see them at a bar, buy them a beer; if you see them at a taco stand, buy them some extra guac.