org.apache.spark.mllib ( )
DataFrame: Spark ML uses DataFrame from Spark SQL as an ML dataset, which can hold a variety of data types. E.g., a DataFrame could have different columns storing text, feature vectors, true labels, and predictions.Transformer: A Transformer is an algorithm which can transform one DataFrame into another DataFrame. E.g., an ML model is a Transformer which transforms DataFrame with features into a DataFrame with predictions.Estimator: An Estimator is an algorithm which can be fit on a DataFrame to produce a Transformer. E.g., a learning algorithm is an Estimator which trains on a DataFrame and produces a model.Pipeline: A Pipeline chains multiple Transformers and Estimators together to specify an ML workflow.Parameter: All Transformers and Estimators now share a common API for specifying parameters.
Classification Logistic regression Decision tree classifier Random forest classifier Gradient-boosted tree classifier Multilayer perceptron classifier One-vs-Rest classifier (a.k.a. One-vs-All)Regression Linear regression Decision tree regression Random forest regression Gradient-boosted tree regression Survival regressionDecision treesTree Ensembles Random Forests Gradient-Boosted Trees (GBTs)
K-meansLatent Dirichlet allocation (LDA)
MLlib 数据类型
Local vectorLabeled pointLocal matrixDistributed matrix RowMatrix IndexedRowMatrix CoordinateMatrix BlockMatrix
MLlib 分类和回归
Binary Classification: linear SVMs, logistic regression, decision trees, random forests, gradient-boosted trees, naive BayesMulticlass Classification:logistic regression, decision trees, random forests, naive BayesRegression:linear least squares, Lasso, ridge regression, decision trees, random forests, gradient-boosted trees, isotonic regression
MLlib 聚类
K-meansGaussian mixturePower iteration clustering (PIC,多用于图像识别)Latent Dirichlet allocation (LDA,多用于主题分类)Bisecting k-meansStreaming k-means
MLlib Models
import import import org.apache.spark.mllib.linalg.{Vector, Vectors} import org.apache.spark.sql.Row val training = sqlContext.createDataFrame(Seq( (1.0, Vectors.dense(0.0, 1.1, 0.1)), (0.0, Vectors.dense(2.0, 1.0, -1.0)), (0.0, Vectors.dense(2.0, 1.3, 1.0)), (1.0, Vectors.dense(0.0, 1.2, -0.5)) )) .toDF("label", "features") val lr = new LogisticRegression()println("LogisticRegression parameters:\n" + lr.explainParams() + "\n") lr.setMaxIter(10).setRegParam(0.01) val model1 = println("Model 1 was fit using parameters: " + model1.parent.extractParamMap) val paramMap = ParamMap(lr.maxIter -> 20) .put(lr.maxIter, 30) .put(lr.regParam -> 0.1, lr.threshold -> 0.55)val paramMap2 = ParamMap(lr.probabilityCol -> "myProbability") val paramMapCombined = paramMap ++ paramMap2val model2 =, paramMapCombined)println("Model 2 was fit using parameters: " + model2.parent.extractParamMap)test = sqlContext.createDataFrame(Seq( (1.0, Vectors.dense(-1.0, 1.5, 1.3)), (0.0, Vectors.dense(3.0, 2.0, -0.1)), (1.0, Vectors.dense(0.0, 2.2, -1.5)) )) .toDF("label", "features")model2.transform(test) .select("features", "label", "myProbability", "prediction") .collect() .foreach { case Row(features: Vector, label: Double, prob: Vector, prediction: Double) => println(s"($features, $label) -> prob=$prob, prediction=$prediction") }