val svd = matrix.computeSVD(20, computeU = true)
val U: RowMatrix = svd.U
val s: Vector = svd.s
val V: Matrix = svd.V
val pipeline = new Pipeline()
.setStages(Array(tokenizer, hashingTermFreq, logisticRegression))
assert(model.intercept ~== interceptR relTol 1E-3)
val denseVec: Vector = Vectors.dense(1.0, 0.0, 3.0)
val sparseVec: Vector = Vectors.sparse(3, Array(0, 2), Array(1.0, 3.0))
val denseLP = LabeledPoint(1.0, Vectors.dense(1.0, 0.0, 3.0))
val sparseLP = LabeledPoint(0.0, Vectors.sparse(3, Array(0, 2), Array(1.0, 3.0)))
val denseMat: Matrix = Matrices.dense(3, 2,
Array(1.0, 3.0, 5.0, 2.0, 4.0, 6.0))
val sparseMat: Matrix = Matrices.sparse(3, 2,
Array(0, 1), Array(0, 1), Array(2.3, -1.0))
Structure | RDD Type | Uses |
---|---|---|
BlockMatrix | ((Int, Int), Matrix) | Intuitively like using parallelization using MPI. Phase space multi-grid discretization, iterative approximations. |
RowMatrix | (Vector) | When order doesn't matter, and only integer-countable number of Vectors. |
IndexedRowMatrix | (Long, Vector) | When order does matter. Eg., multivariate time-series aggregation. |
CoordinateMatrix | (Long, Long, Double) | High-dimensional, sparse data. |
val tokenizer = new Tokenizer()
.setInputCol("text")
.setOutputCol("words")
val hashingTF = new HashingTF().setNumFeatures(1000)
.setInputCol(tokenizer.getOutputCol)
.setOutputCol("features")
val lr = new LogisticRegression().setMaxIter(10).setRegParam(0.01)
val pipeline = new Pipeline()
.setStages(Array(tokenizer, hashingTF, lr))
val model = pipeline.fit(training.toDF)
val test = sc.parallelize(Seq(
Document(4L, "spark i j k"),
Document(5L, "l m n"),
Document(6L, "mapreduce spark"),
Document(7L, "apache hadoop")))
val predictions = model.transform(test.toDF)
Problem:
val standardizer = new StandardScaler(withMean = true, withStd = true)
val model = standardizer.fit(dataRDD)
val standardizedDataRDD = model.transform(dataRDD)
val weightVec = Vectors.dense(2.0, 0.5, 0.0, 0.25)
val weightTransformer = new ElementwiseProduct(weightVec) //in Spark 1.4
val weightedDataRDD = weightTransformer.transform(standardizedDataRDD)
val kModel = KMeans.train(weightedDataRDD, 2, 2, 1, initMode)
val predictions = model.predict(weightedDataRDD)
//persist these for reuse:
val means = model.mean
val vars = model.variance
val weightVec = ...
val centers = kModel.clusterCenters