question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

MLeap Transformer schema is wrong

See original GitHub issue

After creating a PySpark model and serializing it to a bundle, I try to read in the MLeap transformer and make a prediction but the prediction is wrong.

Upon investigation, I’ve found that the inputSchema of the model has been modified, so the features are in the wrong order. If you simply print out the PipelineModel, it shows the features in the correct order, but calling inputSchema gives an incorrect order. @abaveja313 ** Code to reproduce:

Model

from pyspark import SparkConf, SparkContext, SQLContext
from pyspark.sql import Row
from pyspark.ml.feature import VectorAssembler,StringIndexer
from pyspark.ml.classification import LogisticRegression,RandomForestClassifier
from pyspark.ml import Pipeline

data = load_breast_cancer()
X, y = data['data'], data['target']
cols = [str(i) for i in data['feature_names']] + ['label']
sample = Row(*cols)
dataframe = []
for X_sample, y_sample in zip(X, y):
    X_data = [float(i) for i in X_sample]
    label = float(y_sample)
    sample_data = X_data + [label]
    dataframe.append(sample(*sample_data))
df = sqlContext.createDataFrame(dataframe)
features = df.columns
features.remove('label')
assembler = VectorAssembler(inputCols=features, outputCol='features')
model = RandomForestClassifier()
pipeline = Pipeline(stages=[assembler, model])
train, test = df.randomSplit([0.7, 0.3])
fittedPipeline = pipeline.fit(train)
predictions = fittedPipeline.transform(test)
print(predictions.select('prediction').limit(10).collect())

serialization and movement to HDFS:

%%bash 
hdfs dfs -copyFromLocal -f /tmp/mleap-rftest.zip /tmp/mleap-rftest.zip

Reading in the model

import java.net.URI

import ml.bundle.hdfs.HadoopBundleFileSystem
import ml.combust.mleap.runtime.MleapContext
import ml.combust.mleap.runtime.frame.Transformer
import ml.combust.mleap.runtime.MleapSupport._
import org.apache.hadoop.conf.Configuration
import org.apache.hadoop.fs.FileSystem

object HDFSRetriever {
  val config = new Configuration()
  // Create the hadoop file system
  val fs: FileSystem = FileSystem.get(config)
  // Create the hadoop bundle file system
  val bundleFs = new HadoopBundleFileSystem(fs)
  // Create an implicit custom mleap context for saving/loading
  implicit val customMleapContext: MleapContext = MleapContext.defaultContext.copy(
    registry = MleapContext.defaultContext.bundleRegistry.registerFileSystem(bundleFs)
  )

  /**
   * Load a given model from HDFS using
   * the configuration specified in the
   * MLeapContext
   *
   * @param path hdfs path to load from
   */
  def loadBundleFromHDFS(path: String): Transformer = {
    new URI(path).loadMleapBundle().get.root
  }
}
val model = HDFSRetriever.loadBundleFromHDFS("hdfs:///tmp/mleap-rftest.zip");
print(model)
out>> Pipeline(PipelineModel_515b263296a1,NodeShape(Map(),Map()),
PipelineModel(List(VectorAssembler(VectorAssembler_b62fd25850d3,NodeShape(Map(
input0 -> Socket(input0,mean radius), 
input1 -> Socket(input1,mean texture), 
input2 -> Socket(input2,mean perimeter), 
input3 -> Socket(input3,mean area),
input4 -> Socket(input4,mean smoothness), 
input5 -> Socket(input5,mean compactness), 
input6 -> Socket(input6,mean concavity), 
input7 -> Socket(input7,mean concave points), 
input8 -> Socket(input8,mean symmetry),
input9 -> Socket(input9,mean fractal dimension), 
input10 -> Socket(input10,radius error), 
input11 -> Socket(input11,texture error), 
input12 -> Socket(input12,perimeter error), 
input13 -> Socket(input13,area error), 
input14 -> Socket(input14,smoothness error),
input15 -> Socket(input15,compactness error), 
input16 -> Socket(input16,concavity error), 
input17 -> Socket(input17,concave points error), 
input18 -> Socket(input18,symmetry error), 
input19 -> Socket(input19,fractal dimension error), 
input20 -> Socket(input20,worst radius), 
input21 -> Socket(input21,worst texture), 
input22 -> Socket(input22,worst perimeter), 
input23 -> Socket(input23,worst area), 
input24 -> Socket(input24,worst smoothness), 
input25 -> Socket(input25,worst compactness),
input26 -> Socket(input26,worst concavity), 
input27 -> Socket(input27,worst concave points), 
input28 -> Socket(input28,worst symmetry), 
input29 -> Socket(input29,worst fractal dimension)),Map(output -> Socket(output,features))),VectorAssemblerModel(List(ScalarShape(true)

printing the inputSchema:

model.inputSchema.fields.zipWithIndex.foreach { case (field, idx) =>
  println(s"$idx $field")
}
out>> 0 StructField(mean texture,ScalarType(double,true))
1 StructField(concavity error,ScalarType(double,true))
2 StructField(mean compactness,ScalarType(double,true))
3 StructField(mean radius,ScalarType(double,true))
4 StructField(texture error,ScalarType(double,true))
5 StructField(mean smoothness,ScalarType(double,true))
6 StructField(concave points error,ScalarType(double,true))
7 StructField(worst concavity,ScalarType(double,true))
8 StructField(mean concavity,ScalarType(double,true))
9 StructField(compactness error,ScalarType(double,true))
10 StructField(mean area,ScalarType(double,true))
11 StructField(worst fractal dimension,ScalarType(double,true))
12 StructField(worst concave points,ScalarType(double,true))
13 StructField(worst perimeter,ScalarType(double,true))
14 StructField(area error,ScalarType(double,true))
15 StructField(worst compactness,ScalarType(double,true))
16 StructField(worst texture,ScalarType(double,true))
17 StructField(mean concave points,ScalarType(double,true))
18 StructField(mean symmetry,ScalarType(double,true))
19 StructField(worst area,ScalarType(double,true))
20 StructField(symmetry error,ScalarType(double,true))
21 StructField(fractal dimension error,ScalarType(double,true))
22 StructField(worst radius,ScalarType(double,true))
23 StructField(worst smoothness,ScalarType(double,true))
24 StructField(mean fractal dimension,ScalarType(double,true))
25 StructField(radius error,ScalarType(double,true))
26 StructField(smoothness error,ScalarType(double,true))
27 StructField(mean perimeter,ScalarType(double,true))
28 StructField(worst symmetry,ScalarType(double,true))
29 StructField(perimeter error,ScalarType(double,true))

As you can see, the inputSchema is wrong, causing all predictions to be wrong. I’ve reproduced the same with LogisticRegression models as well. I’m stuck here because without being able to generate the schema I have to specify it each time which creates non-reproducible code.

Is there something I’m doing wrong here or missing? Help would be greatly appreciated!

Issue Analytics

  • State:closed
  • Created 4 years ago
  • Comments:11 (7 by maintainers)

github_iconTop GitHub Comments

1reaction
ancasarbcommented, Jan 27, 2021

closing this, please re-open if you’re still struggling with it.

0reactions
bhrigscommented, Aug 24, 2020

Worked fine for me with the latest version ml.combust.mleap:mleap-spark_2.11:0.16.0

Read more comments on GitHub >

github_iconTop Results From Across the Web

Unable to serialize a apache spark transformer in mleap
I use Spark 2.1.0 and Scala 2.11.8. I am trying to build a twitter sentiment analysis model in apache spark and service it...
Read more >
combust/mleap - Gitter
Am I wrong? How should I use this, I created the bundle file Using spark but I'm trying to avoid using spark datasets...
Read more >
Package 'mleap'
An MLeap model object. mleap_model_schema. MLeap model schema. Description. Returns the schema of an MLeap transformer.
Read more >
Custom Transformer · GitBook
Every transformer in MLeap can be considered a custom transformer. ... String).get override def outputSchema: StructType = StructType("output" -> ScalarType ...
Read more >
Source code for mlflow.mleap
This is required by MLeap for data schema inference. ... This model must be MLeap-compatible and cannot contain any custom transformers.
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found