MLeap Transformer schema is wrong
See original GitHub issueIssue Description
After creating a PySpark model and serializing it to a bundle, I try to read in the MLeap transformer and make a prediction but the prediction is wrong.
Upon investigation, I’ve found that the inputSchema of the model has been modified, so the features are in the wrong order. If you simply print out the PipelineModel, it shows the features in the correct order, but calling inputSchema gives an incorrect order. @abaveja313 ** Code to reproduce:
Model
from pyspark import SparkConf, SparkContext, SQLContext
from pyspark.sql import Row
from pyspark.ml.feature import VectorAssembler,StringIndexer
from pyspark.ml.classification import LogisticRegression,RandomForestClassifier
from pyspark.ml import Pipeline
data = load_breast_cancer()
X, y = data['data'], data['target']
cols = [str(i) for i in data['feature_names']] + ['label']
sample = Row(*cols)
dataframe = []
for X_sample, y_sample in zip(X, y):
X_data = [float(i) for i in X_sample]
label = float(y_sample)
sample_data = X_data + [label]
dataframe.append(sample(*sample_data))
df = sqlContext.createDataFrame(dataframe)
features = df.columns
features.remove('label')
assembler = VectorAssembler(inputCols=features, outputCol='features')
model = RandomForestClassifier()
pipeline = Pipeline(stages=[assembler, model])
train, test = df.randomSplit([0.7, 0.3])
fittedPipeline = pipeline.fit(train)
predictions = fittedPipeline.transform(test)
print(predictions.select('prediction').limit(10).collect())
serialization and movement to HDFS:
%%bash
hdfs dfs -copyFromLocal -f /tmp/mleap-rftest.zip /tmp/mleap-rftest.zip
Reading in the model
import java.net.URI
import ml.bundle.hdfs.HadoopBundleFileSystem
import ml.combust.mleap.runtime.MleapContext
import ml.combust.mleap.runtime.frame.Transformer
import ml.combust.mleap.runtime.MleapSupport._
import org.apache.hadoop.conf.Configuration
import org.apache.hadoop.fs.FileSystem
object HDFSRetriever {
val config = new Configuration()
// Create the hadoop file system
val fs: FileSystem = FileSystem.get(config)
// Create the hadoop bundle file system
val bundleFs = new HadoopBundleFileSystem(fs)
// Create an implicit custom mleap context for saving/loading
implicit val customMleapContext: MleapContext = MleapContext.defaultContext.copy(
registry = MleapContext.defaultContext.bundleRegistry.registerFileSystem(bundleFs)
)
/**
* Load a given model from HDFS using
* the configuration specified in the
* MLeapContext
*
* @param path hdfs path to load from
*/
def loadBundleFromHDFS(path: String): Transformer = {
new URI(path).loadMleapBundle().get.root
}
}
val model = HDFSRetriever.loadBundleFromHDFS("hdfs:///tmp/mleap-rftest.zip");
print(model)
out>> Pipeline(PipelineModel_515b263296a1,NodeShape(Map(),Map()),
PipelineModel(List(VectorAssembler(VectorAssembler_b62fd25850d3,NodeShape(Map(
input0 -> Socket(input0,mean radius),
input1 -> Socket(input1,mean texture),
input2 -> Socket(input2,mean perimeter),
input3 -> Socket(input3,mean area),
input4 -> Socket(input4,mean smoothness),
input5 -> Socket(input5,mean compactness),
input6 -> Socket(input6,mean concavity),
input7 -> Socket(input7,mean concave points),
input8 -> Socket(input8,mean symmetry),
input9 -> Socket(input9,mean fractal dimension),
input10 -> Socket(input10,radius error),
input11 -> Socket(input11,texture error),
input12 -> Socket(input12,perimeter error),
input13 -> Socket(input13,area error),
input14 -> Socket(input14,smoothness error),
input15 -> Socket(input15,compactness error),
input16 -> Socket(input16,concavity error),
input17 -> Socket(input17,concave points error),
input18 -> Socket(input18,symmetry error),
input19 -> Socket(input19,fractal dimension error),
input20 -> Socket(input20,worst radius),
input21 -> Socket(input21,worst texture),
input22 -> Socket(input22,worst perimeter),
input23 -> Socket(input23,worst area),
input24 -> Socket(input24,worst smoothness),
input25 -> Socket(input25,worst compactness),
input26 -> Socket(input26,worst concavity),
input27 -> Socket(input27,worst concave points),
input28 -> Socket(input28,worst symmetry),
input29 -> Socket(input29,worst fractal dimension)),Map(output -> Socket(output,features))),VectorAssemblerModel(List(ScalarShape(true)
printing the inputSchema:
model.inputSchema.fields.zipWithIndex.foreach { case (field, idx) =>
println(s"$idx $field")
}
out>> 0 StructField(mean texture,ScalarType(double,true))
1 StructField(concavity error,ScalarType(double,true))
2 StructField(mean compactness,ScalarType(double,true))
3 StructField(mean radius,ScalarType(double,true))
4 StructField(texture error,ScalarType(double,true))
5 StructField(mean smoothness,ScalarType(double,true))
6 StructField(concave points error,ScalarType(double,true))
7 StructField(worst concavity,ScalarType(double,true))
8 StructField(mean concavity,ScalarType(double,true))
9 StructField(compactness error,ScalarType(double,true))
10 StructField(mean area,ScalarType(double,true))
11 StructField(worst fractal dimension,ScalarType(double,true))
12 StructField(worst concave points,ScalarType(double,true))
13 StructField(worst perimeter,ScalarType(double,true))
14 StructField(area error,ScalarType(double,true))
15 StructField(worst compactness,ScalarType(double,true))
16 StructField(worst texture,ScalarType(double,true))
17 StructField(mean concave points,ScalarType(double,true))
18 StructField(mean symmetry,ScalarType(double,true))
19 StructField(worst area,ScalarType(double,true))
20 StructField(symmetry error,ScalarType(double,true))
21 StructField(fractal dimension error,ScalarType(double,true))
22 StructField(worst radius,ScalarType(double,true))
23 StructField(worst smoothness,ScalarType(double,true))
24 StructField(mean fractal dimension,ScalarType(double,true))
25 StructField(radius error,ScalarType(double,true))
26 StructField(smoothness error,ScalarType(double,true))
27 StructField(mean perimeter,ScalarType(double,true))
28 StructField(worst symmetry,ScalarType(double,true))
29 StructField(perimeter error,ScalarType(double,true))
As you can see, the inputSchema is wrong, causing all predictions to be wrong. I’ve reproduced the same with LogisticRegression models as well. I’m stuck here because without being able to generate the schema I have to specify it each time which creates non-reproducible code.
Is there something I’m doing wrong here or missing? Help would be greatly appreciated!
Issue Analytics
- State:
- Created 3 years ago
- Comments:11 (7 by maintainers)
closing this, please re-open if you’re still struggling with it.
Worked fine for me with the latest version ml.combust.mleap:mleap-spark_2.11:0.16.0