Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Unable to deserialize in Spark a bundle that was created in Python

See original GitHub issue

I successfully created an MLeap bundle in Python, but when I try to deserialize it in Spark, I get the error below. The Python and Spark scripts I used to serialize/deserialize the model are pasted here.

Can you suggest how to modify the deserialization script in Spark to make it work? Thanks!

I have Spark 2.2.1 with Hadoop 2.6 and Scala 2.11

scala.MatchError: null
  at ml.combust.bundle.BundleFile$.apply(BundleFile.scala:57)
  at ml.combust.bundle.BundleFile$.apply(BundleFile.scala:40)
  at $anonfun$1.apply(<console>:38)
  at $anonfun$1.apply(<console>:38)
  at resource.DefaultManagedResource.open(AbstractManagedResource.scala:110)
  at resource.AbstractManagedResource.acquireFor(AbstractManagedResource.scala:87)
  at resource.DeferredExtractableManagedResource.either(AbstractManagedResource.scala:29)
  at resource.DeferredExtractableManagedResource.opt(AbstractManagedResource.scala:31)
  ... 56 elided

This is how I start the Spark shell ./spark-shell --packages ml.combust.mleap:mleap-spark_2.11:0.11.0 --driver-memory 4G --verbose

Python code to create a Random Forest model and serialize to bundle

import pandas as pd
import mleap.sklearn.preprocessing.data
import mleap.sklearn.pipeline
from mleap.sklearn.preprocessing.data import FeatureExtractor
from mleap.sklearn.ensemble import forest
from sklearn.ensemble import RandomForestClassifier
from sklearn.pipeline import Pipeline

# Create a test dataframe in Python
test_df = pd.DataFrame({'a': [1,2,3,4,5,6,7,8,9,10],
                        'class': [1,1,0,1,0,1,1,0,0,0]})

test_df['b'] = test_df['a'] * 10


# Create an MLEAP bundle
rf_params = {'n_estimators': 3,
             'max_depth': 8,
             'criterion': 'gini'}

model = RandomForestClassifier(**rf_params)

# Assemble features in a vector
features_list = list(test_df.columns)

feature_assembler = FeatureExtractor(input_scalars=features_list,
                                     output_vector='input_features',
                                     output_vector_items=['f_' + x for x in features_list])

# Assemble a pipeline with features and initialize
model.mlinit(input_features='input_features',
             prediction_column='prediction_python',
             feature_names=['f_' + x for x in features_list])

model_pipeline = Pipeline([
    (feature_assembler.name, feature_assembler),
    (model.name, model)])

model_pipeline.mlinit()

# Train the pipeline
model_pipeline.fit(test_df, test_df['class'])

# Serialiaze the random forest model and save
model_pipeline.serialize_to_bundle('/path/to/bundle/', 'test_mleap', init=True)

Spark code to deserialize the bundle

import ml.combust.bundle.BundleFile
import ml.combust.mleap.runtime.MleapSupport._
import ml.combust.mleap.spark.SparkSupport._
import ml.combust.bundle.serializer.SerializationFormat

import org.apache.spark.ml.feature.{StringIndexerModel, VectorAssembler}
import org.apache.spark.ml.mleap.SparkUtil
import resource._


val bundle_path = "file:/path/to/bundle/test_mleap/"

// Deserialize a directory bundle
val bundle = (for(bundleFile <- managed(BundleFile(bundle_path))) yield {
  bundleFile.loadMleapBundle().get
}).opt.get

Issue Analytics

State:
Created 5 years ago
Comments:8 (1 by maintainers)

Top GitHub Comments

2reactions

paulochfcommented, Oct 19, 2018

… did you first import MLeap? I don’t’ see it in the code snippet you’ve posted.

Did you mean

import mleap.pyspark  # << this?
from pyspark.ml import PipelineModel

lr_model = PipelineModel.deserializeFromBundle('jar:file:/some-path/mymodel.zip')

Actually, I was doing so. I just didn’t paste in my example. Sorry about that.

~I don’t know what was causing the problem, but I just got it fixed by reinstalling the Maven packages.~

Update: just find out that for loading the model you have to do

import mleap.pyspark
from mleap.pyspark.spark_support import SimpleSparkSerializer   # < should import this
from pyspark.ml import PipelineModel

lr_model = PipelineModel.deserializeFromBundle('jar:file:/some-path/mymodel.zip')

The doc doesn’t mention it, although it works because of the notebook’s hidden state, which imports all those things in the beginning.

0reactions

maggiexcommented, Mar 23, 2022

@PowerToThePeople111 were you able to make the example I pasted above work? It doesn’t seem to work for me. It is looking for a num_classes keys in the model.json but it is not finding it in the zip file. It’s a long error, but here are the first few lines:
java.util.NoSuchElementException: key not found: num_classes
  at scala.collection.MapLike$class.default(MapLike.scala:228)
  at scala.collection.AbstractMap.default(Map.scala:59)
  at scala.collection.MapLike$class.apply(MapLike.scala:141)
  at scala.collection.AbstractMap.apply(Map.scala:59)
A colleague of mine found a workaround for the issue with deserializing the directory bundle in Spark, which works but it’s not ideal. The MLeap bundle code seems to have an issue that prevents it from appending a necessary key to the bundle output. Specifically, there are model.json files in the bundle structure. For Random Forest they are in the directories called /root/random_forest_classifier[uid].node/, and each of the tree directories. All of those model.json files are missing the num_classes keys. This key is required during deserialization of the bundle in Spark. I added the key manually to each model.json, and it worked.

There appears to be #TO DO comment in the source code of MLeap which needs to be fixed. This is called out here
        if isinstance(transformer, RandomForestClassifier):
            attributes.append(('num_classes', transformer.n_classes_)) # TODO: get number of classes from the transformer

Could you show an example of how you add the key into the model.json file? I am new to mleap and deserializing a legacy model, and running into the same issue. Thanks!

I added “op”: “missing_key_name” to model.json files but then I got the following error:

Py4JJavaError: An error occurred while calling o46.deserializeFromBundle.
: java.nio.file.NoSuchFileException: /bundle.json

Top Results From Across the Web

MLeap problem: Impossible to deserialize a bundle written ...

I serialize a model with Scikit-Learn: #Generate data import pandas as pd import numpy as np df = pd.DataFrame(np.random.randn(100, 5), ...

Spark SQL, DataFrames and Datasets Guide

DataFrames can be constructed from a wide array of sources such as: structured data files, tables in Hive, external databases, or existing RDDs....

Resolve the "java.lang.ClassNotFoundException" in Spark on ...

This error occurs when either of the following conditions is true: The spark-submit job can't find the relevant files in the class path....

4. Spark SQL and DataFrames: Introduction to Built-in Data ...

For Hive ORC SerDe (serialization and deserialization) tables created with the SQL command USING HIVE OPTIONS (fileFormat 'ORC') , the vectorized reader is...

Spark Structured Streaming with Kafka Example - Part 1

In the Scala code, we create and register a custom UDF called deserialize and use it in two different ways: once in the...