question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Error using MLeap with PySpark

See original GitHub issue

I am experiencing errors whilst trying to setup MLeap similar to that reported in: https://github.com/combust/mleap/issues/172 which is now marked as closed.

I am trying to run the simple spark example: http://mleap-docs.combust.ml/py-spark/ using an AWS EMR cluster. After logging into the master node I run this shell script to install the necessary packages:

sudo pip install --upgrade pip
cd /usr/local/bin 
sudo pip install ipython
curl https://bintray.com/sbt/rpm/rpm | sudo tee /etc/yum.repos.d/bintray-sbt-rpm.repo
sudo yum install sbt
sudo yum install git

cd ~
git clone https://github.com/combust/mleap.git
cd mleap
git submodule init
git submodule update
sbt compile
sbt test

cd /usr/local/bin 
sudo pip install mleap
cd ~

export PYSPARK_DRIVER_PYTHON=ipython
pyspark --packages ml.combust.mleap:mleap-spark_2.11:0.8.1

Then I run the following code from the simple tutorial:

import mleap.pyspark
from mleap.pyspark.spark_support import SimpleSparkSerializer

from pyspark.ml.feature import VectorAssembler, StandardScaler, OneHotEncoder, StringIndexer
from pyspark.ml import Pipeline, PipelineModel
from pyspark.sql import Row

l = [('Alice', 1), ('Bob', 2)]
rdd = sc.parallelize(l)
Person = Row('name', 'age')
person = rdd.map(lambda r: Person(*r))
df2 = spark.createDataFrame(person)
df2.collect()

string_indexer = StringIndexer(inputCol='name', outputCol='name_string_index')
feature_assembler = VectorAssembler(inputCols=[string_indexer.getOutputCol()], outputCol="features")
feature_pipeline = [string_indexer, feature_assembler]
featurePipeline = Pipeline(stages=feature_pipeline)
featurePipeline.fit(df2)
featurePipeline.serializeToBundle("jar:file:/tmp/pyspark.example.zip")

However I get the following error:

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-1-9bed2dd48ce6> in <module>()
     21 featurePipeline = Pipeline(stages=feature_pipeline)
     22 featurePipeline.fit(df2)
---> 23 featurePipeline.serializeToBundle("jar:file:/tmp/pyspark.example.zip")

AttributeError: 'Pipeline' object has no attribute 'serializeToBundle'

This error has been raised in other issues and a common solution is to check that the mleap import statements are executed first. I have ensured that this is the case, however I am still unable to run this code. I would be grateful for any advise to resolve this.

Issue Analytics

  • State:closed
  • Created 6 years ago
  • Reactions:1
  • Comments:19 (4 by maintainers)

github_iconTop GitHub Comments

3reactions
lie-yancommented, Sep 4, 2018

I encountered a similar problem. The error message is as follows.

py4j.protocol.Py4JJavaError: An error occurred while calling o94.serializeToBundle.
: java.lang.NoClassDefFoundError: resource/package$
	at ml.combust.mleap.spark.SimpleSparkSerializer.serializeToBundleWithFormat(SimpleSparkSerializer.scala:25)
	at ml.combust.mleap.spark.SimpleSparkSerializer.serializeToBundle(SimpleSparkSerializer.scala:17)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
	at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
	at py4j.Gateway.invoke(Gateway.java:282)
	at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
	at py4j.commands.CallCommand.execute(CallCommand.java:79)
	at py4j.GatewayConnection.run(GatewayConnection.java:214)
	at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.ClassNotFoundException: resource.package$
	at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
	at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
	at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:335)
	at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
	... 13 more

When I run my script, an exception is raised at the statement:

fitted_pipeline.serializeToBundle("jar:file:/tmp/pyspark.example.zip",fitted_pipeline.transform(df2))

Here is my script.

import mleap.pyspark

from mleap.pyspark.spark_support import SimpleSparkSerializer

from pyspark.ml.feature import VectorAssembler
from pyspark.ml.feature import StandardScaler
from pyspark.ml.feature import OneHotEncoder
from pyspark.ml.feature import StringIndexer
from pyspark.ml import Pipeline, PipelineModel
from pyspark.sql import Row, SparkSession
from pyspark import SparkContext
from pprint import pprint

spark = SparkSession \
    .builder \
    .getOrCreate()

sc = spark.sparkContext

l = [('Alice', 10), ('Bob', 12), ('Alice', 13)]
rdd = sc.parallelize(l)

Person = Row('name', 'age')

person = rdd.map(lambda r: Person(*r))

df2 = spark.createDataFrame(person)

string_indexer = StringIndexer(inputCol='name', outputCol='name_string_index')

pprint(string_indexer.getOutputCol())

feature_assembler = VectorAssembler(inputCols=[string_indexer.getOutputCol()],
                                    outputCol='features')

feature_pipeline = Pipeline(stages=[string_indexer, feature_assembler])
fitted_pipeline = feature_pipeline.fit(df2)

fitted_pipeline.serializeToBundle("jar:file:/tmp/pyspark.example.zip",
                                  fitted_pipeline.transform(df2))

1reaction
nathanaelmouterdecommented, Mar 22, 2018

@dvaldivia Hey, I have the same error as @priyeshkap. But it does not look like the way to call serializeToBundle is wrong, it is more that the featurePipeline object that is a Pipeline does not have any function called serializeToBundle. From there, we can’t call it. I’ve tried with both syntax, none of them work.

Any other suggestions?

As we are still referencing to the pyspark object, I guess a part of the code should be altered by mleap to reference this serializeToBundle function. But it looks like it is not the case here.

Read more comments on GitHub >

github_iconTop Results From Across the Web

MLeap serializeToBundle error for Pyspark custom Transformer
I have a Pyspark custom Transformer that I am trying to serialize to an mLeap bundle object for later model scoring but I'm...
Read more >
combust/mleap - Gitter
When running a simple example is throw the following error: ... I'm working in databricks, but I need to attach mleap to pyspark,...
Read more >
Train an ML Model using Apache Spark in EMR and deploy in ...
If this step fails with an error - ``JavaPackage is not callable``, it means you have not setup the MLeap JAR in the...
Read more >
Source code for mlflow.mleap
The ``mlflow.mleap`` module provides an API for saving Spark MLLib models using the `MLeap <https://github.com/combust/mleap>`_ persistence mechanism.
Read more >
MLeap: Quickly Release Spark ML Pipelines - Medium
MLeap : Release Spark ML Pipelines MLeap allows you to quickly deploy your ... this problem may be all too familiar: data scientists...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found