question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

OneHotEncoder does not serialize if meta data is missing from DataFrame

See original GitHub issue

Hi! I’m trying to use mleap-pyspark:

import mleap.pyspark
from mleap.pyspark.spark_support import SimpleSparkSerializer
model.serializeToBundle( "jar:file:/tmp/pysparklr.zip", dataset=test_model)

and the resulting ZIP file looks like this:

<─ /tmp/pysparklr.zip#uzip/root ─────────────────
│'n                                              
│/..                                             
│/Bucketizer_46cf8e4ebed70d3332d0.node           
│/Bucketizer_48dcb1b53992160f3822.node           
│/LinearRegression_4d6185c2e480b21990be.node     
│/MinMaxScaler_4b99b614a7c0d0aa7695.node         
│/OneHotEncoder_425fa08282a386439839.node        
│/OneHotEncoder_42a78b6ab211dd2dd127.node        
│/OneHotEncoder_43449ed2fc2300bc5663.node        
│/OneHotEncoder_43cf8e3711aba38cd281.node        
│/OneHotEncoder_43dcb205b2f6fb04f91e.node        
│/OneHotEncoder_453cb84de080d6deed33.node        
│/OneHotEncoder_45d39ec3561a9a81c512.node        
│/OneHotEncoder_4648a83a58352a13f7c4.node        
│/OneHotEncoder_46af8a56e8f7348b7d82.node        
│/OneHotEncoder_4be5875451f4ce2754cc.node        
│/OneHotEncoder_4c6da30961c990b9b207.node        
│/OneHotEncoder_4cee8923b7658cd30c59.node        
│/OneHotEncoder_4df6a6a32fb79fba831d.node        
│/OneHotEncoder_4f85bbf6c8707f7a7560.node        
│/PolynomialExpansion_4b0e955c0cb66c94097c.node  
│/StringIndexer_49fdbb5d6b5b7df454e8.node        
│/StringIndexer_4f3da329ccaf7d55738c.node        
│/VectorAssembler_4cac83e941ced21754e7.node      
│/VectorAssembler_4f95906e884daeb48887.node      

Nothing on the root level, only data inside the root folder. So when trying to upload to mleap-server - I get a: [ERROR] [09/29/2017 15:09:47.957] [MleapServing-akka.actor.default-dispatcher-2] [akka.actor.ActorSystemImpl(MleapServing)] Error during processing of request: 'bundle.json'. Completing with 500 Internal Server Error response.

Any suggestions on how to fix this?

Thanks!

P.S. I’m working with 0.8.0 SNAPSHOT

Issue Analytics

  • State:closed
  • Created 6 years ago
  • Comments:17 (8 by maintainers)

github_iconTop GitHub Comments

2reactions
liaosimincommented, Aug 8, 2018

@hollinwilkins Do you have solved the bug?I meet the same exception:“unsupported attribute for field **”

0reactions
inardinicommented, Oct 3, 2021

I get the same error using pyspark 3.0.2 and mleap 0.18.1.

Py4JJavaError: An error occurred while calling o99817.serializeToBundle. : java.lang.RuntimeException: unsupported attribute for field loan_termidximputed at org.apache.spark.ml.bundle.ops.feature.OneHotEncoderOp$.sizeForField(OneHotEncoderOp.scala:31) at org.apache.spark.ml.bundle.ops.feature.OneHotEncoderOp$$anon$1.$anonfun$store$2(OneHotEncoderOp.scala:47) at org.apache.spark.ml.bundle.ops.feature.OneHotEncoderOp$$anon$1.$anonfun$store$2$adapted(OneHotEncoderOp.scala:47) at scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:238) at scala.collection.IndexedSeqOptimized.foreach(IndexedSeqOptimized.scala:36) at scala.collection.IndexedSeqOptimized.foreach$(IndexedSeqOptimized.scala:33) at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:198) at scala.collection.TraversableLike.map(TraversableLike.scala:238) at scala.collection.TraversableLike.map$(TraversableLike.scala:231) at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:198) at org.apache.spark.ml.bundle.ops.feature.OneHotEncoderOp$$anon$1.store(OneHotEncoderOp.scala:47) at org.apache.spark.ml.bundle.ops.feature.OneHotEncoderOp$$anon$1.store(OneHotEncoderOp.scala:37) at ml.combust.bundle.serializer.ModelSerializer.$anonfun$write$1(ModelSerializer.scala:87) at scala.util.Try$.apply(Try.scala:213) at ml.combust.bundle.serializer.ModelSerializer.write(ModelSerializer.scala:83) at ml.combust.bundle.serializer.NodeSerializer.$anonfun$write$1(NodeSerializer.scala:85) at scala.util.Try$.apply(Try.scala:213) at ml.combust.bundle.serializer.NodeSerializer.write(NodeSerializer.scala:81) at ml.combust.bundle.serializer.GraphSerializer.$anonfun$writeNode$1(GraphSerializer.scala:34) at scala.util.Try$.apply(Try.scala:213) at ml.combust.bundle.serializer.GraphSerializer.writeNode(GraphSerializer.scala:30) at ml.combust.bundle.serializer.GraphSerializer.$anonfun$write$2(GraphSerializer.scala:21) at scala.collection.IndexedSeqOptimized.foldLeft(IndexedSeqOptimized.scala:60) at scala.collection.IndexedSeqOptimized.foldLeft$(IndexedSeqOptimized.scala:68) at scala.collection.mutable.WrappedArray.foldLeft(WrappedArray.scala:38) at ml.combust.bundle.serializer.GraphSerializer.write(GraphSerializer.scala:21) at org.apache.spark.ml.bundle.ops.PipelineOp$$anon$1.store(PipelineOp.scala:21) at org.apache.spark.ml.bundle.ops.PipelineOp$$anon$1.store(PipelineOp.scala:14) at ml.combust.bundle.serializer.ModelSerializer.$anonfun$write$1(ModelSerializer.scala:87) at scala.util.Try$.apply(Try.scala:213) at ml.combust.bundle.serializer.ModelSerializer.write(ModelSerializer.scala:83) at ml.combust.bundle.serializer.NodeSerializer.$anonfun$write$1(NodeSerializer.scala:85) at scala.util.Try$.apply(Try.scala:213) at ml.combust.bundle.serializer.NodeSerializer.write(NodeSerializer.scala:81) at ml.combust.bundle.serializer.BundleSerializer.$anonfun$write$1(BundleSerializer.scala:34) at scala.util.Try$.apply(Try.scala:213) at ml.combust.bundle.serializer.BundleSerializer.write(BundleSerializer.scala:29) at ml.combust.bundle.BundleWriter.save(BundleWriter.scala:34) at ml.combust.mleap.spark.SimpleSparkSerializer.$anonfun$serializeToBundleWithFormat$4(SimpleSparkSerializer.scala:26) at resource.AbstractManagedResource.$anonfun$acquireFor$1(AbstractManagedResource.scala:88) at scala.util.control.Exception$Catch.$anonfun$either$1(Exception.scala:252) at scala.util.control.Exception$Catch.apply(Exception.scala:228) at scala.util.control.Exception$Catch.either(Exception.scala:252) at resource.AbstractManagedResource.acquireFor(AbstractManagedResource.scala:88) at resource.ManagedResourceOperations.apply(ManagedResourceOperations.scala:26) at resource.ManagedResourceOperations.apply$(ManagedResourceOperations.scala:26) at resource.AbstractManagedResource.apply(AbstractManagedResource.scala:50) at resource.DeferredExtractableManagedResource.$anonfun$tried$1(AbstractManagedResource.scala:33) at scala.util.Try$.apply(Try.scala:213) at resource.DeferredExtractableManagedResource.tried(AbstractManagedResource.scala:33) at ml.combust.mleap.spark.SimpleSparkSerializer.serializeToBundleWithFormat(SimpleSparkSerializer.scala:25) at ml.combust.mleap.spark.SimpleSparkSerializer.serializeToBundle(SimpleSparkSerializer.scala:17) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244) at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357) at py4j.Gateway.invoke(Gateway.java:282) at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132) at py4j.commands.CallCommand.execute(CallCommand.java:79) at py4j.GatewayConnection.run(GatewayConnection.java:238) at java.lang.Thread.run(Thread.java:748)

But this time I have the following pipeline:

numerical_imputer --> categorical_indexer --> categorical_imputer --> target_indexer --> one_hot_encoder --> realtime_vector_assembler --> realtime_scaler --> features_vector_assembler --> best_rfor

with categorical_indexer is StringIndexer.

Any insights?

Read more comments on GitHub >

github_iconTop Results From Across the Web

OneHot encoder and storing Pipeline (feature dimension issue)
If metadata is missing, like in your case, it uses fallback strategy and assumes there is max(input_column) levels. Serialization is irrelevant ...
Read more >
Obtaining consistent one-hot encoding of train / production data
I'm building an app that will require user input. Currently, on the training set, I run the following code, in which data is...
Read more >
Input contains NaN when onehotencoding | Data Science and ...
Hi peeps,. I'm trying to work my way through the categorical variables exercise, but when I want to one hot encode X_test, I...
Read more >
Glossary of Common Terms and API Elements - Scikit-learn
This glossary hopes to definitively represent the tacit and explicit conventions applied in Scikit-learn and its API, while providing a reference for users ......
Read more >
Release Notes — EvalML 0.64.0 documentation - Alteryx
Fixed bug where One Hot Encoder would error out if a non-categorical feature ... Updated TargetDistributionDataCheck to return metadata details as floats ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found