Using pretrained pipelines in PySpark throws java.util.NoSuchElementException
See original GitHub issueTrying to run a minimal example including a pretrained pipeline in PySpark results in an exception thrown in com.johnsnowlabs.nlp.pretrained.PythonResourceDownloader.downloadPipeline.
Description
Running the minimal example provided at https://github.com/JohnSnowLabs/spark-nlp/blob/master/python/example/quick-start.ipynb crashes at pipeline = PretrainedPipeline('pipeline_vivekn') with the exception
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
NameError: name 'PretrainedPipeline' is not defined
>>> from sparknlp import PretrainedPipeline
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
ImportError: cannot import name 'PretrainedPipeline'
>>> from sparknlp.pretrained import PretrainedPipeline
>>> pipeline = PretrainedPipeline('explain_document_ml')
19/03/21 10:45:54 WARN org.apache.spark.sql.SparkSession$Builder: Using an existing SparkSession; some configuration may not take effect.
Traceback (most recent call last):
File "/usr/lib/spark/python/pyspark/sql/utils.py", line 63, in deco
return f(*a, **kw)
File "/usr/lib/spark/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py", line 328, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling z:com.johnsnowlabs.nlp.pretrained.PythonResourceDownloader.downloadPipeline.
: java.lang.IllegalArgumentException: requirement failed: Was not found appropriate resource to download for request: ResourceRequest(explain_document_ml,Some(en),public/models,2.0.0,2.4.0) with downloader: com.johnsnowlabs.nlp.pretrained.S3ResourceDownloader@4f5dba48
at scala.Predef$.require(Predef.scala:224)
at com.johnsnowlabs.nlp.pretrained.ResourceDownloader$.downloadResource(ResourceDownloader.scala:102)
at com.johnsnowlabs.nlp.pretrained.ResourceDownloader$.downloadPipeline(ResourceDownloader.scala:133)
at com.johnsnowlabs.nlp.pretrained.ResourceDownloader$.downloadPipeline(ResourceDownloader.scala:128)
at com.johnsnowlabs.nlp.pretrained.PythonResourceDownloader$.downloadPipeline(ResourceDownloader.scala:197)
at com.johnsnowlabs.nlp.pretrained.PythonResourceDownloader.downloadPipeline(ResourceDownloader.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:282)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:238)
at java.lang.Thread.run(Thread.java:748)
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/hadoop/spark/tmp/spark-aec8140d-aa84-4905-9d08-641b49024bcc/userFiles-488b2b9f-bdd2-4ce4-af07-a70d10fb4d51/JohnSnowLabs_spark-nlp-2.0.0.jar/sparknlp/pretrained.py", line 30, in __init__
File "/hadoop/spark/tmp/spark-aec8140d-aa84-4905-9d08-641b49024bcc/userFiles-488b2b9f-bdd2-4ce4-af07-a70d10fb4d51/JohnSnowLabs_spark-nlp-2.0.0.jar/sparknlp/pretrained.py", line 18, in downloadPipeline
File "/hadoop/spark/tmp/spark-aec8140d-aa84-4905-9d08-641b49024bcc/userFiles-488b2b9f-bdd2-4ce4-af07-a70d10fb4d51/JohnSnowLabs_spark-nlp-2.0.0.jar/sparknlp/internal.py", line 65, in __init__
File "/usr/lib/spark/python/pyspark/ml/wrapper.py", line 67, in _new_java_obj
return java_obj(*java_args)
File "/usr/lib/spark/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 1257, in __call__
File "/usr/lib/spark/python/pyspark/sql/utils.py", line 79, in deco
raise IllegalArgumentException(s.split(': ', 1)[1], stackTrace)
pyspark.sql.utils.IllegalArgumentException: 'requirement failed: Was not found appropriate resource to download for request: ResourceRequest(explain_document_ml,Some(en),public/models,2.0.0,2.4.0) with downloader: com.johnsnowlabs.nlp.pretrained.S3ResourceDownloader@4f5dba48'
>>> pipeline = PretrainedPipeline('pipeline_vivekn')
[Stage 0:> Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/hadoop/spark/tmp/spark-aec8140d-aa84-4905-9d08-641b49024bcc/userFiles-488b2b9f-bdd2-4ce4-af07-a70d10fb4d51/JohnSnowLabs_spark-nlp-2.0.0.jar/sparknlp/pretrained.py", line 30, in __init__
File "/hadoop/spark/tmp/spark-aec8140d-aa84-4905-9d08-641b49024bcc/userFiles-488b2b9f-bdd2-4ce4-af07-a70d10fb4d51/JohnSnowLabs_spark-nlp-2.0.0.jar/sparknlp/pretrained.py", line 18, in downloadPipeline
File "/hadoop/spark/tmp/spark-aec8140d-aa84-4905-9d08-641b49024bcc/userFiles-488b2b9f-bdd2-4ce4-af07-a70d10fb4d51/JohnSnowLabs_spark-nlp-2.0.0.jar/sparknlp/internal.py", line 65, in __init__
File "/usr/lib/spark/python/pyspark/ml/wrapper.py", line 67, in _new_java_obj
return java_obj(*java_args)
File "/usr/lib/spark/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 1257, in __call__
File "/usr/lib/spark/python/pyspark/sql/utils.py", line 63, in deco
return f(*a, **kw)
File "/usr/lib/spark/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py", line 328, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling z:com.johnsnowlabs.nlp.pretrained.PythonResourceDownloader.downloadPipeline.
: java.util.NoSuchElementException: Param patterns does not exist.
at org.apache.spark.ml.param.Params$$anonfun$getParam$2.apply(params.scala:729)
at org.apache.spark.ml.param.Params$$anonfun$getParam$2.apply(params.scala:729)
at scala.Option.getOrElse(Option.scala:121)
at org.apache.spark.ml.param.Params$class.getParam(params.scala:728)
at org.apache.spark.ml.PipelineStage.getParam(Pipeline.scala:42)
at org.apache.spark.ml.util.DefaultParamsReader$Metadata$$anonfun$setParams$1.apply(ReadWrite.scala:591)
at org.apache.spark.ml.util.DefaultParamsReader$Metadata$$anonfun$setParams$1.apply(ReadWrite.scala:589)
at scala.collection.immutable.List.foreach(List.scala:392)
at org.apache.spark.ml.util.DefaultParamsReader$Metadata.setParams(ReadWrite.scala:589)
at org.apache.spark.ml.util.DefaultParamsReader$Metadata.getAndSetParams(ReadWrite.scala:572)
at org.apache.spark.ml.util.DefaultParamsReader.load(ReadWrite.scala:497)
at com.johnsnowlabs.nlp.FeaturesReader.load(ParamsAndFeaturesReadable.scala:12)
at com.johnsnowlabs.nlp.FeaturesReader.load(ParamsAndFeaturesReadable.scala:8)
at org.apache.spark.ml.util.DefaultParamsReader$.loadParamsInstance(ReadWrite.scala:652)
at org.apache.spark.ml.Pipeline$SharedReadWrite$$anonfun$4.apply(Pipeline.scala:274)
at org.apache.spark.ml.Pipeline$SharedReadWrite$$anonfun$4.apply(Pipeline.scala:272)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186)
at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:186)
at org.apache.spark.ml.Pipeline$SharedReadWrite$.load(Pipeline.scala:272)
at org.apache.spark.ml.PipelineModel$PipelineModelReader.load(Pipeline.scala:348)
at org.apache.spark.ml.PipelineModel$PipelineModelReader.load(Pipeline.scala:342)
at com.johnsnowlabs.nlp.pretrained.ResourceDownloader$.downloadPipeline(ResourceDownloader.scala:134)
at com.johnsnowlabs.nlp.pretrained.ResourceDownloader$.downloadPipeline(ResourceDownloader.scala:128)
at com.johnsnowlabs.nlp.pretrained.PythonResourceDownloader$.downloadPipeline(ResourceDownloader.scala:197)
at com.johnsnowlabs.nlp.pretrained.PythonResourceDownloader.downloadPipeline(ResourceDownloader.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:282)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:238)
at java.lang.Thread.run(Thread.java:748)
Expected Behavior
All necessary resources should be downloaded without running into an exception.
Current Behavior
Downloader crashes when attempting to download resources for pretrained pipeline. This happens at least for the pretrained pipelines pipeline_vivekn and explain_document_ml.
Possible Solution
No idea, but I’m running Spark on a google dataproc cluster in standalone mode. Perhaps that introduces some complications?
Steps to Reproduce
- Set up PySpark for Python 3.
- Run Spark in standalone mode.
- Run
pyspark --packages JohnSnowLabs:spark-nlp:2.0.0. - Run code in https://github.com/JohnSnowLabs/spark-nlp/blob/master/python/example/quick-start.ipynb.
Context
Goal: Run minimal example without crashing. https://github.com/JohnSnowLabs/spark-nlp/issues/299, which seems to be about the same issue. I’m running Version 2.0.0 though, so I figured this bug should be fixed by now.
Your Environment
- Version used: 2.0.0
- Operating System and version (desktop or mobile): Debian 9.8 on a google dataproc cluster.
Issue Analytics
- State:
- Created 5 years ago
- Comments:14 (8 by maintainers)

Top Related StackOverflow Question
Hi @jamshaidsohail5 I hope you are safe and fine as well. In 2020 we no longer need _noncontrib to be compatible in Windows, all the models and pipelines in 2.4.x releases are cross platform. The current and updated names of models and pipelines are here :
https://github.com/JohnSnowLabs/spark-nlp-models
So if you are on latest versions (which I can see you are) you just go ahead and remove the _noncontrib part from the name and only use : recognize_entities_dl
Let me know if you had any issue.
Hi, it works in our latest official release: 2.1.0 as you can see in this example:
https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/jupyter/quick_start.ipynb