spark-nlp won't download pretrained model on Hadoop Cluster
See original GitHub issueDescription
I am using the code below to get word embeddings using BERT model.
from sparknlp.pretrained import PretrainedPipeline
from sparknlp.annotator import *
from sparknlp.common import *
from sparknlp.base import *
spark = SparkSession.builder\
.master("yarn")\
.config("spark.locality.wait", "0")\
.config("spark.kryoserializer.buffer.max", "2000M")\
.config("spark.jars.packages", "com.johnsnowlabs.nlp:spark-nlp-spark23_2.11:3.0.0")\
.config("spark.sql.autoBroadcastJoinThreshold", -1)\
.config("spark.sql.codegen.aggregate.map.twolevel.enabled", "false")\
.getOrCreate()
document_assembler = DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")
sentence_detector = SentenceDetector() \
.setInputCols(["document"]) \
.setOutputCol("sentence") \
.setLazyAnnotator(False)
embeddings = BertSentenceEmbeddings.pretrained("labse", "xx") \
.setInputCols("sentence") \
.setOutputCol("embeddings")
nlp_pipeline = Pipeline(stages=[document_assembler, sentence_detector, embeddings])
pipeline_model = nlp_pipeline.fit(spark.createDataFrame([[""]]).toDF("text"))
The script works great on spark local development mode but when i deployed the script on the Hadoop Cluster ( using YARN as a resource manager ) i get the following error
labse download started this may take some time.
Traceback (most recent call last):
File "testing_bert_hadoop.py", line 138, in <module>
embeddings = BertSentenceEmbeddings.pretrained("labse", "xx") \
File "/usr/local/lib/python3.6/site-packages/sparknlp/annotator.py", line 1969, in pretrained
return ResourceDownloader.downloadModel(BertSentenceEmbeddings, name, lang, remote_loc)
File "/usr/local/lib/python3.6/site-packages/sparknlp/pretrained.py", line 32, in downloadModel
file_size = _internal._GetResourceSize(name, language, remote_loc).apply()
File "/usr/local/lib/python3.6/site-packages/sparknlp/internal.py", line 192, in __init__
"com.johnsnowlabs.nlp.pretrained.PythonResourceDownloader.getDownloadSize", name, language, remote_loc)
File "/usr/local/lib/python3.6/site-packages/sparknlp/internal.py", line 129, in __init__
self._java_obj = self.new_java_obj(java_obj, *args)
File "/usr/local/lib/python3.6/site-packages/sparknlp/internal.py", line 139, in new_java_obj
return self._new_java_obj(java_class, *args)
File "/hadoop/yarn/local/usercache/livy/appcache/application_1623058160826_0016/container_e199_1623058160826_0016_01_000001/pyspark.zip/pyspark/ml/wrapper.py", line 63, in _new_java_obj
File "/hadoop/yarn/local/usercache/livy/appcache/application_1623058160826_0016/container_e199_1623058160826_0016_01_000001/py4j-0.10.6-src.zip/py4j/java_gateway.py", line 1160, in __call__
File "/hadoop/yarn/local/usercache/livy/appcache/application_1623058160826_0016/container_e199_1623058160826_0016_01_000001/pyspark.zip/pyspark/sql/utils.py", line 63, in deco
File "/hadoop/yarn/local/usercache/livy/appcache/application_1623058160826_0016/container_e199_1623058160826_0016_01_000001/py4j-0.10.6-src.zip/py4j/protocol.py", line 320, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling z:com.johnsnowlabs.nlp.pretrained.PythonResourceDownloader.getDownloadSize.
: java.lang.NoSuchMethodError: org.json4s.jackson.JsonMethods$.parse$default$3()Z
at com.johnsnowlabs.nlp.pretrained.ResourceMetadata$.parseJson(ResourceMetadata.scala:61)
at com.johnsnowlabs.nlp.pretrained.ResourceMetadata$$anonfun$readResources$1.applyOrElse(ResourceMetadata.scala:90)
at com.johnsnowlabs.nlp.pretrained.ResourceMetadata$$anonfun$readResources$1.applyOrElse(ResourceMetadata.scala:89)
at scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:36)
at scala.collection.Iterator$$anon$14.next(Iterator.scala:541)
at scala.collection.Iterator$class.foreach(Iterator.scala:893)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:59)
at scala.collection.mutable.ListBuffer.$plus$plus$eq(ListBuffer.scala:183)
at scala.collection.mutable.ListBuffer.$plus$plus$eq(ListBuffer.scala:45)
at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:310)
at scala.collection.AbstractIterator.to(Iterator.scala:1336)
at scala.collection.TraversableOnce$class.toList(TraversableOnce.scala:294)
at scala.collection.AbstractIterator.toList(Iterator.scala:1336)
at com.johnsnowlabs.nlp.pretrained.ResourceMetadata$.readResources(ResourceMetadata.scala:92)
at com.johnsnowlabs.nlp.pretrained.ResourceMetadata$.readResources(ResourceMetadata.scala:84)
at com.johnsnowlabs.nlp.pretrained.S3ResourceDownloader.downloadMetadataIfNeed(S3ResourceDownloader.scala:70)
at com.johnsnowlabs.nlp.pretrained.S3ResourceDownloader.resolveLink(S3ResourceDownloader.scala:81)
at com.johnsnowlabs.nlp.pretrained.S3ResourceDownloader.getDownloadSize(S3ResourceDownloader.scala:159)
at com.johnsnowlabs.nlp.pretrained.ResourceDownloader$.getDownloadSize(ResourceDownloader.scala:399)
at com.johnsnowlabs.nlp.pretrained.PythonResourceDownloader$.getDownloadSize(ResourceDownloader.scala:496)
at com.johnsnowlabs.nlp.pretrained.PythonResourceDownloader.getDownloadSize(ResourceDownloader.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:282)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:214)
at java.lang.Thread.run(Thread.java:745)
I tried to manually updated the jars json4s-native, json4s-scalap and many others but the error still persists.
Expected Behavior
The pretrained pipeline should be downloaded and loaded into the pipeline_model variable
Current Behavior
Gives the above mentioned error while running on Hadoop cluster
Possible Solution
I tried to manually updated the jars json4s-native, json4s-scalap and many others but the error still persists. but maybe i am lacking some knowledge or misunderstanding the problem
Context
I was trying to get word embeddings using LABSE model for classification problem
Your Environment
- Spark NLP version
3.0.0
on all nodes - Apache NLP version
2.3.0.2.6.5.1175-1
- Java version
OpenJDK Runtime Environment (build 1.8.0_292-b10) OpenJDK 64-Bit Server VM (build 25.292-b10, mixed mode)
- Setup and installation : spark comes default with Hadoop installation
- Operating System and version: centos 7
- Cluster Manager: Ambari (HDP 2.6.5.1175-1)
Please do let me know if u need any more info. Thanks
Issue Analytics
- State:
- Created 2 years ago
- Comments:39 (15 by maintainers)
@maziyarpanahi much love and blessings to SPARK-NLP Team
@maziyarpanahi I was literally stuck on this problem for 2 days, didn’t know the FAT Jar was the answer. I really appreciate your support. from here i can take my script onwards.