question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

spark-nlp won't download pretrained model on Hadoop Cluster

See original GitHub issue

Description

I am using the code below to get word embeddings using BERT model.

from sparknlp.pretrained import PretrainedPipeline
from sparknlp.annotator import *
from sparknlp.common import *
from sparknlp.base import *

spark = SparkSession.builder\
    .master("yarn")\
    .config("spark.locality.wait", "0")\
    .config("spark.kryoserializer.buffer.max", "2000M")\
    .config("spark.jars.packages", "com.johnsnowlabs.nlp:spark-nlp-spark23_2.11:3.0.0")\
    .config("spark.sql.autoBroadcastJoinThreshold", -1)\
    .config("spark.sql.codegen.aggregate.map.twolevel.enabled", "false")\
    .getOrCreate()

document_assembler = DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")

sentence_detector = SentenceDetector() \
    .setInputCols(["document"]) \
    .setOutputCol("sentence") \
    .setLazyAnnotator(False)

embeddings = BertSentenceEmbeddings.pretrained("labse", "xx") \
      .setInputCols("sentence") \
      .setOutputCol("embeddings")
nlp_pipeline = Pipeline(stages=[document_assembler, sentence_detector, embeddings])
pipeline_model = nlp_pipeline.fit(spark.createDataFrame([[""]]).toDF("text"))

The script works great on spark local development mode but when i deployed the script on the Hadoop Cluster ( using YARN as a resource manager ) i get the following error

labse download started this may take some time.
Traceback (most recent call last):
  File "testing_bert_hadoop.py", line 138, in <module>
    embeddings = BertSentenceEmbeddings.pretrained("labse", "xx") \
  File "/usr/local/lib/python3.6/site-packages/sparknlp/annotator.py", line 1969, in pretrained
    return ResourceDownloader.downloadModel(BertSentenceEmbeddings, name, lang, remote_loc)
  File "/usr/local/lib/python3.6/site-packages/sparknlp/pretrained.py", line 32, in downloadModel
    file_size = _internal._GetResourceSize(name, language, remote_loc).apply()
  File "/usr/local/lib/python3.6/site-packages/sparknlp/internal.py", line 192, in __init__
    "com.johnsnowlabs.nlp.pretrained.PythonResourceDownloader.getDownloadSize", name, language, remote_loc)
  File "/usr/local/lib/python3.6/site-packages/sparknlp/internal.py", line 129, in __init__
    self._java_obj = self.new_java_obj(java_obj, *args)
  File "/usr/local/lib/python3.6/site-packages/sparknlp/internal.py", line 139, in new_java_obj
    return self._new_java_obj(java_class, *args)
  File "/hadoop/yarn/local/usercache/livy/appcache/application_1623058160826_0016/container_e199_1623058160826_0016_01_000001/pyspark.zip/pyspark/ml/wrapper.py", line 63, in _new_java_obj
  File "/hadoop/yarn/local/usercache/livy/appcache/application_1623058160826_0016/container_e199_1623058160826_0016_01_000001/py4j-0.10.6-src.zip/py4j/java_gateway.py", line 1160, in __call__
  File "/hadoop/yarn/local/usercache/livy/appcache/application_1623058160826_0016/container_e199_1623058160826_0016_01_000001/pyspark.zip/pyspark/sql/utils.py", line 63, in deco
  File "/hadoop/yarn/local/usercache/livy/appcache/application_1623058160826_0016/container_e199_1623058160826_0016_01_000001/py4j-0.10.6-src.zip/py4j/protocol.py", line 320, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling z:com.johnsnowlabs.nlp.pretrained.PythonResourceDownloader.getDownloadSize.
: java.lang.NoSuchMethodError: org.json4s.jackson.JsonMethods$.parse$default$3()Z
	at com.johnsnowlabs.nlp.pretrained.ResourceMetadata$.parseJson(ResourceMetadata.scala:61)
	at com.johnsnowlabs.nlp.pretrained.ResourceMetadata$$anonfun$readResources$1.applyOrElse(ResourceMetadata.scala:90)
	at com.johnsnowlabs.nlp.pretrained.ResourceMetadata$$anonfun$readResources$1.applyOrElse(ResourceMetadata.scala:89)
	at scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:36)
	at scala.collection.Iterator$$anon$14.next(Iterator.scala:541)
	at scala.collection.Iterator$class.foreach(Iterator.scala:893)
	at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
	at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:59)
	at scala.collection.mutable.ListBuffer.$plus$plus$eq(ListBuffer.scala:183)
	at scala.collection.mutable.ListBuffer.$plus$plus$eq(ListBuffer.scala:45)
	at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:310)
	at scala.collection.AbstractIterator.to(Iterator.scala:1336)
	at scala.collection.TraversableOnce$class.toList(TraversableOnce.scala:294)
	at scala.collection.AbstractIterator.toList(Iterator.scala:1336)
	at com.johnsnowlabs.nlp.pretrained.ResourceMetadata$.readResources(ResourceMetadata.scala:92)
	at com.johnsnowlabs.nlp.pretrained.ResourceMetadata$.readResources(ResourceMetadata.scala:84)
	at com.johnsnowlabs.nlp.pretrained.S3ResourceDownloader.downloadMetadataIfNeed(S3ResourceDownloader.scala:70)
	at com.johnsnowlabs.nlp.pretrained.S3ResourceDownloader.resolveLink(S3ResourceDownloader.scala:81)
	at com.johnsnowlabs.nlp.pretrained.S3ResourceDownloader.getDownloadSize(S3ResourceDownloader.scala:159)
	at com.johnsnowlabs.nlp.pretrained.ResourceDownloader$.getDownloadSize(ResourceDownloader.scala:399)
	at com.johnsnowlabs.nlp.pretrained.PythonResourceDownloader$.getDownloadSize(ResourceDownloader.scala:496)
	at com.johnsnowlabs.nlp.pretrained.PythonResourceDownloader.getDownloadSize(ResourceDownloader.scala)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
	at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
	at py4j.Gateway.invoke(Gateway.java:282)
	at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
	at py4j.commands.CallCommand.execute(CallCommand.java:79)
	at py4j.GatewayConnection.run(GatewayConnection.java:214)
	at java.lang.Thread.run(Thread.java:745)

I tried to manually updated the jars json4s-native, json4s-scalap and many others but the error still persists.

Expected Behavior

The pretrained pipeline should be downloaded and loaded into the pipeline_model variable

Current Behavior

Gives the above mentioned error while running on Hadoop cluster

Possible Solution

I tried to manually updated the jars json4s-native, json4s-scalap and many others but the error still persists. but maybe i am lacking some knowledge or misunderstanding the problem

Context

I was trying to get word embeddings using LABSE model for classification problem

Your Environment

  • Spark NLP version 3.0.0 on all nodes
  • Apache NLP version 2.3.0.2.6.5.1175-1
  • Java version OpenJDK Runtime Environment (build 1.8.0_292-b10) OpenJDK 64-Bit Server VM (build 25.292-b10, mixed mode)
  • Setup and installation : spark comes default with Hadoop installation
  • Operating System and version: centos 7
  • Cluster Manager: Ambari (HDP 2.6.5.1175-1)

Please do let me know if u need any more info. Thanks

Issue Analytics

  • State:closed
  • Created 2 years ago
  • Comments:39 (15 by maintainers)

github_iconTop GitHub Comments

2reactions
DanielOXcommented, Jun 8, 2021

@maziyarpanahi much love and blessings to SPARK-NLP Team

2reactions
DanielOXcommented, Jun 8, 2021

@maziyarpanahi I was literally stuck on this problem for 2 days, didn’t know the FAT Jar was the answer. I really appreciate your support. from here i can take my script onwards.

Read more comments on GitHub >

github_iconTop Results From Across the Web

When loading pretrained pipeline from HDFS in offline mode ...
(If I try to load a model created by me and saved on HDFS, ... your setup or all spark-nlp pipelines do not...
Read more >
Installation - Spark NLP - John Snow Labs
pretrained() function to download pretrained models, you will need to manually download your pipeline/model from Models Hub, extract it, and ...
Read more >
Recently Active 'johnsnowlabs-spark-nlp' Questions
I have a spark cluster set up and would like to integrate spark-NLP to run Word Embeddings. I have downloaded the Glove Embeddings...
Read more >
How to Get Started with Spark NLP in 2 Weeks — Part I
Due to impeccable modularity that comes with Spark NLP pipelines, ... Spark is cluster computing so it can't be run on local machines....
Read more >
Deploying Spark NLP for Healthcare: from zero to hero - Medium
This way, you won't leverage the distributed computing of Spark Clusters, ... will allow you to attend HTTP request from an API for...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found