question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Using Fat Jars behind company's firewall not viable.

See original GitHub issue

Description

I have started this conversation:

https://spark-nlp.slack.com/archives/CA118BWRM/p1617225602087300

and based on the response, I have tried fat jars on my work laptop. Using the Fat Jars, it did move pass the starting session step, but it failed short in sentence detection, and there are big differences between spark-nlp 2.7.x and 3.0.x, as detailed below:

1.1. On Spark NLP version 2.7.5: got a timeout when company’s VPN is enabled (on my work MACOS laptop):

spark = SparkSession.builder\
    .appName("Spark NLP")\
    .master("local[4]")\
    .config("spark.driver.memory","16G")\
    .config("spark.driver.maxResultSize", "0")\
    .config("spark.kryoserializer.buffer.max", "2000M")\
    .config("spark.jars", "/Users/filotio/Downloads/spark-nlp-assembly-2.7.5.jar")\
    .getOrCreate()
 spark

Apache Spark version: 2.4.4 Spark NLP version 2.7.5   sentence_detector_dl download started this may take some time.

Py4JJavaError                             Traceback (most recent call last) <ipython-input-8-ee9de09a890f> in <module>       1 sentencerDL = SentenceDetectorDLModel
----> 2     .pretrained(“sentence_detector_dl”, “en”)
      3     .setInputCols([“document”])
      4     .setOutputCol(“sentences”)       5   ~/opt/anaconda3/envs/spark-nlp/lib/python3.7/site-packages/sparknlp/annotator.py in pretrained(name, lang, remote_loc)    3095     def pretrained(name=“sentence_detector_dl”, lang=“en”, remote_loc=None):    3096         from sparknlp.pretrained import ResourceDownloader -> 3097         return ResourceDownloader.downloadModel(SentenceDetectorDLModel, name, lang, remote_loc)    3098    3099   ~/opt/anaconda3/envs/spark-nlp/lib/python3.7/site-packages/sparknlp/pretrained.py in downloadModel(reader, name, language, remote_loc, j_dwn)      30     def downloadModel(reader, name, language, remote_loc=None, j_dwn=‘PythonResourceDownloader’):      31         print(name + " download started this may take some time.") —> 32         file_size = _internal._GetResourceSize(name, language, remote_loc).apply()      33         if file_size == “-1”:      34             print(“Can not find the model to download please check the name!”)   ~/opt/anaconda3/envs/spark-nlp/lib/python3.7/site-packages/sparknlp/internal.py in init(self, name, language, remote_loc)     190     def init(self, name, language, remote_loc):     191         super(_GetResourceSize, self).init( –> 192             “com.johnsnowlabs.nlp.pretrained.PythonResourceDownloader.getDownloadSize”, name, language, remote_loc)     193     194   ~/opt/anaconda3/envs/spark-nlp/lib/python3.7/site-packages/sparknlp/internal.py in init(self, java_obj, *args)     127         super(ExtendedJavaWrapper, self).init(java_obj)     128         self.sc = SparkContext._active_spark_context –> 129         self._java_obj = self.new_java_obj(java_obj, *args)     130         self.java_obj = self._java_obj     131   ~/opt/anaconda3/envs/spark-nlp/lib/python3.7/site-packages/sparknlp/internal.py in new_java_obj(self, java_class, *args)     137     138     def new_java_obj(self, java_class, *args): –> 139         return self._new_java_obj(java_class, *args)     140     141     def new_java_array(self, pylist, java_class):   ~/opt/anaconda3/envs/spark-nlp/lib/python3.7/site-packages/pyspark/ml/wrapper.py in _new_java_obj(java_class, *args)      65             java_obj = getattr(java_obj, name)      66         java_args = [_py2java(sc, arg) for arg in args] —> 67         return java_obj(*java_args)      68      69     @staticmethod   ~/opt/anaconda3/envs/spark-nlp/lib/python3.7/site-packages/py4j/java_gateway.py in call(self, *args)    1255         answer = self.gateway_client.send_command(command)    1256         return_value = get_return_value( -> 1257             answer, self.gateway_client, self.target_id, self.name)    1258    1259         for temp_arg in temp_args:   ~/opt/anaconda3/envs/spark-nlp/lib/python3.7/site-packages/pyspark/sql/utils.py in deco(*a, **kw)      61     def deco(*a, **kw):      62         try: —> 63             return f(*a, **kw)      64         except py4j.protocol.Py4JJavaError as e:      65             s = e.java_exception.toString()   ~/opt/anaconda3/envs/spark-nlp/lib/python3.7/site-packages/py4j/protocol.py in get_return_value(answer, gateway_client, target_id, name)     326                 raise Py4JJavaError(     327                     “An error occurred while calling {0}{1}{2}.\n”. –> 328                     format(target_id, “.”, name), value)     329             else:     330                 raise Py4JError(   Py4JJavaError: An error occurred while calling z:com.johnsnowlabs.nlp.pretrained.PythonResourceDownloader.getDownloadSize. : com.amazonawsShadedAmazonClientException: Unable to execute HTTP request: Connect to auxdata.johnsnowlabs.com.s3.amazonaws.com:443 timed out         at com.amazonawsShadedhttp.AmazonHttpClient.executeHelper(AmazonHttpClient.java:454)         at com.amazonawsShadedhttp.AmazonHttpClient.execute(AmazonHttpClient.java:232)         at com.amazonawsShadedservices.s3.AmazonS3Client.invoke(AmazonS3Client.java:3528)         at com.amazonawsShadedservices.s3.AmazonS3Client.getObject(AmazonS3Client.java:1111)         at com.amazonawsShadedservices.s3.AmazonS3Client.getObject(AmazonS3Client.java:984)         at com.johnsnowlabs.nlp.pretrained.S3ResourceDownloader.downloadMetadataIfNeed(S3ResourceDownloader.scala:69)         at com.johnsnowlabs.nlp.pretrained.S3ResourceDownloader.resolveLink(S3ResourceDownloader.scala:81)         at com.johnsnowlabs.nlp.pretrained.S3ResourceDownloader.getDownloadSize(S3ResourceDownloader.scala:159)         at com.johnsnowlabs.nlp.pretrained.ResourceDownloader$.getDownloadSize(ResourceDownloader.scala:401)         at com.johnsnowlabs.nlp.pretrained.PythonResourceDownloader$.getDownloadSize(ResourceDownloader.scala:501)         at com.johnsnowlabs.nlp.pretrained.PythonResourceDownloader.getDownloadSize(ResourceDownloader.scala)         at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)         at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)         at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)         at java.lang.reflect.Method.invoke(Method.java:498)         at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)         at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)         at py4j.Gateway.invoke(Gateway.java:282)         at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)         at py4j.commands.CallCommand.execute(CallCommand.java:79)         at py4j.GatewayConnection.run(GatewayConnection.java:238)         at java.lang.Thread.run(Thread.java:748) Caused by: org.apache.httpShadedconn.ConnectTimeoutException: Connect to auxdata.johnsnowlabs.com.s3.amazonaws.com:443 timed out         at org.apache.httpShadedconn.ssl.SSLSocketFactory.connectSocket(SSLSocketFactory.java:551)         at org.apache.httpShadedimpl.conn.DefaultClientConnectionOperator.openConnection(DefaultClientConnectionOperator.java:180)         at org.apache.httpShadedimpl.conn.ManagedClientConnectionImpl.open(ManagedClientConnectionImpl.java:294)         at org.apache.httpShadedimpl.client.DefaultRequestDirector.tryConnect(DefaultRequestDirector.java:641)         at org.apache.httpShadedimpl.client.DefaultRequestDirector.execute(DefaultRequestDirector.java:480)         at org.apache.httpShadedimpl.client.AbstractHttpClient.execute(AbstractHttpClient.java:906)         at org.apache.httpShadedimpl.client.AbstractHttpClient.execute(AbstractHttpClient.java:805)         at com.amazonawsShadedhttp.AmazonHttpClient.executeHelper(AmazonHttpClient.java:384)         … 21 more 1.2. However, once I disable the company’s VPN, the above call to SentenceDetectorDLModel works!

2.1. Using Spark NLP version 3.0.1 I get a NullPointerException back:

spark = SparkSession.builder\
    .appName("Spark NLP")\
    .master("local[4]")\
    .config("spark.driver.memory","16G")\
    .config("spark.driver.maxResultSize", "0")\
    .config("spark.kryoserializer.buffer.max", "2000M")\
    .config("spark.jars", "/Users/filotio/Downloads/spark-nlp-assembly-3.0.1.jar")\
    .getOrCreate()
 spark

Apache Spark version: 3.1.1 Spark NLP version 3.0.1

sentence_detector_dl download started this may take some time.

Py4JJavaError                             Traceback (most recent call last) <ipython-input-9-ee9de09a890f> in <module>       1 sentencerDL = SentenceDetectorDLModel
----> 2     .pretrained(“sentence_detector_dl”, “en”)
      3     .setInputCols([“document”])
      4     .setOutputCol(“sentences”)       5   ~/opt/anaconda3/envs/sparknlp/lib/python3.7/site-packages/sparknlp/annotator.py in pretrained(name, lang, remote_loc)    3107     def pretrained(name=“sentence_detector_dl”, lang=“en”, remote_loc=None):    3108         from sparknlp.pretrained import ResourceDownloader -> 3109         return ResourceDownloader.downloadModel(SentenceDetectorDLModel, name, lang, remote_loc)    3110    3111   ~/opt/anaconda3/envs/sparknlp/lib/python3.7/site-packages/sparknlp/pretrained.py in downloadModel(reader, name, language, remote_loc, j_dwn)      30     def downloadModel(reader, name, language, remote_loc=None, j_dwn=‘PythonResourceDownloader’):      31         print(name + " download started this may take some time.") —> 32         file_size = _internal._GetResourceSize(name, language, remote_loc).apply()      33         if file_size == “-1”:      34             print(“Can not find the model to download please check the name!”)   ~/opt/anaconda3/envs/sparknlp/lib/python3.7/site-packages/sparknlp/internal.py in init(self, name, language, remote_loc)     190     def init(self, name, language, remote_loc):     191         super(_GetResourceSize, self).init( –> 192             “com.johnsnowlabs.nlp.pretrained.PythonResourceDownloader.getDownloadSize”, name, language, remote_loc)     193     194   ~/opt/anaconda3/envs/sparknlp/lib/python3.7/site-packages/sparknlp/internal.py in init(self, java_obj, *args)     127         super(ExtendedJavaWrapper, self).init(java_obj)     128         self.sc = SparkContext._active_spark_context –> 129         self._java_obj = self.new_java_obj(java_obj, *args)     130         self.java_obj = self._java_obj     131   ~/opt/anaconda3/envs/sparknlp/lib/python3.7/site-packages/sparknlp/internal.py in new_java_obj(self, java_class, *args)     137     138     def new_java_obj(self, java_class, *args): –> 139         return self._new_java_obj(java_class, *args)     140     141     def new_java_array(self, pylist, java_class):   ~/opt/anaconda3/envs/sparknlp/lib/python3.7/site-packages/pyspark/ml/wrapper.py in _new_java_obj(java_class, *args)      64             java_obj = getattr(java_obj, name)      65         java_args = [_py2java(sc, arg) for arg in args] —> 66         return java_obj(*java_args)      67      68     @staticmethod   ~/opt/anaconda3/envs/sparknlp/lib/python3.7/site-packages/py4j/java_gateway.py in call(self, *args)    1303         answer = self.gateway_client.send_command(command)    1304         return_value = get_return_value( -> 1305             answer, self.gateway_client, self.target_id, self.name)    1306    1307         for temp_arg in temp_args:   ~/opt/anaconda3/envs/sparknlp/lib/python3.7/site-packages/pyspark/sql/utils.py in deco(*a, **kw)     109     def deco(*a, **kw):     110         try: –> 111             return f(*a, **kw)     112         except py4j.protocol.Py4JJavaError as e:     113             converted = convert_exception(e.java_exception)   ~/opt/anaconda3/envs/sparknlp/lib/python3.7/site-packages/py4j/protocol.py in get_return_value(answer, gateway_client, target_id, name)     326                 raise Py4JJavaError(     327                     “An error occurred while calling {0}{1}{2}.\n”. –> 328                     format(target_id, “.”, name), value)     329             else:     330                 raise Py4JError(   Py4JJavaError: An error occurred while calling z:com.johnsnowlabs.nlp.pretrained.PythonResourceDownloader.getDownloadSize. : java.lang.NullPointerException         at com.amazonaws.ShadedByJSLClientConfiguration.getProxyUsernameEnvironment(ClientConfiguration.java:874)         at com.amazonaws.ShadedByJSLClientConfiguration.getProxyUsername(ClientConfiguration.java:902)         at com.amazonaws.ShadedByJSLhttp.settings.HttpClientSettings.getProxyUsername(HttpClientSettings.java:90)         at com.amazonaws.ShadedByJSLhttp.settings.HttpClientSettings.isAuthenticatedProxy(HttpClientSettings.java:182)         at com.amazonaws.ShadedByJSLhttp.apache.client.impl.ApacheHttpClientFactory.addProxyConfig(ApacheHttpClientFactory.java:96)         at com.amazonaws.ShadedByJSLhttp.apache.client.impl.ApacheHttpClientFactory.create(ApacheHttpClientFactory.java:75)         at com.amazonaws.ShadedByJSLhttp.apache.client.impl.ApacheHttpClientFactory.create(ApacheHttpClientFactory.java:38)         at com.amazonaws.ShadedByJSLhttp.AmazonHttpClient.<init>(AmazonHttpClient.java:324)         at com.amazonaws.ShadedByJSLhttp.AmazonHttpClient.<init>(AmazonHttpClient.java:308)         at com.amazonaws.ShadedByJSLAmazonWebServiceClient.<init>(AmazonWebServiceClient.java:229)         at com.amazonaws.ShadedByJSLAmazonWebServiceClient.<init>(AmazonWebServiceClient.java:181)         at com.amazonaws.ShadedByJSLservices.s3.AmazonS3Client.<init>(AmazonS3Client.java:617)         at com.amazonaws.ShadedByJSLservices.s3.AmazonS3Client.<init>(AmazonS3Client.java:597)         at com.amazonaws.ShadedByJSLservices.s3.AmazonS3Client.<init>(AmazonS3Client.java:575)         at com.amazonaws.ShadedByJSLservices.s3.AmazonS3Client.<init>(AmazonS3Client.java:542)         at com.johnsnowlabs.nlp.pretrained.S3ResourceDownloader.client$lzycompute(S3ResourceDownloader.scala:45)         at com.johnsnowlabs.nlp.pretrained.S3ResourceDownloader.client(S3ResourceDownloader.scala:36)         at com.johnsnowlabs.nlp.pretrained.S3ResourceDownloader.downloadMetadataIfNeed(S3ResourceDownloader.scala:69)         at com.johnsnowlabs.nlp.pretrained.S3ResourceDownloader.resolveLink(S3ResourceDownloader.scala:81)         at com.johnsnowlabs.nlp.pretrained.S3ResourceDownloader.getDownloadSize(S3ResourceDownloader.scala:159)         at com.johnsnowlabs.nlp.pretrained.ResourceDownloader$.getDownloadSize(ResourceDownloader.scala:401)         at com.johnsnowlabs.nlp.pretrained.PythonResourceDownloader$.getDownloadSize(ResourceDownloader.scala:501)         at com.johnsnowlabs.nlp.pretrained.PythonResourceDownloader.getDownloadSize(ResourceDownloader.scala)         at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)         at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)         at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)         at java.lang.reflect.Method.invoke(Method.java:498)         at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)         at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)         at py4j.Gateway.invoke(Gateway.java:282)         at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)         at py4j.commands.CallCommand.execute(CallCommand.java:79)         at py4j.GatewayConnection.run(GatewayConnection.java:238)         at java.lang.Thread.run(Thread.java:748)

2.2. If I disaable company’s VPN, I get the same NullPointerException as above - 2.1.

Expected Behavior

I would like to use your code behind company’s firewall, and more importantly from AWS SageMaker. I do test it first on my work laptop, so I like to have it working there as well.

Current Behavior

Not working, got a healthcare temp license, which expires in a couple of days, and so far I was not able to run any of your code behind company’s firewall. So, setting the spark-nlp session using the Fat Jars: when using a pretrain model such as: sentencerDL = SentenceDetectorDLModel
.pretrained(“sentence_detector_dl”, “en”)
.setInputCols([“document”])
.setOutputCol(“sentences”) it fails.

Possible Solution

Like the idea of using Fat Jars, but need them functional.

Steps to Reproduce

tested on my work macos catalina latest version using the installation instructions: https://nlp.johnsnowlabs.com/docs/en/install#python for both: $ java -version $ conda create -n sparknlp python=3.7 -y $ conda activate sparknlp $ pip install spark-nlp==3.0.1 pyspark==3.1.1 $ pip install jupyter $ jupyter notebook

and

$ java -version $ conda create -n spark-nlp python=3.7 -y $ conda activate spark-nlp $ pip install spark-nlp==2.7.5 pyspark==2.4.4 $ pip install jupyter $ jupyter notebook

Pretty much follow the code from: https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Public/9.SentenceDetectorDL.ipynb#scrollTo=KvNuyGXpD7Nt

but using the Fat Jars instead:

spark = SparkSession.builder
.appName(“Spark NLP”)
.master(“local[4]”)
.config(“spark.driver.memory”,“16G”)
.config(“spark.driver.maxResultSize”, “0”)
.config(“spark.kryoserializer.buffer.max”, “2000M”)
.config(“spark.jars”, “/Users/filotio/Downloads/spark-nlp-assembly-3.0.1.jar”)
.getOrCreate()

and the moment I hit this code:

sentencerDL = SentenceDetectorDLModel
.pretrained(“sentence_detector_dl”, “en”)
.setInputCols([“document”])
.setOutputCol(“sentences”)

I get the above errors (NullPointerException for spark-nlp 3.0.x and timing out for spark-nlp 2.7.x)

Context

Your Environment

  • Spark NLP version sparknlp.version(): Spark NLP version 3.0.1
  • Apache NLP version spark.version: Apache Spark version: 3.1.1
  • Java version java -version: openjdk version “1.8.0_282” OpenJDK Runtime Environment (build 1.8.0_282-bre_2021_01_20_16_37-b00) OpenJDK 64-Bit Server VM (build 25.282-b00, mixed mode)
  • Conda latest release.
  • Operating System and version: MacOS catalina, latest release.

Issue Analytics

  • State:closed
  • Created 2 years ago
  • Comments:22 (10 by maintainers)

github_iconTop GitHub Comments

1reaction
Octavian-actcommented, Apr 20, 2021

@maziyarpanahi I still hope to be able to use JSL software. Please, have some one contacting me and test/debug further. Thank you.

1reaction
maziyarpanahicommented, Apr 9, 2021

Oh, I sent you the Fat Jar via Slack. I’ll put it on S3 and share the link. All of our models/pipelines are here: https://nlp.johnsnowlabs.com/models

They all have examples and download link inside.

Read more comments on GitHub >

github_iconTop Results From Across the Web

The Fault in Our JARs: Why We Stopped Building Fat JARs
The first issue we hit is that JARs are not meant to be aggregated like this. There can be files with the same...
Read more >
Spring Boot Reference Documentation
Executable jars (sometimes called “fat jars”) are archives containing your compiled classes along with all of the jar dependencies that your code needs...
Read more >
1 Oracle Business Intelligence
This chapter describes issues associated specifically with Oracle Business Intelligence, including issues related to installation, upgrade, analyses and ...
Read more >
Should I use fat JAR or not (wrapper on REST API)
My question is, what is the better approach. To build the JAR above as a fat jar, with all the dependencies, or as...
Read more >
Drools Documentation - JBoss.org
Drools is a business-rule management system with a forward-chaining and ... When running Drools in a fat JAR, for example created by the...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found