Using Fat Jars behind company's firewall not viable.
See original GitHub issueDescription
I have started this conversation:
https://spark-nlp.slack.com/archives/CA118BWRM/p1617225602087300
and based on the response, I have tried fat jars on my work laptop. Using the Fat Jars, it did move pass the starting session step, but it failed short in sentence detection, and there are big differences between spark-nlp 2.7.x and 3.0.x, as detailed below:
1.1. On Spark NLP version 2.7.5: got a timeout when company’s VPN is enabled (on my work MACOS laptop):
spark = SparkSession.builder\
.appName("Spark NLP")\
.master("local[4]")\
.config("spark.driver.memory","16G")\
.config("spark.driver.maxResultSize", "0")\
.config("spark.kryoserializer.buffer.max", "2000M")\
.config("spark.jars", "/Users/filotio/Downloads/spark-nlp-assembly-2.7.5.jar")\
.getOrCreate()
spark
Apache Spark version: 2.4.4 Spark NLP version 2.7.5 sentence_detector_dl download started this may take some time.
Py4JJavaError Traceback (most recent call last)
<ipython-input-8-ee9de09a890f> in <module>
1 sentencerDL = SentenceDetectorDLModel
----> 2 .pretrained(“sentence_detector_dl”, “en”)
3 .setInputCols([“document”])
4 .setOutputCol(“sentences”)
5
~/opt/anaconda3/envs/spark-nlp/lib/python3.7/site-packages/sparknlp/annotator.py in pretrained(name, lang, remote_loc)
3095 def pretrained(name=“sentence_detector_dl”, lang=“en”, remote_loc=None):
3096 from sparknlp.pretrained import ResourceDownloader
-> 3097 return ResourceDownloader.downloadModel(SentenceDetectorDLModel, name, lang, remote_loc)
3098
3099
~/opt/anaconda3/envs/spark-nlp/lib/python3.7/site-packages/sparknlp/pretrained.py in downloadModel(reader, name, language, remote_loc, j_dwn)
30 def downloadModel(reader, name, language, remote_loc=None, j_dwn=‘PythonResourceDownloader’):
31 print(name + " download started this may take some time.")
—> 32 file_size = _internal._GetResourceSize(name, language, remote_loc).apply()
33 if file_size == “-1”:
34 print(“Can not find the model to download please check the name!”)
~/opt/anaconda3/envs/spark-nlp/lib/python3.7/site-packages/sparknlp/internal.py in init(self, name, language, remote_loc)
190 def init(self, name, language, remote_loc):
191 super(_GetResourceSize, self).init(
–> 192 “com.johnsnowlabs.nlp.pretrained.PythonResourceDownloader.getDownloadSize”, name, language, remote_loc)
193
194
~/opt/anaconda3/envs/spark-nlp/lib/python3.7/site-packages/sparknlp/internal.py in init(self, java_obj, *args)
127 super(ExtendedJavaWrapper, self).init(java_obj)
128 self.sc = SparkContext._active_spark_context
–> 129 self._java_obj = self.new_java_obj(java_obj, *args)
130 self.java_obj = self._java_obj
131
~/opt/anaconda3/envs/spark-nlp/lib/python3.7/site-packages/sparknlp/internal.py in new_java_obj(self, java_class, *args)
137
138 def new_java_obj(self, java_class, *args):
–> 139 return self._new_java_obj(java_class, *args)
140
141 def new_java_array(self, pylist, java_class):
~/opt/anaconda3/envs/spark-nlp/lib/python3.7/site-packages/pyspark/ml/wrapper.py in _new_java_obj(java_class, *args)
65 java_obj = getattr(java_obj, name)
66 java_args = [_py2java(sc, arg) for arg in args]
—> 67 return java_obj(*java_args)
68
69 @staticmethod
~/opt/anaconda3/envs/spark-nlp/lib/python3.7/site-packages/py4j/java_gateway.py in call(self, *args)
1255 answer = self.gateway_client.send_command(command)
1256 return_value = get_return_value(
-> 1257 answer, self.gateway_client, self.target_id, self.name)
1258
1259 for temp_arg in temp_args:
~/opt/anaconda3/envs/spark-nlp/lib/python3.7/site-packages/pyspark/sql/utils.py in deco(*a, **kw)
61 def deco(*a, **kw):
62 try:
—> 63 return f(*a, **kw)
64 except py4j.protocol.Py4JJavaError as e:
65 s = e.java_exception.toString()
~/opt/anaconda3/envs/spark-nlp/lib/python3.7/site-packages/py4j/protocol.py in get_return_value(answer, gateway_client, target_id, name)
326 raise Py4JJavaError(
327 “An error occurred while calling {0}{1}{2}.\n”.
–> 328 format(target_id, “.”, name), value)
329 else:
330 raise Py4JError(
Py4JJavaError: An error occurred while calling z:com.johnsnowlabs.nlp.pretrained.PythonResourceDownloader.getDownloadSize.
: com.amazonawsShadedAmazonClientException: Unable to execute HTTP request: Connect to auxdata.johnsnowlabs.com.s3.amazonaws.com:443 timed out
at com.amazonawsShadedhttp.AmazonHttpClient.executeHelper(AmazonHttpClient.java:454)
at com.amazonawsShadedhttp.AmazonHttpClient.execute(AmazonHttpClient.java:232)
at com.amazonawsShadedservices.s3.AmazonS3Client.invoke(AmazonS3Client.java:3528)
at com.amazonawsShadedservices.s3.AmazonS3Client.getObject(AmazonS3Client.java:1111)
at com.amazonawsShadedservices.s3.AmazonS3Client.getObject(AmazonS3Client.java:984)
at com.johnsnowlabs.nlp.pretrained.S3ResourceDownloader.downloadMetadataIfNeed(S3ResourceDownloader.scala:69)
at com.johnsnowlabs.nlp.pretrained.S3ResourceDownloader.resolveLink(S3ResourceDownloader.scala:81)
at com.johnsnowlabs.nlp.pretrained.S3ResourceDownloader.getDownloadSize(S3ResourceDownloader.scala:159)
at com.johnsnowlabs.nlp.pretrained.ResourceDownloader$.getDownloadSize(ResourceDownloader.scala:401)
at com.johnsnowlabs.nlp.pretrained.PythonResourceDownloader$.getDownloadSize(ResourceDownloader.scala:501)
at com.johnsnowlabs.nlp.pretrained.PythonResourceDownloader.getDownloadSize(ResourceDownloader.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:282)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:238)
at java.lang.Thread.run(Thread.java:748)
Caused by: org.apache.httpShadedconn.ConnectTimeoutException: Connect to auxdata.johnsnowlabs.com.s3.amazonaws.com:443 timed out
at org.apache.httpShadedconn.ssl.SSLSocketFactory.connectSocket(SSLSocketFactory.java:551)
at org.apache.httpShadedimpl.conn.DefaultClientConnectionOperator.openConnection(DefaultClientConnectionOperator.java:180)
at org.apache.httpShadedimpl.conn.ManagedClientConnectionImpl.open(ManagedClientConnectionImpl.java:294)
at org.apache.httpShadedimpl.client.DefaultRequestDirector.tryConnect(DefaultRequestDirector.java:641)
at org.apache.httpShadedimpl.client.DefaultRequestDirector.execute(DefaultRequestDirector.java:480)
at org.apache.httpShadedimpl.client.AbstractHttpClient.execute(AbstractHttpClient.java:906)
at org.apache.httpShadedimpl.client.AbstractHttpClient.execute(AbstractHttpClient.java:805)
at com.amazonawsShadedhttp.AmazonHttpClient.executeHelper(AmazonHttpClient.java:384)
… 21 more
1.2. However, once I disable the company’s VPN, the above call to SentenceDetectorDLModel works!
2.1. Using Spark NLP version 3.0.1 I get a NullPointerException back:
spark = SparkSession.builder\
.appName("Spark NLP")\
.master("local[4]")\
.config("spark.driver.memory","16G")\
.config("spark.driver.maxResultSize", "0")\
.config("spark.kryoserializer.buffer.max", "2000M")\
.config("spark.jars", "/Users/filotio/Downloads/spark-nlp-assembly-3.0.1.jar")\
.getOrCreate()
spark
Apache Spark version: 3.1.1 Spark NLP version 3.0.1
sentence_detector_dl download started this may take some time.
Py4JJavaError Traceback (most recent call last)
<ipython-input-9-ee9de09a890f> in <module>
1 sentencerDL = SentenceDetectorDLModel
----> 2 .pretrained(“sentence_detector_dl”, “en”)
3 .setInputCols([“document”])
4 .setOutputCol(“sentences”)
5
~/opt/anaconda3/envs/sparknlp/lib/python3.7/site-packages/sparknlp/annotator.py in pretrained(name, lang, remote_loc)
3107 def pretrained(name=“sentence_detector_dl”, lang=“en”, remote_loc=None):
3108 from sparknlp.pretrained import ResourceDownloader
-> 3109 return ResourceDownloader.downloadModel(SentenceDetectorDLModel, name, lang, remote_loc)
3110
3111
~/opt/anaconda3/envs/sparknlp/lib/python3.7/site-packages/sparknlp/pretrained.py in downloadModel(reader, name, language, remote_loc, j_dwn)
30 def downloadModel(reader, name, language, remote_loc=None, j_dwn=‘PythonResourceDownloader’):
31 print(name + " download started this may take some time.")
—> 32 file_size = _internal._GetResourceSize(name, language, remote_loc).apply()
33 if file_size == “-1”:
34 print(“Can not find the model to download please check the name!”)
~/opt/anaconda3/envs/sparknlp/lib/python3.7/site-packages/sparknlp/internal.py in init(self, name, language, remote_loc)
190 def init(self, name, language, remote_loc):
191 super(_GetResourceSize, self).init(
–> 192 “com.johnsnowlabs.nlp.pretrained.PythonResourceDownloader.getDownloadSize”, name, language, remote_loc)
193
194
~/opt/anaconda3/envs/sparknlp/lib/python3.7/site-packages/sparknlp/internal.py in init(self, java_obj, *args)
127 super(ExtendedJavaWrapper, self).init(java_obj)
128 self.sc = SparkContext._active_spark_context
–> 129 self._java_obj = self.new_java_obj(java_obj, *args)
130 self.java_obj = self._java_obj
131
~/opt/anaconda3/envs/sparknlp/lib/python3.7/site-packages/sparknlp/internal.py in new_java_obj(self, java_class, *args)
137
138 def new_java_obj(self, java_class, *args):
–> 139 return self._new_java_obj(java_class, *args)
140
141 def new_java_array(self, pylist, java_class):
~/opt/anaconda3/envs/sparknlp/lib/python3.7/site-packages/pyspark/ml/wrapper.py in _new_java_obj(java_class, *args)
64 java_obj = getattr(java_obj, name)
65 java_args = [_py2java(sc, arg) for arg in args]
—> 66 return java_obj(*java_args)
67
68 @staticmethod
~/opt/anaconda3/envs/sparknlp/lib/python3.7/site-packages/py4j/java_gateway.py in call(self, *args)
1303 answer = self.gateway_client.send_command(command)
1304 return_value = get_return_value(
-> 1305 answer, self.gateway_client, self.target_id, self.name)
1306
1307 for temp_arg in temp_args:
~/opt/anaconda3/envs/sparknlp/lib/python3.7/site-packages/pyspark/sql/utils.py in deco(*a, **kw)
109 def deco(*a, **kw):
110 try:
–> 111 return f(*a, **kw)
112 except py4j.protocol.Py4JJavaError as e:
113 converted = convert_exception(e.java_exception)
~/opt/anaconda3/envs/sparknlp/lib/python3.7/site-packages/py4j/protocol.py in get_return_value(answer, gateway_client, target_id, name)
326 raise Py4JJavaError(
327 “An error occurred while calling {0}{1}{2}.\n”.
–> 328 format(target_id, “.”, name), value)
329 else:
330 raise Py4JError(
Py4JJavaError: An error occurred while calling z:com.johnsnowlabs.nlp.pretrained.PythonResourceDownloader.getDownloadSize.
: java.lang.NullPointerException
at com.amazonaws.ShadedByJSLClientConfiguration.getProxyUsernameEnvironment(ClientConfiguration.java:874)
at com.amazonaws.ShadedByJSLClientConfiguration.getProxyUsername(ClientConfiguration.java:902)
at com.amazonaws.ShadedByJSLhttp.settings.HttpClientSettings.getProxyUsername(HttpClientSettings.java:90)
at com.amazonaws.ShadedByJSLhttp.settings.HttpClientSettings.isAuthenticatedProxy(HttpClientSettings.java:182)
at com.amazonaws.ShadedByJSLhttp.apache.client.impl.ApacheHttpClientFactory.addProxyConfig(ApacheHttpClientFactory.java:96)
at com.amazonaws.ShadedByJSLhttp.apache.client.impl.ApacheHttpClientFactory.create(ApacheHttpClientFactory.java:75)
at com.amazonaws.ShadedByJSLhttp.apache.client.impl.ApacheHttpClientFactory.create(ApacheHttpClientFactory.java:38)
at com.amazonaws.ShadedByJSLhttp.AmazonHttpClient.<init>(AmazonHttpClient.java:324)
at com.amazonaws.ShadedByJSLhttp.AmazonHttpClient.<init>(AmazonHttpClient.java:308)
at com.amazonaws.ShadedByJSLAmazonWebServiceClient.<init>(AmazonWebServiceClient.java:229)
at com.amazonaws.ShadedByJSLAmazonWebServiceClient.<init>(AmazonWebServiceClient.java:181)
at com.amazonaws.ShadedByJSLservices.s3.AmazonS3Client.<init>(AmazonS3Client.java:617)
at com.amazonaws.ShadedByJSLservices.s3.AmazonS3Client.<init>(AmazonS3Client.java:597)
at com.amazonaws.ShadedByJSLservices.s3.AmazonS3Client.<init>(AmazonS3Client.java:575)
at com.amazonaws.ShadedByJSLservices.s3.AmazonS3Client.<init>(AmazonS3Client.java:542)
at com.johnsnowlabs.nlp.pretrained.S3ResourceDownloader.client$lzycompute(S3ResourceDownloader.scala:45)
at com.johnsnowlabs.nlp.pretrained.S3ResourceDownloader.client(S3ResourceDownloader.scala:36)
at com.johnsnowlabs.nlp.pretrained.S3ResourceDownloader.downloadMetadataIfNeed(S3ResourceDownloader.scala:69)
at com.johnsnowlabs.nlp.pretrained.S3ResourceDownloader.resolveLink(S3ResourceDownloader.scala:81)
at com.johnsnowlabs.nlp.pretrained.S3ResourceDownloader.getDownloadSize(S3ResourceDownloader.scala:159)
at com.johnsnowlabs.nlp.pretrained.ResourceDownloader$.getDownloadSize(ResourceDownloader.scala:401)
at com.johnsnowlabs.nlp.pretrained.PythonResourceDownloader$.getDownloadSize(ResourceDownloader.scala:501)
at com.johnsnowlabs.nlp.pretrained.PythonResourceDownloader.getDownloadSize(ResourceDownloader.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:282)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:238)
at java.lang.Thread.run(Thread.java:748)
2.2. If I disaable company’s VPN, I get the same NullPointerException as above - 2.1.
Expected Behavior
I would like to use your code behind company’s firewall, and more importantly from AWS SageMaker. I do test it first on my work laptop, so I like to have it working there as well.
Current Behavior
Not working, got a healthcare temp license, which expires in a couple of days, and so far I was not able to run any of your code behind company’s firewall.
So, setting the spark-nlp session using the Fat Jars: when using a pretrain model such as:
sentencerDL = SentenceDetectorDLModel
.pretrained(“sentence_detector_dl”, “en”)
.setInputCols([“document”])
.setOutputCol(“sentences”)
it fails.
Possible Solution
Like the idea of using Fat Jars, but need them functional.
Steps to Reproduce
tested on my work macos catalina latest version using the installation instructions: https://nlp.johnsnowlabs.com/docs/en/install#python for both: $ java -version $ conda create -n sparknlp python=3.7 -y $ conda activate sparknlp $ pip install spark-nlp==3.0.1 pyspark==3.1.1 $ pip install jupyter $ jupyter notebook
and
$ java -version $ conda create -n spark-nlp python=3.7 -y $ conda activate spark-nlp $ pip install spark-nlp==2.7.5 pyspark==2.4.4 $ pip install jupyter $ jupyter notebook
Pretty much follow the code from: https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Public/9.SentenceDetectorDL.ipynb#scrollTo=KvNuyGXpD7Nt
but using the Fat Jars instead:
spark = SparkSession.builder
.appName(“Spark NLP”)
.master(“local[4]”)
.config(“spark.driver.memory”,“16G”)
.config(“spark.driver.maxResultSize”, “0”)
.config(“spark.kryoserializer.buffer.max”, “2000M”)
.config(“spark.jars”, “/Users/filotio/Downloads/spark-nlp-assembly-3.0.1.jar”)
.getOrCreate()
and the moment I hit this code:
sentencerDL = SentenceDetectorDLModel
.pretrained(“sentence_detector_dl”, “en”)
.setInputCols([“document”])
.setOutputCol(“sentences”)
I get the above errors (NullPointerException for spark-nlp 3.0.x and timing out for spark-nlp 2.7.x)
Context
Your Environment
- Spark NLP version
sparknlp.version()
: Spark NLP version 3.0.1 - Apache NLP version
spark.version
: Apache Spark version: 3.1.1 - Java version
java -version
: openjdk version “1.8.0_282” OpenJDK Runtime Environment (build 1.8.0_282-bre_2021_01_20_16_37-b00) OpenJDK 64-Bit Server VM (build 25.282-b00, mixed mode) - Conda latest release.
- Operating System and version: MacOS catalina, latest release.
Issue Analytics
- State:
- Created 2 years ago
- Comments:22 (10 by maintainers)
@maziyarpanahi I still hope to be able to use JSL software. Please, have some one contacting me and test/debug further. Thank you.
Oh, I sent you the Fat Jar via Slack. I’ll put it on S3 and share the link. All of our models/pipelines are here: https://nlp.johnsnowlabs.com/models
They all have examples and download link inside.