Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

AccessDenied errors loading healthcare models

See original GitHub issue

When creating annotators from pretrained models in versions 2.7.4 and 2.7.5 we are getting AccessDenied errors when running Spark locally (master = “local”). Version 2.7.3 works fine and running 2.7.4/2.7.5 on AWS EMR cluster also works.

We are specifying the AWS credentials in the .aws/credentials file using [spark_nlp] profile.

Current Behavior

We get the following error when trying to load a pretrained model from the clinical/models location.

Py4JJavaError: An error occurred while calling z:com.johnsnowlabs.nlp.pretrained.PythonResourceDownloader.getDownloadSize.
: com.amazonaws.services.s3.model.AmazonS3Exception: Status Code: 403, AWS Service: Amazon S3, AWS Request ID: 7JGS2EJM96C6SHGG, AWS Error Code: AccessDenied, AWS Error Message: Access Denied, S3 Extended Request ID: KgsQ87K74+KaiKhc7arCbcuVtc8POle+iUPrzaJ2XG5ECu37HiN5VDNzqKBs4DunvOG1yqEJlwg=
	at com.amazonaws.http.AmazonHttpClient.handleErrorResponse(AmazonHttpClient.java:798)
	at com.amazonaws.http.AmazonHttpClient.executeHelper(AmazonHttpClient.java:421)
	at com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:232)
	at com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:3528)
	at com.amazonaws.services.s3.AmazonS3Client.getObject(AmazonS3Client.java:1111)
	at com.amazonaws.services.s3.AmazonS3Client.getObject(AmazonS3Client.java:984)
	at com.johnsnowlabs.nlp.pretrained.S3ResourceDownloader.downloadMetadataIfNeed(S3ResourceDownloader.scala:69)
	at com.johnsnowlabs.nlp.pretrained.S3ResourceDownloader.resolveLink(S3ResourceDownloader.scala:81)
	at com.johnsnowlabs.nlp.pretrained.S3ResourceDownloader.getDownloadSize(S3ResourceDownloader.scala:159)
	at com.johnsnowlabs.nlp.pretrained.ResourceDownloader$.getDownloadSize(ResourceDownloader.scala:403)
	at com.johnsnowlabs.nlp.pretrained.PythonResourceDownloader$.getDownloadSize(ResourceDownloader.scala:501)
	at com.johnsnowlabs.nlp.pretrained.PythonResourceDownloader.getDownloadSize(ResourceDownloader.scala)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
	at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
	at py4j.Gateway.invoke(Gateway.java:282)
	at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
	at py4j.commands.CallCommand.execute(CallCommand.java:79)
	at py4j.GatewayConnection.run(GatewayConnection.java:238)
	at java.lang.Thread.run(Thread.java:748)

Steps to Reproduce

import sparknlp
from pyspark.sql import SparkSession

spark = SparkSession.builder \
    .appName("Spark NLP")\
    .master("local[4]")\
    .config("spark.driver.memory","16G")\
    .config("spark.driver.maxResultSize", "0") \
    .config("spark.jars.packages", "com.johnsnowlabs.nlp:spark-nlp_2.11:2.7.5")\
    .config("spark.kryoserializer.buffer.max", "1000M")\
    .getOrCreate()

from sparknlp.annotator import *

sent_detect = SentenceDetectorDLModel() \
  .pretrained("sentence_detector_dl_healthcare", "en", "clinical/models") \
  .setInputCols(["document"]) \
  .setOutputCol("sentences")

Context

Your Environment

Spark NLP version sparknlp.version(): 2.7.4
Apache NLP version spark.version: 2.4.5
Java version java -version: openjdk version “1.8.0_282”

Issue Analytics

State:
Created 3 years ago
Comments:5 (3 by maintainers)

Top GitHub Comments

1reaction

dkincaidcommented, Mar 22, 2021

I tried the new 3.0.0 version that was released today and this issue seems to be fixed in that version.

1reaction

dkincaidcommented, Mar 10, 2021

Version 2.7.3 and earlier of the Spark NLP jar works with AWS credentials stored in the .aws/credentials file. The problem with specifying them in environment variables is that it only allows one set of credentials to be in effect at a time. We can’t do that since we need our own credentials to access our AWS resources. As far as I know the only way to have different credentials setup is to use profiles which is what the .aws/credentials file provides. I suspect the problem with 2.7.4 and 2.7.5 has to do with reverting to a really old version of the AWS SDK (1.7.4). Version 2.7.3 uses 1.11.603. You were using 1.11.603 at least as far back as 2.5.0.