boto3 seems to be breaking with apache spark in yarn mode. - `NoCredentialsError: Unable to locate credentials`.
See original GitHub issueThis is a bit weird and I cannot rule out that I am doing something stupid.
with Apache spark 2.0.0 on Hortonworks Data Platform 2.5 (HDP 2.5) I am seeing that parrallelised tasks of jobs running through yarn are not able to locate credentials. I am very sure that the user I am using (centos) has the credentials stored in the right place (~/.aws) I have tested this very thoroughly with vanilla python boto3 and the awscli.
I have a couple of boto calls. one before parallelism which works.
for object in my_bucket.objects.filter(Prefix='1971-01'):
and this one is supposed to run in parallel downloading the object. It seems that this is failing.
s3obj = boto3.resource('s3').Object(bucket_name='time-waits-for-no-man', key=s3Key)
The job fails with
NoCredentialsError: Unable to locate credentials
.
Stacktrace:
Job aborted due to stage failure: Task 0 in stage 0.0 failed 4 times, most recent failure: Lost task 0.3 in stage 0.0 (TID 3, hadoop002.dbszod.aws.db.de): org.apache.spark.api.python.PythonException: Traceback (most recent call last):
File "/hadoop/yarn/local/usercache/centos/appcache/application_1480271222291_0048/container_1480271222291_0048_01_000020/pyspark.zip/pyspark/worker.py", line 172, in main
process()
File "/hadoop/yarn/local/usercache/centos/appcache/application_1480271222291_0048/container_1480271222291_0048_01_000020/pyspark.zip/pyspark/worker.py", line 167, in process
serializer.dump_stream(func(split_index, iterator), outfile)
File "/hadoop/yarn/local/usercache/centos/appcache/application_1480271222291_0048/container_1480271222291_0048_01_000020/pyspark.zip/pyspark/serializers.py", line 263, in dump_stream
vs = list(itertools.islice(iterator, batch))
File "/usr/hdp/2.5.0.0-1245/spark2/python/lib/pyspark.zip/pyspark/rdd.py", line 1306, in takeUpToNumLeft
File "/home/centos/fun-functions/spark-parrallel-read-from-s3/tick.py", line 38, in distributedJsonRead
File "/usr/lib/python2.7/site-packages/boto3/resources/factory.py", line 520, in do_action
response = action(self, *args, **kwargs)
File "/usr/lib/python2.7/site-packages/boto3/resources/action.py", line 83, in __call__
response = getattr(parent.meta.client, operation_name)(**params)
File "/usr/lib/python2.7/site-packages/botocore/client.py", line 251, in _api_call
return self._make_api_call(operation_name, kwargs)
File "/usr/lib/python2.7/site-packages/botocore/client.py", line 526, in _make_api_call
operation_model, request_dict)
File "/usr/lib/python2.7/site-packages/botocore/endpoint.py", line 141, in make_request
return self._send_request(request_dict, operation_model)
File "/usr/lib/python2.7/site-packages/botocore/endpoint.py", line 166, in _send_request
request = self.create_request(request_dict, operation_model)
File "/usr/lib/python2.7/site-packages/botocore/endpoint.py", line 150, in create_request
operation_name=operation_model.name)
File "/usr/lib/python2.7/site-packages/botocore/hooks.py", line 227, in emit
return self._emit(event_name, kwargs)
File "/usr/lib/python2.7/site-packages/botocore/hooks.py", line 210, in _emit
response = handler(**kwargs)
File "/usr/lib/python2.7/site-packages/botocore/signers.py", line 90, in handler
return self.sign(operation_name, request)
File "/usr/lib/python2.7/site-packages/botocore/signers.py", line 147, in sign
auth.add_auth(request)
File "/usr/lib/python2.7/site-packages/botocore/auth.py", line 678, in add_auth
raise NoCredentialsError
NoCredentialsError: Unable to locate credentials
at org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRDD.scala:193)
at org.apache.spark.api.python.PythonRunner$$anon$1.<init>(PythonRDD.scala:234)
at org.apache.spark.api.python.PythonRunner.compute(PythonRDD.scala:152)
at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:63)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
at org.apache.spark.scheduler.Task.run(Task.scala:85)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Driver stacktrace:
stacktrace:Job aborted due to stage failure: Task 0 in stage 0.0 failed 4 times, most recent failure: Lost task 0.3 in stage 0.0 (TID 3, hadoop002.dbszod.aws.db.de): org.apache.spark.api.python.PythonException: Traceback (most recent call last): File "/hadoop/yarn/local/usercache/centos/appcache/application_1480271222291_0048/container_1480271222291_0048_01_000020/pyspark.zip/pyspark/worker.py", line 172, in main process() File "/hadoop/yarn/local/usercache/centos/appcache/application_1480271222291_0048/container_1480271222291_0048_01_000020/pyspark.zip/pyspark/worker.py", line 167, in process serializer.dump_stream(func(split_index, iterator), outfile) File "/hadoop/yarn/local/usercache/centos/appcache/application_1480271222291_0048/container_1480271222291_0048_01_000020/pyspark.zip/pyspark/serializers.py", line 263, in dump_stream vs = list(itertools.islice(iterator, batch)) File "/usr/hdp/2.5.0.0-1245/spark2/python/lib/pyspark.zip/pyspark/rdd.py", line 1306, in takeUpToNumLeft File "/home/centos/fun-functions/spark-parrallel-read-from-s3/tick.py", line 38, in distributedJsonRead File "/usr/lib/python2.7/site-packages/boto3/resources/factory.py", line 520, in do_action response = action(self, *args, **kwargs) File "/usr/lib/python2.7/site-packages/boto3/resources/action.py", line 83, in __call__ response = getattr(parent.meta.client, operation_name)(**params) File "/usr/lib/python2.7/site-packages/botocore/client.py", line 251, in _api_call return self._make_api_call(operation_name, kwargs) File "/usr/lib/python2.7/site-packages/botocore/client.py", line 526, in _make_api_call operation_model, request_dict) File "/usr/lib/python2.7/site-packages/botocore/endpoint.py", line 141, in make_request return self._send_request(request_dict, operation_model) File "/usr/lib/python2.7/site-packages/botocore/endpoint.py", line 166, in _send_request request = self.create_request(request_dict, operation_model) File "/usr/lib/python2.7/site-packages/botocore/endpoint.py", line 150, in create_request operation_name=operation_model.name) File "/usr/lib/python2.7/site-packages/botocore/hooks.py", line 227, in emit return self._emit(event_name, kwargs) File "/usr/lib/python2.7/site-packages/botocore/hooks.py", line 210, in _emit response = handler(**kwargs) File "/usr/lib/python2.7/site-packages/botocore/signers.py", line 90, in handler return self.sign(operation_name, request) File "/usr/lib/python2.7/site-packages/botocore/signers.py", line 147, in sign auth.add_auth(request) File "/usr/lib/python2.7/site-packages/botocore/auth.py", line 678, in add_auth raise NoCredentialsErrorNoCredentialsError: Unable to locate credentialsat org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRDD.scala:193)at org.apache.spark.api.python.PythonRunner$$anon$1.<init>(PythonRDD.scala:234)at org.apache.spark.api.python.PythonRunner.compute(PythonRDD.scala:152)at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:63)at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)at org.apache.spark.scheduler.Task.run(Task.scala:85)at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)at java.lang.Thread.run(Thread.java:745)Driver stacktrace:
I am not sure it is relevent but the last thing I can see from the botocore debug output is:
2016-11-30 22:36:47,955 botocore.hooks [DEBUG] Event needs-retry.s3.ListObjects: calling handler <botocore.retryhandler.RetryHandler object at 0x20f7310>
2016-11-30 22:36:47,955 botocore.retryhandler [DEBUG] No retry needed.
2016-11-30 22:36:47,955 botocore.hooks [DEBUG] Event needs-retry.s3.ListObjects: calling handler <bound method S3RegionRedirector.redirect_from_error of <botocore.utils.S3RegionRedirector object at 0x223a6d0>>
2016-11-30 22:36:47,955 botocore.hooks [DEBUG] Event after-call.s3.ListObjects: calling handler <function decode_list_object at 0x16c3b90>
2016-11-30 22:36:47,956 botocore.hooks [DEBUG] Event creating-resource-class.s3.ObjectSummary: calling handler <function _handler at 0x1bd7488>
The full code ( please excuse the mess)
import boto3
import ujson
import arrow
import sys
import os
from pyspark.sql import SQLContext
from pyspark import SparkContext
boto3.set_stream_logger('botocore', level='DEBUG')
sc = SparkContext()
version = sys.version
log4jLogger = sc._jvm.org.apache.log4j
LOGGER = log4jLogger.LogManager.getLogger(__name__)
LOGGER.info("pyspark script logger initialized")
LOGGER.info("Python Version: " + version)
s3_list = []
s3 = boto3.resource('s3')
my_bucket = s3.Bucket('time-waits-for-no-man')
for object in my_bucket.objects.filter(Prefix='1971-01'):
s3_list.append(object.key)
def add_timestamp(dict):
dict['timestamp'] = arrow.get(
int(dict['year']),
int(dict['month']),
int(dict['day']),
int(dict['hour']),
int(dict['minute']),
int(dict['second'])
).timestamp
return dict
def distributedJsonRead(s3Key):
s3obj = boto3.resource('s3').Object(bucket_name='time-waits-for-no-man', key=s3Key)
contents = s3obj.get()['Body'].read().decode()
meow = contents.splitlines()
result_wo_timestamp = map(ujson.loads, meow)
result_wi_timestamp = map(add_timestamp, result_wo_timestamp)
return result_wi_timestamp
sqlContext = SQLContext(sc)
job = sc.parallelize(s3_list)
foo = job.flatMap(distributedJsonRead)
df = foo.toDF()
#df.show()
blah = df.count()
print(blah)
df.printSchema()
#df.write.parquet('dates_by_seconds', mode="overwrite", partitionBy=["second"])
sc.stop()
exit()
[centos@hadoop003 ~]$ cat .aws/config
[default]
region = eu-central-1
[Boto]
proxy = webproxy.foo.de
proxy_port = 8080
[centos@hadoop003 ~]$ cat .aws/credentials
[default]
aws_access_key_id = XXXXXXXXXXXXXXXXXX
aws_secret_access_key = XxXxXxXxXxXxXxXxXxXxXxXxXxXxXxXxXxXxXxXxXx
Issue Analytics
- State:
- Created 7 years ago
- Comments:13 (2 by maintainers)
Top GitHub Comments
Please have a look at this SO thread. https://stackoverflow.com/questions/37950728/boto3-cannot-create-client-on-pyspark-worker/42102858#42102858
It is because the boto3 will download some file to where it lives in.
If you have your python app and dependencies package altogether as a zip, you will be likely to encounter this issue. As those files cannot be save into a zip location.
The work around would be install boto3 on every instance of your spark cluster.
I am having the same problem when I run this in AWS EMR.
Here is my error: