question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

boto3 seems to be breaking with apache spark in yarn mode. - `NoCredentialsError: Unable to locate credentials`.

See original GitHub issue

This is a bit weird and I cannot rule out that I am doing something stupid.

with Apache spark 2.0.0 on Hortonworks Data Platform 2.5 (HDP 2.5) I am seeing that parrallelised tasks of jobs running through yarn are not able to locate credentials. I am very sure that the user I am using (centos) has the credentials stored in the right place (~/.aws) I have tested this very thoroughly with vanilla python boto3 and the awscli.

I have a couple of boto calls. one before parallelism which works.

for object in my_bucket.objects.filter(Prefix='1971-01'):

and this one is supposed to run in parallel downloading the object. It seems that this is failing.

s3obj = boto3.resource('s3').Object(bucket_name='time-waits-for-no-man', key=s3Key)

The job fails with

NoCredentialsError: Unable to locate credentials.

Stacktrace:
Job aborted due to stage failure: Task 0 in stage 0.0 failed 4 times, most recent failure: Lost task 0.3 in stage 0.0 (TID 3, hadoop002.dbszod.aws.db.de): org.apache.spark.api.python.PythonException: Traceback (most recent call last):
  File "/hadoop/yarn/local/usercache/centos/appcache/application_1480271222291_0048/container_1480271222291_0048_01_000020/pyspark.zip/pyspark/worker.py", line 172, in main
    process()
  File "/hadoop/yarn/local/usercache/centos/appcache/application_1480271222291_0048/container_1480271222291_0048_01_000020/pyspark.zip/pyspark/worker.py", line 167, in process
    serializer.dump_stream(func(split_index, iterator), outfile)
  File "/hadoop/yarn/local/usercache/centos/appcache/application_1480271222291_0048/container_1480271222291_0048_01_000020/pyspark.zip/pyspark/serializers.py", line 263, in dump_stream
    vs = list(itertools.islice(iterator, batch))
  File "/usr/hdp/2.5.0.0-1245/spark2/python/lib/pyspark.zip/pyspark/rdd.py", line 1306, in takeUpToNumLeft
  File "/home/centos/fun-functions/spark-parrallel-read-from-s3/tick.py", line 38, in distributedJsonRead
  File "/usr/lib/python2.7/site-packages/boto3/resources/factory.py", line 520, in do_action
    response = action(self, *args, **kwargs)
  File "/usr/lib/python2.7/site-packages/boto3/resources/action.py", line 83, in __call__
    response = getattr(parent.meta.client, operation_name)(**params)
  File "/usr/lib/python2.7/site-packages/botocore/client.py", line 251, in _api_call
    return self._make_api_call(operation_name, kwargs)
  File "/usr/lib/python2.7/site-packages/botocore/client.py", line 526, in _make_api_call
    operation_model, request_dict)
  File "/usr/lib/python2.7/site-packages/botocore/endpoint.py", line 141, in make_request
    return self._send_request(request_dict, operation_model)
  File "/usr/lib/python2.7/site-packages/botocore/endpoint.py", line 166, in _send_request
    request = self.create_request(request_dict, operation_model)
  File "/usr/lib/python2.7/site-packages/botocore/endpoint.py", line 150, in create_request
    operation_name=operation_model.name)
  File "/usr/lib/python2.7/site-packages/botocore/hooks.py", line 227, in emit
    return self._emit(event_name, kwargs)
  File "/usr/lib/python2.7/site-packages/botocore/hooks.py", line 210, in _emit
    response = handler(**kwargs)
  File "/usr/lib/python2.7/site-packages/botocore/signers.py", line 90, in handler
    return self.sign(operation_name, request)
  File "/usr/lib/python2.7/site-packages/botocore/signers.py", line 147, in sign
    auth.add_auth(request)
  File "/usr/lib/python2.7/site-packages/botocore/auth.py", line 678, in add_auth
    raise NoCredentialsError
NoCredentialsError: Unable to locate credentials

	at org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRDD.scala:193)
	at org.apache.spark.api.python.PythonRunner$$anon$1.<init>(PythonRDD.scala:234)
	at org.apache.spark.api.python.PythonRunner.compute(PythonRDD.scala:152)
	at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:63)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
	at org.apache.spark.scheduler.Task.run(Task.scala:85)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
	at java.lang.Thread.run(Thread.java:745)

Driver stacktrace:
stacktrace:Job aborted due to stage failure: Task 0 in stage 0.0 failed 4 times, most recent failure: Lost task 0.3 in stage 0.0 (TID 3, hadoop002.dbszod.aws.db.de): org.apache.spark.api.python.PythonException: Traceback (most recent call last):  File "/hadoop/yarn/local/usercache/centos/appcache/application_1480271222291_0048/container_1480271222291_0048_01_000020/pyspark.zip/pyspark/worker.py", line 172, in main    process()  File "/hadoop/yarn/local/usercache/centos/appcache/application_1480271222291_0048/container_1480271222291_0048_01_000020/pyspark.zip/pyspark/worker.py", line 167, in process    serializer.dump_stream(func(split_index, iterator), outfile)  File "/hadoop/yarn/local/usercache/centos/appcache/application_1480271222291_0048/container_1480271222291_0048_01_000020/pyspark.zip/pyspark/serializers.py", line 263, in dump_stream    vs = list(itertools.islice(iterator, batch))  File "/usr/hdp/2.5.0.0-1245/spark2/python/lib/pyspark.zip/pyspark/rdd.py", line 1306, in takeUpToNumLeft  File "/home/centos/fun-functions/spark-parrallel-read-from-s3/tick.py", line 38, in distributedJsonRead  File "/usr/lib/python2.7/site-packages/boto3/resources/factory.py", line 520, in do_action    response = action(self, *args, **kwargs)  File "/usr/lib/python2.7/site-packages/boto3/resources/action.py", line 83, in __call__    response = getattr(parent.meta.client, operation_name)(**params)  File "/usr/lib/python2.7/site-packages/botocore/client.py", line 251, in _api_call    return self._make_api_call(operation_name, kwargs)  File "/usr/lib/python2.7/site-packages/botocore/client.py", line 526, in _make_api_call    operation_model, request_dict)  File "/usr/lib/python2.7/site-packages/botocore/endpoint.py", line 141, in make_request    return self._send_request(request_dict, operation_model)  File "/usr/lib/python2.7/site-packages/botocore/endpoint.py", line 166, in _send_request    request = self.create_request(request_dict, operation_model)  File "/usr/lib/python2.7/site-packages/botocore/endpoint.py", line 150, in create_request    operation_name=operation_model.name)  File "/usr/lib/python2.7/site-packages/botocore/hooks.py", line 227, in emit    return self._emit(event_name, kwargs)  File "/usr/lib/python2.7/site-packages/botocore/hooks.py", line 210, in _emit    response = handler(**kwargs)  File "/usr/lib/python2.7/site-packages/botocore/signers.py", line 90, in handler    return self.sign(operation_name, request)  File "/usr/lib/python2.7/site-packages/botocore/signers.py", line 147, in sign    auth.add_auth(request)  File "/usr/lib/python2.7/site-packages/botocore/auth.py", line 678, in add_auth    raise NoCredentialsErrorNoCredentialsError: Unable to locate credentialsat org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRDD.scala:193)at org.apache.spark.api.python.PythonRunner$$anon$1.<init>(PythonRDD.scala:234)at org.apache.spark.api.python.PythonRunner.compute(PythonRDD.scala:152)at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:63)at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)at org.apache.spark.scheduler.Task.run(Task.scala:85)at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)at java.lang.Thread.run(Thread.java:745)Driver stacktrace:

I am not sure it is relevent but the last thing I can see from the botocore debug output is:

2016-11-30 22:36:47,955 botocore.hooks [DEBUG] Event needs-retry.s3.ListObjects: calling handler <botocore.retryhandler.RetryHandler object at 0x20f7310>
2016-11-30 22:36:47,955 botocore.retryhandler [DEBUG] No retry needed.
2016-11-30 22:36:47,955 botocore.hooks [DEBUG] Event needs-retry.s3.ListObjects: calling handler <bound method S3RegionRedirector.redirect_from_error of <botocore.utils.S3RegionRedirector object at 0x223a6d0>>
2016-11-30 22:36:47,955 botocore.hooks [DEBUG] Event after-call.s3.ListObjects: calling handler <function decode_list_object at 0x16c3b90>
2016-11-30 22:36:47,956 botocore.hooks [DEBUG] Event creating-resource-class.s3.ObjectSummary: calling handler <function _handler at 0x1bd7488>

The full code ( please excuse the mess)

import boto3
import ujson
import arrow
import sys
import os
from pyspark.sql import SQLContext
from pyspark import SparkContext

boto3.set_stream_logger('botocore', level='DEBUG')
sc = SparkContext()

version = sys.version
log4jLogger = sc._jvm.org.apache.log4j
LOGGER = log4jLogger.LogManager.getLogger(__name__)
LOGGER.info("pyspark script logger initialized")
LOGGER.info("Python Version: " + version)

s3_list = []
s3 = boto3.resource('s3')
my_bucket = s3.Bucket('time-waits-for-no-man')
for object in my_bucket.objects.filter(Prefix='1971-01'):
    s3_list.append(object.key)

def add_timestamp(dict):
    dict['timestamp'] = arrow.get(
                        int(dict['year']),
                        int(dict['month']),
                        int(dict['day']),
                        int(dict['hour']),
                        int(dict['minute']),
                        int(dict['second'])
                        ).timestamp
    return dict

def distributedJsonRead(s3Key):
    s3obj = boto3.resource('s3').Object(bucket_name='time-waits-for-no-man', key=s3Key)
    contents = s3obj.get()['Body'].read().decode()
    meow = contents.splitlines()
    result_wo_timestamp = map(ujson.loads, meow)
    result_wi_timestamp = map(add_timestamp, result_wo_timestamp)
    return result_wi_timestamp

sqlContext = SQLContext(sc)
job = sc.parallelize(s3_list)
foo = job.flatMap(distributedJsonRead)
df = foo.toDF()
#df.show()
blah = df.count()
print(blah)
df.printSchema()

#df.write.parquet('dates_by_seconds', mode="overwrite", partitionBy=["second"])
sc.stop()
exit()

[centos@hadoop003 ~]$ cat .aws/config

[default]
region = eu-central-1

[Boto]

proxy = webproxy.foo.de
proxy_port = 8080

[centos@hadoop003 ~]$ cat .aws/credentials

[default]
aws_access_key_id = XXXXXXXXXXXXXXXXXX
aws_secret_access_key = XxXxXxXxXxXxXxXxXxXxXxXxXxXxXxXxXxXxXxXxXx

Issue Analytics

  • State:closed
  • Created 7 years ago
  • Comments:13 (2 by maintainers)

github_iconTop GitHub Comments

5reactions
tly1980commented, Dec 30, 2017

Please have a look at this SO thread. https://stackoverflow.com/questions/37950728/boto3-cannot-create-client-on-pyspark-worker/42102858#42102858

It is because the boto3 will download some file to where it lives in.

If you have your python app and dependencies package altogether as a zip, you will be likely to encounter this issue. As those files cannot be save into a zip location.

The work around would be install boto3 on every instance of your spark cluster.

2reactions
ucrkarthikcommented, Dec 23, 2017

I am having the same problem when I run this in AWS EMR.

#!/usr/bin/python

import boto3

def sendTestBoto3S3():
    client = boto3.client('s3', region_name='us-east-1')
    response = client.list_buckets()

    print response

def mainTest():
    sendTestBoto3S3()

if __name__ == '__main__':
    mainTest()

Here is my error:

Traceback (most recent call last):
  File "mainTest.py", line 18, in <module>
    mainTest()
  File "mainTest.py", line 15, in mainTest
    sendTestBoto3S3()
  File "mainTest.py", line 9, in sendTestBoto3S3
    client = boto3.client('s3', region_name='us-east-1')
  File "/mnt/yarn/usercache/hadoop/appcache/application_1514061511432_0008/container_1514061511432_0008_01_000001/emsencodinglibs.zip/boto3/__init__.py", line 83, in client
    
  File "/mnt/yarn/usercache/hadoop/appcache/application_1514061511432_0008/container_1514061511432_0008_01_000001/emsencodinglibs.zip/boto3/session.py", line 263, in client
  File "/mnt/yarn/usercache/hadoop/appcache/application_1514061511432_0008/container_1514061511432_0008_01_000001/emsencodinglibs.zip/botocore/session.py", line 851, in create_client
  File "/mnt/yarn/usercache/hadoop/appcache/application_1514061511432_0008/container_1514061511432_0008_01_000001/emsencodinglibs.zip/botocore/session.py", line 726, in get_component
  File "/mnt/yarn/usercache/hadoop/appcache/application_1514061511432_0008/container_1514061511432_0008_01_000001/emsencodinglibs.zip/botocore/session.py", line 922, in get_component
  File "/mnt/yarn/usercache/hadoop/appcache/application_1514061511432_0008/container_1514061511432_0008_01_000001/emsencodinglibs.zip/botocore/session.py", line 189, in create_default_resolver
  File "/mnt/yarn/usercache/hadoop/appcache/application_1514061511432_0008/container_1514061511432_0008_01_000001/emsencodinglibs.zip/botocore/loaders.py", line 132, in _wrapper
  File "/mnt/yarn/usercache/hadoop/appcache/application_1514061511432_0008/container_1514061511432_0008_01_000001/emsencodinglibs.zip/botocore/loaders.py", line 424, in load_data
botocore.exceptions.DataNotFoundError: Unable to load data for: endpoints
Read more comments on GitHub >

github_iconTop Results From Across the Web

boto3 seems to be breaking with apache spark in yarn mode ...
boto3 seems to be breaking with apache spark in yarn mode. - `NoCredentialsError: Unable to locate credentials`. the boto project.
Read more >
Resolving the Boto3 NoCredentialsError in Python - Rollbar
NoCredentialsError is raised while using Boto3 to access AWS in Python, when a credentials file is invalid or cannot be located.
Read more >
Boto3 Error: botocore.exceptions.NoCredentialsError: Unable ...
It is always good to get credentials from os environment. To set Environment variables run the following commands in terminal.
Read more >
AWS SDK for pandas - Read the Docs
How awswrangler handles Sessions and AWS credentials? After version 1.0.0 awswrangler relies on Boto3.Session() to manage AWS credentials and configurations ...
Read more >
Connect to remote data - Dask documentation
Hadoop File System: hdfs:// - Hadoop Distributed File System, for resilient, replicated files within a cluster. This uses PyArrow as the backend. Amazon...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found