`google.auth.exceptions.RefreshError` with excessive concurrent requests.
See original GitHub issuegcsfs
propagates an google.auth.exceptions.RefreshError
when executing many concurrent requests from a single node using the google_default
credentials class. This is likely due to repeated, excessive number of requests to the internal metadata service. This is a known bug of the external library at GoogleCloudPlatform/google-auth-library-python#211.
Anecdotally, I’ve primarily observed this in dask.distributed
workers and believe this might occur due to the way GCSFiles are distributed. This primarily occurs when a large number of small files are being read from storage and many worker threads are performing concurrent reads. I believe the GCSFile
s serialized in dask tasks then each instantiate a separate GCSFilesystem
, resolve credentials and open a session.
If this is the case it would be preferable to store a fixed set of AuthenticatedSession
handles, ideally via cache on the GCSFilesystem
class, and dispatch to an auth-method-specific session in the GCSFilesystem._connect_*
connection functions.
As a more specific solution, google.auth.exceptions.RefreshError
or its base class should be added to the retrying exception list in _call, however this may mask legitimate authentication errors. The credentials should probably be “tested” via some call that does not retry this error during session initialization. This may be as simple as calling session.credentials.refresh
or performing a single authenticated request after session initialization.
Issue Analytics
- State:
- Created 6 years ago
- Comments:5 (3 by maintainers)
Top GitHub Comments
gcsfs is used routinely with Dask, but does not guarantee thread-safety. Specifically, if you have the same set of parameters when instantiating (which would be true for your example), you only create one instance and share it, so only one auth request is sent. However, the underlying library
requests
is almost, but not entirely thread-safe: apparently it is possible for connections to be dropped if a pool fills up; but this case would seem very unlikely in this kind of use (and should be covered by internal retries).Directory listings could also potentially fall out of sync, but the code aggressively purges the cache when writing, and in the dask scenario, listings are usually done just once in the client.
Is
gcsfs
thread-safe? A dask worker could be running multiple threads. For example: