azure.identity.ClientSecretCredential object can't be (cloud)pickled
See original GitHub issue- Package Name: azure.identity
- Package Version: 1.7.1
- Operating System: Ubuntu 20.04.3 LTS
- Python Version: 3.9.7
Describe the bug
An azure.identity.ClientSecretCredential
object can’t be (cloud)pickled which makes it unusable in a multiprocessing context using e.g. Dask.
To Reproduce Steps to reproduce the behavior:
import os
from azure.identity import ClientSecretCredential
import cloudpickle # version 2.0.0
credential = ClientSecretCredential(
tenant_id=os.getenv('ADLFS_TENANT_ID'), # with valid env variables set
client_id=os.getenv('ADLFS_CLIENT_ID'),
client_secret=os.getenv('ADLFS_CLIENT_SECRET'),
authority='login.microsoftonline.com/')
cloudpickle.dumps(credential)
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
/tmp/ipykernel_31443/3356912537.py in <module>
9 client_secret=os.getenv('ADLFS_CLIENT_SECRET'),
10 authority='login.microsoftonline.com/')
---> 11 cloudpickle.dumps(credential)
/usr/local/continuum/miniconda3/envs/py39_ymp576_latest/lib/python3.9/site-packages/cloudpickle/cloudpickle_fast.py in dumps(obj, protocol, buffer_callback)
71 file, protocol=protocol, buffer_callback=buffer_callback
72 )
---> 73 cp.dump(obj)
74 return file.getvalue()
75
/usr/local/continuum/miniconda3/envs/py39_ymp576_latest/lib/python3.9/site-packages/cloudpickle/cloudpickle_fast.py in dump(self, obj)
600 def dump(self, obj):
601 try:
--> 602 return Pickler.dump(self, obj)
603 except RuntimeError as e:
604 if "recursion" in e.args[0]:
TypeError: cannot pickle '_thread._local' object
Note that sometimes the error is TypeError: cannot pickle '_thread.RLock' object
. This seem to be due to two different offending locks in the ClientSecretCredential object, i.e. _cache: _thread.RLock
and
_client: _thread._local
.
Expected behavior
I expect to be able to (cloud)pickle the ClientSecretCredential
object such that it can be used in a multiprocessing setting.
Additional context
In data science tasks, multiprocessing via e.g. Dask are often essential. Thus, I think it is absolutely necessary that you can pickle objects like ClientSecretCredential
such that Azure plays nicely with the Scientific Python Ecosystem for doing data science.
Specifically, my use case is saving Xarray datasets in an Azure Data Lake, e.g. something like
import os
from azure.identity import ClientSecretCredential
from azure.storage.blob import ContainerClient
import dask.array
import numpy as np
import xarray as xr
import zarr
# Setup ADL connection
credential = ClientSecretCredential(
tenant_id=os.getenv('ADLFS_TENANT_ID'),
client_id=os.getenv('ADLFS_CLIENT_ID'),
client_secret=os.getenv('ADLFS_CLIENT_SECRET'),
authority='login.microsoftonline.com/')
container_client = ContainerClient(
account_url='https://{my_storage_account}.blob.core.windows.net',
container_name='{my_container}',
credential=credential)
store = zarr.storage.ABSStore(
client=container_client,
prefix='test/')
group = 'ds.zarr'
# Create dummy data
ds = xr.Dataset(
data_vars={"var_1": (('x'), dask.array.from_array(np.random.rand(10), chunks=2))},
coords={'x': np.arange(10)})
# Save xarray to Zarr in ADL
ds.to_zarr(store=store, group=group, compute=False).compute(scheduler='processes') # Compute using processes (not threads)
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
/tmp/ipykernel_33027/4096524395.py in <module>
28
29 # Save xarray to Zarr in ADL
---> 30 ds.to_zarr(store=store, group=group, compute=False).compute(scheduler='processes') # Compute using processes (not threads)
/usr/local/continuum/miniconda3/envs/py39_ymp576_latest/lib/python3.9/site-packages/dask/base.py in compute(self, **kwargs)
286 dask.base.compute
287 """
--> 288 (result,) = compute(self, traverse=False, **kwargs)
289 return result
290
/usr/local/continuum/miniconda3/envs/py39_ymp576_latest/lib/python3.9/site-packages/dask/base.py in compute(traverse, optimize_graph, scheduler, get, *args, **kwargs)
569 postcomputes.append(x.__dask_postcompute__())
570
--> 571 results = schedule(dsk, keys, **kwargs)
572 return repack([f(r, *a) for r, (f, a) in zip(results, postcomputes)])
573
/usr/local/continuum/miniconda3/envs/py39_ymp576_latest/lib/python3.9/site-packages/dask/multiprocessing.py in get(dsk, keys, num_workers, func_loads, func_dumps, optimize_graph, pool, chunksize, **kwargs)
217 try:
218 # Run
--> 219 result = get_async(
220 pool.submit,
221 pool._max_workers,
/usr/local/continuum/miniconda3/envs/py39_ymp576_latest/lib/python3.9/site-packages/dask/local.py in get_async(submit, num_workers, dsk, result, cache, get_id, rerun_exceptions_locally, pack_exception, raise_exception, callbacks, dumps, loads, chunksize, **kwargs)
493 # Main loop, wait on tasks to finish, insert new ones
494 while state["waiting"] or state["ready"] or state["running"]:
--> 495 fire_tasks(chunksize)
496 for key, res_info, failed in queue_get(queue).result():
497 if failed:
/usr/local/continuum/miniconda3/envs/py39_ymp576_latest/lib/python3.9/site-packages/dask/local.py in fire_tasks(chunksize)
475 (
476 key,
--> 477 dumps((dsk[key], data)),
478 dumps,
479 loads,
/usr/local/continuum/miniconda3/envs/py39_ymp576_latest/lib/python3.9/site-packages/cloudpickle/cloudpickle_fast.py in dumps(obj, protocol, buffer_callback)
71 file, protocol=protocol, buffer_callback=buffer_callback
72 )
---> 73 cp.dump(obj)
74 return file.getvalue()
75
/usr/local/continuum/miniconda3/envs/py39_ymp576_latest/lib/python3.9/site-packages/cloudpickle/cloudpickle_fast.py in dump(self, obj)
600 def dump(self, obj):
601 try:
--> 602 return Pickler.dump(self, obj)
603 except RuntimeError as e:
604 if "recursion" in e.args[0]:
TypeError: cannot pickle '_thread._local' object
Issue Analytics
- State:
- Created 2 years ago
- Reactions:5
- Comments:7 (2 by maintainers)
A workaround to this problem is to delay the instantiation of the
ClientSecretCredential
object until it is actually needed. That way we can simply pass around the client/secret (which are simple strings that can be pickled) in the dask cluster.Implementing this workaround for my use case is not trivial though, as the current interface to the Xarray/Zarr setup requires an instantiated credential object. However, I have come up with this very ugly
ProxyCredential
class that allows me to pretend to have an instantiated credential object but really just instantiate theClientSecretCredential
when needed. This obviously comes with some overhead, but it doesn’t seem too bad to be useful.Now using the
ProxyCredential
, my use case works as expected, i.e. the following examples now works:Doing a /unresolved to get some feedback on @dhirschfeld suggestion for making
ManagedIdentityCredential
picklable which seems very reasonable to me.