AttributeError when trying to perform a distributed read_parquet (maybe serialization issues)
See original GitHub issueWhat happened: Have a project where I dump tweet information to Parquets in a distributed environment using Dask to read those files and perform computation (although I usually write the files with Pandas/Arrow). In dev version, each of the workers is running in a Docker container and the scheduler is too. From version 2.17.0 onwards, the Dask read_parquet function stopped working. What I’m getting is an AttributeError on the scheduler and what seem to be some serialization issues on the workers side.
What you expected to happen: To be able to read/load/compute the contents of the Parquet file using a distributed Dask client.
Minimal Complete Verifiable Example: Probably not easily verifiable because of the specifics of the environment I have, but still:
import pandas
from distributed import Client
import dask.dataframe as dd
dask_client = Client("dask-scheduler:8786", set_as_default=False)
df = pandas.DataFrame([{'a': 1, 'b': 2}, {'a': 5, 'b': 10, 'c': 20}])
df.to_parquet("./test.parquet", engine="pyarrow")
dask_client.compute(dd.read_parquet("./test.parquet"), sync=True)
The traceback error I get on the Jupyter notebook:
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
<ipython-input-5-c51f04ec0f2f> in <module>
6 df = pandas.DataFrame([{'a': 1, 'b': 2}, {'a': 5, 'b': 10, 'c': 20}])
7 df.to_parquet("./test.parquet", engine="pyarrow")
----> 8 dask_client.compute(dd.read_parquet("./test.parquet"), sync=True)
/usr/local/lib/python3.8/site-packages/distributed/client.py in compute(self, collections, sync, optimize_graph, workers, allow_other_workers, resources, retries, priority, fifo_timeout, actors, traverse, **kwargs)
2928
2929 if sync:
-> 2930 result = self.gather(futures)
2931 else:
2932 result = futures
/usr/local/lib/python3.8/site-packages/distributed/client.py in gather(self, futures, errors, direct, asynchronous)
1984 else:
1985 local_worker = None
-> 1986 return self.sync(
1987 self._gather,
1988 futures,
/usr/local/lib/python3.8/site-packages/distributed/client.py in sync(self, func, asynchronous, callback_timeout, *args, **kwargs)
830 return future
831 else:
--> 832 return sync(
833 self.loop, func, *args, callback_timeout=callback_timeout, **kwargs
834 )
/usr/local/lib/python3.8/site-packages/distributed/utils.py in sync(loop, func, callback_timeout, *args, **kwargs)
338 if error[0]:
339 typ, exc, tb = error[0]
--> 340 raise exc.with_traceback(tb)
341 else:
342 return result[0]
/usr/local/lib/python3.8/site-packages/distributed/utils.py in f()
322 if callback_timeout is not None:
323 future = asyncio.wait_for(future, callback_timeout)
--> 324 result[0] = yield future
325 except Exception as exc:
326 error[0] = sys.exc_info()
/usr/local/lib/python3.8/site-packages/tornado/gen.py in run(self)
760
761 try:
--> 762 value = future.result()
763 except Exception:
764 exc_info = sys.exc_info()
/usr/local/lib/python3.8/site-packages/distributed/client.py in _gather(self, futures, errors, direct, local_worker)
1849 exc = CancelledError(key)
1850 else:
-> 1851 raise exception.with_traceback(traceback)
1852 raise exc
1853 if errors == "skip":
/usr/local/lib/python3.8/site-packages/distributed/protocol/pickle.py in loads()
73 return pickle.loads(x, buffers=buffers)
74 else:
---> 75 return pickle.loads(x)
76 except Exception as e:
77 logger.info("Failed to deserialize %s", x[:10000], exc_info=True)
/usr/local/lib/python3.8/site-packages/twisted/persisted/styles.py in unpickleMethod()
97 return getattr(im_class, im_name)
98 try:
---> 99 methodFunction = _methodFunction(im_class, im_name)
100 except AttributeError:
101 log.msg("Method", im_name, "not on class", im_class)
/usr/local/lib/python3.8/site-packages/twisted/persisted/styles.py in _methodFunction()
74 @rtype: L{types.FunctionType}
75 """
---> 76 methodObject = getattr(classObject, methodName)
77 if _PY3:
78 return methodObject
AttributeError: type object 'type' has no attribute 'read_partition'
And the workers logs:
2021-01-13 12:44:58,673 | INFO | -------------------------------------------------
2021-01-13 12:44:58,677 | INFO | Registered to: tcp://dask-scheduler:8786
2021-01-13 12:44:58,677 | INFO | -------------------------------------------------
2021-01-13 12:45:17,298 | WARNING | Could not deserialize task Traceback (most recent call last): File "/usr/local/lib/python3.8/site-packages/distributed/worker.py", line 2445, in _maybe_deserialize_task function, args, kwargs = _deserialize(*self.tasks[key]) File "/usr/local/lib/python3.8/site-packages/distributed/worker.py", line 3281, in _deserialize args = pickle.loads(args) File "/usr/local/lib/python3.8/site-packages/distributed/protocol/pickle.py", line 75, in loads return pickle.loads(x) File "/usr/local/lib/python3.8/site-packages/twisted/persisted/styles.py", line 99, in unpickleMethod methodFunction = _methodFunction(im_class, im_name) File "/usr/local/lib/python3.8/site-packages/twisted/persisted/styles.py", line 76, in _methodFunction methodObject = getattr(classObject, methodName) AttributeError: type object 'type' has no attribute 'read_partition'
2021-01-13 12:45:45,310 | ERROR | type object 'type' has no attribute 'read_partition' Traceback (most recent call last): File "/usr/local/lib/python3.8/site-packages/distributed/worker.py", line 915, in handle_scheduler await self.handle_stream( File "/usr/local/lib/python3.8/site-packages/distributed/core.py", line 579, in handle_stream msgs = await comm.read() File "/usr/local/lib/python3.8/site-packages/distributed/comm/tcp.py", line 204, in read msg = await from_frames( File "/usr/local/lib/python3.8/site-packages/distributed/comm/utils.py", line 87, in from_frames res = _from_frames() File "/usr/local/lib/python3.8/site-packages/distributed/comm/utils.py", line 65, in _from_frames return protocol.loads( File "/usr/local/lib/python3.8/site-packages/distributed/protocol/core.py", line 151, in loads value = _deserialize(head, fs, deserializers=deserializers) File "/usr/local/lib/python3.8/site-packages/distributed/protocol/serialize.py", line 335, in deserialize return loads(header, frames) File "/usr/local/lib/python3.8/site-packages/distributed/protocol/serialize.py", line 71, in pickle_loads return pickle.loads(x, buffers=buffers) File "/usr/local/lib/python3.8/site-packages/distributed/protocol/pickle.py", line 73, in loads return pickle.loads(x, buffers=buffers) File "/usr/local/lib/python3.8/site-packages/twisted/persisted/styles.py", line 99, in unpickleMethod methodFunction = _methodFunction(im_class, im_name) File "/usr/local/lib/python3.8/site-packages/twisted/persisted/styles.py", line 76, in _methodFunction methodObject = getattr(classObject, methodName) AttributeError: type object 'type' has no attribute 'read_partition'
2021-01-13 12:45:45,311 | INFO | Connection to scheduler broken. Reconnecting...
Anything else we need to know?:
- Doing the read of the Parquet file with Pandas, pyarrow or a Dask local cluster works fine.
- Downgrading the Dask/distributed packages to 2.16.0 makes it work again, although I read the 2.17.0 changelog and didn’t find anything that seemed guilty of breaking this.
- Searching for similar issues/questions I saw that in some cases the disparity of versions could be the origin, but I checked the key packages with get_versions() and all of them seemed to be the same.
Environment:
- Dask version: 2020.12.0 (distributed too)
- Python version: 3.8.6
- Operating System: Linux (Manjaro)
- Install method (conda, pip, source): pip
Most probably this is not an issue but an error on my part somewhere, but I’m really stuck and the question I asked a while ago in SO went unanswered. Any other information that could be of any help (packages versions, scheduler/workers launching, etc.) I will be glad to provide. Thanks for the help and the Dask project in general!
Issue Analytics
- State:
- Created 3 years ago
- Comments:6 (3 by maintainers)
Top GitHub Comments
The latest version of Dask refactored Parquet and CSV to be so-called
Blockwise
layers. The ultimate goal is to make it easier to merge IO operations with follow-up operations on the same partition… Unfortunately, we didn’t handle serialization correctly. dask#7048 has reverted these changes for the next Dask release. dask#7042 will add back the Blockwise changes for a later release (once we are more confindent that the serialization is handled correctly).I’ll close this once I can confirm that the dask reversion fixed this issue. I’ll also be sure to check that 7042 works before it is merged.
Hi again!
I have had some time today to look at this issue once again, and doing a debug step by step of what was happening in the client.compute(). I must add that, although in my Django project and the linked Jupyter kernel (based on Django-Shell-Plus) I encountered the issue, when executing the same code from a Python shell in any of the containers that form the environment, I didn’t.
Thus, I finally found that the core of the problem arises from the pickling of the task done in warn_dumps, specifically, in the pickling of the function ArrowLegacyEngine.read_partition.
In my Jupyter notebook, the pickling of the function:
Results in this:
b'\x80\x04\x95\x90\x00\x00\x00\x00\x00\x00\x00\x8c\x18twisted.persisted.styles\x94\x8c\x0eunpickleMethod\x94\x93\x94\x8c\x0eread_partition\x94\x8c\x1fdask.dataframe.io.parquet.arrow\x94\x8c\x11ArrowLegacyEngine\x94\x93\x94\x8c\x08builtins\x94\x8c\x04type\x94\x93\x94\x87\x94R\x94.'
Which seems to be wrong, because trying to pickle.loads() that bytestream results in the said AttributeError:
On the other hand, the pickle.dumps(ArrowLegacyEngine.read_partition), when done from a regular Python shell in a container returns this:
b'\x80\x04\x95e\x00\x00\x00\x00\x00\x00\x00\x8c\x08builtins\x94\x8c\x07getattr\x94\x93\x94\x8c\x1fdask.dataframe.io.parquet.arrow\x94\x8c\x11ArrowLegacyEngine\x94\x93\x94\x8c\x0eread_partition\x94\x86\x94R\x94.'
Which, when used within pickle.loads():
<bound method ArrowDatasetEngine.read_partition of <class 'dask.dataframe.io.parquet.arrow.ArrowLegacyEngine'>>
Any ideas about what could be causing this? I checked that pickle.format_version was the same on both environments and dask.version too. Sorry, but I don’t know how to make it reproducible (only seems to happen within my Dask project).