question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

AttributeError when trying to perform a distributed read_parquet (maybe serialization issues)

See original GitHub issue

What happened: Have a project where I dump tweet information to Parquets in a distributed environment using Dask to read those files and perform computation (although I usually write the files with Pandas/Arrow). In dev version, each of the workers is running in a Docker container and the scheduler is too. From version 2.17.0 onwards, the Dask read_parquet function stopped working. What I’m getting is an AttributeError on the scheduler and what seem to be some serialization issues on the workers side.

What you expected to happen: To be able to read/load/compute the contents of the Parquet file using a distributed Dask client.

Minimal Complete Verifiable Example: Probably not easily verifiable because of the specifics of the environment I have, but still:

import pandas
from distributed import Client
import dask.dataframe as dd

dask_client = Client("dask-scheduler:8786", set_as_default=False)
df = pandas.DataFrame([{'a': 1, 'b': 2}, {'a': 5, 'b': 10, 'c': 20}])
df.to_parquet("./test.parquet", engine="pyarrow")
dask_client.compute(dd.read_parquet("./test.parquet"), sync=True)

The traceback error I get on the Jupyter notebook:

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-5-c51f04ec0f2f> in <module>
      6 df = pandas.DataFrame([{'a': 1, 'b': 2}, {'a': 5, 'b': 10, 'c': 20}])
      7 df.to_parquet("./test.parquet", engine="pyarrow")
----> 8 dask_client.compute(dd.read_parquet("./test.parquet"), sync=True)

/usr/local/lib/python3.8/site-packages/distributed/client.py in compute(self, collections, sync, optimize_graph, workers, allow_other_workers, resources, retries, priority, fifo_timeout, actors, traverse, **kwargs)
   2928 
   2929         if sync:
-> 2930             result = self.gather(futures)
   2931         else:
   2932             result = futures

/usr/local/lib/python3.8/site-packages/distributed/client.py in gather(self, futures, errors, direct, asynchronous)
   1984             else:
   1985                 local_worker = None
-> 1986             return self.sync(
   1987                 self._gather,
   1988                 futures,

/usr/local/lib/python3.8/site-packages/distributed/client.py in sync(self, func, asynchronous, callback_timeout, *args, **kwargs)
    830             return future
    831         else:
--> 832             return sync(
    833                 self.loop, func, *args, callback_timeout=callback_timeout, **kwargs
    834             )

/usr/local/lib/python3.8/site-packages/distributed/utils.py in sync(loop, func, callback_timeout, *args, **kwargs)
    338     if error[0]:
    339         typ, exc, tb = error[0]
--> 340         raise exc.with_traceback(tb)
    341     else:
    342         return result[0]

/usr/local/lib/python3.8/site-packages/distributed/utils.py in f()
    322             if callback_timeout is not None:
    323                 future = asyncio.wait_for(future, callback_timeout)
--> 324             result[0] = yield future
    325         except Exception as exc:
    326             error[0] = sys.exc_info()

/usr/local/lib/python3.8/site-packages/tornado/gen.py in run(self)
    760 
    761                     try:
--> 762                         value = future.result()
    763                     except Exception:
    764                         exc_info = sys.exc_info()

/usr/local/lib/python3.8/site-packages/distributed/client.py in _gather(self, futures, errors, direct, local_worker)
   1849                             exc = CancelledError(key)
   1850                         else:
-> 1851                             raise exception.with_traceback(traceback)
   1852                         raise exc
   1853                     if errors == "skip":

/usr/local/lib/python3.8/site-packages/distributed/protocol/pickle.py in loads()
     73             return pickle.loads(x, buffers=buffers)
     74         else:
---> 75             return pickle.loads(x)
     76     except Exception as e:
     77         logger.info("Failed to deserialize %s", x[:10000], exc_info=True)

/usr/local/lib/python3.8/site-packages/twisted/persisted/styles.py in unpickleMethod()
     97         return getattr(im_class, im_name)
     98     try:
---> 99         methodFunction = _methodFunction(im_class, im_name)
    100     except AttributeError:
    101         log.msg("Method", im_name, "not on class", im_class)

/usr/local/lib/python3.8/site-packages/twisted/persisted/styles.py in _methodFunction()
     74     @rtype: L{types.FunctionType}
     75     """
---> 76     methodObject = getattr(classObject, methodName)
     77     if _PY3:
     78         return methodObject

AttributeError: type object 'type' has no attribute 'read_partition'

And the workers logs:

 2021-01-13 12:44:58,673 | INFO | -------------------------------------------------

2021-01-13 12:44:58,677 | INFO | Registered to: tcp://dask-scheduler:8786

2021-01-13 12:44:58,677 | INFO | -------------------------------------------------

2021-01-13 12:45:17,298 | WARNING | Could not deserialize task Traceback (most recent call last): File "/usr/local/lib/python3.8/site-packages/distributed/worker.py", line 2445, in _maybe_deserialize_task function, args, kwargs = _deserialize(*self.tasks[key]) File "/usr/local/lib/python3.8/site-packages/distributed/worker.py", line 3281, in _deserialize args = pickle.loads(args) File "/usr/local/lib/python3.8/site-packages/distributed/protocol/pickle.py", line 75, in loads return pickle.loads(x) File "/usr/local/lib/python3.8/site-packages/twisted/persisted/styles.py", line 99, in unpickleMethod methodFunction = _methodFunction(im_class, im_name) File "/usr/local/lib/python3.8/site-packages/twisted/persisted/styles.py", line 76, in _methodFunction methodObject = getattr(classObject, methodName) AttributeError: type object 'type' has no attribute 'read_partition'

2021-01-13 12:45:45,310 | ERROR | type object 'type' has no attribute 'read_partition' Traceback (most recent call last): File "/usr/local/lib/python3.8/site-packages/distributed/worker.py", line 915, in handle_scheduler await self.handle_stream( File "/usr/local/lib/python3.8/site-packages/distributed/core.py", line 579, in handle_stream msgs = await comm.read() File "/usr/local/lib/python3.8/site-packages/distributed/comm/tcp.py", line 204, in read msg = await from_frames( File "/usr/local/lib/python3.8/site-packages/distributed/comm/utils.py", line 87, in from_frames res = _from_frames() File "/usr/local/lib/python3.8/site-packages/distributed/comm/utils.py", line 65, in _from_frames return protocol.loads( File "/usr/local/lib/python3.8/site-packages/distributed/protocol/core.py", line 151, in loads value = _deserialize(head, fs, deserializers=deserializers) File "/usr/local/lib/python3.8/site-packages/distributed/protocol/serialize.py", line 335, in deserialize return loads(header, frames) File "/usr/local/lib/python3.8/site-packages/distributed/protocol/serialize.py", line 71, in pickle_loads return pickle.loads(x, buffers=buffers) File "/usr/local/lib/python3.8/site-packages/distributed/protocol/pickle.py", line 73, in loads return pickle.loads(x, buffers=buffers) File "/usr/local/lib/python3.8/site-packages/twisted/persisted/styles.py", line 99, in unpickleMethod methodFunction = _methodFunction(im_class, im_name) File "/usr/local/lib/python3.8/site-packages/twisted/persisted/styles.py", line 76, in _methodFunction methodObject = getattr(classObject, methodName) AttributeError: type object 'type' has no attribute 'read_partition'

2021-01-13 12:45:45,311 | INFO | Connection to scheduler broken. Reconnecting... 

Anything else we need to know?:

  • Doing the read of the Parquet file with Pandas, pyarrow or a Dask local cluster works fine.
  • Downgrading the Dask/distributed packages to 2.16.0 makes it work again, although I read the 2.17.0 changelog and didn’t find anything that seemed guilty of breaking this.
  • Searching for similar issues/questions I saw that in some cases the disparity of versions could be the origin, but I checked the key packages with get_versions() and all of them seemed to be the same.

Environment:

  • Dask version: 2020.12.0 (distributed too)
  • Python version: 3.8.6
  • Operating System: Linux (Manjaro)
  • Install method (conda, pip, source): pip

Most probably this is not an issue but an error on my part somewhere, but I’m really stuck and the question I asked a while ago in SO went unanswered. Any other information that could be of any help (packages versions, scheduler/workers launching, etc.) I will be glad to provide. Thanks for the help and the Dask project in general!

Issue Analytics

  • State:open
  • Created 3 years ago
  • Comments:6 (3 by maintainers)

github_iconTop GitHub Comments

1reaction
rjzamoracommented, Jan 13, 2021

The latest version of Dask refactored Parquet and CSV to be so-called Blockwise layers. The ultimate goal is to make it easier to merge IO operations with follow-up operations on the same partition… Unfortunately, we didn’t handle serialization correctly. dask#7048 has reverted these changes for the next Dask release. dask#7042 will add back the Blockwise changes for a later release (once we are more confindent that the serialization is handled correctly).

I’ll close this once I can confirm that the dask reversion fixed this issue. I’ll also be sure to check that 7042 works before it is merged.

0reactions
Serbafcommented, Feb 16, 2021

Hi again!

I have had some time today to look at this issue once again, and doing a debug step by step of what was happening in the client.compute(). I must add that, although in my Django project and the linked Jupyter kernel (based on Django-Shell-Plus) I encountered the issue, when executing the same code from a Python shell in any of the containers that form the environment, I didn’t.

Thus, I finally found that the core of the problem arises from the pickling of the task done in warn_dumps, specifically, in the pickling of the function ArrowLegacyEngine.read_partition.

In my Jupyter notebook, the pickling of the function:

from dask.dataframe.io.parquet.arrow import ArrowLegacyEngine
pickle.dumps(ArrowLegacyEngine.read_partition)

Results in this: b'\x80\x04\x95\x90\x00\x00\x00\x00\x00\x00\x00\x8c\x18twisted.persisted.styles\x94\x8c\x0eunpickleMethod\x94\x93\x94\x8c\x0eread_partition\x94\x8c\x1fdask.dataframe.io.parquet.arrow\x94\x8c\x11ArrowLegacyEngine\x94\x93\x94\x8c\x08builtins\x94\x8c\x04type\x94\x93\x94\x87\x94R\x94.'

Which seems to be wrong, because trying to pickle.loads() that bytestream results in the said AttributeError:

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-157-3a7606b542c9> in <module>
----> 1 pickle.loads(pickle.dumps(ArrowLegacyEngine.read_partition))

/usr/local/lib/python3.8/site-packages/twisted/persisted/styles.py in unpickleMethod(im_name, im_self, im_class)
     97         return getattr(im_class, im_name)
     98     try:
---> 99         methodFunction = _methodFunction(im_class, im_name)
    100     except AttributeError:
    101         log.msg("Method", im_name, "not on class", im_class)

/usr/local/lib/python3.8/site-packages/twisted/persisted/styles.py in _methodFunction(classObject, methodName)
     74     @rtype: L{types.FunctionType}
     75     """
---> 76     methodObject = getattr(classObject, methodName)
     77     if _PY3:
     78         return methodObject

AttributeError: type object 'type' has no attribute 'read_partition'

On the other hand, the pickle.dumps(ArrowLegacyEngine.read_partition), when done from a regular Python shell in a container returns this: b'\x80\x04\x95e\x00\x00\x00\x00\x00\x00\x00\x8c\x08builtins\x94\x8c\x07getattr\x94\x93\x94\x8c\x1fdask.dataframe.io.parquet.arrow\x94\x8c\x11ArrowLegacyEngine\x94\x93\x94\x8c\x0eread_partition\x94\x86\x94R\x94.'

Which, when used within pickle.loads(): <bound method ArrowDatasetEngine.read_partition of <class 'dask.dataframe.io.parquet.arrow.ArrowLegacyEngine'>>

Any ideas about what could be causing this? I checked that pickle.format_version was the same on both environments and dask.version too. Sorry, but I don’t know how to make it reproducible (only seems to happen within my Dask project).

Read more comments on GitHub >

github_iconTop Results From Across the Web

Dask distributed library giving serialization error - Stack Overflow
The error says that there is some object that you're trying to send to a worker that is not serializable. The type is...
Read more >
Fixed Issues in Apache Impala | 5.x - Cloudera Documentation
IMPALA-4951 - Fix database visibility for user with only column privilege; IMPALA-5597 - Try casting targetExpr when building runtime filter ...
Read more >
Apache Arrow 3.0.0 (2021-01-18)
ARROW-1846 - [C++] Implement “any” reduction kernel for boolean data ... [Python] Parquet metadata to_dict raises attribute error ...
Read more >
FIX Your AttributeError in Python & WHY You See it - YouTube
We all find ourselves spending hours trying to fix errors, and without a little help it can be very frustrating! the most common...
Read more >
Fast Data Processing with Spark 2
run the program in a distributed manner is interesting. The ec2/spark-ec2 destroy <cluster name> command will terminate the instances. If you have a...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found