question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Support arrow Table/RecordBatch types

See original GitHub issue

xref: #614

I’m sure this is possible with #606 so this issue is mostly just to document my attempt to get this working. Out-of-the box (1.22.0+9.gdb758d0f), attempting to pass an arrow Table or RecordBatch results in a TypeError:

TypeError: no default __reduce__ due to non-trivial __cinit__
from distributed import Client
import pandas as pd
import pyarrow as pa

client = Client()

df = pd.DataFrame({'A': list('abc'), 'B': [1,2,3]})
tbl = pa.Table.from_pandas(df, preserve_index=False)

def echo(arg):
    return arg
>>> client.submit(echo, df).result().equals(df)
True
>>> client.submit(echo, tbl).result()
distributed.protocol.pickle - INFO - Failed to serialize (pyarrow.Table
A: string
B: int64
metadata
--------
{b'pandas': b'{"index_columns": [], "column_indexes": [], "columns": [{"name":'
            b' "A", "field_name": "A", "pandas_type": "unicode", "numpy_type":'
            b' "object", "metadata": null}, {"name": "B", "field_name": "B", "'
            b'pandas_type": "int64", "numpy_type": "int64", "metadata": null}]'
            b', "pandas_version": "0.23.1"}'},). Exception: no default __reduce__ due to non-trivial __cinit__
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
C:\Miniconda3\lib\site-packages\distributed\protocol\pickle.py in dumps(x)
     37     try:
---> 38         result = pickle.dumps(x, protocol=pickle.HIGHEST_PROTOCOL)
     39         if len(result) < 1000:

C:\Miniconda3\lib\site-packages\pyarrow\lib.cp36-win_amd64.pyd in pyarrow.lib.RecordBatch.__reduce_cython__()

TypeError: no default __reduce__ due to non-trivial __cinit__

During handling of the above exception, another exception occurred:

TypeError                                 Traceback (most recent call last)
<ipython-input-48-4e8e2ea90e79> in <module>()
----> 1 client.submit(echo, tbl).result()

C:\Miniconda3\lib\site-packages\distributed\client.py in submit(self, func, *args, **kwargs)
   1236                                          resources={skey: resources} if resources else None,
   1237                                          retries=retries,
-> 1238                                          fifo_timeout=fifo_timeout)
   1239 
   1240         logger.debug("Submit %s(...), %s", funcname(func), key)

C:\Miniconda3\lib\site-packages\distributed\client.py in _graph_to_futures(self, dsk, keys, restrictions, loose_restrictions, priority, user_priority, resources, retries, fifo_timeout)
   2093 
   2094             self._send_to_scheduler({'op': 'update-graph',
-> 2095                                      'tasks': valmap(dumps_task, dsk3),
   2096                                      'dependencies': dependencies,
   2097                                      'keys': list(flatkeys),

C:\Miniconda3\lib\site-packages\cytoolz\dicttoolz.pyx in cytoolz.dicttoolz.valmap()

C:\Miniconda3\lib\site-packages\cytoolz\dicttoolz.pyx in cytoolz.dicttoolz.valmap()

C:\Miniconda3\lib\site-packages\distributed\worker.py in dumps_task(task)
    799         elif not any(map(_maybe_complex, task[1:])):
    800             return {'function': dumps_function(task[0]),
--> 801                     'args': warn_dumps(task[1:])}
    802     return to_serialize(task)
    803 

C:\Miniconda3\lib\site-packages\distributed\worker.py in warn_dumps(obj, dumps, limit)
    808 def warn_dumps(obj, dumps=pickle.dumps, limit=1e6):
    809     """ Dump an object to bytes, warn if those bytes are large """
--> 810     b = dumps(obj)
    811     if not _warn_dumps_warned[0] and len(b) > limit:
    812         _warn_dumps_warned[0] = True

C:\Miniconda3\lib\site-packages\distributed\protocol\pickle.py in dumps(x)
     49     except Exception:
     50         try:
---> 51             return cloudpickle.dumps(x, protocol=pickle.HIGHEST_PROTOCOL)
     52         except Exception as e:
     53             logger.info("Failed to serialize %s. Exception: %s", x, e)

C:\Miniconda3\lib\site-packages\cloudpickle\cloudpickle.py in dumps(obj, protocol)
    893     try:
    894         cp = CloudPickler(file, protocol=protocol)
--> 895         cp.dump(obj)
    896         return file.getvalue()
    897     finally:

C:\Miniconda3\lib\site-packages\cloudpickle\cloudpickle.py in dump(self, obj)
    266         self.inject_addons()
    267         try:
--> 268             return Pickler.dump(self, obj)
    269         except RuntimeError as e:
    270             if 'recursion' in e.args[0]:

C:\Miniconda3\lib\pickle.py in dump(self, obj)
    407         if self.proto >= 4:
    408             self.framer.start_framing()
--> 409         self.save(obj)
    410         self.write(STOP)
    411         self.framer.end_framing()

C:\Miniconda3\lib\pickle.py in save(self, obj, save_persistent_id)
    474         f = self.dispatch.get(t)
    475         if f is not None:
--> 476             f(self, obj) # Call unbound method with explicit self
    477             return
    478 

C:\Miniconda3\lib\pickle.py in save_tuple(self, obj)
    734         if n <= 3 and self.proto >= 2:
    735             for element in obj:
--> 736                 save(element)
    737             # Subtle.  Same as in the big comment below.
    738             if id(obj) in memo:

C:\Miniconda3\lib\pickle.py in save(self, obj, save_persistent_id)
    494             reduce = getattr(obj, "__reduce_ex__", None)
    495             if reduce is not None:
--> 496                 rv = reduce(self.proto)
    497             else:
    498                 reduce = getattr(obj, "__reduce__", None)

C:\Miniconda3\lib\site-packages\pyarrow\lib.cp36-win_amd64.pyd in pyarrow.lib.RecordBatch.__reduce_cython__()

TypeError: no default __reduce__ due to non-trivial __cinit__

Issue Analytics

  • State:closed
  • Created 5 years ago
  • Comments:29 (29 by maintainers)

github_iconTop GitHub Comments

1reaction
mrocklincommented, Jul 8, 2018

Interesting.

I can reproduce this with just Pickle. I recommend raising this as an issue with Arrow.

In [1]: import pandas as pd
   ...: import pyarrow as pa
   ...: df = pd.DataFrame({'A': list('abc'), 'B': [1,2,3]})
   ...: tbl = pa.Table.from_pandas(df, preserve_index=False)
   ...: 
   ...: 

In [2]: import pickle

In [3]: b = pickle.dumps(tbl)
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-3-264061ce888e> in <module>()
----> 1 b = pickle.dumps(tbl)

~/Software/anaconda/envs/arrow/lib/python3.6/site-packages/pyarrow/lib.cpython-36m-x86_64-linux-gnu.so in pyarrow.lib.Table.__reduce_cython__()

TypeError: no default __reduce__ due to non-trivial __cinit__

Alternatively if you wanted you could also help to implement a custom serialization solution for Arrow for Dask (see docs here) which would be useful to help avoid memory copies during transfer, but this is probably a minor performance concern in the common case. Getting other projects to implement the pickle protocol is probably the first step.

0reactions
dhirschfeldcommented, Jul 18, 2018

Closing at this has been implemented in #2115 and further discussion can take place in #2110

Read more comments on GitHub >

github_iconTop Results From Across the Web

Data Types and In-Memory Data Model - Apache Arrow
RecordBatch , which are a collection of Array objects with a particular Schema. Tables: Instances of pyarrow.Table , a logical table data structure...
Read more >
pyarrow.Table — Apache Arrow v3.0.0 - enpiar.com
Construct a Table from a sequence or iterator of Arrow RecordBatches. from_pandas (type cls, df, Schema schema=None). Convert pandas.DataFrame to an Arrow ......
Read more >
Using the Arrow C++ Library in R
Arrow supports custom key-value metadata attached to Schemas. When we convert a data.frame to an Arrow Table or RecordBatch, the package stores any ......
Read more >
MEP 13 -- Support Apache Arrow As In-Memory Data Format
MEP 13 -- Support Apache Arrow As In-Memory Data Format ... pymilvus creates a data insert request with type milvuspb. ... Table {...
Read more >
Readme · Arrow.jl - JuliaHub
Extension types; Streaming, file, record batch, and replacement and isdelta dictionary messages. It currently doesn't include support for: Tensors or sparse ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found