Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Error in converting byte array to list of floats using dask dataframe

See original GitHub issue

What happened: byte array is messed up when reading a parquet file using dask dataframe but not when directly using pandas dataframe. Last byte is missing in case of dask (below) but not pandas.

What you expected to happen: Suggest if I am doing something wrong or if this is a bug fix

Minimal Complete Verifiable Example:

# Pandas dataframe
In [1]: import pandas as pd

In [2]: df = pd.read_parquet("Testing.parquet")

In [8]: df.ByteArray[0]
Out[8]: b"\x1a\xfb\xf2>\x92\x06\xdf\xbe\xeb\xe3\x99\xbe\xf1\xd4\x13\xbe\xed\r~>\xe9\x9d\n\xbd.\xab0\xbdr\xa4\x93\xbd\x96B\xa8>\xfe\x7f\x00?\x9c\xf9\x9d\xbe\t\xa7\xa5=\x13a\x93>\xc8^??\x0c\x1e\xae>\xd0\x80:=\xdf\xbf)>\xd8~\xf2\xbdy\xc9\xbf\xbd\x88\x81\x02?6\x01f\xbe\xde\xb0\xad\xbe\x12\x12\xd9\xbe\xdc\xb8u>pB\x89>h<q\xbe3\xc0\xed\xbe!\x8f\xf0>\xdfN\xba>W\xeb\xc4<\xfa\xb7K\xbc\xfaG\x07?]\xc5\xb2>FC\x06=\xe0\xf59\xbe\x1a\x18\x99\xbe\xf4\xf8m>\x02g9>\xc7\xbb\x83=\xc6\xbf?\xbe\xf8P\x82\xbeA\xf4\xe4>\x0b\x98\xf0\xbeM\xf4-\xbf\x9b=@>h\x1f\x93>\x85]\x04>N|u=\xed\xf3(>\xaaf&\xbe<\xf8\r\xbf\x83\xbeD\xbe\xfa\x9b\xd0\xbe\x8e\\\x1b?UP\xa1>\xd3N}\xbe\xb8\xcb\x9e=\xe4g\x8b>Y\xc40\xbe1\x97\xfc\xbe\x1aNy>\xd8\x9c\x03=Y\xc2\xfa\xbe4G\xce\xbe\xa8\x19\x8a>#0\xd6\xbe\xcf\xc0(\xbe\x8f\x17\xb2\xbd\xf9\xf3\xad=bi\xa0=\x9d\xf4\x06\xbf+\x16\x87>r\x8by\xbc\xd9vZ\xbeu9\x95\xbeJ^m>\xafC\xa5>\xf1\xa1t>\xa6E\x95>\xcf\xa1\xcc\xbd*\xc6I>\x95\x9d\xfe\xbdC\xac\xfe\xba\xce5,\xbe\xe8\xa0\xe3>\x8bRB>\xefoP=<K\xe0\xbe\xbcy\xba>\xd1\x05\xed\xbe\xf7\x94\xec>\x93\x8b\xa1\xbe\xec\xc1t>F[\xdd\xbe\x15\x1e\x04\xbe\xc9\x8f\x98\xbd\x03'\xdb:\x8d{s\xbc\xfc\xc4\x81\xbc\x87\xc0\x81\xbe\xb1Op\xbeX\xfe\\\xbeM\x10\x95\xbd\xbd\xe4/>\xae\xbb\xb9>\t\xa5\x7f\xbe\xf5\x0e/?\xaf\x94\x8d>\xca\x87\xe0>\xcd\xadP\xbe5\xeb\xb4\xbe\xe4\x13r\xbd%\xce*>\xaf\x98\xd9\xbe\x86u\x83\xbd\xa6\r\x87=\xf7\x05\x84\xbez\x1aP\xbe\xef\x91\r=JE\x1f\xbf\xe6\xadJ\xbf\xf5\x0f2\xbe\xcbK\x1e>\x92\xcc\xaa\xbe\xb0\xe4\xea=lz\x88\xbe\x81\xce\xa4=\x00\x00\x00\x00"

# Dask dataframe
In [6]: import dask.dataframe as dd

In [10]: dd = dd.read_parquet("Testing.parquet")
In [13]: dd.compute().ByteArray[0]
Out[13]: b"\x1a\xfb\xf2>\x92\x06\xdf\xbe\xeb\xe3\x99\xbe\xf1\xd4\x13\xbe\xed\r~>\xe9\x9d\n\xbd.\xab0\xbdr\xa4\x93\xbd\x96B\xa8>\xfe\x7f\x00?\x9c\xf9\x9d\xbe\t\xa7\xa5=\x13a\x93>\xc8^??\x0c\x1e\xae>\xd0\x80:=\xdf\xbf)>\xd8~\xf2\xbdy\xc9\xbf\xbd\x88\x81\x02?6\x01f\xbe\xde\xb0\xad\xbe\x12\x12\xd9\xbe\xdc\xb8u>pB\x89>h<q\xbe3\xc0\xed\xbe!\x8f\xf0>\xdfN\xba>W\xeb\xc4<\xfa\xb7K\xbc\xfaG\x07?]\xc5\xb2>FC\x06=\xe0\xf59\xbe\x1a\x18\x99\xbe\xf4\xf8m>\x02g9>\xc7\xbb\x83=\xc6\xbf?\xbe\xf8P\x82\xbeA\xf4\xe4>\x0b\x98\xf0\xbeM\xf4-\xbf\x9b=@>h\x1f\x93>\x85]\x04>N|u=\xed\xf3(>\xaaf&\xbe<\xf8\r\xbf\x83\xbeD\xbe\xfa\x9b\xd0\xbe\x8e\\\x1b?UP\xa1>\xd3N}\xbe\xb8\xcb\x9e=\xe4g\x8b>Y\xc40\xbe1\x97\xfc\xbe\x1aNy>\xd8\x9c\x03=Y\xc2\xfa\xbe4G\xce\xbe\xa8\x19\x8a>#0\xd6\xbe\xcf\xc0(\xbe\x8f\x17\xb2\xbd\xf9\xf3\xad=bi\xa0=\x9d\xf4\x06\xbf+\x16\x87>r\x8by\xbc\xd9vZ\xbeu9\x95\xbeJ^m>\xafC\xa5>\xf1\xa1t>\xa6E\x95>\xcf\xa1\xcc\xbd*\xc6I>\x95\x9d\xfe\xbdC\xac\xfe\xba\xce5,\xbe\xe8\xa0\xe3>\x8bRB>\xefoP=<K\xe0\xbe\xbcy\xba>\xd1\x05\xed\xbe\xf7\x94\xec>\x93\x8b\xa1\xbe\xec\xc1t>F[\xdd\xbe\x15\x1e\x04\xbe\xc9\x8f\x98\xbd\x03'\xdb:\x8d{s\xbc\xfc\xc4\x81\xbc\x87\xc0\x81\xbe\xb1Op\xbeX\xfe\\\xbeM\x10\x95\xbd\xbd\xe4/>\xae\xbb\xb9>\t\xa5\x7f\xbe\xf5\x0e/?\xaf\x94\x8d>\xca\x87\xe0>\xcd\xadP\xbe5\xeb\xb4\xbe\xe4\x13r\xbd%\xce*>\xaf\x98\xd9\xbe\x86u\x83\xbd\xa6\r\x87=\xf7\x05\x84\xbez\x1aP\xbe\xef\x91\r=JE\x1f\xbf\xe6\xadJ\xbf\xf5\x0f2\xbe\xcbK\x1e>\x92\xcc\xaa\xbe\xb0\xe4\xea=lz\x88\xbe\x81\xce\xa4="

Anything else we need to know?:

Environment:

Dask version: 2.1.0
Python version: 3.7.3
Operating System: Linux
Install method (conda, pip, source): conda most probably

Issue Analytics

State:
Created 3 years ago
Comments:6 (3 by maintainers)

Top GitHub Comments

1reaction

martindurantcommented, Jun 8, 2020

This is a duplicate of #504 , where there is also a suggested fix, that has not been implemented yet.

0reactions

LeenaShekharcommented, Jun 8, 2020

Okay so using ‘pyarrow’ as engine should work then for now for me. Thanks.

Top Results From Across the Web

How to write a Dask dataframe containing a column of arrays ...

I have discovered it is possible if the pyarrow engine is used instead of the default fastparquet: pip/conda install pyarrow. then:

fastparquet silently strips NULL bytes off the end of binary arrays

Issue I was trying to read a parquet file with binary encoded data in it. ... Error in converting byte array to list...

Create Dask Arrays - Dask documentation

Create dask array from something that looks like an array. Many storage formats have Python projects that expose storage using NumPy slicing syntax....

dask.dataframe.read_csv - Dask documentation

Use the assume_missing keyword to assume that all columns inferred as integers contain missing values, and convert them to floats. Increase the size...

Source code for dask.dataframe.io.csv - Dask documentation

import os from collections.abc import Mapping from io import BytesIO from warnings ... """Convert blocks of bytes to a dask.dataframe This accepts a...