Error in converting byte array to list of floats using dask dataframe
See original GitHub issueWhat happened: byte array is messed up when reading a parquet file using dask dataframe but not when directly using pandas dataframe. Last byte is missing in case of dask (below) but not pandas.
What you expected to happen: Suggest if I am doing something wrong or if this is a bug fix
Minimal Complete Verifiable Example:
# Pandas dataframe
In [1]: import pandas as pd
In [2]: df = pd.read_parquet("Testing.parquet")
In [8]: df.ByteArray[0]
Out[8]: b"\x1a\xfb\xf2>\x92\x06\xdf\xbe\xeb\xe3\x99\xbe\xf1\xd4\x13\xbe\xed\r~>\xe9\x9d\n\xbd.\xab0\xbdr\xa4\x93\xbd\x96B\xa8>\xfe\x7f\x00?\x9c\xf9\x9d\xbe\t\xa7\xa5=\x13a\x93>\xc8^??\x0c\x1e\xae>\xd0\x80:=\xdf\xbf)>\xd8~\xf2\xbdy\xc9\xbf\xbd\x88\x81\x02?6\x01f\xbe\xde\xb0\xad\xbe\x12\x12\xd9\xbe\xdc\xb8u>pB\x89>h<q\xbe3\xc0\xed\xbe!\x8f\xf0>\xdfN\xba>W\xeb\xc4<\xfa\xb7K\xbc\xfaG\x07?]\xc5\xb2>FC\x06=\xe0\xf59\xbe\x1a\x18\x99\xbe\xf4\xf8m>\x02g9>\xc7\xbb\x83=\xc6\xbf?\xbe\xf8P\x82\xbeA\xf4\xe4>\x0b\x98\xf0\xbeM\xf4-\xbf\x9b=@>h\x1f\x93>\x85]\x04>N|u=\xed\xf3(>\xaaf&\xbe<\xf8\r\xbf\x83\xbeD\xbe\xfa\x9b\xd0\xbe\x8e\\\x1b?UP\xa1>\xd3N}\xbe\xb8\xcb\x9e=\xe4g\x8b>Y\xc40\xbe1\x97\xfc\xbe\x1aNy>\xd8\x9c\x03=Y\xc2\xfa\xbe4G\xce\xbe\xa8\x19\x8a>#0\xd6\xbe\xcf\xc0(\xbe\x8f\x17\xb2\xbd\xf9\xf3\xad=bi\xa0=\x9d\xf4\x06\xbf+\x16\x87>r\x8by\xbc\xd9vZ\xbeu9\x95\xbeJ^m>\xafC\xa5>\xf1\xa1t>\xa6E\x95>\xcf\xa1\xcc\xbd*\xc6I>\x95\x9d\xfe\xbdC\xac\xfe\xba\xce5,\xbe\xe8\xa0\xe3>\x8bRB>\xefoP=<K\xe0\xbe\xbcy\xba>\xd1\x05\xed\xbe\xf7\x94\xec>\x93\x8b\xa1\xbe\xec\xc1t>F[\xdd\xbe\x15\x1e\x04\xbe\xc9\x8f\x98\xbd\x03'\xdb:\x8d{s\xbc\xfc\xc4\x81\xbc\x87\xc0\x81\xbe\xb1Op\xbeX\xfe\\\xbeM\x10\x95\xbd\xbd\xe4/>\xae\xbb\xb9>\t\xa5\x7f\xbe\xf5\x0e/?\xaf\x94\x8d>\xca\x87\xe0>\xcd\xadP\xbe5\xeb\xb4\xbe\xe4\x13r\xbd%\xce*>\xaf\x98\xd9\xbe\x86u\x83\xbd\xa6\r\x87=\xf7\x05\x84\xbez\x1aP\xbe\xef\x91\r=JE\x1f\xbf\xe6\xadJ\xbf\xf5\x0f2\xbe\xcbK\x1e>\x92\xcc\xaa\xbe\xb0\xe4\xea=lz\x88\xbe\x81\xce\xa4=\x00\x00\x00\x00"
# Dask dataframe
In [6]: import dask.dataframe as dd
In [10]: dd = dd.read_parquet("Testing.parquet")
In [13]: dd.compute().ByteArray[0]
Out[13]: b"\x1a\xfb\xf2>\x92\x06\xdf\xbe\xeb\xe3\x99\xbe\xf1\xd4\x13\xbe\xed\r~>\xe9\x9d\n\xbd.\xab0\xbdr\xa4\x93\xbd\x96B\xa8>\xfe\x7f\x00?\x9c\xf9\x9d\xbe\t\xa7\xa5=\x13a\x93>\xc8^??\x0c\x1e\xae>\xd0\x80:=\xdf\xbf)>\xd8~\xf2\xbdy\xc9\xbf\xbd\x88\x81\x02?6\x01f\xbe\xde\xb0\xad\xbe\x12\x12\xd9\xbe\xdc\xb8u>pB\x89>h<q\xbe3\xc0\xed\xbe!\x8f\xf0>\xdfN\xba>W\xeb\xc4<\xfa\xb7K\xbc\xfaG\x07?]\xc5\xb2>FC\x06=\xe0\xf59\xbe\x1a\x18\x99\xbe\xf4\xf8m>\x02g9>\xc7\xbb\x83=\xc6\xbf?\xbe\xf8P\x82\xbeA\xf4\xe4>\x0b\x98\xf0\xbeM\xf4-\xbf\x9b=@>h\x1f\x93>\x85]\x04>N|u=\xed\xf3(>\xaaf&\xbe<\xf8\r\xbf\x83\xbeD\xbe\xfa\x9b\xd0\xbe\x8e\\\x1b?UP\xa1>\xd3N}\xbe\xb8\xcb\x9e=\xe4g\x8b>Y\xc40\xbe1\x97\xfc\xbe\x1aNy>\xd8\x9c\x03=Y\xc2\xfa\xbe4G\xce\xbe\xa8\x19\x8a>#0\xd6\xbe\xcf\xc0(\xbe\x8f\x17\xb2\xbd\xf9\xf3\xad=bi\xa0=\x9d\xf4\x06\xbf+\x16\x87>r\x8by\xbc\xd9vZ\xbeu9\x95\xbeJ^m>\xafC\xa5>\xf1\xa1t>\xa6E\x95>\xcf\xa1\xcc\xbd*\xc6I>\x95\x9d\xfe\xbdC\xac\xfe\xba\xce5,\xbe\xe8\xa0\xe3>\x8bRB>\xefoP=<K\xe0\xbe\xbcy\xba>\xd1\x05\xed\xbe\xf7\x94\xec>\x93\x8b\xa1\xbe\xec\xc1t>F[\xdd\xbe\x15\x1e\x04\xbe\xc9\x8f\x98\xbd\x03'\xdb:\x8d{s\xbc\xfc\xc4\x81\xbc\x87\xc0\x81\xbe\xb1Op\xbeX\xfe\\\xbeM\x10\x95\xbd\xbd\xe4/>\xae\xbb\xb9>\t\xa5\x7f\xbe\xf5\x0e/?\xaf\x94\x8d>\xca\x87\xe0>\xcd\xadP\xbe5\xeb\xb4\xbe\xe4\x13r\xbd%\xce*>\xaf\x98\xd9\xbe\x86u\x83\xbd\xa6\r\x87=\xf7\x05\x84\xbez\x1aP\xbe\xef\x91\r=JE\x1f\xbf\xe6\xadJ\xbf\xf5\x0f2\xbe\xcbK\x1e>\x92\xcc\xaa\xbe\xb0\xe4\xea=lz\x88\xbe\x81\xce\xa4="
Anything else we need to know?:
Environment:
- Dask version: 2.1.0
- Python version: 3.7.3
- Operating System: Linux
- Install method (conda, pip, source): conda most probably
Issue Analytics
- State:
- Created 3 years ago
- Comments:6 (3 by maintainers)
Top Results From Across the Web
How to write a Dask dataframe containing a column of arrays ...
I have discovered it is possible if the pyarrow engine is used instead of the default fastparquet: pip/conda install pyarrow. then:
Read more >fastparquet silently strips NULL bytes off the end of binary arrays
Issue I was trying to read a parquet file with binary encoded data in it. ... Error in converting byte array to list...
Read more >Create Dask Arrays - Dask documentation
Create dask array from something that looks like an array. Many storage formats have Python projects that expose storage using NumPy slicing syntax....
Read more >dask.dataframe.read_csv - Dask documentation
Use the assume_missing keyword to assume that all columns inferred as integers contain missing values, and convert them to floats. Increase the size...
Read more >Source code for dask.dataframe.io.csv - Dask documentation
import os from collections.abc import Mapping from io import BytesIO from warnings ... """Convert blocks of bytes to a dask.dataframe This accepts a...
Read more >
Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free
Top Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
This is a duplicate of #504 , where there is also a suggested fix, that has not been implemented yet.
Okay so using ‘pyarrow’ as engine should work then for now for me. Thanks.