Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Unexpected Behavior in read_csv() with max_result_size

See original GitHub issue

I’m using the read_csv() with max_result_size of 128mb to process a 135mb csv file. When attempt to write parquet, it return the following error:

Traceback (most recent call last): File "/usr/lib64/python3.6/multiprocessing/process.py", line 258, in _bootstrap self.run() File "/usr/lib64/python3.6/multiprocessing/process.py", line 93, in run self._target(*self._args, **self._kwargs) File "/home/root/environment/demo/lib64/python3.6/dist-packages/awswrangler/pandas.py", line 815, in _data_to_s3_dataset_writer_remote isolated_dataframe=True)) File "/home/root/environment/demo/lib64/python3.6/dist-packages/awswrangler/pandas.py", line 758, in _data_to_s3_dataset_writer isolated_dataframe=isolated_dataframe) File "/home/root/environment/demo/lib64/python3.6/dist-packages/awswrangler/pandas.py", line 856, in _data_to_s3_object_writer isolated_dataframe=isolated_dataframe) File "/home/root/environment/demo/lib64/python3.6/dist-packages/awswrangler/pandas.py", line 902, in write_parquet_dataframe table = pa.Table.from_pandas(df=dataframe, preserve_index=preserve_index, safe=False) File "pyarrow/table.pxi", line 1174, in pyarrow.lib.Table.from_pandas File "/home/root/environment/demo/lib64/python3.6/dist-packages/pyarrow/pandas_compat.py", line 501, in dataframe_to_arrays convert_fields)) File "/usr/lib64/python3.6/concurrent/futures/_base.py", line 586, in result_iterator yield fs.pop().result() File "/usr/lib64/python3.6/concurrent/futures/_base.py", line 425, in result return self.__get_result() File "/usr/lib64/python3.6/concurrent/futures/_base.py", line 384, in __get_result raise self._exception File "/usr/lib64/python3.6/concurrent/futures/thread.py", line 56, in run result = self.fn(*self.args, **self.kwargs) File "/home/root/environment/demo/lib64/python3.6/dist-packages/pyarrow/pandas_compat.py", line 487, in convert_column raise e File "/home/root/environment/demo/lib64/python3.6/dist-packages/pyarrow/pandas_compat.py", line 481, in convert_column result = pa.array(col, type=type_, from_pandas=True, safe=safe) File "pyarrow/array.pxi", line 191, in pyarrow.lib.array File "pyarrow/array.pxi", line 78, in pyarrow.lib._ndarray_to_array File "pyarrow/error.pxi", line 95, in pyarrow.lib.check_status pyarrow.lib.ArrowTypeError: ("Expected a bytes object, got a 'Timestamp' object", 'Conversion failed for column xxxxx with type object')

When process the same file using read_csv() without max_result_size, it work fine.

Code Snippet:

df_iter = session.pandas.read_csv(csv, sep=‘|’, header=0, dtype=file_meta, parse_dates=dates_cols, max_result_size=12810241024)

for df in df_iter: df_data=df_data.append(df)

session.pandas.to_parquet(dataframe=df_data,database=db_name,table=tbl_name,preserve_index=False,mode=‘overwrite’, path=path)

Issue Analytics

State:
Created 4 years ago
Reactions:1
Comments:5 (3 by maintainers)

Top GitHub Comments

1reaction

josecwcommented, Nov 28, 2019

Hi,

The total number of rows and columns between all the dataframes make sense? total number of rows and columns are intact
The data types between all the dataframes are igual? No. One of the date field is classify as datetime64[ns] data type on first iteration, then become object in second iteration For the dataframe create from read_csv() without max_size_result, the data type is classify as object.

0reactions

igorborgestcommented, Nov 29, 2019

In this case you should try to cast to the most general type between both -> object (string)

Top Results From Across the Web

Unexpected Behavior in read_csv() with max_result_size #71

This problem is intrinsic to read CSV files. Unfortunately CSV doesn't have built-in schema nor hard data types. After all, every value on...

write.csv() extremely unexpected behavior - Stack Overflow

Show activity on this post. The below does the trick. do.call(data.frame,x) and write.csv2(x,file="xxxx.csv", row.names=FALSE).

Configuration - Spark 3.3.1 Documentation - Apache Spark

SparkConf allows you to configure some of the common properties (e.g. master URL and application name), as well as arbitrary key-value pairs through...

having difficulty in reading CSV file - RStudio Community

choose() in the console. Appear a new window and select the file. The console show the path or directory of his file. Copy...

Things I Wish I'd Known About Spark When I Started (One ...

maxResultSize " could mean you've set your number of partitions too high and results can't fit onto a particular worker. “Column x is...