Unexpected Behavior in read_csv() with max_result_size
See original GitHub issueI’m using the read_csv() with max_result_size of 128mb to process a 135mb csv file. When attempt to write parquet, it return the following error:
Traceback (most recent call last): File "/usr/lib64/python3.6/multiprocessing/process.py", line 258, in _bootstrap self.run() File "/usr/lib64/python3.6/multiprocessing/process.py", line 93, in run self._target(*self._args, **self._kwargs) File "/home/root/environment/demo/lib64/python3.6/dist-packages/awswrangler/pandas.py", line 815, in _data_to_s3_dataset_writer_remote isolated_dataframe=True)) File "/home/root/environment/demo/lib64/python3.6/dist-packages/awswrangler/pandas.py", line 758, in _data_to_s3_dataset_writer isolated_dataframe=isolated_dataframe) File "/home/root/environment/demo/lib64/python3.6/dist-packages/awswrangler/pandas.py", line 856, in _data_to_s3_object_writer isolated_dataframe=isolated_dataframe) File "/home/root/environment/demo/lib64/python3.6/dist-packages/awswrangler/pandas.py", line 902, in write_parquet_dataframe table = pa.Table.from_pandas(df=dataframe, preserve_index=preserve_index, safe=False) File "pyarrow/table.pxi", line 1174, in pyarrow.lib.Table.from_pandas File "/home/root/environment/demo/lib64/python3.6/dist-packages/pyarrow/pandas_compat.py", line 501, in dataframe_to_arrays convert_fields)) File "/usr/lib64/python3.6/concurrent/futures/_base.py", line 586, in result_iterator yield fs.pop().result() File "/usr/lib64/python3.6/concurrent/futures/_base.py", line 425, in result return self.__get_result() File "/usr/lib64/python3.6/concurrent/futures/_base.py", line 384, in __get_result raise self._exception File "/usr/lib64/python3.6/concurrent/futures/thread.py", line 56, in run result = self.fn(*self.args, **self.kwargs) File "/home/root/environment/demo/lib64/python3.6/dist-packages/pyarrow/pandas_compat.py", line 487, in convert_column raise e File "/home/root/environment/demo/lib64/python3.6/dist-packages/pyarrow/pandas_compat.py", line 481, in convert_column result = pa.array(col, type=type_, from_pandas=True, safe=safe) File "pyarrow/array.pxi", line 191, in pyarrow.lib.array File "pyarrow/array.pxi", line 78, in pyarrow.lib._ndarray_to_array File "pyarrow/error.pxi", line 95, in pyarrow.lib.check_status pyarrow.lib.ArrowTypeError: ("Expected a bytes object, got a 'Timestamp' object", 'Conversion failed for column xxxxx with type object')
When process the same file using read_csv() without max_result_size, it work fine.
Code Snippet:
df_iter = session.pandas.read_csv(csv, sep=‘|’, header=0, dtype=file_meta, parse_dates=dates_cols, max_result_size=12810241024)
for df in df_iter: df_data=df_data.append(df)
session.pandas.to_parquet(dataframe=df_data,database=db_name,table=tbl_name,preserve_index=False,mode=‘overwrite’, path=path)
Issue Analytics
- State:
- Created 4 years ago
- Reactions:1
- Comments:5 (3 by maintainers)
Hi,
The total number of rows and columns between all the dataframes make sense? total number of rows and columns are intact
The data types between all the dataframes are igual? No. One of the date field is classify as datetime64[ns] data type on first iteration, then become object in second iteration For the dataframe create from read_csv() without max_size_result, the data type is classify as object.
In this case you should try to cast to the most general type between both -> object (string)