question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Unexpected Behavior in read_csv() with max_result_size

See original GitHub issue

I’m using the read_csv() with max_result_size of 128mb to process a 135mb csv file. When attempt to write parquet, it return the following error:

Traceback (most recent call last): File "/usr/lib64/python3.6/multiprocessing/process.py", line 258, in _bootstrap self.run() File "/usr/lib64/python3.6/multiprocessing/process.py", line 93, in run self._target(*self._args, **self._kwargs) File "/home/root/environment/demo/lib64/python3.6/dist-packages/awswrangler/pandas.py", line 815, in _data_to_s3_dataset_writer_remote isolated_dataframe=True)) File "/home/root/environment/demo/lib64/python3.6/dist-packages/awswrangler/pandas.py", line 758, in _data_to_s3_dataset_writer isolated_dataframe=isolated_dataframe) File "/home/root/environment/demo/lib64/python3.6/dist-packages/awswrangler/pandas.py", line 856, in _data_to_s3_object_writer isolated_dataframe=isolated_dataframe) File "/home/root/environment/demo/lib64/python3.6/dist-packages/awswrangler/pandas.py", line 902, in write_parquet_dataframe table = pa.Table.from_pandas(df=dataframe, preserve_index=preserve_index, safe=False) File "pyarrow/table.pxi", line 1174, in pyarrow.lib.Table.from_pandas File "/home/root/environment/demo/lib64/python3.6/dist-packages/pyarrow/pandas_compat.py", line 501, in dataframe_to_arrays convert_fields)) File "/usr/lib64/python3.6/concurrent/futures/_base.py", line 586, in result_iterator yield fs.pop().result() File "/usr/lib64/python3.6/concurrent/futures/_base.py", line 425, in result return self.__get_result() File "/usr/lib64/python3.6/concurrent/futures/_base.py", line 384, in __get_result raise self._exception File "/usr/lib64/python3.6/concurrent/futures/thread.py", line 56, in run result = self.fn(*self.args, **self.kwargs) File "/home/root/environment/demo/lib64/python3.6/dist-packages/pyarrow/pandas_compat.py", line 487, in convert_column raise e File "/home/root/environment/demo/lib64/python3.6/dist-packages/pyarrow/pandas_compat.py", line 481, in convert_column result = pa.array(col, type=type_, from_pandas=True, safe=safe) File "pyarrow/array.pxi", line 191, in pyarrow.lib.array File "pyarrow/array.pxi", line 78, in pyarrow.lib._ndarray_to_array File "pyarrow/error.pxi", line 95, in pyarrow.lib.check_status pyarrow.lib.ArrowTypeError: ("Expected a bytes object, got a 'Timestamp' object", 'Conversion failed for column xxxxx with type object')

When process the same file using read_csv() without max_result_size, it work fine.

Code Snippet:

df_iter = session.pandas.read_csv(csv, sep=‘|’, header=0, dtype=file_meta, parse_dates=dates_cols, max_result_size=12810241024)

for df in df_iter: df_data=df_data.append(df)

session.pandas.to_parquet(dataframe=df_data,database=db_name,table=tbl_name,preserve_index=False,mode=‘overwrite’, path=path)

Issue Analytics

  • State:closed
  • Created 4 years ago
  • Reactions:1
  • Comments:5 (3 by maintainers)

github_iconTop GitHub Comments

1reaction
josecwcommented, Nov 28, 2019

Hi,

  • The total number of rows and columns between all the dataframes make sense? total number of rows and columns are intact

  • The data types between all the dataframes are igual? No. One of the date field is classify as datetime64[ns] data type on first iteration, then become object in second iteration For the dataframe create from read_csv() without max_size_result, the data type is classify as object.

0reactions
igorborgestcommented, Nov 29, 2019

In this case you should try to cast to the most general type between both -> object (string)

Read more comments on GitHub >

github_iconTop Results From Across the Web

Unexpected Behavior in read_csv() with max_result_size #71
This problem is intrinsic to read CSV files. Unfortunately CSV doesn't have built-in schema nor hard data types. After all, every value on...
Read more >
write.csv() extremely unexpected behavior - Stack Overflow
Show activity on this post. The below does the trick. do.call(data.frame,x) and write.csv2(x,file="xxxx.csv", row.names=FALSE).
Read more >
Configuration - Spark 3.3.1 Documentation - Apache Spark
SparkConf allows you to configure some of the common properties (e.g. master URL and application name), as well as arbitrary key-value pairs through...
Read more >
having difficulty in reading CSV file - RStudio Community
choose() in the console. Appear a new window and select the file. The console show the path or directory of his file. Copy...
Read more >
Things I Wish I'd Known About Spark When I Started (One ...
maxResultSize " could mean you've set your number of partitions too high and results can't fit onto a particular worker. “Column x is...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found