EOF error when writing to S3
See original GitHub issueFirst, thanks for this awesome library! I was testing out some of the features yesterday and hit this error. The workflow was:
- Read a geospatial dataset off S3 using geopandas
- Create a geodataframe (a special case of a pandas dataframe)
- Write to S3 using
aws-data-wrangler
like so:
session = awswrangler.Session()
session.pandas.to_parquet(
dataframe=final_gdf.astype({'geometry': str}),
database='foo',
table='bar',
path="s3://bucket/path/to/file/",
partition_cols=['partition_col'],
compression='gzip',
mode='overwrite'
)
It looks like it writes many files but eventually I hit an EOFError
error:
---------------------------------------------------------------------------
EOFError Traceback (most recent call last)
<ipython-input-40-0f6d02d0cf6c> in <module>
5 partition_cols=['STATEFP', 'COUNTYFP', 'TRACTCE'],
6 compression='gzip',
----> 7 mode='overwrite'
8 )
/usr/local/lib/python3.7/site-packages/awswrangler/pandas.py in to_parquet(self, dataframe, path, database, table, partition_cols, preserve_index, mode, compression, procs_cpu_bound, procs_io_bound, cast_columns, inplace)
555 procs_io_bound=procs_io_bound,
556 cast_columns=cast_columns,
--> 557 inplace=inplace)
558
559 def to_s3(self,
/usr/local/lib/python3.7/site-packages/awswrangler/pandas.py in to_s3(self, dataframe, path, file_format, database, table, partition_cols, preserve_index, mode, compression, procs_cpu_bound, procs_io_bound, cast_columns, extra_args, inplace)
632 procs_io_bound=procs_io_bound,
633 cast_columns=cast_columns,
--> 634 extra_args=extra_args)
635 if database:
636 self._session.glue.metadata_to_glue(dataframe=dataframe,
/usr/local/lib/python3.7/site-packages/awswrangler/pandas.py in data_to_s3(self, dataframe, path, file_format, partition_cols, preserve_index, mode, compression, procs_cpu_bound, procs_io_bound, cast_columns, extra_args)
685 receive_pipes.append(receive_pipe)
686 for i in range(len(procs)):
--> 687 objects_paths += receive_pipes[i].recv()
688 procs[i].join()
689 receive_pipes[i].close()
/usr/local/lib/python3.7/multiprocessing/connection.py in recv(self)
248 self._check_closed()
249 self._check_readable()
--> 250 buf = self._recv_bytes()
251 return _ForkingPickler.loads(buf.getbuffer())
252
/usr/local/lib/python3.7/multiprocessing/connection.py in _recv_bytes(self, maxsize)
405
406 def _recv_bytes(self, maxsize=None):
--> 407 buf = self._recv(4)
408 size, = struct.unpack("!i", buf.getvalue())
409 if maxsize is not None and size > maxsize:
/usr/local/lib/python3.7/multiprocessing/connection.py in _recv(self, size, read)
381 if n == 0:
382 if remaining == size:
--> 383 raise EOFError
384 else:
385 raise OSError("got end of file during message")
EOFError:
Looks like the connection is getting reset somehow? I’ll note that I’m testing this in a jupyter notebook running in a docker container.
Current configuration:
- MacOS with a Docker container running Debian (stretch)
- Python 3.7
- awswrangler 0.0.12
Any ideas what’s going on here?
Issue Analytics
- State:
- Created 4 years ago
- Comments:5 (5 by maintainers)
Top Results From Across the Web
EOF error when writing to S3 · Issue #52 · aws/aws-sdk-pandas
I was testing out some of the features yesterday and hit this error. The workflow was: Read a geospatial dataset off S3 using...
Read more >Using boto for S3 upload I'm getting a EOF error - Stack Overflow
I'm trying to upload a file to S3 (actually the upload works) but my app crashes and throws the error saying: fp is...
Read more >Resolve errors uploading data to or downloading data ... - AWS
First, follow the steps below to run the SELECT INTO OUTFILE S3 or LOAD DATA FROM S3 commands using Amazon Aurora. If you...
Read more >SEGYImport "unexpected EOF" error when importing in minio
When importing file into a local service minio (or Zenko s3 server), on the minio side occurs error: Error: unexpected EOF (*errors.
Read more >502 Bad Gateway Unexpected EOF | Apigee Edge
This indicates that the 502 error is caused by the target due to unexpected EOF. In addition, make a note of the Request...
Read more >
Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free
Top Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Hi @koshy1123, thanks for contributing here!
Peculiar case, I’ve never faced that before, I will need more time to understand/troubleshoot that.
But in the meantime could you test the same without the parallelism (procs_cpu_bound=1)?
I got parallelization working after casting the column to be a
str
and forcing the dataframe to be an instance of a pandas dataframe, a la:Edit: This works some of the time 😕