question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

EOF error when writing to S3

See original GitHub issue

First, thanks for this awesome library! I was testing out some of the features yesterday and hit this error. The workflow was:

  • Read a geospatial dataset off S3 using geopandas
  • Create a geodataframe (a special case of a pandas dataframe)
  • Write to S3 using aws-data-wrangler like so:
session = awswrangler.Session()
session.pandas.to_parquet(
    dataframe=final_gdf.astype({'geometry': str}),
    database='foo',
    table='bar',
    path="s3://bucket/path/to/file/",
    partition_cols=['partition_col'],
    compression='gzip',
    mode='overwrite'
)

It looks like it writes many files but eventually I hit an EOFError error:

---------------------------------------------------------------------------
EOFError                                  Traceback (most recent call last)
<ipython-input-40-0f6d02d0cf6c> in <module>
      5     partition_cols=['STATEFP', 'COUNTYFP', 'TRACTCE'],
      6     compression='gzip',
----> 7     mode='overwrite'
      8 )

/usr/local/lib/python3.7/site-packages/awswrangler/pandas.py in to_parquet(self, dataframe, path, database, table, partition_cols, preserve_index, mode, compression, procs_cpu_bound, procs_io_bound, cast_columns, inplace)
    555                           procs_io_bound=procs_io_bound,
    556                           cast_columns=cast_columns,
--> 557                           inplace=inplace)
    558 
    559     def to_s3(self,

/usr/local/lib/python3.7/site-packages/awswrangler/pandas.py in to_s3(self, dataframe, path, file_format, database, table, partition_cols, preserve_index, mode, compression, procs_cpu_bound, procs_io_bound, cast_columns, extra_args, inplace)
    632                                         procs_io_bound=procs_io_bound,
    633                                         cast_columns=cast_columns,
--> 634                                         extra_args=extra_args)
    635         if database:
    636             self._session.glue.metadata_to_glue(dataframe=dataframe,

/usr/local/lib/python3.7/site-packages/awswrangler/pandas.py in data_to_s3(self, dataframe, path, file_format, partition_cols, preserve_index, mode, compression, procs_cpu_bound, procs_io_bound, cast_columns, extra_args)
    685                 receive_pipes.append(receive_pipe)
    686             for i in range(len(procs)):
--> 687                 objects_paths += receive_pipes[i].recv()
    688                 procs[i].join()
    689                 receive_pipes[i].close()

/usr/local/lib/python3.7/multiprocessing/connection.py in recv(self)
    248         self._check_closed()
    249         self._check_readable()
--> 250         buf = self._recv_bytes()
    251         return _ForkingPickler.loads(buf.getbuffer())
    252 

/usr/local/lib/python3.7/multiprocessing/connection.py in _recv_bytes(self, maxsize)
    405 
    406     def _recv_bytes(self, maxsize=None):
--> 407         buf = self._recv(4)
    408         size, = struct.unpack("!i", buf.getvalue())
    409         if maxsize is not None and size > maxsize:

/usr/local/lib/python3.7/multiprocessing/connection.py in _recv(self, size, read)
    381             if n == 0:
    382                 if remaining == size:
--> 383                     raise EOFError
    384                 else:
    385                     raise OSError("got end of file during message")

EOFError: 

Looks like the connection is getting reset somehow? I’ll note that I’m testing this in a jupyter notebook running in a docker container.

Current configuration:

  • MacOS with a Docker container running Debian (stretch)
  • Python 3.7
  • awswrangler 0.0.12

Any ideas what’s going on here?

Issue Analytics

  • State:closed
  • Created 4 years ago
  • Comments:5 (5 by maintainers)

github_iconTop GitHub Comments

1reaction
igorborgestcommented, Oct 24, 2019

Hi @koshy1123, thanks for contributing here!

Peculiar case, I’ve never faced that before, I will need more time to understand/troubleshoot that.

But in the meantime could you test the same without the parallelism (procs_cpu_bound=1)?

session = awswrangler.Session()
session.pandas.to_parquet(
    dataframe=final_gdf.astype({'geometry': str}),
    database='foo',
    table='bar',
    path="s3://bucket/path/to/file/",
    partition_cols=['partition_col'],
    compression='gzip',
    mode='overwrite',
    procs_cpu_bound=1
)
0reactions
koshy1123commented, Oct 25, 2019

I got parallelization working after casting the column to be a str and forcing the dataframe to be an instance of a pandas dataframe, a la:

df = df.astype('geometry': str)
df = pandas.DataFrame(df)

Edit: This works some of the time 😕

Read more comments on GitHub >

github_iconTop Results From Across the Web

EOF error when writing to S3 · Issue #52 · aws/aws-sdk-pandas
I was testing out some of the features yesterday and hit this error. The workflow was: Read a geospatial dataset off S3 using...
Read more >
Using boto for S3 upload I'm getting a EOF error - Stack Overflow
I'm trying to upload a file to S3 (actually the upload works) but my app crashes and throws the error saying: fp is...
Read more >
Resolve errors uploading data to or downloading data ... - AWS
First, follow the steps below to run the SELECT INTO OUTFILE S3 or LOAD DATA FROM S3 commands using Amazon Aurora. If you...
Read more >
SEGYImport "unexpected EOF" error when importing in minio
When importing file into a local service minio (or Zenko s3 server), on the minio side occurs error: Error: unexpected EOF (*errors.
Read more >
502 Bad Gateway Unexpected EOF | Apigee Edge
This indicates that the 502 error is caused by the target due to unexpected EOF. In addition, make a note of the Request...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found