Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

EOF error when writing to S3

See original GitHub issue

First, thanks for this awesome library! I was testing out some of the features yesterday and hit this error. The workflow was:

Read a geospatial dataset off S3 using geopandas
Create a geodataframe (a special case of a pandas dataframe)
Write to S3 using aws-data-wrangler like so:

session = awswrangler.Session()
session.pandas.to_parquet(
    dataframe=final_gdf.astype({'geometry': str}),
    database='foo',
    table='bar',
    path="s3://bucket/path/to/file/",
    partition_cols=['partition_col'],
    compression='gzip',
    mode='overwrite'
)

It looks like it writes many files but eventually I hit an EOFError error:

---------------------------------------------------------------------------
EOFError                                  Traceback (most recent call last)
<ipython-input-40-0f6d02d0cf6c> in <module>
      5     partition_cols=['STATEFP', 'COUNTYFP', 'TRACTCE'],
      6     compression='gzip',
----> 7     mode='overwrite'
      8 )

/usr/local/lib/python3.7/site-packages/awswrangler/pandas.py in to_parquet(self, dataframe, path, database, table, partition_cols, preserve_index, mode, compression, procs_cpu_bound, procs_io_bound, cast_columns, inplace)
    555                           procs_io_bound=procs_io_bound,
    556                           cast_columns=cast_columns,
--> 557                           inplace=inplace)
    558 
    559     def to_s3(self,

/usr/local/lib/python3.7/site-packages/awswrangler/pandas.py in to_s3(self, dataframe, path, file_format, database, table, partition_cols, preserve_index, mode, compression, procs_cpu_bound, procs_io_bound, cast_columns, extra_args, inplace)
    632                                         procs_io_bound=procs_io_bound,
    633                                         cast_columns=cast_columns,
--> 634                                         extra_args=extra_args)
    635         if database:
    636             self._session.glue.metadata_to_glue(dataframe=dataframe,

/usr/local/lib/python3.7/site-packages/awswrangler/pandas.py in data_to_s3(self, dataframe, path, file_format, partition_cols, preserve_index, mode, compression, procs_cpu_bound, procs_io_bound, cast_columns, extra_args)
    685                 receive_pipes.append(receive_pipe)
    686             for i in range(len(procs)):
--> 687                 objects_paths += receive_pipes[i].recv()
    688                 procs[i].join()
    689                 receive_pipes[i].close()

/usr/local/lib/python3.7/multiprocessing/connection.py in recv(self)
    248         self._check_closed()
    249         self._check_readable()
--> 250         buf = self._recv_bytes()
    251         return _ForkingPickler.loads(buf.getbuffer())
    252 

/usr/local/lib/python3.7/multiprocessing/connection.py in _recv_bytes(self, maxsize)
    405 
    406     def _recv_bytes(self, maxsize=None):
--> 407         buf = self._recv(4)
    408         size, = struct.unpack("!i", buf.getvalue())
    409         if maxsize is not None and size > maxsize:

/usr/local/lib/python3.7/multiprocessing/connection.py in _recv(self, size, read)
    381             if n == 0:
    382                 if remaining == size:
--> 383                     raise EOFError
    384                 else:
    385                     raise OSError("got end of file during message")

EOFError:

Looks like the connection is getting reset somehow? I’ll note that I’m testing this in a jupyter notebook running in a docker container.

Current configuration:

MacOS with a Docker container running Debian (stretch)
Python 3.7
awswrangler 0.0.12

Any ideas what’s going on here?

Issue Analytics

State:
Created 4 years ago
Comments:5 (5 by maintainers)

Top GitHub Comments

1reaction

igorborgestcommented, Oct 24, 2019

Hi @koshy1123, thanks for contributing here!

Peculiar case, I’ve never faced that before, I will need more time to understand/troubleshoot that.

But in the meantime could you test the same without the parallelism (procs_cpu_bound=1)?

session = awswrangler.Session()
session.pandas.to_parquet(
    dataframe=final_gdf.astype({'geometry': str}),
    database='foo',
    table='bar',
    path="s3://bucket/path/to/file/",
    partition_cols=['partition_col'],
    compression='gzip',
    mode='overwrite',
    procs_cpu_bound=1
)

0reactions

koshy1123commented, Oct 25, 2019

I got parallelization working after casting the column to be a str and forcing the dataframe to be an instance of a pandas dataframe, a la:

df = df.astype('geometry': str)
df = pandas.DataFrame(df)

Edit: This works some of the time 😕

Top Results From Across the Web

EOF error when writing to S3 · Issue #52 · aws/aws-sdk-pandas

I was testing out some of the features yesterday and hit this error. The workflow was: Read a geospatial dataset off S3 using...

Using boto for S3 upload I'm getting a EOF error - Stack Overflow

I'm trying to upload a file to S3 (actually the upload works) but my app crashes and throws the error saying: fp is...

Resolve errors uploading data to or downloading data ... - AWS

First, follow the steps below to run the SELECT INTO OUTFILE S3 or LOAD DATA FROM S3 commands using Amazon Aurora. If you...

SEGYImport "unexpected EOF" error when importing in minio

When importing file into a local service minio (or Zenko s3 server), on the minio side occurs error: Error: unexpected EOF (*errors.

502 Bad Gateway Unexpected EOF | Apigee Edge

This indicates that the 502 error is caused by the target due to unexpected EOF. In addition, make a note of the Request...

Troubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.

Start Free

Top Related Reddit Thread

No results found

Top Related Tweet

No results found

Top Related Dev.to Post

No results found

EOF error when writing to S3

Issue Analytics

Top GitHub Comments

Top Results From Across the Web

Top Related Medium Post

Top Related StackOverflow Question

Troubleshoot Live Code

Top Related Reddit Thread

Top Related Hackernoon Post

Top Related Tweet

Top Related Dev.to Post

Top Related Hashnode Post

Unexpected Behavior in read_csv() with max_result_size

Pyarrow schema registration in Glue