question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Can't load file from s3 bucket

See original GitHub issue

System information

  • OS Platform and Distribution (e.g., Linux Ubuntu 16.04): MAC 18.7.0
  • Modin version (modin.__version__): ‘0.7.4’
  • Python version: 3.7
  • Code we can use to reproduce:
import modin.pandas as mpd
import padnas as pd 

path = "s3://bucket_name/data/dataframe.snappy.parquet"
df = pd.read_parquet(path) # works 
df2 = mpd.read_parquet(path) 

Describe the problem

I can’t load snappy.parquet files from s3. Pandas works fine. Does modin support snappy.parquet files?

Source code / logs

Error log:

---------------------------------------------------------------------------
FileNotFoundError                         Traceback (most recent call last)
<ipython-input-14-814dc08ef229> in <module>
----> 1 df2 = mpd.read_parquet(path)

/opt/anaconda3/envs/myenv/lib/python3.7/site-packages/modin/pandas/io.py in read_parquet(path, engine, columns, **kwargs)
     40     return DataFrame(
     41         query_compiler=EngineDispatcher.read_parquet(
---> 42             path=path, columns=columns, engine=engine, **kwargs
     43         )
     44     )

/opt/anaconda3/envs/myenv/lib/python3.7/site-packages/modin/data_management/dispatcher.py in read_parquet(cls, **kwargs)
    105     @classmethod
    106     def read_parquet(cls, **kwargs):
--> 107         return cls.__engine._read_parquet(**kwargs)
    108 
    109     @classmethod

/opt/anaconda3/envs/myenv/lib/python3.7/site-packages/modin/data_management/factories.py in _read_parquet(cls, **kwargs)
     46     @classmethod
     47     def _read_parquet(cls, **kwargs):
---> 48         return cls.io_cls.read_parquet(**kwargs)
     49 
     50     @classmethod

/opt/anaconda3/envs/myenv/lib/python3.7/site-packages/modin/engines/base/io/file_reader.py in read(cls, *args, **kwargs)
     27     @classmethod
     28     def read(cls, *args, **kwargs):
---> 29         query_compiler = cls._read(*args, **kwargs)
     30         # TODO (devin-petersohn): Make this section more general for non-pandas kernel
     31         # implementations.

/opt/anaconda3/envs/myenv/lib/python3.7/site-packages/modin/engines/base/io/column_stores/parquet_reader.py in _read(cls, path, engine, columns, **kwargs)
     68                 column_names = pd.schema.names
     69             else:
---> 70                 meta = ParquetFile(path).metadata
     71                 column_names = meta.schema.names
     72             if meta is not None:

/opt/anaconda3/envs/myenv/lib/python3.7/site-packages/pyarrow/parquet.py in __init__(self, source, metadata, common_metadata, read_dictionary, memory_map, buffer_size)
    135         self.reader.open(source, use_memory_map=memory_map,
    136                          buffer_size=buffer_size,
--> 137                          read_dictionary=read_dictionary, metadata=metadata)
    138         self.common_metadata = common_metadata
    139         self._nested_paths_by_prefix = self._build_nested_paths()

/opt/anaconda3/envs/myenv/lib/python3.7/site-packages/pyarrow/_parquet.pyx in pyarrow._parquet.ParquetReader.open()

/opt/anaconda3/envs/myenv/lib/python3.7/site-packages/pyarrow/io.pxi in pyarrow.lib.get_reader()

/opt/anaconda3/envs/myenv/lib/python3.7/site-packages/pyarrow/io.pxi in pyarrow.lib._get_native_file()

/opt/anaconda3/envs/myenv/lib/python3.7/site-packages/pyarrow/io.pxi in pyarrow.lib.OSFile.__cinit__()

/opt/anaconda3/envs/myenv/lib/python3.7/site-packages/pyarrow/io.pxi in pyarrow.lib.OSFile._open_readable()

/opt/anaconda3/envs/myenv/lib/python3.7/site-packages/pyarrow/error.pxi in pyarrow.lib.pyarrow_internal_check_status()

/opt/anaconda3/envs/myenv/lib/python3.7/site-packages/pyarrow/error.pxi in pyarrow.lib.check_status()

FileNotFoundError: [Errno 2] Failed to open local file s3://bucket_name/data/dataframe.snappy.parquet'. Detail: [errno 2] No such file or directory

Issue Analytics

  • State:closed
  • Created 3 years ago
  • Comments:7 (6 by maintainers)

github_iconTop GitHub Comments

1reaction
devin-petersohncommented, Jul 22, 2020

@DenisVorotyntsev I am going to reopen this so we don’t lose track of it 😄

0reactions
prutskovcommented, Sep 23, 2020

I’ll try to use open_file for parquet files.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Resolve errors uploading data to or downloading data ... - AWS
Your file does not exist. Confirm that the file exists in your S3 bucket, and that the name you specified in your script...
Read more >
Troubleshoot Amazon S3 content loading issue - AWS re:Post
I'm using an Amazon Simple Storage Service (Amazon S3) bucket to store content for my website. A user from another AWS account uploaded...
Read more >
unable to read large csv file from s3 bucket to python
1 Answer 1 · Make sure the region of the S3 bucket is the same as your AWS configure. · Make sure the...
Read more >
Can't download individual files from S3 bucket #13586
Connect to the S3 bucket · Select a directory/ folder · Right click > Download.
Read more >
Bulk Loading from Amazon S3 - Snowflake Documentation
If the S3 bucket referenced by your external stage is in the same region as your Snowflake account, your network traffic does not...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found