Error reading parquet files made by AWS Athena
See original GitHub issueI made a bunch of parquet files using an amazon athena CTAS query. I downloaded these files to first test locally (the end goal is to access the data from S3).
If I run the code below;
import s3fs
from petastorm.reader import make_batch_reader
from petastorm.tf_utils import make_petastorm_dataset
dataset_url = "file:///Data/test-parquet"
with make_batch_reader(dataset_url) as reader:
dataset = make_petastorm_dataset(reader)
for batch in dataset:
break
batch.correct
I receive a lot of warnings and then an error in for batch in dataset
pyarrow.lib.ArrowIOError: The file only has 1 row groups, requested metadata for row group: 1
If 1 look at dataset.take(1) or something alike, I do see the correct schema of the table. However, I don’t seem to be able to access the data.
Issue Analytics
- State:
- Created 4 years ago
- Comments:14
Top Results From Across the Web
Troubleshooting in Athena - AWS Documentation
This error is caused by a parquet schema mismatch. A column that has a non-primitive type (for example, array ) has been declared...
Read more >How can Athena read parquet file from S3 bucket
We can read parquet file in athena by creating a table for given s3 location. ... More details can be found in this...
Read more >Error querying parquet files from AWS Glue/S3 - Dremio
Hi,. I'm running up against an error saying “java.io.IOException: Not a file…” when trying to query an AWS Glue table (parquet in S3)....
Read more >Not able to read S3 Parquet file - AWS re:Post
Hi Team, I'm trying to read Parquet files in S3, but I get the following error. Please help. I'm not sure if the...
Read more >Issue loading parquet files from S3 (Athena) - TigerGraph
Hi,. I'm trying to load nodes from a parquet file from S3, which was created by AWS Athena. ... I am specifying gzip...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
Thank you. This is a serious bug. I really appreciate you looking into this. I am in the middle of 0.8.1 release and will add this fix to the release. If you don’t mind, I’ll add you to the reviewers of the fix.
I’ve run a couple of tests and it appears to be working. I’m continuing to test with a full train and I’ll report back by the end of the week once it has run.