`datasets` can't read a Parquet file in Python 3.9.13
See original GitHub issueDescribe the bug
I have an error when trying to load this dataset (it’s private but I can add you to the bigcode org). datasets
can’t read one of the parquet files in the Java subset
from datasets import load_dataset
ds = load_dataset("bigcode/the-stack-dedup-pjj", data_dir="data/java", split="train", revision="v1.1.a1", use_auth_token=True)
File "pyarrow/error.pxi", line 100, in pyarrow.lib.check_status
pyarrow.lib.ArrowInvalid: Parquet magic bytes not found in footer. Either the file is corrupted or this is not a parquet file.
It seems to be an issue with new Python versions, Because it works in these two environements:
- `datasets` version: 2.6.1
- Platform: Linux-5.4.0-131-generic-x86_64-with-glibc2.31
- Python version: 3.9.7
- PyArrow version: 9.0.0
- Pandas version: 1.3.4
- `datasets` version: 2.6.1
- Platform: Linux-4.19.0-22-cloud-amd64-x86_64-with-debian-10.13
- Python version: 3.7.12
- PyArrow version: 9.0.0
- Pandas version: 1.3.4
But not in this:
- `datasets` version: 2.6.1
- Platform: Linux-4.19.0-22-cloud-amd64-x86_64-with-glibc2.28
- Python version: 3.9.13
- PyArrow version: 9.0.0
- Pandas version: 1.3.4
Steps to reproduce the bug
Load the dataset in python 3.9.13
Expected behavior
Load the dataset without the pyarrow error.
Environment info
- `datasets` version: 2.6.1
- Platform: Linux-4.19.0-22-cloud-amd64-x86_64-with-glibc2.28
- Python version: 3.9.13
- PyArrow version: 9.0.0
- Pandas version: 1.3.4
Issue Analytics
- State:
- Created 10 months ago
- Comments:15 (7 by maintainers)
Top Results From Across the Web
Pandas cannot read parquet files created in PySpark
The problem is that Spark partitions the file due to its distributed nature (each executor writes a file inside the directory that receives ......
Read more >Reading Parquet Files in Python - YouTube
This video is a step by step guide on how to read parquet files in python. Leveraging the pandas library, we can read...
Read more >fastparquet - PyPI
fastparquet is a python implementation of the parquet format, aiming integrate into python-based big data work-flows. It is used implicitly by the projects ......
Read more >AWS Glue Python shell now supports Python 3.9 with a ...
In this post, we walk you through on how to use AWS Glue Python shell to create an ETL job that imports an...
Read more >Intermittent read parquet error - KNIME Forum
Hi, I've a workflow that runs a python script, and as part of the script it saves the output to a .parquet file,...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
Cool !
We don’t perform integrity verifications if we don’t know in advance the hash of the file to download.
datasets
caches the files by URL and ETag. If the content of a file changes, then the ETag changes and so it redownloads the fileI think you have to try them all 😕
Alternatively you can add a try/catch in
parquet.py
indatasets
to raise the name of the file that fails at doingparquet_file = pq.ParquetFile(f)
when you run your initial codebut it will still iterate on all the files until it fails