Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

`datasets` can't read a Parquet file in Python 3.9.13

See original GitHub issue

Describe the bug

I have an error when trying to load this dataset (it’s private but I can add you to the bigcode org). datasets can’t read one of the parquet files in the Java subset

from datasets import load_dataset

ds = load_dataset("bigcode/the-stack-dedup-pjj", data_dir="data/java", split="train", revision="v1.1.a1", use_auth_token=True)

  File "pyarrow/error.pxi", line 100, in pyarrow.lib.check_status
pyarrow.lib.ArrowInvalid: Parquet magic bytes not found in footer. Either the file is corrupted or this is not a parquet file.

It seems to be an issue with new Python versions, Because it works in these two environements:

- `datasets` version: 2.6.1
- Platform: Linux-5.4.0-131-generic-x86_64-with-glibc2.31
- Python version: 3.9.7
- PyArrow version: 9.0.0
- Pandas version: 1.3.4

- `datasets` version: 2.6.1
- Platform: Linux-4.19.0-22-cloud-amd64-x86_64-with-debian-10.13
- Python version: 3.7.12
- PyArrow version: 9.0.0
- Pandas version: 1.3.4

But not in this:

- `datasets` version: 2.6.1
- Platform: Linux-4.19.0-22-cloud-amd64-x86_64-with-glibc2.28
- Python version: 3.9.13
- PyArrow version: 9.0.0
- Pandas version: 1.3.4

Steps to reproduce the bug

Load the dataset in python 3.9.13

Expected behavior

Load the dataset without the pyarrow error.

Environment info

- `datasets` version: 2.6.1
- Platform: Linux-4.19.0-22-cloud-amd64-x86_64-with-glibc2.28
- Python version: 3.9.13
- PyArrow version: 9.0.0
- Pandas version: 1.3.4

Issue Analytics

State:
Created 10 months ago
Comments:15 (7 by maintainers)

Top GitHub Comments

1reaction

lhoestqcommented, Nov 22, 2022

Cool !

But I thought if something went wrong with a download datasets creates new cache for all the files

We don’t perform integrity verifications if we don’t know in advance the hash of the file to download.

at some point I even changed dataset versions so it was still using that cache?

datasets caches the files by URL and ETag. If the content of a file changes, then the ETag changes and so it redownloads the file

1reaction

lhoestqcommented, Nov 21, 2022

I think you have to try them all 😕

Alternatively you can add a try/catch in parquet.py in datasets to raise the name of the file that fails at doing parquet_file = pq.ParquetFile(f) when you run your initial code

load_dataset("bigcode/the-stack-dedup-pjj", data_dir="data/java", split="train", revision="v1.1.a1", use_auth_token=True)

but it will still iterate on all the files until it fails

Top Results From Across the Web

Pandas cannot read parquet files created in PySpark

The problem is that Spark partitions the file due to its distributed nature (each executor writes a file inside the directory that receives ......

Reading Parquet Files in Python - YouTube

This video is a step by step guide on how to read parquet files in python. Leveraging the pandas library, we can read...

fastparquet - PyPI

fastparquet is a python implementation of the parquet format, aiming integrate into python-based big data work-flows. It is used implicitly by the projects ......

AWS Glue Python shell now supports Python 3.9 with a ...

In this post, we walk you through on how to use AWS Glue Python shell to create an ETL job that imports an...

Intermittent read parquet error - KNIME Forum

Hi, I've a workflow that runs a python script, and as part of the script it saves the output to a .parquet file,...