question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

`datasets` can't read a Parquet file in Python 3.9.13

See original GitHub issue

Describe the bug

I have an error when trying to load this dataset (it’s private but I can add you to the bigcode org). datasets can’t read one of the parquet files in the Java subset

from datasets import load_dataset

ds = load_dataset("bigcode/the-stack-dedup-pjj", data_dir="data/java", split="train", revision="v1.1.a1", use_auth_token=True)
  File "pyarrow/error.pxi", line 100, in pyarrow.lib.check_status
pyarrow.lib.ArrowInvalid: Parquet magic bytes not found in footer. Either the file is corrupted or this is not a parquet file.

It seems to be an issue with new Python versions, Because it works in these two environements:

- `datasets` version: 2.6.1
- Platform: Linux-5.4.0-131-generic-x86_64-with-glibc2.31
- Python version: 3.9.7
- PyArrow version: 9.0.0
- Pandas version: 1.3.4
- `datasets` version: 2.6.1
- Platform: Linux-4.19.0-22-cloud-amd64-x86_64-with-debian-10.13
- Python version: 3.7.12
- PyArrow version: 9.0.0
- Pandas version: 1.3.4

But not in this:

- `datasets` version: 2.6.1
- Platform: Linux-4.19.0-22-cloud-amd64-x86_64-with-glibc2.28
- Python version: 3.9.13
- PyArrow version: 9.0.0
- Pandas version: 1.3.4

Steps to reproduce the bug

Load the dataset in python 3.9.13

Expected behavior

Load the dataset without the pyarrow error.

Environment info

- `datasets` version: 2.6.1
- Platform: Linux-4.19.0-22-cloud-amd64-x86_64-with-glibc2.28
- Python version: 3.9.13
- PyArrow version: 9.0.0
- Pandas version: 1.3.4

Issue Analytics

  • State:closed
  • Created 10 months ago
  • Comments:15 (7 by maintainers)

github_iconTop GitHub Comments

1reaction
lhoestqcommented, Nov 22, 2022

Cool !

But I thought if something went wrong with a download datasets creates new cache for all the files

We don’t perform integrity verifications if we don’t know in advance the hash of the file to download.

at some point I even changed dataset versions so it was still using that cache?

datasets caches the files by URL and ETag. If the content of a file changes, then the ETag changes and so it redownloads the file

1reaction
lhoestqcommented, Nov 21, 2022

I think you have to try them all 😕

Alternatively you can add a try/catch in parquet.py in datasets to raise the name of the file that fails at doing parquet_file = pq.ParquetFile(f) when you run your initial code

load_dataset("bigcode/the-stack-dedup-pjj", data_dir="data/java", split="train", revision="v1.1.a1", use_auth_token=True)

but it will still iterate on all the files until it fails

Read more comments on GitHub >

github_iconTop Results From Across the Web

Pandas cannot read parquet files created in PySpark
The problem is that Spark partitions the file due to its distributed nature (each executor writes a file inside the directory that receives ......
Read more >
Reading Parquet Files in Python - YouTube
This video is a step by step guide on how to read parquet files in python. Leveraging the pandas library, we can read...
Read more >
fastparquet - PyPI
fastparquet is a python implementation of the parquet format, aiming integrate into python-based big data work-flows. It is used implicitly by the projects ......
Read more >
AWS Glue Python shell now supports Python 3.9 with a ...
In this post, we walk you through on how to use AWS Glue Python shell to create an ETL job that imports an...
Read more >
Intermittent read parquet error - KNIME Forum
Hi, I've a workflow that runs a python script, and as part of the script it saves the output to a .parquet file,...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found