Dataset order is not deterministic with ZIP archives and `iter_files`
See original GitHub issueDescribe the bug
For the beans
dataset (did not try on other), the order of samples is not the same on different machines. Tested on my local laptop, github actions machine, and ec2 instance. The three yield a different order.
Steps to reproduce the bug
In a clean docker container or conda environment with datasets==2.6.1, run
from datasets import load_dataset
from pprint import pprint
data = load_dataset("beans", split="validation")
pprint(data["image_file_path"])
Expected behavior
The order of the images is the same on all machines.
Environment info
On the EC2 instance:
- `datasets` version: 2.6.1
- Platform: Linux-4.14.291-218.527.amzn2.x86_64-x86_64-with-glibc2.2.5
- Python version: 3.7.10
- PyArrow version: 9.0.0
- Pandas version: 1.3.5
- Numpy version: not checked
On my local laptop:
- `datasets` version: 2.6.1
- Platform: Linux-5.15.0-50-generic-x86_64-with-glibc2.35
- Python version: 3.9.12
- PyArrow version: 7.0.0
- Pandas version: 1.3.5
- Numpy version: 1.23.1
On github actions:
- `datasets` version: 2.6.1
- Platform: Linux-5.15.0-1022-azure-x86_64-with-glibc2.2.5
- Python version: 3.8.14
- PyArrow version: 9.0.0
- Pandas version: 1.5.1
- Numpy version: 1.23.4
Issue Analytics
- State:
- Created a year ago
- Comments:8 (8 by maintainers)
Top Results From Across the Web
Create deterministic archives in C# - Stack Overflow
According to my research, zip archives are not deterministic and have different metadata attached to them each time they are zipped, EVEN if...
Read more >tf.data.Dataset.list_files return is deterministic order ... - GitHub
Defined in python/training/input.py. NOTE: The order of the files returned can be non-deterministic.
Read more >Building Deterministic Zip Files with Built-In Commands | by Ezri
Zip files are not deterministic by nature, and this can cause some problems when you're trying to do what you gotta do.
Read more >deterministic-zip - PyPI
A tool to generate consistent zip files. This tool was specifically built to prevent zip file changes from triggering AWS Lambda function ...
Read more >I received my image data in zip files that end with numbers ...
In general, the HIRO will not create a single zip file larger than 1.5 GB in size ... However, it is not uncommon...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
@albertvillanova Thanks for the fix!
This is still a bug, so I’d keep this one open if you don’t mind 😉