question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Dataset order is not deterministic with ZIP archives and `iter_files`

See original GitHub issue

Describe the bug

For the beans dataset (did not try on other), the order of samples is not the same on different machines. Tested on my local laptop, github actions machine, and ec2 instance. The three yield a different order.

Steps to reproduce the bug

In a clean docker container or conda environment with datasets==2.6.1, run

from datasets import load_dataset
from pprint import pprint

data = load_dataset("beans", split="validation")

pprint(data["image_file_path"])

Expected behavior

The order of the images is the same on all machines.

Environment info

On the EC2 instance:

- `datasets` version: 2.6.1
- Platform: Linux-4.14.291-218.527.amzn2.x86_64-x86_64-with-glibc2.2.5
- Python version: 3.7.10
- PyArrow version: 9.0.0
- Pandas version: 1.3.5
- Numpy version: not checked

On my local laptop:

- `datasets` version: 2.6.1
- Platform: Linux-5.15.0-50-generic-x86_64-with-glibc2.35
- Python version: 3.9.12
- PyArrow version: 7.0.0
- Pandas version: 1.3.5
- Numpy version: 1.23.1

On github actions:

- `datasets` version: 2.6.1
- Platform: Linux-5.15.0-1022-azure-x86_64-with-glibc2.2.5
- Python version: 3.8.14
- PyArrow version: 9.0.0
- Pandas version: 1.5.1
- Numpy version: 1.23.4

Issue Analytics

  • State:closed
  • Created a year ago
  • Comments:8 (8 by maintainers)

github_iconTop GitHub Comments

1reaction
fxmartycommented, Oct 27, 2022

@albertvillanova Thanks for the fix!

1reaction
lhoestqcommented, Oct 21, 2022

This is still a bug, so I’d keep this one open if you don’t mind 😉

Read more comments on GitHub >

github_iconTop Results From Across the Web

Create deterministic archives in C# - Stack Overflow
According to my research, zip archives are not deterministic and have different metadata attached to them each time they are zipped, EVEN if...
Read more >
tf.data.Dataset.list_files return is deterministic order ... - GitHub
Defined in python/training/input.py. NOTE: The order of the files returned can be non-deterministic.
Read more >
Building Deterministic Zip Files with Built-In Commands | by Ezri
Zip files are not deterministic by nature, and this can cause some problems when you're trying to do what you gotta do.
Read more >
deterministic-zip - PyPI
A tool to generate consistent zip files. This tool was specifically built to prevent zip file changes from triggering AWS Lambda function ...
Read more >
I received my image data in zip files that end with numbers ...
In general, the HIRO will not create a single zip file larger than 1.5 GB in size ... However, it is not uncommon...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found