Dataset sharding non-contiguous?
See original GitHub issueDescribe the bug
I’m not sure if this is a bug; more likely normal behavior but i wanted to double check.
Is it normal that datasets.shard
does not produce chunks that, when concatenated produce the original ordering of the sharded dataset?
This might be related to this pull request (https://github.com/huggingface/datasets/pull/4466) but I have to admit I did not properly look into the changes made.
Steps to reproduce the bug
max_shard_size = convert_file_size_to_int('300MB')
dataset_nbytes = dataset.data.nbytes
num_shards = int(dataset_nbytes / max_shard_size) + 1
num_shards = max(num_shards, 1)
print(f"{num_shards=}")
for shard_index in range(num_shards):
shard = dataset.shard(num_shards=num_shards, index=shard_index)
shard.to_parquet(f"tokenized/tokenized-{shard_index:03d}.parquet")
os.listdir('tokenized/')
Expected results
I expected the shards to match the order of the data of the original dataset; i.e. dataset[10]
being the same as shard_1[10]
for example
Actual results
Only the first element is the same; i.e. dataset[0]
is the same as shard_1[0]
Environment info
datasets
version: 2.3.2- Platform: Linux-4.15.0-176-generic-x86_64-with-glibc2.31
- Python version: 3.10.4
- PyArrow version: 8.0.0
- Pandas version: 1.4.2
Issue Analytics
- State:
- Created a year ago
- Comments:5 (5 by maintainers)
Top Results From Across the Web
Dataset sharding in MultiWorker Mirrored Strategy #42146
Hi, I am relatively new to distributed TensorFlow. I am running a custom CIFAR10 training code from tensorflow website on 2 Azure VMs....
Read more >Data sharding and replication
Sharding with no replication: unique copy, distributed data sets. ▫ (+) Better concurrency levels ... a random non-contiguous subset of that conversation.
Read more >AutoTable: Your Butler-Like Sharding Configuration Tool
Sharding is the core feature of Apache ShardingSphere. ... My Data Source Names Are Non-Contiguous or I Have Too Many Data Source Names....
Read more >Data partitioning strategies - Azure Architecture Center
A shard can hold more than one dataset (called a shardlet). ... can use list shardlets to store data for different non-contiguous tenants...
Read more >Sql query to get a non-contiguous subset of results
I'm writing a web application that should show very large results on a search query. Say some queries will return 10.000 items. I'd...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
This project started as a fork of TFDS, and
contiguous=False
is the default behavior there.Hi! You can pass
contiguous=True
to.shard()
get contiguous shards. More info on this and the default behavior can be found in the docs.EDIT: Answered as you closed the thread 😄