Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Dataset sharding non-contiguous?

See original GitHub issue

Describe the bug

I’m not sure if this is a bug; more likely normal behavior but i wanted to double check. Is it normal that datasets.shard does not produce chunks that, when concatenated produce the original ordering of the sharded dataset?

This might be related to this pull request (https://github.com/huggingface/datasets/pull/4466) but I have to admit I did not properly look into the changes made.

Steps to reproduce the bug

max_shard_size = convert_file_size_to_int('300MB')
dataset_nbytes = dataset.data.nbytes
num_shards = int(dataset_nbytes / max_shard_size) + 1
num_shards = max(num_shards, 1)
print(f"{num_shards=}")
for shard_index in range(num_shards):
    shard = dataset.shard(num_shards=num_shards, index=shard_index)
    shard.to_parquet(f"tokenized/tokenized-{shard_index:03d}.parquet")
os.listdir('tokenized/')

Expected results

I expected the shards to match the order of the data of the original dataset; i.e. dataset[10] being the same as shard_1[10] for example

Actual results

Only the first element is the same; i.e. dataset[0] is the same as shard_1[0]

Environment info

datasets version: 2.3.2
Platform: Linux-4.15.0-176-generic-x86_64-with-glibc2.31
Python version: 3.10.4
PyArrow version: 8.0.0
Pandas version: 1.4.2

Issue Analytics

State:
Created a year ago
Comments:5 (5 by maintainers)

Top GitHub Comments

1reaction

mariosaskocommented, Jun 30, 2022

This project started as a fork of TFDS, and contiguous=False is the default behavior there.

1reaction

mariosaskocommented, Jun 26, 2022

Hi! You can pass contiguous=True to .shard() get contiguous shards. More info on this and the default behavior can be found in the docs.

EDIT: Answered as you closed the thread 😄

Top Results From Across the Web

Dataset sharding in MultiWorker Mirrored Strategy #42146

Hi, I am relatively new to distributed TensorFlow. I am running a custom CIFAR10 training code from tensorflow website on 2 Azure VMs....

Data sharding and replication

Sharding with no replication: unique copy, distributed data sets. ▫ (+) Better concurrency levels ... a random non-contiguous subset of that conversation.

AutoTable: Your Butler-Like Sharding Configuration Tool

Sharding is the core feature of Apache ShardingSphere. ... My Data Source Names Are Non-Contiguous or I Have Too Many Data Source Names....

Data partitioning strategies - Azure Architecture Center

A shard can hold more than one dataset (called a shardlet). ... can use list shardlets to store data for different non-contiguous tenants...

Sql query to get a non-contiguous subset of results

I'm writing a web application that should show very large results on a search query. Say some queries will return 10.000 items. I'd...