question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Dataset sharding non-contiguous?

See original GitHub issue

Describe the bug

I’m not sure if this is a bug; more likely normal behavior but i wanted to double check. Is it normal that datasets.shard does not produce chunks that, when concatenated produce the original ordering of the sharded dataset?

This might be related to this pull request (https://github.com/huggingface/datasets/pull/4466) but I have to admit I did not properly look into the changes made.

Steps to reproduce the bug

max_shard_size = convert_file_size_to_int('300MB')
dataset_nbytes = dataset.data.nbytes
num_shards = int(dataset_nbytes / max_shard_size) + 1
num_shards = max(num_shards, 1)
print(f"{num_shards=}")
for shard_index in range(num_shards):
    shard = dataset.shard(num_shards=num_shards, index=shard_index)
    shard.to_parquet(f"tokenized/tokenized-{shard_index:03d}.parquet")
os.listdir('tokenized/')

Expected results

I expected the shards to match the order of the data of the original dataset; i.e. dataset[10] being the same as shard_1[10] for example

Actual results

Only the first element is the same; i.e. dataset[0] is the same as shard_1[0]

Environment info

  • datasets version: 2.3.2
  • Platform: Linux-4.15.0-176-generic-x86_64-with-glibc2.31
  • Python version: 3.10.4
  • PyArrow version: 8.0.0
  • Pandas version: 1.4.2

Issue Analytics

  • State:closed
  • Created a year ago
  • Comments:5 (5 by maintainers)

github_iconTop GitHub Comments

1reaction
mariosaskocommented, Jun 30, 2022

This project started as a fork of TFDS, and contiguous=False is the default behavior there.

1reaction
mariosaskocommented, Jun 26, 2022

Hi! You can pass contiguous=True to .shard() get contiguous shards. More info on this and the default behavior can be found in the docs.

EDIT: Answered as you closed the thread 😄

Read more comments on GitHub >

github_iconTop Results From Across the Web

Dataset sharding in MultiWorker Mirrored Strategy #42146
Hi, I am relatively new to distributed TensorFlow. I am running a custom CIFAR10 training code from tensorflow website on 2 Azure VMs....
Read more >
Data sharding and replication
Sharding with no replication: unique copy, distributed data sets. ▫ (+) Better concurrency levels ... a random non-contiguous subset of that conversation.
Read more >
AutoTable: Your Butler-Like Sharding Configuration Tool
Sharding is the core feature of Apache ShardingSphere. ... My Data Source Names Are Non-Contiguous or I Have Too Many Data Source Names....
Read more >
Data partitioning strategies - Azure Architecture Center
A shard can hold more than one dataset (called a shardlet). ... can use list shardlets to store data for different non-contiguous tenants...
Read more >
Sql query to get a non-contiguous subset of results
I'm writing a web application that should show very large results on a search query. Say some queries will return 10.000 items. I'd...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found