question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Features of IterableDataset set to None by remove column

See original GitHub issue

Describe the bug

The remove_column method of the IterableDataset sets the dataset features to None.

Steps to reproduce the bug

from datasets import Audio, load_dataset

# load LS in streaming mode
dataset = load_dataset("librispeech_asr", "clean", split="validation", streaming=True)

# check original features
print("Original features: ", dataset.features.keys())

# define features to remove: we KEEP audio and text
COLUMNS_TO_REMOVE = ['chapter_id', 'speaker_id', 'file', 'id']

dataset = dataset.remove_columns(COLUMNS_TO_REMOVE)

# check processed features, uh-oh!
print("Processed features: ", dataset.features)

# streaming the first audio sample still works
print("First sample:", next(iter(ds)))

Print Output:

Original features:  dict_keys(['file', 'audio', 'text', 'speaker_id', 'chapter_id', 'id'])
Processed features:  None
First sample: {'audio': {'path': '2277-149896-0000.flac', 'array': array([ 0.00186157,  0.0005188 ,  0.00024414, ..., -0.00097656,
       -0.00109863, -0.00146484]), 'sampling_rate': 16000}, 'text': "HE WAS IN A FEVERED STATE OF MIND OWING TO THE BLIGHT HIS WIFE'S ACTION THREATENED TO CAST UPON HIS ENTIRE FUTURE"}

Expected behavior

The features should be those not removed by the remove_column method, i.e. audio and text.

Environment info

  • datasets version: 2.7.1
  • Platform: Linux-5.10.133±x86_64-with-Ubuntu-18.04-bionic
  • Python version: 3.7.15
  • PyArrow version: 9.0.0
  • Pandas version: 1.3.5

(Running on Google Colab for a blog post: https://colab.research.google.com/drive/1ySCQREPZEl4msLfxb79pYYOWjUZhkr9y#scrollTo=8pRDGiVmH2ml)

cc @polinaeterna @lhoestq

Issue Analytics

  • State:closed
  • Created 10 months ago
  • Comments:8 (8 by maintainers)

github_iconTop GitHub Comments

1reaction
alvarobarttcommented, Nov 26, 2022

@sanchit-gandhi PR is ready and open for review at #5287, but there’s still one issue I may need @lhoestq’s input 🤗

1reaction
sanchit-gandhicommented, Nov 25, 2022

Awesome - thank you so much for this PR @alvarobartt! Is much appreciated!

Read more comments on GitHub >

github_iconTop Results From Across the Web

Source code for datasets.iterable_dataset - Hugging Face
Table.from_pydict(batch) if try_features is not None: try: pa_table ... List[str]]) -> "IterableDataset": """ Remove one or several column(s) in the dataset ...
Read more >
torchtext.data - Read the Docs
Defines a datatype together with instructions for converting to Tensor. Field class models common text processing datatypes that can be represented by tensors....
Read more >
ray.data.dataset — Ray 2.2.0 - the Ray documentation
To learn more about writing functions for :meth:`~Dataset.map_batches`, ... None, **ray_remote_args, ) -> "Dataset[U]": """Drop one or more columns from the ...
Read more >
Source code for monai.data.iterable_dataset
self.data = data self.transform = transform self.source = None ... if None, set to `2 x chunksize`. col_names: names of the expected columns...
Read more >
Using iterable datasets - PyTorch Forums
Hello, Im trying to set up a processes to deal with streaming sensor data. ... Unfortunately it seems to be a relatively new...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found