Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Features of IterableDataset set to None by remove column

See original GitHub issue

Describe the bug

The remove_column method of the IterableDataset sets the dataset features to None.

Steps to reproduce the bug

from datasets import Audio, load_dataset

# load LS in streaming mode
dataset = load_dataset("librispeech_asr", "clean", split="validation", streaming=True)

# check original features
print("Original features: ", dataset.features.keys())

# define features to remove: we KEEP audio and text
COLUMNS_TO_REMOVE = ['chapter_id', 'speaker_id', 'file', 'id']

dataset = dataset.remove_columns(COLUMNS_TO_REMOVE)

# check processed features, uh-oh!
print("Processed features: ", dataset.features)

# streaming the first audio sample still works
print("First sample:", next(iter(ds)))

Print Output:

Original features:  dict_keys(['file', 'audio', 'text', 'speaker_id', 'chapter_id', 'id'])
Processed features:  None
First sample: {'audio': {'path': '2277-149896-0000.flac', 'array': array([ 0.00186157,  0.0005188 ,  0.00024414, ..., -0.00097656,
       -0.00109863, -0.00146484]), 'sampling_rate': 16000}, 'text': "HE WAS IN A FEVERED STATE OF MIND OWING TO THE BLIGHT HIS WIFE'S ACTION THREATENED TO CAST UPON HIS ENTIRE FUTURE"}

Expected behavior

The features should be those not removed by the remove_column method, i.e. audio and text.

Environment info

datasets version: 2.7.1
Platform: Linux-5.10.133±x86_64-with-Ubuntu-18.04-bionic
Python version: 3.7.15
PyArrow version: 9.0.0
Pandas version: 1.3.5

(Running on Google Colab for a blog post: https://colab.research.google.com/drive/1ySCQREPZEl4msLfxb79pYYOWjUZhkr9y#scrollTo=8pRDGiVmH2ml)

cc @polinaeterna @lhoestq

Issue Analytics

State:
Created 10 months ago
Comments:8 (8 by maintainers)

Top GitHub Comments

1reaction

alvarobarttcommented, Nov 26, 2022

@sanchit-gandhi PR is ready and open for review at #5287, but there’s still one issue I may need @lhoestq’s input 🤗

1reaction

sanchit-gandhicommented, Nov 25, 2022

Awesome - thank you so much for this PR @alvarobartt! Is much appreciated!

Top Results From Across the Web

Source code for datasets.iterable_dataset - Hugging Face

Table.from_pydict(batch) if try_features is not None: try: pa_table ... List[str]]) -> "IterableDataset": """ Remove one or several column(s) in the dataset ...

torchtext.data - Read the Docs

Defines a datatype together with instructions for converting to Tensor. Field class models common text processing datatypes that can be represented by tensors....

ray.data.dataset — Ray 2.2.0 - the Ray documentation

To learn more about writing functions for :meth:`~Dataset.map_batches`, ... None, **ray_remote_args, ) -> "Dataset[U]": """Drop one or more columns from the ...

Source code for monai.data.iterable_dataset

self.data = data self.transform = transform self.source = None ... if None, set to `2 x chunksize`. col_names: names of the expected columns...

Using iterable datasets - PyTorch Forums

Hello, Im trying to set up a processes to deal with streaming sensor data. ... Unfortunately it seems to be a relatively new...