Features of IterableDataset set to None by remove column
See original GitHub issueDescribe the bug
The remove_column
method of the IterableDataset sets the dataset features to None.
Steps to reproduce the bug
from datasets import Audio, load_dataset
# load LS in streaming mode
dataset = load_dataset("librispeech_asr", "clean", split="validation", streaming=True)
# check original features
print("Original features: ", dataset.features.keys())
# define features to remove: we KEEP audio and text
COLUMNS_TO_REMOVE = ['chapter_id', 'speaker_id', 'file', 'id']
dataset = dataset.remove_columns(COLUMNS_TO_REMOVE)
# check processed features, uh-oh!
print("Processed features: ", dataset.features)
# streaming the first audio sample still works
print("First sample:", next(iter(ds)))
Print Output:
Original features: dict_keys(['file', 'audio', 'text', 'speaker_id', 'chapter_id', 'id'])
Processed features: None
First sample: {'audio': {'path': '2277-149896-0000.flac', 'array': array([ 0.00186157, 0.0005188 , 0.00024414, ..., -0.00097656,
-0.00109863, -0.00146484]), 'sampling_rate': 16000}, 'text': "HE WAS IN A FEVERED STATE OF MIND OWING TO THE BLIGHT HIS WIFE'S ACTION THREATENED TO CAST UPON HIS ENTIRE FUTURE"}
Expected behavior
The features should be those not removed by the remove_column
method, i.e. audio and text.
Environment info
datasets
version: 2.7.1- Platform: Linux-5.10.133±x86_64-with-Ubuntu-18.04-bionic
- Python version: 3.7.15
- PyArrow version: 9.0.0
- Pandas version: 1.3.5
(Running on Google Colab for a blog post: https://colab.research.google.com/drive/1ySCQREPZEl4msLfxb79pYYOWjUZhkr9y#scrollTo=8pRDGiVmH2ml)
Issue Analytics
- State:
- Created 10 months ago
- Comments:8 (8 by maintainers)
Top Results From Across the Web
Source code for datasets.iterable_dataset - Hugging Face
Table.from_pydict(batch) if try_features is not None: try: pa_table ... List[str]]) -> "IterableDataset": """ Remove one or several column(s) in the dataset ...
Read more >torchtext.data - Read the Docs
Defines a datatype together with instructions for converting to Tensor. Field class models common text processing datatypes that can be represented by tensors....
Read more >ray.data.dataset — Ray 2.2.0 - the Ray documentation
To learn more about writing functions for :meth:`~Dataset.map_batches`, ... None, **ray_remote_args, ) -> "Dataset[U]": """Drop one or more columns from the ...
Read more >Source code for monai.data.iterable_dataset
self.data = data self.transform = transform self.source = None ... if None, set to `2 x chunksize`. col_names: names of the expected columns...
Read more >Using iterable datasets - PyTorch Forums
Hello, Im trying to set up a processes to deal with streaming sensor data. ... Unfortunately it seems to be a relatively new...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
@sanchit-gandhi PR is ready and open for review at #5287, but there’s still one issue I may need @lhoestq’s input 🤗
Awesome - thank you so much for this PR @alvarobartt! Is much appreciated!