question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Second concatenation of datasets produces errors

See original GitHub issue

Hi,

I am need to concatenate my dataset with others several times, and after I concatenate it for the second time, the features of features (e.g. tags names) are collapsed. This hinders, for instance, the usage of tokenize function with data.map.

from datasets import load_dataset, concatenate_datasets

data = load_dataset('trec')['train']
concatenated = concatenate_datasets([data, data])
concatenated_2 = concatenate_datasets([concatenated, concatenated])
print('True features of features:', concatenated.features)
print('\nProduced features of features:', concatenated_2.features)

outputs

True features of features: {'label-coarse': ClassLabel(num_classes=6, names=['DESC', 'ENTY', 'ABBR', 'HUM', 'NUM', 'LOC'], names_file=None, id=None), 'label-fine': ClassLabel(num_classes=47, names=['manner', 'cremat', 'animal', 'exp', 'ind', 'gr', 'title', 'def', 'date', 'reason', 'event', 'state', 'desc', 'count', 'other', 'letter', 'religion', 'food', 'country', 'color', 'termeq', 'city', 'body', 'dismed', 'mount', 'money', 'product', 'period', 'substance', 'sport', 'plant', 'techmeth', 'volsize', 'instru', 'abb', 'speed', 'word', 'lang', 'perc', 'code', 'dist', 'temp', 'symbol', 'ord', 'veh', 'weight', 'currency'], names_file=None, id=None), 'text': Value(dtype='string', id=None)}

Produced features of features: {'label-coarse': Value(dtype='int64', id=None), 'label-fine': Value(dtype='int64', id=None), 'text': Value(dtype='string', id=None)}

I am using datasets v.1.11.0

Issue Analytics

  • State:closed
  • Created 2 years ago
  • Comments:5 (3 by maintainers)

github_iconTop GitHub Comments

1reaction
albertvillanovacommented, Aug 4, 2021

Hi @Aktsvigun, thanks for reporting.

I’m investigating this.

1reaction
Aktsviguncommented, Aug 4, 2021
Read more comments on GitHub >

github_iconTop Results From Across the Web

15.3 - Concatenating Two or More Data Sets | STAT 481
To concatenate two or more SAS data sets means to stack one "on top" of the other into a single SAS data set....
Read more >
Concatenating Data Sets with the SET Statement
The following program creates the SALES and CUSTOMER_SUPPORT data sets ... To concatenate the two data sets, list them in the SET statement....
Read more >
Why does my memory usage explode when concatenating ...
In this article we will take a look at a memory issue that I've run into multiple times in real life datasets -...
Read more >
Dataset concatenation from random link split but it just ends ...
while i can concatenate the two datasets successfully. when trying to access the concatenated dataset im ending up with a keyerror in every ......
Read more >
Combining Datasets: Concat and Append
While this is valid within DataFrame s, the outcome is often undesirable. pd.concat() gives us a few ways to handle it. Catching the...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found