question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

DatasetDict save load Failing test in 1.6 not in 1.5

See original GitHub issue

Describe the bug

We have a test that saves a DatasetDict to disk and then loads it from disk. In 1.6 there is an incompatibility in the schema.

Downgrading to >1.6 – fixes the problem.

Steps to reproduce the bug


### Load a dataset dict from jsonl 

path = '/test/foo'

ds_dict.save_to_disk(path)

ds_from_disk = DatasetDict.load_from_disk(path).  ## <-- this is where I see the error on 1.6

Expected results

Upgrading to 1.6 shouldn’t break that test. We should be able to serialize to and from disk.

Actual results

        # Infer features if None
        inferred_features = Features.from_arrow_schema(arrow_table.schema)
        if self.info.features is None:
            self.info.features = inferred_features
    
        # Infer fingerprint if None
    
        if self._fingerprint is None:
            self._fingerprint = generate_fingerprint(self)
    
        # Sanity checks
    
        assert self.features is not None, "Features can't be None in a Dataset object"
        assert self._fingerprint is not None, "Fingerprint can't be None in a Dataset object"
        if self.info.features.type != inferred_features.type:
>           raise ValueError(
                "External features info don't match the dataset:\nGot\n{}\nwith type\n{}\n\nbut expected something like\n{}\nwith type\n{}".format(
                    self.info.features, self.info.features.type, inferred_features, inferred_features.type
                )
            )
E           ValueError: External features info don't match the dataset:
E           Got
E           {'_input_hash': Value(dtype='int64', id=None), '_task_hash': Value(dtype='int64', id=None), '_view_id': Value(dtype='string', id=None), 'answer': Value(dtype='string', id=None), 'encoding__ids': Sequence(feature=Value(dtype='int64', id=None), length=-1, id=None), 'encoding__offsets': Sequence(feature=Sequence(feature=Value(dtype='int64', id=None), length=-1, id=None), length=-1, id=None), 'encoding__overflowing': Sequence(feature=Value(dtype='null', id=None), length=-1, id=None), 'encoding__tokens': Sequence(feature=Value(dtype='string', id=None), length=-1, id=None), 'encoding__words': Sequence(feature=Value(dtype='int64', id=None), length=-1, id=None), 'ner_ids': Sequence(feature=Value(dtype='int64', id=None), length=-1, id=None), 'ner_labels': Sequence(feature=Value(dtype='string', id=None), length=-1, id=None), 'relations': [{'child': Value(dtype='int64', id=None), 'child_span': {'end': Value(dtype='int64', id=None), 'label': Value(dtype='string', id=None), 'start': Value(dtype='int64', id=None), 'token_end': Value(dtype='int64', id=None), 'token_start': Value(dtype='int64', id=None)}, 'color': Value(dtype='string', id=None), 'head': Value(dtype='int64', id=None), 'head_span': {'end': Value(dtype='int64', id=None), 'label': Value(dtype='string', id=None), 'start': Value(dtype='int64', id=None), 'token_end': Value(dtype='int64', id=None), 'token_start': Value(dtype='int64', id=None)}, 'label': Value(dtype='string', id=None)}], 'spans': [{'end': Value(dtype='int64', id=None), 'label': Value(dtype='string', id=None), 'start': Value(dtype='int64', id=None), 'text': Value(dtype='string', id=None), 'token_end': Value(dtype='int64', id=None), 'token_start': Value(dtype='int64', id=None), 'type': Value(dtype='string', id=None)}], 'text': Value(dtype='string', id=None), 'tokens': [{'disabled': Value(dtype='bool', id=None), 'end': Value(dtype='int64', id=None), 'id': Value(dtype='int64', id=None), 'start': Value(dtype='int64', id=None), 'text': Value(dtype='string', id=None), 'ws': Value(dtype='bool', id=None)}]}
E           with type
E           struct<_input_hash: int64, _task_hash: int64, _view_id: string, answer: string, encoding__ids: list<item: int64>, encoding__offsets: list<item: list<item: int64>>, encoding__overflowing: list<item: null>, encoding__tokens: list<item: string>, encoding__words: list<item: int64>, ner_ids: list<item: int64>, ner_labels: list<item: string>, relations: list<item: struct<child: int64, child_span: struct<end: int64, label: string, start: int64, token_end: int64, token_start: int64>, color: string, head: int64, head_span: struct<end: int64, label: string, start: int64, token_end: int64, token_start: int64>, label: string>>, spans: list<item: struct<end: int64, label: string, start: int64, text: string, token_end: int64, token_start: int64, type: string>>, text: string, tokens: list<item: struct<disabled: bool, end: int64, id: int64, start: int64, text: string, ws: bool>>>
E           
E           but expected something like
E           {'_input_hash': Value(dtype='int64', id=None), '_task_hash': Value(dtype='int64', id=None), '_view_id': Value(dtype='string', id=None), 'answer': Value(dtype='string', id=None), 'encoding__ids': Sequence(feature=Value(dtype='int64', id=None), length=-1, id=None), 'encoding__offsets': Sequence(feature=Sequence(feature=Value(dtype='int64', id=None), length=-1, id=None), length=-1, id=None), 'encoding__overflowing': Sequence(feature=Value(dtype='null', id=None), length=-1, id=None), 'encoding__tokens': Sequence(feature=Value(dtype='string', id=None), length=-1, id=None), 'encoding__words': Sequence(feature=Value(dtype='int64', id=None), length=-1, id=None), 'ner_ids': Sequence(feature=Value(dtype='int64', id=None), length=-1, id=None), 'ner_labels': Sequence(feature=Value(dtype='string', id=None), length=-1, id=None), 'relations': [{'head': Value(dtype='int64', id=None), 'child': Value(dtype='int64', id=None), 'head_span': {'start': Value(dtype='int64', id=None), 'end': Value(dtype='int64', id=None), 'token_start': Value(dtype='int64', id=None), 'token_end': Value(dtype='int64', id=None), 'label': Value(dtype='string', id=None)}, 'child_span': {'start': Value(dtype='int64', id=None), 'end': Value(dtype='int64', id=None), 'token_start': Value(dtype='int64', id=None), 'token_end': Value(dtype='int64', id=None), 'label': Value(dtype='string', id=None)}, 'color': Value(dtype='string', id=None), 'label': Value(dtype='string', id=None)}], 'spans': [{'text': Value(dtype='string', id=None), 'start': Value(dtype='int64', id=None), 'token_start': Value(dtype='int64', id=None), 'token_end': Value(dtype='int64', id=None), 'end': Value(dtype='int64', id=None), 'type': Value(dtype='string', id=None), 'label': Value(dtype='string', id=None)}], 'text': Value(dtype='string', id=None), 'tokens': [{'text': Value(dtype='string', id=None), 'start': Value(dtype='int64', id=None), 'end': Value(dtype='int64', id=None), 'id': Value(dtype='int64', id=None), 'ws': Value(dtype='bool', id=None), 'disabled': Value(dtype='bool', id=None)}]}
E           with type
E           struct<_input_hash: int64, _task_hash: int64, _view_id: string, answer: string, encoding__ids: list<item: int64>, encoding__offsets: list<item: list<item: int64>>, encoding__overflowing: list<item: null>, encoding__tokens: list<item: string>, encoding__words: list<item: int64>, ner_ids: list<item: int64>, ner_labels: list<item: string>, relations: list<item: struct<head: int64, child: int64, head_span: struct<start: int64, end: int64, token_start: int64, token_end: int64, label: string>, child_span: struct<start: int64, end: int64, token_start: int64, token_end: int64, label: string>, color: string, label: string>>, spans: list<item: struct<text: string, start: int64, token_start: int64, token_end: int64, end: int64, type: string, label: string>>, text: string, tokens: list<item: struct<text: string, start: int64, end: int64, id: int64, ws: bool, disabled: bool>>>

../../../../../.virtualenvs/tf_ner_rel_lib/lib/python3.8/site-packages/datasets/arrow_dataset.py:274: ValueError

Versions

  • Datasets: 1.6.1
  • Python: 3.8.5 (default, Jan 26 2021, 10:01:04) [Clang 12.0.0 (clang-1200.0.32.2)]
  • Platform: macOS-10.15.7-x86_64-i386-64bit

Issue Analytics

  • State:open
  • Created 2 years ago
  • Comments:6 (4 by maintainers)

github_iconTop GitHub Comments

3reactions
lhoestqcommented, May 28, 2021

I just pushed a fix on master. We’ll do a new release soon !

Thanks for reporting

0reactions
lhoestqcommented, May 28, 2021

It looks like this issue comes from the order of the fields in the ‘idx’ struct that is different for some reason. I’m looking into it. Note that as a workaround you can also flatten the nested features with ds = ds.flatten()

Read more comments on GitHub >

github_iconTop Results From Across the Web

Save `DatasetDict` to HuggingFace Hub - Hugging Face Forums
I'd like to upload the generated folder to the HuggingFace Hub and use it using the usual load_dataset function. Though, I have not...
Read more >
Satpy Documentation
Satpy is designed to make data loading, manipulating, and analysis easy. However, the best way to get satellite imagery.
Read more >
from datasets import DatasetDict - Kaggle
I wanted to test how much data it can scale to in a Kaggle notebook. Once we know how to ingest the data...
Read more >
satpy Changelog - pyup.io
PyUp actively tracks 468,123 Python packages for vulnerabilities to keep ... [Issue 1883](https://github.com/pytroll/satpy/issues/1883) - Test failure on ...
Read more >
Release 2.3.5 Open Knowledge Foundation
CKAN does not mind what format the data is in. A ... available and the DataStore tests will fail. ... from paste.deploy import...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found