Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

DatasetDict save load Failing test in 1.6 not in 1.5

See original GitHub issue

Describe the bug

We have a test that saves a DatasetDict to disk and then loads it from disk. In 1.6 there is an incompatibility in the schema.

Downgrading to >1.6 – fixes the problem.

Steps to reproduce the bug


### Load a dataset dict from jsonl 

path = '/test/foo'

ds_dict.save_to_disk(path)

ds_from_disk = DatasetDict.load_from_disk(path).  ## <-- this is where I see the error on 1.6

Expected results

Upgrading to 1.6 shouldn’t break that test. We should be able to serialize to and from disk.

Actual results

        # Infer features if None
        inferred_features = Features.from_arrow_schema(arrow_table.schema)
        if self.info.features is None:
            self.info.features = inferred_features
    
        # Infer fingerprint if None
    
        if self._fingerprint is None:
            self._fingerprint = generate_fingerprint(self)
    
        # Sanity checks
    
        assert self.features is not None, "Features can't be None in a Dataset object"
        assert self._fingerprint is not None, "Fingerprint can't be None in a Dataset object"
        if self.info.features.type != inferred_features.type:
>           raise ValueError(
                "External features info don't match the dataset:\nGot\n{}\nwith type\n{}\n\nbut expected something like\n{}\nwith type\n{}".format(
                    self.info.features, self.info.features.type, inferred_features, inferred_features.type
                )
            )
E           ValueError: External features info don't match the dataset:
E           Got
E           {'_input_hash': Value(dtype='int64', id=None), '_task_hash': Value(dtype='int64', id=None), '_view_id': Value(dtype='string', id=None), 'answer': Value(dtype='string', id=None), 'encoding__ids': Sequence(feature=Value(dtype='int64', id=None), length=-1, id=None), 'encoding__offsets': Sequence(feature=Sequence(feature=Value(dtype='int64', id=None), length=-1, id=None), length=-1, id=None), 'encoding__overflowing': Sequence(feature=Value(dtype='null', id=None), length=-1, id=None), 'encoding__tokens': Sequence(feature=Value(dtype='string', id=None), length=-1, id=None), 'encoding__words': Sequence(feature=Value(dtype='int64', id=None), length=-1, id=None), 'ner_ids': Sequence(feature=Value(dtype='int64', id=None), length=-1, id=None), 'ner_labels': Sequence(feature=Value(dtype='string', id=None), length=-1, id=None), 'relations': [{'child': Value(dtype='int64', id=None), 'child_span': {'end': Value(dtype='int64', id=None), 'label': Value(dtype='string', id=None), 'start': Value(dtype='int64', id=None), 'token_end': Value(dtype='int64', id=None), 'token_start': Value(dtype='int64', id=None)}, 'color': Value(dtype='string', id=None), 'head': Value(dtype='int64', id=None), 'head_span': {'end': Value(dtype='int64', id=None), 'label': Value(dtype='string', id=None), 'start': Value(dtype='int64', id=None), 'token_end': Value(dtype='int64', id=None), 'token_start': Value(dtype='int64', id=None)}, 'label': Value(dtype='string', id=None)}], 'spans': [{'end': Value(dtype='int64', id=None), 'label': Value(dtype='string', id=None), 'start': Value(dtype='int64', id=None), 'text': Value(dtype='string', id=None), 'token_end': Value(dtype='int64', id=None), 'token_start': Value(dtype='int64', id=None), 'type': Value(dtype='string', id=None)}], 'text': Value(dtype='string', id=None), 'tokens': [{'disabled': Value(dtype='bool', id=None), 'end': Value(dtype='int64', id=None), 'id': Value(dtype='int64', id=None), 'start': Value(dtype='int64', id=None), 'text': Value(dtype='string', id=None), 'ws': Value(dtype='bool', id=None)}]}
E           with type
E           struct<_input_hash: int64, _task_hash: int64, _view_id: string, answer: string, encoding__ids: list<item: int64>, encoding__offsets: list<item: list<item: int64>>, encoding__overflowing: list<item: null>, encoding__tokens: list<item: string>, encoding__words: list<item: int64>, ner_ids: list<item: int64>, ner_labels: list<item: string>, relations: list<item: struct<child: int64, child_span: struct<end: int64, label: string, start: int64, token_end: int64, token_start: int64>, color: string, head: int64, head_span: struct<end: int64, label: string, start: int64, token_end: int64, token_start: int64>, label: string>>, spans: list<item: struct<end: int64, label: string, start: int64, text: string, token_end: int64, token_start: int64, type: string>>, text: string, tokens: list<item: struct<disabled: bool, end: int64, id: int64, start: int64, text: string, ws: bool>>>
E           
E           but expected something like
E           {'_input_hash': Value(dtype='int64', id=None), '_task_hash': Value(dtype='int64', id=None), '_view_id': Value(dtype='string', id=None), 'answer': Value(dtype='string', id=None), 'encoding__ids': Sequence(feature=Value(dtype='int64', id=None), length=-1, id=None), 'encoding__offsets': Sequence(feature=Sequence(feature=Value(dtype='int64', id=None), length=-1, id=None), length=-1, id=None), 'encoding__overflowing': Sequence(feature=Value(dtype='null', id=None), length=-1, id=None), 'encoding__tokens': Sequence(feature=Value(dtype='string', id=None), length=-1, id=None), 'encoding__words': Sequence(feature=Value(dtype='int64', id=None), length=-1, id=None), 'ner_ids': Sequence(feature=Value(dtype='int64', id=None), length=-1, id=None), 'ner_labels': Sequence(feature=Value(dtype='string', id=None), length=-1, id=None), 'relations': [{'head': Value(dtype='int64', id=None), 'child': Value(dtype='int64', id=None), 'head_span': {'start': Value(dtype='int64', id=None), 'end': Value(dtype='int64', id=None), 'token_start': Value(dtype='int64', id=None), 'token_end': Value(dtype='int64', id=None), 'label': Value(dtype='string', id=None)}, 'child_span': {'start': Value(dtype='int64', id=None), 'end': Value(dtype='int64', id=None), 'token_start': Value(dtype='int64', id=None), 'token_end': Value(dtype='int64', id=None), 'label': Value(dtype='string', id=None)}, 'color': Value(dtype='string', id=None), 'label': Value(dtype='string', id=None)}], 'spans': [{'text': Value(dtype='string', id=None), 'start': Value(dtype='int64', id=None), 'token_start': Value(dtype='int64', id=None), 'token_end': Value(dtype='int64', id=None), 'end': Value(dtype='int64', id=None), 'type': Value(dtype='string', id=None), 'label': Value(dtype='string', id=None)}], 'text': Value(dtype='string', id=None), 'tokens': [{'text': Value(dtype='string', id=None), 'start': Value(dtype='int64', id=None), 'end': Value(dtype='int64', id=None), 'id': Value(dtype='int64', id=None), 'ws': Value(dtype='bool', id=None), 'disabled': Value(dtype='bool', id=None)}]}
E           with type
E           struct<_input_hash: int64, _task_hash: int64, _view_id: string, answer: string, encoding__ids: list<item: int64>, encoding__offsets: list<item: list<item: int64>>, encoding__overflowing: list<item: null>, encoding__tokens: list<item: string>, encoding__words: list<item: int64>, ner_ids: list<item: int64>, ner_labels: list<item: string>, relations: list<item: struct<head: int64, child: int64, head_span: struct<start: int64, end: int64, token_start: int64, token_end: int64, label: string>, child_span: struct<start: int64, end: int64, token_start: int64, token_end: int64, label: string>, color: string, label: string>>, spans: list<item: struct<text: string, start: int64, token_start: int64, token_end: int64, end: int64, type: string, label: string>>, text: string, tokens: list<item: struct<text: string, start: int64, end: int64, id: int64, ws: bool, disabled: bool>>>

../../../../../.virtualenvs/tf_ner_rel_lib/lib/python3.8/site-packages/datasets/arrow_dataset.py:274: ValueError

Versions

Datasets: 1.6.1
Python: 3.8.5 (default, Jan 26 2021, 10:01:04) [Clang 12.0.0 (clang-1200.0.32.2)]
Platform: macOS-10.15.7-x86_64-i386-64bit

Issue Analytics

State:
Created 2 years ago
Comments:6 (4 by maintainers)

Top GitHub Comments

3reactions

lhoestqcommented, May 28, 2021

I just pushed a fix on master. We’ll do a new release soon !

Thanks for reporting

0reactions

lhoestqcommented, May 28, 2021

It looks like this issue comes from the order of the fields in the ‘idx’ struct that is different for some reason. I’m looking into it. Note that as a workaround you can also flatten the nested features with ds = ds.flatten()

Top Results From Across the Web

Save `DatasetDict` to HuggingFace Hub - Hugging Face Forums

I'd like to upload the generated folder to the HuggingFace Hub and use it using the usual load_dataset function. Though, I have not...

Satpy Documentation

Satpy is designed to make data loading, manipulating, and analysis easy. However, the best way to get satellite imagery.

from datasets import DatasetDict - Kaggle

I wanted to test how much data it can scale to in a Kaggle notebook. Once we know how to ingest the data...

satpy Changelog - pyup.io

PyUp actively tracks 468,123 Python packages for vulnerabilities to keep ... [Issue 1883](https://github.com/pytroll/satpy/issues/1883) - Test failure on ...

Release 2.3.5 Open Knowledge Foundation

CKAN does not mind what format the data is in. A ... available and the DataStore tests will fail. ... from paste.deploy import...