DatasetDict save load Failing test in 1.6 not in 1.5
See original GitHub issueDescribe the bug
We have a test that saves a DatasetDict to disk and then loads it from disk. In 1.6 there is an incompatibility in the schema.
Downgrading to >1.6
– fixes the problem.
Steps to reproduce the bug
### Load a dataset dict from jsonl
path = '/test/foo'
ds_dict.save_to_disk(path)
ds_from_disk = DatasetDict.load_from_disk(path). ## <-- this is where I see the error on 1.6
Expected results
Upgrading to 1.6 shouldn’t break that test. We should be able to serialize to and from disk.
Actual results
# Infer features if None
inferred_features = Features.from_arrow_schema(arrow_table.schema)
if self.info.features is None:
self.info.features = inferred_features
# Infer fingerprint if None
if self._fingerprint is None:
self._fingerprint = generate_fingerprint(self)
# Sanity checks
assert self.features is not None, "Features can't be None in a Dataset object"
assert self._fingerprint is not None, "Fingerprint can't be None in a Dataset object"
if self.info.features.type != inferred_features.type:
> raise ValueError(
"External features info don't match the dataset:\nGot\n{}\nwith type\n{}\n\nbut expected something like\n{}\nwith type\n{}".format(
self.info.features, self.info.features.type, inferred_features, inferred_features.type
)
)
E ValueError: External features info don't match the dataset:
E Got
E {'_input_hash': Value(dtype='int64', id=None), '_task_hash': Value(dtype='int64', id=None), '_view_id': Value(dtype='string', id=None), 'answer': Value(dtype='string', id=None), 'encoding__ids': Sequence(feature=Value(dtype='int64', id=None), length=-1, id=None), 'encoding__offsets': Sequence(feature=Sequence(feature=Value(dtype='int64', id=None), length=-1, id=None), length=-1, id=None), 'encoding__overflowing': Sequence(feature=Value(dtype='null', id=None), length=-1, id=None), 'encoding__tokens': Sequence(feature=Value(dtype='string', id=None), length=-1, id=None), 'encoding__words': Sequence(feature=Value(dtype='int64', id=None), length=-1, id=None), 'ner_ids': Sequence(feature=Value(dtype='int64', id=None), length=-1, id=None), 'ner_labels': Sequence(feature=Value(dtype='string', id=None), length=-1, id=None), 'relations': [{'child': Value(dtype='int64', id=None), 'child_span': {'end': Value(dtype='int64', id=None), 'label': Value(dtype='string', id=None), 'start': Value(dtype='int64', id=None), 'token_end': Value(dtype='int64', id=None), 'token_start': Value(dtype='int64', id=None)}, 'color': Value(dtype='string', id=None), 'head': Value(dtype='int64', id=None), 'head_span': {'end': Value(dtype='int64', id=None), 'label': Value(dtype='string', id=None), 'start': Value(dtype='int64', id=None), 'token_end': Value(dtype='int64', id=None), 'token_start': Value(dtype='int64', id=None)}, 'label': Value(dtype='string', id=None)}], 'spans': [{'end': Value(dtype='int64', id=None), 'label': Value(dtype='string', id=None), 'start': Value(dtype='int64', id=None), 'text': Value(dtype='string', id=None), 'token_end': Value(dtype='int64', id=None), 'token_start': Value(dtype='int64', id=None), 'type': Value(dtype='string', id=None)}], 'text': Value(dtype='string', id=None), 'tokens': [{'disabled': Value(dtype='bool', id=None), 'end': Value(dtype='int64', id=None), 'id': Value(dtype='int64', id=None), 'start': Value(dtype='int64', id=None), 'text': Value(dtype='string', id=None), 'ws': Value(dtype='bool', id=None)}]}
E with type
E struct<_input_hash: int64, _task_hash: int64, _view_id: string, answer: string, encoding__ids: list<item: int64>, encoding__offsets: list<item: list<item: int64>>, encoding__overflowing: list<item: null>, encoding__tokens: list<item: string>, encoding__words: list<item: int64>, ner_ids: list<item: int64>, ner_labels: list<item: string>, relations: list<item: struct<child: int64, child_span: struct<end: int64, label: string, start: int64, token_end: int64, token_start: int64>, color: string, head: int64, head_span: struct<end: int64, label: string, start: int64, token_end: int64, token_start: int64>, label: string>>, spans: list<item: struct<end: int64, label: string, start: int64, text: string, token_end: int64, token_start: int64, type: string>>, text: string, tokens: list<item: struct<disabled: bool, end: int64, id: int64, start: int64, text: string, ws: bool>>>
E
E but expected something like
E {'_input_hash': Value(dtype='int64', id=None), '_task_hash': Value(dtype='int64', id=None), '_view_id': Value(dtype='string', id=None), 'answer': Value(dtype='string', id=None), 'encoding__ids': Sequence(feature=Value(dtype='int64', id=None), length=-1, id=None), 'encoding__offsets': Sequence(feature=Sequence(feature=Value(dtype='int64', id=None), length=-1, id=None), length=-1, id=None), 'encoding__overflowing': Sequence(feature=Value(dtype='null', id=None), length=-1, id=None), 'encoding__tokens': Sequence(feature=Value(dtype='string', id=None), length=-1, id=None), 'encoding__words': Sequence(feature=Value(dtype='int64', id=None), length=-1, id=None), 'ner_ids': Sequence(feature=Value(dtype='int64', id=None), length=-1, id=None), 'ner_labels': Sequence(feature=Value(dtype='string', id=None), length=-1, id=None), 'relations': [{'head': Value(dtype='int64', id=None), 'child': Value(dtype='int64', id=None), 'head_span': {'start': Value(dtype='int64', id=None), 'end': Value(dtype='int64', id=None), 'token_start': Value(dtype='int64', id=None), 'token_end': Value(dtype='int64', id=None), 'label': Value(dtype='string', id=None)}, 'child_span': {'start': Value(dtype='int64', id=None), 'end': Value(dtype='int64', id=None), 'token_start': Value(dtype='int64', id=None), 'token_end': Value(dtype='int64', id=None), 'label': Value(dtype='string', id=None)}, 'color': Value(dtype='string', id=None), 'label': Value(dtype='string', id=None)}], 'spans': [{'text': Value(dtype='string', id=None), 'start': Value(dtype='int64', id=None), 'token_start': Value(dtype='int64', id=None), 'token_end': Value(dtype='int64', id=None), 'end': Value(dtype='int64', id=None), 'type': Value(dtype='string', id=None), 'label': Value(dtype='string', id=None)}], 'text': Value(dtype='string', id=None), 'tokens': [{'text': Value(dtype='string', id=None), 'start': Value(dtype='int64', id=None), 'end': Value(dtype='int64', id=None), 'id': Value(dtype='int64', id=None), 'ws': Value(dtype='bool', id=None), 'disabled': Value(dtype='bool', id=None)}]}
E with type
E struct<_input_hash: int64, _task_hash: int64, _view_id: string, answer: string, encoding__ids: list<item: int64>, encoding__offsets: list<item: list<item: int64>>, encoding__overflowing: list<item: null>, encoding__tokens: list<item: string>, encoding__words: list<item: int64>, ner_ids: list<item: int64>, ner_labels: list<item: string>, relations: list<item: struct<head: int64, child: int64, head_span: struct<start: int64, end: int64, token_start: int64, token_end: int64, label: string>, child_span: struct<start: int64, end: int64, token_start: int64, token_end: int64, label: string>, color: string, label: string>>, spans: list<item: struct<text: string, start: int64, token_start: int64, token_end: int64, end: int64, type: string, label: string>>, text: string, tokens: list<item: struct<text: string, start: int64, end: int64, id: int64, ws: bool, disabled: bool>>>
../../../../../.virtualenvs/tf_ner_rel_lib/lib/python3.8/site-packages/datasets/arrow_dataset.py:274: ValueError
Versions
- Datasets: 1.6.1
- Python: 3.8.5 (default, Jan 26 2021, 10:01:04) [Clang 12.0.0 (clang-1200.0.32.2)]
- Platform: macOS-10.15.7-x86_64-i386-64bit
Issue Analytics
- State:
- Created 2 years ago
- Comments:6 (4 by maintainers)
Top Results From Across the Web
Save `DatasetDict` to HuggingFace Hub - Hugging Face Forums
I'd like to upload the generated folder to the HuggingFace Hub and use it using the usual load_dataset function. Though, I have not...
Read more >Satpy Documentation
Satpy is designed to make data loading, manipulating, and analysis easy. However, the best way to get satellite imagery.
Read more >from datasets import DatasetDict - Kaggle
I wanted to test how much data it can scale to in a Kaggle notebook. Once we know how to ingest the data...
Read more >satpy Changelog - pyup.io
PyUp actively tracks 468,123 Python packages for vulnerabilities to keep ... [Issue 1883](https://github.com/pytroll/satpy/issues/1883) - Test failure on ...
Read more >Release 2.3.5 Open Knowledge Foundation
CKAN does not mind what format the data is in. A ... available and the DataStore tests will fail. ... from paste.deploy import...
Read more >
Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free
Top Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
I just pushed a fix on
master
. We’ll do a new release soon !Thanks for reporting
It looks like this issue comes from the order of the fields in the ‘idx’ struct that is different for some reason. I’m looking into it. Note that as a workaround you can also flatten the nested features with
ds = ds.flatten()