question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Hash is not recalculated when writing data package metadata

See original GitHub issue

Overview

When you write the data package metadata with to_yaml or to_json, the resource hashes are not recalculated. This leads to validation errors down the line.

How to reproduce

On a clean environment, install Frictionless 3.14.0:

(env) $ pip install ipython frictionless==3.14.0

Open a Python (or IPython) terminal, then:

In [1]: from frictionless import describe_package, validate

In [2]: from pprint import pprint

In [3]: csv = 'a,b\n0,1'

In [4]: with open('test.csv', 'w') as f:
   ...:     f.write(csv)
   ...: 

In [5]: package = describe_package('test.csv')

In [6]: resource = package.get_resource('test')

In [7]: resource.hashing = 'sha256'

In [8]: package.to_json('test.json')

In [9]: report = validate('test.json', source_type='package')

In [10]: pprint(report)
{'errors': [],
 'stats': {'errors': 1, 'tables': 1},
 'tables': [{'compression': 'no',
             'compressionPath': '',
             'dialect': {},
             'encoding': 'utf-8',
             'errors': [{'code': 'checksum-error',
                         'description': 'This error can happen if the data is '
                                        'corrupted.',
                         'message': 'The data source does not match the '
                                    'expected checksum: expected hash in '
                                    'sha256 is '
                                    '"a316a7ac2a0f3a69719cb532b31a6788" and '
                                    'actual is '
                                    '"14d6e4164bb209ee74f10b8182da85f913a636c233690ebd80cc8aa4cbc53491"',
                         'name': 'Checksum Error',
                         'note': 'expected hash in sha256 is '
                                 '"a316a7ac2a0f3a69719cb532b31a6788" and '
                                 'actual is '
                                 '"14d6e4164bb209ee74f10b8182da85f913a636c233690ebd80cc8aa4cbc53491"',
                         'tags': ['#table', '#checksum']}],
             'format': 'csv',
             'hashing': 'sha256',
             'header': ['a', 'b'],
             'partial': False,
             'path': 'test.csv',
             'query': {},
             'schema': {'fields': [{'name': 'a', 'type': 'integer'},
                                   {'name': 'b', 'type': 'integer'}]},
             'scheme': 'file',
             'scope': ['dialect-error',
                       'schema-error',
                       'field-error',
                       'extra-header',
                       'missing-header',
                       'blank-header',
                       'duplicate-header',
                       'non-matching-header',
                       'extra-cell',
                       'missing-cell',
                       'blank-row',
                       'type-error',
                       'constraint-error',
                       'unique-error',
                       'primary-key-error',
                       'foreign-key-error',
                       'checksum-error'],
             'stats': {'bytes': 7,
                       'errors': 1,
                       'fields': 2,
                       'hash': '14d6e4164bb209ee74f10b8182da85f913a636c233690ebd80cc8aa4cbc53491',
                       'rows': 1},
             'time': 0.005,
             'valid': False}],
 'time': 0.019,
 'valid': False,
 'version': '3.14.0'}

Expected behavior

I believe I should be able to choose the hashing type when describing a data package. Changing the hashing type should mean the hash gets recalculated with the new hashing algorithm.

The generated data package should be validated.

Actual behavior

We get a validation error because of the differing checksum.


Please preserve this line to notify @roll (lead of this repository)

Issue Analytics

  • State:closed
  • Created 3 years ago
  • Comments:6 (6 by maintainers)

github_iconTop GitHub Comments

1reaction
augusto-herrmanncommented, Oct 7, 2020

Awesome! Looking forward to trying it out soon!

0reactions
rollcommented, Oct 7, 2020

Hi @augusto-herrmann,

You might also to include a parameter in describe_package to allow the user to choose the hashing algorithm, instead of calculating the default MD5 and then changing it later. Especially considering, as you mentioned, the resources might be huge, it should only be calculated once.

It’s a great idea I’m releasing frictionless@3.18 with this argument available for describe/describe_package (it’s was implemented only for describe_resource)

Also, won’t resource.infer() reset the whole schema of the resource?

No, it will not change existent properties except for recalculation of resource.stats

Read more comments on GitHub >

github_iconTop Results From Across the Web

Poetry refuses to install package with correct hash · Issue #4523
I am on the latest Poetry version. I have searched the issues of this repo and believe that this is not a duplicate....
Read more >
python - Compute hash of only the core image data (excluding ...
Trying to efficiently create a hash of an image that does not change when the EXIF data is edited. (ImageMagick has a visual...
Read more >
Hash error messages - Code42 Support
Overview. When a file in a cloud service is updated, moved, or shared, Code42 calculates the hash value for the file.
Read more >
Why Hash Values Are Crucial in Evidence Collection & Digital ...
When it comes to authenticating digital evidence, the use of hash values is absolutely crucial. Read this blog post to understand why.
Read more >
Does an identical cryptographic hash or checksum for two files ...
For your purposes, yes, identical hashes means identical files. As other answers make clear, it's possible to construct 2 different files which ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found