question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Incorrect detection of encoding gives out a source-error

See original GitHub issue

We have a csv file that uses utf-8 encoding, but uses only ascii characters up to some point at the middle of the file. Goodtables is probably incorrectly interpreting it as an ascii file, because it gives out a source-error:

$ goodtables --infer-schema data/nz.csv 
DATASET
=======
{'error-count': 1,
 'preset': 'nested',
 'table-count': 1,
 'time': 0.003,
 'valid': False}

TABLE [1]
=========
{'error-count': 1,
 'headers': [],
 'row-count': 0,
 'source': 'data/nz.csv',
 'time': 0.002,
 'valid': False}
---------
[-,-] [source-error] 'charmap' codec can't decode byte 0x81 in position 1780: character maps to <undefined>

File: data/nz.csv

The line that gives an error is L18:

nz/archives-new-zealand-te-rua-mahara-o-te-kwanatanga,Archives New Zealand Te Rua Mahara o te Kāwanatanga,,,,,,,,,http://www.archives.govt.nz,NZ,,,,,

Apparently, this would have been fixed in #45 by implementing a decode_strategy keyword. But I could not find any mention of it in the documentation. The default settings still leave users of Goodtables confused.

Issue Analytics

  • State:closed
  • Created 5 years ago
  • Reactions:3
  • Comments:6 (6 by maintainers)

github_iconTop GitHub Comments

1reaction
AntoineAugusticommented, Feb 27, 2020

I can confirm

goodtables.validate(
    [{"source": "exemple-valide.csv", "schema": "schema.json", "encoding": "utf-8"}]
)

works as expected. Thanks!

0reactions
rollcommented, Apr 23, 2020

It works only if encoding=utf-8 is provided. We’re going to improve quality of detection here - https://github.com/frictionlessdata/tabulator-py/issues/308.

For now, as it’s a tabulator/chardet problem source-error is a proper error code from goodtables.

Read more comments on GitHub >

github_iconTop Results From Across the Web

utf 8 - How to detect and fix incorrect character encoding
Attempting to decode as ISO 8859-1 and then as UTF-8, and falling back to simply decoding as UTF-8 if this produces invalid byte...
Read more >
incorrect detection of windows-1254 instead of utf-8 · Issue #148
Attempting to decode it as the latter just throws "UnicodeDecodeError: 'charmap' codec can't decode byte 0x9d in position 285: character maps to ...
Read more >
Dealing with source files encoding detection issues in Trados ...
Workaround: Specify target file encoding in Trados Studio ... Warning: If the source file encoding was not detected as desired, incorrect ...
Read more >
Source Monitoring - an overview | ScienceDirect Topics
Source monitoring includes tracking, encoding, and recalling the source of one's knowledge, such as how or when that knowledge was acquired (Johnson, ...
Read more >
Troubleshooting Encoding Errors in Ruby - Honeybadger.io
When encoding breaks, it can feel like the floor is falling out from under you ... the string into its representation with a...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found