Incorrect detection of encoding gives out a source-error
See original GitHub issueWe have a csv file that uses utf-8 encoding, but uses only ascii characters up to some point at the middle of the file. Goodtables is probably incorrectly interpreting it as an ascii file, because it gives out a source-error:
$ goodtables --infer-schema data/nz.csv
DATASET
=======
{'error-count': 1,
'preset': 'nested',
'table-count': 1,
'time': 0.003,
'valid': False}
TABLE [1]
=========
{'error-count': 1,
'headers': [],
'row-count': 0,
'source': 'data/nz.csv',
'time': 0.002,
'valid': False}
---------
[-,-] [source-error] 'charmap' codec can't decode byte 0x81 in position 1780: character maps to <undefined>
File: data/nz.csv
The line that gives an error is L18:
nz/archives-new-zealand-te-rua-mahara-o-te-kwanatanga,Archives New Zealand Te Rua Mahara o te Kāwanatanga,,,,,,,,,http://www.archives.govt.nz,NZ,,,,,
Apparently, this would have been fixed in #45 by implementing a decode_strategy
keyword. But I could not find any mention of it in the documentation. The default settings still leave users of Goodtables confused.
Issue Analytics
- State:
- Created 5 years ago
- Reactions:3
- Comments:6 (6 by maintainers)
Top Results From Across the Web
utf 8 - How to detect and fix incorrect character encoding
Attempting to decode as ISO 8859-1 and then as UTF-8, and falling back to simply decoding as UTF-8 if this produces invalid byte...
Read more >incorrect detection of windows-1254 instead of utf-8 · Issue #148
Attempting to decode it as the latter just throws "UnicodeDecodeError: 'charmap' codec can't decode byte 0x9d in position 285: character maps to ...
Read more >Dealing with source files encoding detection issues in Trados ...
Workaround: Specify target file encoding in Trados Studio ... Warning: If the source file encoding was not detected as desired, incorrect ...
Read more >Source Monitoring - an overview | ScienceDirect Topics
Source monitoring includes tracking, encoding, and recalling the source of one's knowledge, such as how or when that knowledge was acquired (Johnson, ...
Read more >Troubleshooting Encoding Errors in Ruby - Honeybadger.io
When encoding breaks, it can feel like the floor is falling out from under you ... the string into its representation with a...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
I can confirm
works as expected. Thanks!
It works only if
encoding=utf-8
is provided. We’re going to improve quality of detection here - https://github.com/frictionlessdata/tabulator-py/issues/308.For now, as it’s a
tabulator/chardet
problemsource-error
is a proper error code fromgoodtables
.