question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Improve dialect sniffing

See original GitHub issue

Back in June 2017, @petersonjr pointed out some test cases that cause csvkit/agate to misinterpret the dialect of files: https://github.com/wireservice/csvkit/issues/751#issuecomment-310803282

The reasonable workaround @jpmckinney pointed out was to configure --snifflimit 0 at the commandline. While this helps, I’d like to point out a couple of things:

a) Those test cases work fine without the override to the Sniffer class introduced in https://github.com/wireservice/agate/commit/3b9ceea131ba143cc72b6d1a9f7871d059188b52

b) I dug into the default Sniffer in the Python standard library. It’s not very robust; it uses regular expressions for parsing each line, which means that it gets confused by say commas inside quoted strings.

c) While agate has a reasonable default sniff_limit of 0, the commandline argument parsing in csvkit overrides it to None, which has the effect of reading the entire file. As the size of the input file grows, it becomes increasingly likely that it’ll encounter stuff that the Python Sniffer can’t handle.

These observations indicate a few places in the stack where a change may be beneficial. I’m not sure if any of them is a good idea – I may well be missing context as an outsider – but thought I’d try to start a discussion. Is there a testsuite for the custom Sniffer in agate? Is it an option to default --snifflimit to something smaller, say 1024? (Obviously changing the Python standard library is a bigger task. The maintainer for the csv module is also not active anymore.)

Issue Analytics

  • State:closed
  • Created 5 years ago
  • Comments:6 (4 by maintainers)

github_iconTop GitHub Comments

1reaction
jpmckinneycommented, Sep 26, 2018

Sure, we can start a 2.0 branch if you contribute a patch.

0reactions
jpmckinneycommented, Sep 14, 2021

I consider this closed by #1140

Read more comments on GitHub >

github_iconTop Results From Across the Web

SNIFF pronunciation | Improve your language with bab.la
Improve your spoken English by listening to SNIFF pronounced by different speakers – and in example sentences too. Learn and love languages ...
Read more >
Sniffing is an underrated communication tool used to indicate ...
Sniffing acts as an important tool in the art of conversation and is used to indicate the end of a thought or the...
Read more >
Speaking of the nose: the unexpected role of sniffing during ...
It seems that this exactly reveals its function: sniffing is a signal you will not talk for a moment.
Read more >
Tips and Troubleshooting — csvkit 1.0.0 documentation
If you use Python 2, pip install cdecimal for a boost. ... Sniffer to deduce the format of a CSV file: that is,...
Read more >
Getting csv.Sniffer to work with quoted values - Stack Overflow
This seems like a major deficiency of the Python documentation which shows the problematic method of reading 1024 bytes: dialect = csv.Sniffer() ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found