question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Quotation mark at start of line corrupts data

See original GitHub issue

When I read in a .tsv file containing the following two lines, containing two fields (input and output)

a " c    b
" b c	 a

the first example will results in fields [‘a’, ‘"’, ‘c’], [‘b’] While the second example will results in an input field [‘b’, ‘c’, ‘a’] and no output field. I think the desired behaviour would be that it would read as [‘"’, ‘b’, ‘c’], [‘a’], and be agnostic of the specific characters or tokens used.

I’ve seen this go wrong in multiple ways. It seems as if the character is not properly escaped.

Produced with torchtext 0.2.3 and python 3.6.5 The problem does not seem to appear with torchtext 0.2.1

Issue Analytics

  • State:closed
  • Created 5 years ago
  • Comments:7 (2 by maintainers)

github_iconTop GitHub Comments

1reaction
mttkcommented, Sep 29, 2018

@keitakurita please do file a PR

0reactions
wcollins-ebscocommented, Feb 6, 2019

I just got bit by this big time, in two different ways. Because of the severity of the problems it can cause I would urge a warning note or something in the documentation if not the default argument for ignoring quotes.

Lines in my dataset that started with double quotes were a fringe case, but the csv reader selects everything up to the next double quote, which could be dozens or hundreds of records! When this bloated record tensor was put onto CUDA it blew out the memory. Since I was shuffling the data each time, this conglomerate record would vary in position and size, so the problem was intermittent. It took me a long time to diagnose this as the cause of the CUDA memory problems.

When the training/testing did happen to complete I realized that my record count for the test data was not as expected. I discovered this the hard way when trying to post a Kaggle submission, which was rejected due to the bad record count.

I wasn’t able to diagnose this problem until I read the source code in this repo and looked into each of the components that could have caused what I thought were dropped records. Eventually I found out about csv reader’s default treatment of double quotes and discovered this keyword argument for addressing the issue.

That’s my sob story. I hope it can help others.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Why you may need quotation marks in CSV files?
In CSV files, quotation marks function as text qualifiers. This means that they define what text should be stored as a single value...
Read more >
How to correct improper json containing extraneous quote ...
Notice that the value for "WordText" contains a double quote mark after the backslash. When I process this with json.dumps it get an...
Read more >
Single and double quotes entered in form field cause ...
The form seems to have trouble when the string data entered into the text area page items on the form contain single and...
Read more >
Escaping nested double-quotes in a corrupted CSV file
I created a sample input file based on each line being 10 fields with fields 4 and 9 possibly quoted: $ cat file...
Read more >
ASCII and Unicode quotation marks
“For historical reasons, U+0027 is a particularly overloaded character. In ASCII it is used to represent a punctuation mark (such as right single...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found