Quotation mark at start of line corrupts data
See original GitHub issueWhen I read in a .tsv
file containing the following two lines, containing two fields (input and output)
a " c b
" b c a
the first example will results in fields [‘a’, ‘"’, ‘c’], [‘b’] While the second example will results in an input field [‘b’, ‘c’, ‘a’] and no output field. I think the desired behaviour would be that it would read as [‘"’, ‘b’, ‘c’], [‘a’], and be agnostic of the specific characters or tokens used.
I’ve seen this go wrong in multiple ways. It seems as if the character is not properly escaped.
Produced with torchtext 0.2.3 and python 3.6.5 The problem does not seem to appear with torchtext 0.2.1
Issue Analytics
- State:
- Created 5 years ago
- Comments:7 (2 by maintainers)
Top Results From Across the Web
Why you may need quotation marks in CSV files?
In CSV files, quotation marks function as text qualifiers. This means that they define what text should be stored as a single value...
Read more >How to correct improper json containing extraneous quote ...
Notice that the value for "WordText" contains a double quote mark after the backslash. When I process this with json.dumps it get an...
Read more >Single and double quotes entered in form field cause ...
The form seems to have trouble when the string data entered into the text area page items on the form contain single and...
Read more >Escaping nested double-quotes in a corrupted CSV file
I created a sample input file based on each line being 10 fields with fields 4 and 9 possibly quoted: $ cat file...
Read more >ASCII and Unicode quotation marks
“For historical reasons, U+0027 is a particularly overloaded character. In ASCII it is used to represent a punctuation mark (such as right single...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
@keitakurita please do file a PR
I just got bit by this big time, in two different ways. Because of the severity of the problems it can cause I would urge a warning note or something in the documentation if not the default argument for ignoring quotes.
Lines in my dataset that started with double quotes were a fringe case, but the csv reader selects everything up to the next double quote, which could be dozens or hundreds of records! When this bloated record tensor was put onto CUDA it blew out the memory. Since I was shuffling the data each time, this conglomerate record would vary in position and size, so the problem was intermittent. It took me a long time to diagnose this as the cause of the CUDA memory problems.
When the training/testing did happen to complete I realized that my record count for the test data was not as expected. I discovered this the hard way when trying to post a Kaggle submission, which was rejected due to the bad record count.
I wasn’t able to diagnose this problem until I read the source code in this repo and looked into each of the components that could have caused what I thought were dropped records. Eventually I found out about csv reader’s default treatment of double quotes and discovered this keyword argument for addressing the issue.
That’s my sob story. I hope it can help others.