Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Quotation mark at start of line corrupts data

See original GitHub issue

When I read in a .tsv file containing the following two lines, containing two fields (input and output)

a " c    b
" b c	 a

the first example will results in fields [‘a’, ‘"’, ‘c’], [‘b’] While the second example will results in an input field [‘b’, ‘c’, ‘a’] and no output field. I think the desired behaviour would be that it would read as [‘"’, ‘b’, ‘c’], [‘a’], and be agnostic of the specific characters or tokens used.

I’ve seen this go wrong in multiple ways. It seems as if the character is not properly escaped.

Produced with torchtext 0.2.3 and python 3.6.5 The problem does not seem to appear with torchtext 0.2.1

Issue Analytics

State:
Created 5 years ago
Comments:7 (2 by maintainers)

Top GitHub Comments

1reaction

mttkcommented, Sep 29, 2018

@keitakurita please do file a PR

0reactions

wcollins-ebscocommented, Feb 6, 2019

I just got bit by this big time, in two different ways. Because of the severity of the problems it can cause I would urge a warning note or something in the documentation if not the default argument for ignoring quotes.

Lines in my dataset that started with double quotes were a fringe case, but the csv reader selects everything up to the next double quote, which could be dozens or hundreds of records! When this bloated record tensor was put onto CUDA it blew out the memory. Since I was shuffling the data each time, this conglomerate record would vary in position and size, so the problem was intermittent. It took me a long time to diagnose this as the cause of the CUDA memory problems.

When the training/testing did happen to complete I realized that my record count for the test data was not as expected. I discovered this the hard way when trying to post a Kaggle submission, which was rejected due to the bad record count.

I wasn’t able to diagnose this problem until I read the source code in this repo and looked into each of the components that could have caused what I thought were dropped records. Eventually I found out about csv reader’s default treatment of double quotes and discovered this keyword argument for addressing the issue.

That’s my sob story. I hope it can help others.