Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

UnicodeError while creating TabularDataset

See original GitHub issue

I am following the tutorial which can be found here. The code is written as following:

>>> import torch
>>> from torchtext import data, datasets
>>> from torch.autograd import Variable
>>> import torch.nn as nn
>>> import torch.nn.functional as F
>>> import sys
>>> 
>>> 
>>> text_field = data.Field(lower=True, tokenize='spacy',tensor_type=torch.LongTensor)
>>> label_field = data.Field(sequential=False)
>>> 
>>> text_field.preprocessing = lambda x:x
>>> 
>>> pr_data = data.TabularDataset(path='polarity.tsv',format='tsv',fields=[('text',text_field),('label',label_field)])
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/rahulbhalley/miniconda2/lib/python2.7/site-packages/torchtext/data/dataset.py", line 235, in __init__
    examples = [make_example(line, fields) for line in reader]
  File "/Users/rahulbhalley/miniconda2/lib/python2.7/site-packages/torchtext/utils.py", line 60, in unicode_csv_reader
    for row in csv_reader:
  File "/Users/rahulbhalley/miniconda2/lib/python2.7/site-packages/torchtext/utils.py", line 69, in utf_8_encoder
    for line in unicode_csv_data:
  File "/Users/rahulbhalley/miniconda2/lib/python2.7/codecs.py", line 314, in decode
    (result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf8' codec can't decode byte 0xd1 in position 8: invalid continuation byte

Issue Analytics

State:
Created 5 years ago
Comments:7 (4 by maintainers)

Top GitHub Comments

3reactions

RahulBhalleycommented, Oct 10, 2018

Thanks @mttk. I guess the problem was that I opened raw version of polarity.tsv in GitHub and then saved it. Now I just ran

git clone https://github.com/DaehanKim/torchtext_tutorial.git

and loaded polarity.tsv & it fa*king worked! Thanks a lot!

2reactions

mttkcommented, Oct 10, 2018

Could you try downloading polarity.tsv again and running again with the fresh download (without opening the downloaded file in any editor)?

Top Results From Across the Web

python - AttributeError:module 'torchtext.data' has no attribute ...

I want to create a dataset from a tsv file with pytorch. I was thinking of using torchtext.data.TabularDataset.splits.

How to Avoid UnicodeDecodeError while loading data into ...

I am trying load CSV data into DataFrame using python. while doing it, below error is occurred. I have found below command resolve...

azureml.data.tabular_dataset.TabularDataset class

Represents a tabular dataset to use in Azure Machine Learning. A TabularDataset defines a series of lazily-evaluated, immutable operations to load data from ......

'utf-8' codec can't decode byte 0x93 in position 364 - Intellipaat

'utf-8' codec can't decode byte 0x93 in position 364: invalid start byte. my codes. df = pd.read_csv(path, encoding = 'unicode_escape').

Azure/azureml-sdk-for-r source: R/run.R - RDRR.io

... dots if we get here due to unicode error on windows rstudio console # terminals while (run$get_status() %in% azureml$core$run$RUNNING_STATES) { cat(".