question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

UnicodeError while creating TabularDataset

See original GitHub issue

I am following the tutorial which can be found here. The code is written as following:

>>> import torch
>>> from torchtext import data, datasets
>>> from torch.autograd import Variable
>>> import torch.nn as nn
>>> import torch.nn.functional as F
>>> import sys
>>> 
>>> 
>>> text_field = data.Field(lower=True, tokenize='spacy',tensor_type=torch.LongTensor)
>>> label_field = data.Field(sequential=False)
>>> 
>>> text_field.preprocessing = lambda x:x
>>> 
>>> pr_data = data.TabularDataset(path='polarity.tsv',format='tsv',fields=[('text',text_field),('label',label_field)])
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/rahulbhalley/miniconda2/lib/python2.7/site-packages/torchtext/data/dataset.py", line 235, in __init__
    examples = [make_example(line, fields) for line in reader]
  File "/Users/rahulbhalley/miniconda2/lib/python2.7/site-packages/torchtext/utils.py", line 60, in unicode_csv_reader
    for row in csv_reader:
  File "/Users/rahulbhalley/miniconda2/lib/python2.7/site-packages/torchtext/utils.py", line 69, in utf_8_encoder
    for line in unicode_csv_data:
  File "/Users/rahulbhalley/miniconda2/lib/python2.7/codecs.py", line 314, in decode
    (result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf8' codec can't decode byte 0xd1 in position 8: invalid continuation byte

Issue Analytics

  • State:closed
  • Created 5 years ago
  • Comments:7 (4 by maintainers)

github_iconTop GitHub Comments

3reactions
RahulBhalleycommented, Oct 10, 2018

Thanks @mttk. I guess the problem was that I opened raw version of polarity.tsv in GitHub and then saved it. Now I just ran

git clone https://github.com/DaehanKim/torchtext_tutorial.git

and loaded polarity.tsv & it fa*king worked! Thanks a lot!

2reactions
mttkcommented, Oct 10, 2018

Could you try downloading polarity.tsv again and running again with the fresh download (without opening the downloaded file in any editor)?

Read more comments on GitHub >

github_iconTop Results From Across the Web

python - AttributeError:module 'torchtext.data' has no attribute ...
I want to create a dataset from a tsv file with pytorch. I was thinking of using torchtext.data.TabularDataset.splits.
Read more >
How to Avoid UnicodeDecodeError while loading data into ...
I am trying load CSV data into DataFrame using python. while doing it, below error is occurred. I have found below command resolve...
Read more >
azureml.data.tabular_dataset.TabularDataset class
Represents a tabular dataset to use in Azure Machine Learning. A TabularDataset defines a series of lazily-evaluated, immutable operations to load data from ......
Read more >
'utf-8' codec can't decode byte 0x93 in position 364 - Intellipaat
'utf-8' codec can't decode byte 0x93 in position 364: invalid start byte. my codes. df = pd.read_csv(path, encoding = 'unicode_escape').
Read more >
Azure/azureml-sdk-for-r source: R/run.R - RDRR.io
... dots if we get here due to unicode error on windows rstudio console # terminals while (run$get_status() %in% azureml$core$run$RUNNING_STATES) { cat(".
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found