UnicodeError while creating TabularDataset
See original GitHub issueI am following the tutorial which can be found here. The code is written as following:
>>> import torch
>>> from torchtext import data, datasets
>>> from torch.autograd import Variable
>>> import torch.nn as nn
>>> import torch.nn.functional as F
>>> import sys
>>>
>>>
>>> text_field = data.Field(lower=True, tokenize='spacy',tensor_type=torch.LongTensor)
>>> label_field = data.Field(sequential=False)
>>>
>>> text_field.preprocessing = lambda x:x
>>>
>>> pr_data = data.TabularDataset(path='polarity.tsv',format='tsv',fields=[('text',text_field),('label',label_field)])
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/Users/rahulbhalley/miniconda2/lib/python2.7/site-packages/torchtext/data/dataset.py", line 235, in __init__
examples = [make_example(line, fields) for line in reader]
File "/Users/rahulbhalley/miniconda2/lib/python2.7/site-packages/torchtext/utils.py", line 60, in unicode_csv_reader
for row in csv_reader:
File "/Users/rahulbhalley/miniconda2/lib/python2.7/site-packages/torchtext/utils.py", line 69, in utf_8_encoder
for line in unicode_csv_data:
File "/Users/rahulbhalley/miniconda2/lib/python2.7/codecs.py", line 314, in decode
(result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf8' codec can't decode byte 0xd1 in position 8: invalid continuation byte
Issue Analytics
- State:
- Created 5 years ago
- Comments:7 (4 by maintainers)
Top Results From Across the Web
python - AttributeError:module 'torchtext.data' has no attribute ...
I want to create a dataset from a tsv file with pytorch. I was thinking of using torchtext.data.TabularDataset.splits.
Read more >How to Avoid UnicodeDecodeError while loading data into ...
I am trying load CSV data into DataFrame using python. while doing it, below error is occurred. I have found below command resolve...
Read more >azureml.data.tabular_dataset.TabularDataset class
Represents a tabular dataset to use in Azure Machine Learning. A TabularDataset defines a series of lazily-evaluated, immutable operations to load data from ......
Read more >'utf-8' codec can't decode byte 0x93 in position 364 - Intellipaat
'utf-8' codec can't decode byte 0x93 in position 364: invalid start byte. my codes. df = pd.read_csv(path, encoding = 'unicode_escape').
Read more >Azure/azureml-sdk-for-r source: R/run.R - RDRR.io
... dots if we get here due to unicode error on windows rstudio console # terminals while (run$get_status() %in% azureml$core$run$RUNNING_STATES) { cat(".
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
Thanks @mttk. I guess the problem was that I opened raw version of
polarity.tsv
in GitHub and then saved it. Now I just ranand loaded
polarity.tsv
& it fa*king worked! Thanks a lot!Could you try downloading
polarity.tsv
again and running again with the fresh download (without opening the downloaded file in any editor)?