Customize torchtext.data.Dataset takes much time to generate dataset
See original GitHub issue❓ Questions and Help
Description I wrote a customized data.Dataset for multilabel classification. When I processed the data, I found that it is very slow to generate train and test using the customized dataset (it takes about 1.5s per example). I am wondering is it normal or it’s something wrong with my customized dataset.
Customized data.Dataset for mulilabel classification is as follows:
class TextMultiLabelDataset(data.Dataset):
def __init__(self, text, text_field, label_field, lbls=None, **kwargs):
# torchtext Field objects
fields = [('text', text_field), ('label', label_field)]
# for l in lbl_cols:
# fields.append((l, label_field))
is_test = True if lbls is None else False
if is_test:
pass
else:
n_labels = len(lbls)
examples = []
for i, txt in enumerate(tqdm(text)):
if not is_test:
l = lbls[i]
else:
l = [0.0] * n_labels
examples.append(data.Example.fromlist([txt, l], fields))
super(TextMultiLabelDataset, self).__init__(examples, fields, **kwargs)
where text is a list of list strings that in the documents, and lbls is a list of list labels in binary. (Total number of labels ~ 20000)
examples of text:
[["There are few factors more important to the mechanisms of evolution than stress. The stress response has formed as a result of natural selection..."], ["A 46-year-old female patient presenting with unspecific lower back pain, diffuse abdominal pain, and slightly elevated body temperature"], ...]
examples of lbls:
[[1 1 1 1 0 0 0 1 0 ...], [1 0 1 0 1 1 1 1 ...], ...]
Issue Analytics
- State:
- Created 3 years ago
- Comments:18 (8 by maintainers)
Top Results From Across the Web
torchtext.data - Read the Docs
Create a dataset from a list of Examples and Fields. ... Default: torch.long. preprocessing – The Pipeline that will be applied to examples...
Read more >Creating a Custom torchtext Dataset from a Text File
In a non-demo scenario, preparing data for NLP can take many days or weeks. Three country's flags from an Internet search for “red...
Read more >Use torchtext to Load NLP Datasets — Part I | by Ceshine Lee
There is a significant problem in the approach presented above. It's really slow. The entire dataset loading process takes around seven minutes ...
Read more >Pytorch Torchtext Tutorial 1: Custom Datasets and loading ...
In this video I show you how to to load different file formats (json, csv, tsv) in Pytorch Torchtext using Fields, TabularDataset, ...
Read more >How to use the torchtext.data.Dataset function in ... - Snyk
To help you get started, we've selected a few torchtext.data.Dataset examples, based on popular ways it is used in public projects.
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
In that case, you don’t need to load vector into
model.embedding
. Instead, just convert a list of tokens into a tensor by calling Vector’s getitem func here and send the tensor to model.It really helps! Thank you!