question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Customize torchtext.data.Dataset takes much time to generate dataset

See original GitHub issue

❓ Questions and Help

Description I wrote a customized data.Dataset for multilabel classification. When I processed the data, I found that it is very slow to generate train and test using the customized dataset (it takes about 1.5s per example). I am wondering is it normal or it’s something wrong with my customized dataset.

Customized data.Dataset for mulilabel classification is as follows:

class TextMultiLabelDataset(data.Dataset):
    def __init__(self, text, text_field, label_field, lbls=None, **kwargs):
        # torchtext Field objects
        fields = [('text', text_field), ('label', label_field)]
        # for l in lbl_cols:
        # fields.append((l, label_field))

        is_test = True if lbls is None else False
        if is_test:
            pass
        else:
            n_labels = len(lbls)

        examples = []
        for i, txt in enumerate(tqdm(text)):
            if not is_test:
                l = lbls[i]
            else:
                l = [0.0] * n_labels

            examples.append(data.Example.fromlist([txt, l], fields))

        super(TextMultiLabelDataset, self).__init__(examples, fields, **kwargs)
where text is a list of list strings that in the documents, and lbls is a list of list labels in binary. (Total number of labels ~ 20000)

examples of text:

[["There are few factors more important to the mechanisms of evolution than stress. The stress response has formed as a result of natural selection..."], ["A 46-year-old female patient presenting with unspecific lower back pain, diffuse abdominal pain, and slightly elevated body temperature"], ...]

examples of lbls:

[[1 1 1 1 0 0 0 1 0 ...], [1 0 1 0 1 1 1 1 ...], ...]

Issue Analytics

  • State:closed
  • Created 3 years ago
  • Comments:18 (8 by maintainers)

github_iconTop GitHub Comments

1reaction
zhangguanheng66commented, Jul 9, 2020

That one might not be the best resource (it’s based on the legacy code). If you are using the new dataset abstraction, you can take a look at this text classification example here. See train.py file showing how to use DataLoader.

It helps, thank you! One last question, in the previous code, the field has a build_vocab function which allows to load per-trained word2vec using vocab created by training set, i.e. TEXT.build_vocab(train, vectors=vectors). Is there any function does the similar thing or nn.Embedding.from_pretrained is the right way? Thank you!

This is our new pre-trained word vector (FastText and Glove) (link). cc @Nayef211

When we have the pertained vectors, I am wondering how to align these with the vocabulary built from the training set or load it to the model? Before, (TEXT is the desired field in previous abstraction) align these with the vocabulary:

TEXT.build_vocab(train, vectors=pretrained_embeddings)

load to the model:

model.embedding.weight.data.copy_(TEXT.vocab.vectors)

For now in the new abstraction the field has been removed, I am wondering how to do these without field? Thank you!

In that case, you don’t need to load vector into model.embedding. Instead, just convert a list of tokens into a tensor by calling Vector’s getitem func here and send the tensor to model.

0reactions
xdwang0726commented, Jul 16, 2020

That one might not be the best resource (it’s based on the legacy code). If you are using the new dataset abstraction, you can take a look at this text classification example here. See train.py file showing how to use DataLoader.

It helps, thank you! One last question, in the previous code, the field has a build_vocab function which allows to load per-trained word2vec using vocab created by training set, i.e. TEXT.build_vocab(train, vectors=vectors). Is there any function does the similar thing or nn.Embedding.from_pretrained is the right way? Thank you!

This is our new pre-trained word vector (FastText and Glove) (link). cc @Nayef211

When we have the pertained vectors, I am wondering how to align these with the vocabulary built from the training set or load it to the model? Before, (TEXT is the desired field in previous abstraction) align these with the vocabulary:

TEXT.build_vocab(train, vectors=pretrained_embeddings)

load to the model:

model.embedding.weight.data.copy_(TEXT.vocab.vectors)

For now in the new abstraction the field has been removed, I am wondering how to do these without field? Thank you!

In that case, you don’t need to load vector into model.embedding. Instead, just convert a list of tokens into a tensor by calling Vector’s getitem func here and send the tensor to model.

Hi, I found that in the torchtext repo, only text classification tasks use the new dataset abstraction, for other tasks fields still exist in the dataset settings (referred from here). I am wondering if I want use torchtext to create dataset for summarization task with BERT, which resource is better to refer to, the translation dataset? Thank you!

I just merged the BERT pipeline under the example folder. #767

It really helps! Thank you!

Read more comments on GitHub >

github_iconTop Results From Across the Web

torchtext.data - Read the Docs
Create a dataset from a list of Examples and Fields. ... Default: torch.long. preprocessing – The Pipeline that will be applied to examples...
Read more >
Creating a Custom torchtext Dataset from a Text File
In a non-demo scenario, preparing data for NLP can take many days or weeks. Three country's flags from an Internet search for “red...
Read more >
Use torchtext to Load NLP Datasets — Part I | by Ceshine Lee
There is a significant problem in the approach presented above. It's really slow. The entire dataset loading process takes around seven minutes ...
Read more >
Pytorch Torchtext Tutorial 1: Custom Datasets and loading ...
In this video I show you how to to load different file formats (json, csv, tsv) in Pytorch Torchtext using Fields, TabularDataset, ...
Read more >
How to use the torchtext.data.Dataset function in ... - Snyk
To help you get started, we've selected a few torchtext.data.Dataset examples, based on popular ways it is used in public projects.
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found