Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Customize torchtext.data.Dataset takes much time to generate dataset

See original GitHub issue

❓ Questions and Help

Description I wrote a customized data.Dataset for multilabel classification. When I processed the data, I found that it is very slow to generate train and test using the customized dataset (it takes about 1.5s per example). I am wondering is it normal or it’s something wrong with my customized dataset.

Customized data.Dataset for mulilabel classification is as follows:

class TextMultiLabelDataset(data.Dataset):
    def __init__(self, text, text_field, label_field, lbls=None, **kwargs):
        # torchtext Field objects
        fields = [('text', text_field), ('label', label_field)]
        # for l in lbl_cols:
        # fields.append((l, label_field))

        is_test = True if lbls is None else False
        if is_test:
            pass
        else:
            n_labels = len(lbls)

        examples = []
        for i, txt in enumerate(tqdm(text)):
            if not is_test:
                l = lbls[i]
            else:
                l = [0.0] * n_labels

            examples.append(data.Example.fromlist([txt, l], fields))

        super(TextMultiLabelDataset, self).__init__(examples, fields, **kwargs)

where text is a list of list strings that in the documents, and lbls is a list of list labels in binary. (Total number of labels ~ 20000)

examples of text:

[["There are few factors more important to the mechanisms of evolution than stress. The stress response has formed as a result of natural selection..."], ["A 46-year-old female patient presenting with unspecific lower back pain, diffuse abdominal pain, and slightly elevated body temperature"], ...]

examples of lbls:

[[1 1 1 1 0 0 0 1 0 ...], [1 0 1 0 1 1 1 1 ...], ...]

Issue Analytics

State:
Created 3 years ago
Comments:18 (8 by maintainers)

Top GitHub Comments

1reaction

zhangguanheng66commented, Jul 9, 2020

That one might not be the best resource (it’s based on the legacy code). If you are using the new dataset abstraction, you can take a look at this text classification example here. See train.py file showing how to use DataLoader.

It helps, thank you! One last question, in the previous code, the field has a build_vocab function which allows to load per-trained word2vec using vocab created by training set, i.e. TEXT.build_vocab(train, vectors=vectors). Is there any function does the similar thing or nn.Embedding.from_pretrained is the right way? Thank you!

This is our new pre-trained word vector (FastText and Glove) (link). cc @Nayef211

When we have the pertained vectors, I am wondering how to align these with the vocabulary built from the training set or load it to the model? Before, (TEXT is the desired field in previous abstraction) align these with the vocabulary:
TEXT.build_vocab(train, vectors=pretrained_embeddings)
load to the model:
model.embedding.weight.data.copy_(TEXT.vocab.vectors)
For now in the new abstraction the field has been removed, I am wondering how to do these without field? Thank you!

In that case, you don’t need to load vector into model.embedding. Instead, just convert a list of tokens into a tensor by calling Vector’s getitem func here and send the tensor to model.

0reactions

xdwang0726commented, Jul 16, 2020

That one might not be the best resource (it’s based on the legacy code). If you are using the new dataset abstraction, you can take a look at this text classification example here. See train.py file showing how to use DataLoader.

It helps, thank you! One last question, in the previous code, the field has a build_vocab function which allows to load per-trained word2vec using vocab created by training set, i.e. TEXT.build_vocab(train, vectors=vectors). Is there any function does the similar thing or nn.Embedding.from_pretrained is the right way? Thank you!

This is our new pre-trained word vector (FastText and Glove) (link). cc @Nayef211

When we have the pertained vectors, I am wondering how to align these with the vocabulary built from the training set or load it to the model? Before, (TEXT is the desired field in previous abstraction) align these with the vocabulary:
TEXT.build_vocab(train, vectors=pretrained_embeddings)
load to the model:
model.embedding.weight.data.copy_(TEXT.vocab.vectors)
For now in the new abstraction the field has been removed, I am wondering how to do these without field? Thank you!
In that case, you don’t need to load vector into model.embedding. Instead, just convert a list of tokens into a tensor by calling Vector’s getitem func here and send the tensor to model.
Hi, I found that in the torchtext repo, only text classification tasks use the new dataset abstraction, for other tasks fields still exist in the dataset settings (referred from here). I am wondering if I want use torchtext to create dataset for summarization task with BERT, which resource is better to refer to, the translation dataset? Thank you!
I just merged the BERT pipeline under the example folder. #767