Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Multiple Fields on LanguageModeling Dataset

See original GitHub issue

Currently we only accept single field on LanguageModellingDataset constructor https://github.com/pytorch/text/blob/499e327ea53bdf67c648f5747ed26764283b968a/torchtext/datasets/language_modeling.py#L8

That assume people will only use word embedding to parse the dataset. To throw one example, according to this paper https://www.aaai.org/ocs/index.php/AAAI/AAAI16/paper/download/12489/12017, it is possible to instead of using words to predict words, we can use composed-chars (using cnn) to predict words. We can enable that by using two different fields, both for character and word level representation, it is hard to enable that with current LanguageModellingDataset. Now I am using my own code to replicate LanguageModellingDataset with extra fields and logic to run enable that, below is the snippet

class LanguageModelingDataset(data.Dataset):
    """Defines a dataset for language modeling."""

    def __init__(self, path, fields, newline_eos=True,
                 encoding='utf-8', **kwargs):
        """Create a LanguageModelingDataset given a path and a field.
        Arguments:
            path: Path to the data file.
            fields: Dictionary containing keyword 
            newline_eos: Whether to add an <eos> token for every newline in the
                data file. Default: True.
            Remaining keyword arguments: Passed to the constructor of
                data.Dataset.
        """
        if isinstance(fields, dict):
            raise ValueError("`fields` must be an instance of dictionary!")
        if "text" not in fields.keys():
            raise AttributeError("Field with key `text` is required to preprocess the data!")
        text_field = fields["text"]
        text = []
        with io.open(path, encoding=encoding) as f:
            for line in f:
                text += text_field.preprocess(line)
                if newline_eos:
                    text.append(u'<eos>')

        tuple_fields = [(k, v) for k, v in fields.items()]
        examples = [data.Example.fromlist([text], tuple_fields)]
        super(LanguageModelingDataset, self).__init__(
            examples, fields, **kwargs)

Where fields contains both word and character representation field. Not sure if this the best design, and not sure if this worth the PR. But, let me know if you have any thought!

Issue Analytics

State:
Created 5 years ago
Comments:6 (4 by maintainers)

Top GitHub Comments

1reaction

akurniawancommented, Oct 11, 2018

@bentrevett sorry, I think I put the wrong code. This should be the right one

class LanguageModelingDataset(data.Dataset):
    """Defines a dataset for language modeling."""

    def __init__(self,
                 path,
                 fields,
                 newline_eos=True,
                 encoding='utf-8',
                 **kwargs):
        """Create a LanguageModelingDataset given a path and a field.
        Arguments:
            path: Path to the data file.
            fields: Dictionary containing keyword
            newline_eos: Whether to add an <eos> token for every newline in the
                data file. Default: True.
            Remaining keyword arguments: Passed to the constructor of
                data.Dataset.
        """
        if not isinstance(fields, dict):
            raise ValueError("`fields` must be an instance of dictionary!")
        if "text" not in fields.keys():
            raise AttributeError(
                "Field with key `text` is required to preprocess the data!")

        text_field = fields["text"]
        text = []
        with io.open(path, encoding=encoding) as f:
            for line in f:
                text += text_field.preprocess(line)
                if newline_eos:
                    text.append(u'<eos>')

        tuple_fields = [tuple(zip(*fields.items()))]
        examples = [data.Example.fromlist([text], tuple_fields)]
        super(LanguageModelingDataset, self).__init__(examples, fields,
                                                      **kwargs)

and this is how you can use it

TEXT = Field()
special_tokens = [
    TEXT.unk_token, TEXT.pad_token, TEXT.init_token, TEXT.eos_token,
    "<eos>"
]

def tokenize_fn(word):
    if word in special_tokens:
        return [word]
    else:
        return list(word)

char_nesting = Field(tokenize=tokenize_fn)
CHARS = NestedField(char_nesting)

fields = {'text': TEXT, 'chars': CHARS}

train_data = LanguageModelingDataset('example.txt', fields)

and we may need to create a new BPTTIterator (or update the existing one), since for now the BPTTIterator still hardcoding the fields TEXT and TARGET

0reactions

abhinavaroracommented, Jan 25, 2022

Closing issue as we have gotten rid ofLanguageModelingDataset and the issue is no more relevant. #1537

Top Results From Across the Web

Natural Language Processing on multiple columns in python

We can see that my new X_train dataframe and my y_train have the same number of columns, so we can successfully run regression...

Language modeling - Hugging Face

Language modeling tasks predicts words in a sentence, making these types of models great at generating text. You can use these models for...

An Analysis of Neural Language Modeling at Multiple Scales

This vastly simplifies the task of character-level language modeling as character transitions will be limited to those found within the limited ...

Datasets for Language Modelling in NLP using TensorFlow ...

With an increase in the size of the dataset, there is an increase in the normal number of times a word shows up...

Wiki-40B: Multilingual Language Model Dataset

However, with the rising interests in mul- tilingual research, multilingual language modeling could be of great interest to researchers in the fields of...