question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Multiple Fields on LanguageModeling Dataset

See original GitHub issue

Currently we only accept single field on LanguageModellingDataset constructor https://github.com/pytorch/text/blob/499e327ea53bdf67c648f5747ed26764283b968a/torchtext/datasets/language_modeling.py#L8

That assume people will only use word embedding to parse the dataset. To throw one example, according to this paper https://www.aaai.org/ocs/index.php/AAAI/AAAI16/paper/download/12489/12017, it is possible to instead of using words to predict words, we can use composed-chars (using cnn) to predict words. We can enable that by using two different fields, both for character and word level representation, it is hard to enable that with current LanguageModellingDataset. Now I am using my own code to replicate LanguageModellingDataset with extra fields and logic to run enable that, below is the snippet

class LanguageModelingDataset(data.Dataset):
    """Defines a dataset for language modeling."""

    def __init__(self, path, fields, newline_eos=True,
                 encoding='utf-8', **kwargs):
        """Create a LanguageModelingDataset given a path and a field.
        Arguments:
            path: Path to the data file.
            fields: Dictionary containing keyword 
            newline_eos: Whether to add an <eos> token for every newline in the
                data file. Default: True.
            Remaining keyword arguments: Passed to the constructor of
                data.Dataset.
        """
        if isinstance(fields, dict):
            raise ValueError("`fields` must be an instance of dictionary!")
        if "text" not in fields.keys():
            raise AttributeError("Field with key `text` is required to preprocess the data!")
        text_field = fields["text"]
        text = []
        with io.open(path, encoding=encoding) as f:
            for line in f:
                text += text_field.preprocess(line)
                if newline_eos:
                    text.append(u'<eos>')

        tuple_fields = [(k, v) for k, v in fields.items()]
        examples = [data.Example.fromlist([text], tuple_fields)]
        super(LanguageModelingDataset, self).__init__(
            examples, fields, **kwargs)

Where fields contains both word and character representation field. Not sure if this the best design, and not sure if this worth the PR. But, let me know if you have any thought!

Issue Analytics

  • State:closed
  • Created 5 years ago
  • Comments:6 (4 by maintainers)

github_iconTop GitHub Comments

1reaction
akurniawancommented, Oct 11, 2018

@bentrevett sorry, I think I put the wrong code. This should be the right one

class LanguageModelingDataset(data.Dataset):
    """Defines a dataset for language modeling."""

    def __init__(self,
                 path,
                 fields,
                 newline_eos=True,
                 encoding='utf-8',
                 **kwargs):
        """Create a LanguageModelingDataset given a path and a field.
        Arguments:
            path: Path to the data file.
            fields: Dictionary containing keyword
            newline_eos: Whether to add an <eos> token for every newline in the
                data file. Default: True.
            Remaining keyword arguments: Passed to the constructor of
                data.Dataset.
        """
        if not isinstance(fields, dict):
            raise ValueError("`fields` must be an instance of dictionary!")
        if "text" not in fields.keys():
            raise AttributeError(
                "Field with key `text` is required to preprocess the data!")

        text_field = fields["text"]
        text = []
        with io.open(path, encoding=encoding) as f:
            for line in f:
                text += text_field.preprocess(line)
                if newline_eos:
                    text.append(u'<eos>')

        tuple_fields = [tuple(zip(*fields.items()))]
        examples = [data.Example.fromlist([text], tuple_fields)]
        super(LanguageModelingDataset, self).__init__(examples, fields,
                                                      **kwargs)

and this is how you can use it

TEXT = Field()
special_tokens = [
    TEXT.unk_token, TEXT.pad_token, TEXT.init_token, TEXT.eos_token,
    "<eos>"
]

def tokenize_fn(word):
    if word in special_tokens:
        return [word]
    else:
        return list(word)

char_nesting = Field(tokenize=tokenize_fn)
CHARS = NestedField(char_nesting)

fields = {'text': TEXT, 'chars': CHARS}

train_data = LanguageModelingDataset('example.txt', fields)

and we may need to create a new BPTTIterator (or update the existing one), since for now the BPTTIterator still hardcoding the fields TEXT and TARGET

0reactions
abhinavaroracommented, Jan 25, 2022

Closing issue as we have gotten rid ofLanguageModelingDataset and the issue is no more relevant. #1537

Read more comments on GitHub >

github_iconTop Results From Across the Web

Natural Language Processing on multiple columns in python
We can see that my new X_train dataframe and my y_train have the same number of columns, so we can successfully run regression...
Read more >
Language modeling - Hugging Face
Language modeling tasks predicts words in a sentence, making these types of models great at generating text. You can use these models for...
Read more >
An Analysis of Neural Language Modeling at Multiple Scales
This vastly simplifies the task of character-level language modeling as character transitions will be limited to those found within the limited ...
Read more >
Datasets for Language Modelling in NLP using TensorFlow ...
With an increase in the size of the dataset, there is an increase in the normal number of times a word shows up...
Read more >
Wiki-40B: Multilingual Language Model Dataset
However, with the rising interests in mul- tilingual research, multilingual language modeling could be of great interest to researchers in the fields of...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found