Multiple Fields on LanguageModeling Dataset
See original GitHub issueCurrently we only accept single field on LanguageModellingDataset
constructor
https://github.com/pytorch/text/blob/499e327ea53bdf67c648f5747ed26764283b968a/torchtext/datasets/language_modeling.py#L8
That assume people will only use word embedding to parse the dataset. To throw one example, according to this paper https://www.aaai.org/ocs/index.php/AAAI/AAAI16/paper/download/12489/12017, it is possible to instead of using words to predict words, we can use composed-chars (using cnn) to predict words. We can enable that by using two different fields, both for character and word level representation, it is hard to enable that with current LanguageModellingDataset
. Now I am using my own code to replicate LanguageModellingDataset
with extra fields and logic to run enable that, below is the snippet
class LanguageModelingDataset(data.Dataset):
"""Defines a dataset for language modeling."""
def __init__(self, path, fields, newline_eos=True,
encoding='utf-8', **kwargs):
"""Create a LanguageModelingDataset given a path and a field.
Arguments:
path: Path to the data file.
fields: Dictionary containing keyword
newline_eos: Whether to add an <eos> token for every newline in the
data file. Default: True.
Remaining keyword arguments: Passed to the constructor of
data.Dataset.
"""
if isinstance(fields, dict):
raise ValueError("`fields` must be an instance of dictionary!")
if "text" not in fields.keys():
raise AttributeError("Field with key `text` is required to preprocess the data!")
text_field = fields["text"]
text = []
with io.open(path, encoding=encoding) as f:
for line in f:
text += text_field.preprocess(line)
if newline_eos:
text.append(u'<eos>')
tuple_fields = [(k, v) for k, v in fields.items()]
examples = [data.Example.fromlist([text], tuple_fields)]
super(LanguageModelingDataset, self).__init__(
examples, fields, **kwargs)
Where fields
contains both word and character representation field.
Not sure if this the best design, and not sure if this worth the PR. But, let me know if you have any thought!
Issue Analytics
- State:
- Created 5 years ago
- Comments:6 (4 by maintainers)
Top GitHub Comments
@bentrevett sorry, I think I put the wrong code. This should be the right one
and this is how you can use it
and we may need to create a new
BPTTIterator
(or update the existing one), since for now theBPTTIterator
still hardcoding the fieldsTEXT
andTARGET
Closing issue as we have gotten rid of
LanguageModelingDataset
and the issue is no more relevant. #1537