Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Sequence tagging custom dataset

See original GitHub issue

❓ Questions and Help

Description

Hi, I have a custom dataset that has the following format:

Word1	O	O	N	s: 1	Sentence: 1	Doc: 1
Word2	O	O	N	s: 1	Sentence: 1	Doc: 1
Word3	O	O	N	s: 1	Sentence: 1	Doc: 1
Word4	O	O	N	s: 1	Sentence: 1	Doc: 1

I want to use column 0 as my sentences, and the next three consecutive columns as my labels (label1, label2, label3). I could afford to ignore the other fields. (Maybe in the future I would consider to use the last column, for example I have an idea to zero the gradient only when I switch document, and not in switching sentence, and I would like to test it, if that makes sense).

Could you help me on how I could read this dataset? For example to point me out a similar example in the documentation. Thank you for your support!

Issue Analytics

State:
Created 4 years ago
Comments:10 (6 by maintainers)

Top GitHub Comments

1reaction

mttkcommented, Nov 29, 2019

Yeah. Although I am pretty sure that just doing logits[1:-1] and not setting init and eos tokens for labels will work. I’m honestly confused right now why the labels in the example also have init and eos tokens, since you should never predict those.

1reaction

antgrcommented, Nov 28, 2019

What I have done following the answer and the test : (below I have changed the specific names to lab1,lab2,lab3 for more generality)

class MySeqLabelDS(SequenceTaggingDataset):
    @classmethod
    def splits(cls, fields, root=".data", train="train.txt",
               validation="dev.txt",
               test="test.txt", **kwargs):

        return super(MySeqLabelDS, cls).splits(
            fields=fields, root=root, train=train, validation=validation,
            test=test, **kwargs)

and

WORD = data.Field(init_token="<bos>", eos_token="<eos>")
LAB1 = data.Field(init_token="<bos>", eos_token="<eos>")
LAB2 = data.Field(init_token="<bos>", eos_token="<eos>")
LAB3 = data.Field(init_token="<bos>", eos_token="<eos>")

and finaly:

train, val, test = MySeqLabelDS.splits(
    fields=(('word', WORD), ('lab1', LAB1), ('lab2', LAB2), ('lab3', LAB3)),
    path="./dataset",
    train="train.txt",
    validation="dev.txt",
    test="test.txt")

WORD.build_vocab(train.word, min_freq=3)
LAB1.build_vocab(train.lab1)
LAB2.build_vocab(train.lab2)
LAB3.build_vocab(train.lab3)

train_iter, val_iter = data.BucketIterator.splits(
    (train, val), batch_size=3, device="cuda:0")

do we always use the following init_token="<bos>", eos_token="<eos>" ?

Top Results From Across the Web

Fine-tuning with custom datasets - Hugging Face

Sequence Classification with IMDb Reviews ... This dataset can be explored in the Hugging Face model hub (IMDb), and can be alternatively downloaded...

Most Popular Datasets For Neural Sequence Tagging with the ...

Here, we will cover the details of datasets used in Sequence Tagging. Further, we will execute these datasets using Tensorflow and Pytorch ...

Neural Models for Sequence Tagging — NLP Architect by Intel ...

The described model in the paper consists of multiple sequential Bi-directional LSTM layers which are set to predict different tags. the Part-of-speech tags...

Sequence Labeling With Transformers - LightTag

Practical NLP operates on long texts and annotations for sequence labeling tasks often come in offset format. Pre-trained transformer models assume tokenization ...

Use Amazon SageMaker Ground Truth to Label Data

Use either pre-built or custom tools to assign the labeling tasks for your training dataset. A labeling UI template is a webpage that...