question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Sequence tagging custom dataset

See original GitHub issue

❓ Questions and Help

Description

Hi, I have a custom dataset that has the following format:

Word1	O	O	N	s: 1	Sentence: 1	Doc: 1
Word2	O	O	N	s: 1	Sentence: 1	Doc: 1
Word3	O	O	N	s: 1	Sentence: 1	Doc: 1
Word4	O	O	N	s: 1	Sentence: 1	Doc: 1

I want to use column 0 as my sentences, and the next three consecutive columns as my labels (label1, label2, label3). I could afford to ignore the other fields. (Maybe in the future I would consider to use the last column, for example I have an idea to zero the gradient only when I switch document, and not in switching sentence, and I would like to test it, if that makes sense).

Could you help me on how I could read this dataset? For example to point me out a similar example in the documentation. Thank you for your support!

Issue Analytics

  • State:closed
  • Created 4 years ago
  • Comments:10 (6 by maintainers)

github_iconTop GitHub Comments

1reaction
mttkcommented, Nov 29, 2019

Yeah. Although I am pretty sure that just doing logits[1:-1] and not setting init and eos tokens for labels will work. I’m honestly confused right now why the labels in the example also have init and eos tokens, since you should never predict those.

1reaction
antgrcommented, Nov 28, 2019

What I have done following the answer and the test : (below I have changed the specific names to lab1,lab2,lab3 for more generality)

class MySeqLabelDS(SequenceTaggingDataset):
    @classmethod
    def splits(cls, fields, root=".data", train="train.txt",
               validation="dev.txt",
               test="test.txt", **kwargs):

        return super(MySeqLabelDS, cls).splits(
            fields=fields, root=root, train=train, validation=validation,
            test=test, **kwargs)

and

WORD = data.Field(init_token="<bos>", eos_token="<eos>")
LAB1 = data.Field(init_token="<bos>", eos_token="<eos>")
LAB2 = data.Field(init_token="<bos>", eos_token="<eos>")
LAB3 = data.Field(init_token="<bos>", eos_token="<eos>")

and finaly:

train, val, test = MySeqLabelDS.splits(
    fields=(('word', WORD), ('lab1', LAB1), ('lab2', LAB2), ('lab3', LAB3)),
    path="./dataset",
    train="train.txt",
    validation="dev.txt",
    test="test.txt")

WORD.build_vocab(train.word, min_freq=3)
LAB1.build_vocab(train.lab1)
LAB2.build_vocab(train.lab2)
LAB3.build_vocab(train.lab3)

train_iter, val_iter = data.BucketIterator.splits(
    (train, val), batch_size=3, device="cuda:0")

do we always use the following init_token="<bos>", eos_token="<eos>" ?

Read more comments on GitHub >

github_iconTop Results From Across the Web

Fine-tuning with custom datasets - Hugging Face
Sequence Classification with IMDb Reviews ... This dataset can be explored in the Hugging Face model hub (IMDb), and can be alternatively downloaded...
Read more >
Most Popular Datasets For Neural Sequence Tagging with the ...
Here, we will cover the details of datasets used in Sequence Tagging. Further, we will execute these datasets using Tensorflow and Pytorch ...
Read more >
Neural Models for Sequence Tagging — NLP Architect by Intel ...
The described model in the paper consists of multiple sequential Bi-directional LSTM layers which are set to predict different tags. the Part-of-speech tags...
Read more >
Sequence Labeling With Transformers - LightTag
Practical NLP operates on long texts and annotations for sequence labeling tasks often come in offset format. Pre-trained transformer models assume tokenization ...
Read more >
Use Amazon SageMaker Ground Truth to Label Data
Use either pre-built or custom tools to assign the labeling tasks for your training dataset. A labeling UI template is a webpage that...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found