Sequence tagging custom dataset
See original GitHub issue❓ Questions and Help
Description
Hi, I have a custom dataset that has the following format:
Word1 O O N s: 1 Sentence: 1 Doc: 1
Word2 O O N s: 1 Sentence: 1 Doc: 1
Word3 O O N s: 1 Sentence: 1 Doc: 1
Word4 O O N s: 1 Sentence: 1 Doc: 1
I want to use column 0 as my sentences, and the next three consecutive columns as my labels (label1, label2, label3). I could afford to ignore the other fields. (Maybe in the future I would consider to use the last column, for example I have an idea to zero the gradient only when I switch document, and not in switching sentence, and I would like to test it, if that makes sense).
Could you help me on how I could read this dataset? For example to point me out a similar example in the documentation. Thank you for your support!
Issue Analytics
- State:
- Created 4 years ago
- Comments:10 (6 by maintainers)
Top Results From Across the Web
Fine-tuning with custom datasets - Hugging Face
Sequence Classification with IMDb Reviews ... This dataset can be explored in the Hugging Face model hub (IMDb), and can be alternatively downloaded...
Read more >Most Popular Datasets For Neural Sequence Tagging with the ...
Here, we will cover the details of datasets used in Sequence Tagging. Further, we will execute these datasets using Tensorflow and Pytorch ...
Read more >Neural Models for Sequence Tagging — NLP Architect by Intel ...
The described model in the paper consists of multiple sequential Bi-directional LSTM layers which are set to predict different tags. the Part-of-speech tags...
Read more >Sequence Labeling With Transformers - LightTag
Practical NLP operates on long texts and annotations for sequence labeling tasks often come in offset format. Pre-trained transformer models assume tokenization ...
Read more >Use Amazon SageMaker Ground Truth to Label Data
Use either pre-built or custom tools to assign the labeling tasks for your training dataset. A labeling UI template is a webpage that...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
Yeah. Although I am pretty sure that just doing
logits[1:-1]
and not setting init and eos tokens for labels will work. I’m honestly confused right now why the labels in the example also have init and eos tokens, since you should never predict those.What I have done following the answer and the test : (below I have changed the specific names to lab1,lab2,lab3 for more generality)
and
and finaly:
do we always use the following
init_token="<bos>", eos_token="<eos>"
?