Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

How can I use sciBERT for Token Classification?

See original GitHub issue

I tried with the code below:

from transformers import AutoTokenizer, AutoModel,AutoModelForTokenClassification
import torch

#I am getting the label list from labels.txt file present in the Pytorch Huggingface model(scibert-scivocab-uncased)
def read_label_list():
    f = open('labels.txt','r')
    label_list = []
    for line in f:
        label_list.append(line)
    return label_list

tokenizer = AutoTokenizer.from_pretrained('allenai/scibert_scivocab_uncased')
model = AutoModel.from_pretrained('allenai/scibert_scivocab_uncased')

sequence = 'Effectiveness of current drug treatments for hospitalized patients with SARS-CoV-2 infection (COVID-19 patients) in routine clinical practice|Risk factors or modifiers of pharmacological effect such as demographic characteristics, comorbidity or underlying pathology, concomitant medication.'

label_list = read_label_list()
tokens = tokenizer.tokenize(tokenizer.decode(tokenizer.encode(sequence)))
inputs = tokenizer.encode(sequence, return_tensors="pt")
outputs = model(inputs)[0]
predictions = torch.argmax(outputs, dim = 2)

for token, prediction in zip(tokens,predictions[0].numpy()):
    print((token, label_list[prediction]))

I am getting the following output which is not making sense: (‘[CLS]’, ‘##.49\n’) (‘effectiveness’, ‘##.49\n’) (‘of’, ‘##.49\n’) (‘current’, ‘##.49\n’) (‘drug’, ‘##.49\n’) (‘treatments’, ‘##.49\n’) (‘for’, ‘##.49\n’) (‘hospitalized’, ‘##.49\n’) (‘patients’, ‘##.49\n’) (‘with’, ‘##.49\n’) (‘sar’, ‘##.49\n’) (‘##s’, ‘##.49\n’) (‘-’, ‘##.49\n’) (‘cov’, ‘##.49\n’) (‘-’, ‘##.49\n’) (‘2’, ‘##.49\n’) (‘infection’, ‘##.49\n’) (‘(’, ‘##.49\n’) (‘cov’, ‘##.49\n’) (‘##id’, ‘##.49\n’) (‘-’, ‘##.49\n’) (‘19’, ‘##.49\n’) (‘patients’, ‘##.49\n’) (‘)’, ‘##.49\n’) (‘in’, ‘##.49\n’) (‘routine’, ‘##.49\n’) (‘clinical’, ‘##.49\n’) (‘practice’, ‘##.49\n’) (‘|’, ‘##.49\n’) (‘risk’, ‘##.49\n’) (‘factors’, ‘##.49\n’) (‘or’, ‘##.49\n’) (‘modi’, ‘##.49\n’) (‘##fi’, ‘##.49\n’) (‘##ers’, ‘##.49\n’) (‘of’, ‘##.49\n’) (‘pharmacological’, ‘##.49\n’) (‘effect’, ‘##.49\n’) (‘such’, ‘##.49\n’) (‘as’, ‘##.49\n’) (‘demographic’, ‘##.49\n’) (‘characteristics’, ‘##.49\n’) (‘,’, ‘##.49\n’) (‘comorbidity’, ‘##.49\n’) (‘or’, ‘##.49\n’) (‘underlying’, ‘##.49\n’) (‘pathology’, ‘##.49\n’) (‘,’, ‘##.49\n’) (‘concomitant’, ‘##.49\n’) (‘medication’, ‘##.49\n’) (‘.’, ‘##1-4\n’) (‘[SEP]’, ‘##.49\n’)

Issue Analytics

State:
Created 3 years ago
Comments:5

Top GitHub Comments

1reaction

zcyzhuangzhoucommented, Mar 30, 2021

I tried with the code below:
from transformers import AutoTokenizer, AutoModel,AutoModelForTokenClassification
import torch

#I am getting the label list from labels.txt file present in the Pytorch Huggingface model(scibert-scivocab-uncased)
def read_label_list():
    f = open('labels.txt','r')
    label_list = []
    for line in f:
        label_list.append(line)
    return label_list

tokenizer = AutoTokenizer.from_pretrained('allenai/scibert_scivocab_uncased')
model = AutoModel.from_pretrained('allenai/scibert_scivocab_uncased')

sequence = 'Effectiveness of current drug treatments for hospitalized patients with SARS-CoV-2 infection (COVID-19 patients) in routine clinical practice|Risk factors or modifiers of pharmacological effect such as demographic characteristics, comorbidity or underlying pathology, concomitant medication.'

label_list = read_label_list()
tokens = tokenizer.tokenize(tokenizer.decode(tokenizer.encode(sequence)))
inputs = tokenizer.encode(sequence, return_tensors="pt")
outputs = model(inputs)[0]
predictions = torch.argmax(outputs, dim = 2)

for token, prediction in zip(tokens,predictions[0].numpy()):
    print((token, label_list[prediction]))
I am getting the following output which is not making sense: (‘[CLS]’, ‘##.49\n’) (‘effectiveness’, ‘##.49\n’) (‘of’, ‘##.49\n’) (‘current’, ‘##.49\n’) (‘drug’, ‘##.49\n’) (‘treatments’, ‘##.49\n’) (‘for’, ‘##.49\n’) (‘hospitalized’, ‘##.49\n’) (‘patients’, ‘##.49\n’) (‘with’, ‘##.49\n’) (‘sar’, ‘##.49\n’) (‘##s’, ‘##.49\n’) (‘-’, ‘##.49\n’) (‘cov’, ‘##.49\n’) (‘-’, ‘##.49\n’) (‘2’, ‘##.49\n’) (‘infection’, ‘##.49\n’) (‘(’, ‘##.49\n’) (‘cov’, ‘##.49\n’) (‘##id’, ‘##.49\n’) (‘-’, ‘##.49\n’) (‘19’, ‘##.49\n’) (‘patients’, ‘##.49\n’) (‘)’, ‘##.49\n’) (‘in’, ‘##.49\n’) (‘routine’, ‘##.49\n’) (‘clinical’, ‘##.49\n’) (‘practice’, ‘##.49\n’) (‘|’, ‘##.49\n’) (‘risk’, ‘##.49\n’) (‘factors’, ‘##.49\n’) (‘or’, ‘##.49\n’) (‘modi’, ‘##.49\n’) (‘##fi’, ‘##.49\n’) (‘##ers’, ‘##.49\n’) (‘of’, ‘##.49\n’) (‘pharmacological’, ‘##.49\n’) (‘effect’, ‘##.49\n’) (‘such’, ‘##.49\n’) (‘as’, ‘##.49\n’) (‘demographic’, ‘##.49\n’) (‘characteristics’, ‘##.49\n’) (‘,’, ‘##.49\n’) (‘comorbidity’, ‘##.49\n’) (‘or’, ‘##.49\n’) (‘underlying’, ‘##.49\n’) (‘pathology’, ‘##.49\n’) (‘,’, ‘##.49\n’) (‘concomitant’, ‘##.49\n’) (‘medication’, ‘##.49\n’) (‘.’, ‘##1-4\n’) (‘[SEP]’, ‘##.49\n’)

Hello, why is there no label.txt in the model file I downloaded? I want to fine-tune my data, but I don’t know the format of the scibert data and all labels.

1reaction

stefan-itcommented, Jul 24, 2020

@Sachit1137 just use the fine-tuning example for token-classification from Transformers:

https://github.com/huggingface/transformers/tree/master/examples/token-classification

There are two examples given which you just need to adapt for your dataset.

Later, you can just use the Transformers Pipelines feature to make predictions, see this example.

If you need help with the token classification example, just ping me 😃

Top Results From Across the Web

How to use SciBERT in the best manner? - Stack Overflow

When I want to do tokenization and batching, it only allows me to use max_length of <=512. Is there any way to use...

fran-martinez/scibert_scivocab_cased_ner_jnlpba

SciBERT is a pretrained language model based on BERT and trained by the Allen ... AutoModelForTokenClassification # Example text = "Mouse thymus was...

Guide To SciBERT: A Pre-trained BERT-Based Language ...

Predict randomly masked tokens; Predict whether two sentences follow each other. SciBERT follows the same model architecture as BERT; the only ...

SciBERT: A Pretrained Language Model for ... - ACL Anthology

the final BERT vector for each token into a linear classification layer with softmax output. We dif- fer slightly in using an additional...

Text Classification with SciBERT - Yash Gupta - Medium

The BERT model has been on the rise lately in the field of NLP and text classification. The model has a transformer architecture...