How can I use sciBERT for Token Classification?
See original GitHub issueI tried with the code below:
from transformers import AutoTokenizer, AutoModel,AutoModelForTokenClassification
import torch
#I am getting the label list from labels.txt file present in the Pytorch Huggingface model(scibert-scivocab-uncased)
def read_label_list():
f = open('labels.txt','r')
label_list = []
for line in f:
label_list.append(line)
return label_list
tokenizer = AutoTokenizer.from_pretrained('allenai/scibert_scivocab_uncased')
model = AutoModel.from_pretrained('allenai/scibert_scivocab_uncased')
sequence = 'Effectiveness of current drug treatments for hospitalized patients with SARS-CoV-2 infection (COVID-19 patients) in routine clinical practice|Risk factors or modifiers of pharmacological effect such as demographic characteristics, comorbidity or underlying pathology, concomitant medication.'
label_list = read_label_list()
tokens = tokenizer.tokenize(tokenizer.decode(tokenizer.encode(sequence)))
inputs = tokenizer.encode(sequence, return_tensors="pt")
outputs = model(inputs)[0]
predictions = torch.argmax(outputs, dim = 2)
for token, prediction in zip(tokens,predictions[0].numpy()):
print((token, label_list[prediction]))
I am getting the following output which is not making sense: (‘[CLS]’, ‘##.49\n’) (‘effectiveness’, ‘##.49\n’) (‘of’, ‘##.49\n’) (‘current’, ‘##.49\n’) (‘drug’, ‘##.49\n’) (‘treatments’, ‘##.49\n’) (‘for’, ‘##.49\n’) (‘hospitalized’, ‘##.49\n’) (‘patients’, ‘##.49\n’) (‘with’, ‘##.49\n’) (‘sar’, ‘##.49\n’) (‘##s’, ‘##.49\n’) (‘-’, ‘##.49\n’) (‘cov’, ‘##.49\n’) (‘-’, ‘##.49\n’) (‘2’, ‘##.49\n’) (‘infection’, ‘##.49\n’) (‘(’, ‘##.49\n’) (‘cov’, ‘##.49\n’) (‘##id’, ‘##.49\n’) (‘-’, ‘##.49\n’) (‘19’, ‘##.49\n’) (‘patients’, ‘##.49\n’) (‘)’, ‘##.49\n’) (‘in’, ‘##.49\n’) (‘routine’, ‘##.49\n’) (‘clinical’, ‘##.49\n’) (‘practice’, ‘##.49\n’) (‘|’, ‘##.49\n’) (‘risk’, ‘##.49\n’) (‘factors’, ‘##.49\n’) (‘or’, ‘##.49\n’) (‘modi’, ‘##.49\n’) (‘##fi’, ‘##.49\n’) (‘##ers’, ‘##.49\n’) (‘of’, ‘##.49\n’) (‘pharmacological’, ‘##.49\n’) (‘effect’, ‘##.49\n’) (‘such’, ‘##.49\n’) (‘as’, ‘##.49\n’) (‘demographic’, ‘##.49\n’) (‘characteristics’, ‘##.49\n’) (‘,’, ‘##.49\n’) (‘comorbidity’, ‘##.49\n’) (‘or’, ‘##.49\n’) (‘underlying’, ‘##.49\n’) (‘pathology’, ‘##.49\n’) (‘,’, ‘##.49\n’) (‘concomitant’, ‘##.49\n’) (‘medication’, ‘##.49\n’) (‘.’, ‘##1-4\n’) (‘[SEP]’, ‘##.49\n’)
Issue Analytics
- State:
- Created 3 years ago
- Comments:5
Top GitHub Comments
Hello, why is there no label.txt in the model file I downloaded? I want to fine-tune my data, but I don’t know the format of the scibert data and all labels.
@Sachit1137 just use the fine-tuning example for token-classification from Transformers:
https://github.com/huggingface/transformers/tree/master/examples/token-classification
There are two examples given which you just need to adapt for your dataset.
Later, you can just use the Transformers Pipelines feature to make predictions, see this example.
If you need help with the token classification example, just ping me 😃