question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. ItĀ collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Proposal: Offset based Token Classification utilities

See original GitHub issue

šŸš€ Feature request

Hi. So we work a lot with span annotations on text that isn’t tokenized and want a ā€œcanonicalā€ way to work with that. I have some ideas and rough implementations, so I’m looking for feedback on if this belongs in the library, and if the proposed implementation is more or less good.

I also think there is a good chance that everything I want exists, and the only solution needed is slightly clearer documentation. I should hope that’s the case and happy to document if someone can point me in the right direction.

The Desired Capabilities

What I’d like is a canonical way to:

  • Tokenize the examples in the dataset
  • Align my annotations with the output tokens (see notes below)
  • Have the tokens and labels correctly padded to the max length of an example in the batch or max_sequence_length
  • Have a convenient function that returns predicted offsets

Some Nice To Haves

  • It would be nice if such a utility internally handled tagging schemes like IOB BIOES internally and optionally exposed them in the output or ā€œfoldedā€ them to the core entities.
  • It would be nice if there was a recommended/default strategy implemented for handling examples that are longer then the max_sequence_length
  • It would be amazing if we could pass labels to the tokenizer and have the alignment happen in Rust (in parallel). But I don’t know Rust and I have a sense this is complicated so I won’t be taking that on myself, and assuming that this is happening in Python.

Current State and what I’m missing

  • The docs and examples for Token Classification assume that the text is pre-tokenized.
  • For a word that has a label and is tokenized to multiple tokens, it is recommended to place the label on the first token and ā€œignoreā€ the following tokens
  • The example pads all examples to max_sequence_length which is a big performance hit (as opposed to bucketing by length and padding dynamically)
  • The example loads the entire dataset at once in memory. I’m not sure if this is a real problem or I’m being nitpicky, but I think ā€œthe right wayā€ to do this would be to lazy load a batch or a few batches.

Alignment

The path to align tokens to span annotations is by using the return_offsets_mapping flag on the tokenizer (which is awesome!). There are probably a few strategies, I’ve been using this I use logic like this:

def align_tokens_to_annos(offsets,annos):
    anno_ix =0
    results =[]
    done =len(annos)==0
    for offset in offsets:

        if done == True:
            results.append(dict(offset=offset,tag='O',))
        else:
            anno = annos[anno_ix]
            start, end = offset
            if end < anno['start']:
                # the offset is before the next annotation
                results.append(dict(offset=offset, tag='O', ))
            elif start <=anno['start'] and end <=anno['end']:
                results.append(dict(offset=offset, tag=f'B-{anno["tag"]}',))
            elif start>=anno['start'] and end<=anno['end']:
                results.append(dict(offset=offset, tag=f'I-{anno["tag"]}', ))
            elif start>=anno['start'] and end>anno['end']:
                anno_ix += 1
                results.append(dict(offset=offset, tag=f'E-{anno["tag"]}', ))
            else:
                raise Exception(f"Funny Overlap {offset},{anno}",)

            if anno_ix>=len(annos):
                done=True
    return results

And then call that function inside add_labels here

        res_batch = tokenizer([s['text'] for s in pre_batch],return_offsets_mapping=True,padding=True)
        offsets_batch = res_batch.pop('offset_mapping')
        res_batch['labels'] =[]
        for i in range(len(offsets_batch)):
          labels = add_labels(res_batch['input_ids'][i],offsets_batch[i],pre_batch[i]['annotations'])
          res_batch['labels'].append(labels)

This works, and it’s nice because the padding is consistent with the longest sentence so bucketing gives a big boost. But, the add_labels stuff is in python and thus sequential over the examples and not super fast. I haven’t measured this to confirm it’s a problem, just bring it up.

Desired Solution

I need most of this stuff so I’m going to make it. I could do it

The current ā€œNERā€ examples and issues assume that text is pre-tokenized. Our use case is such that the full text is not tokenized and the labels for ā€œNERā€ come as offsets. I propose a utility /example to handle that scenario because I haven’t been able to find one.

In practice, most values of X don’t need any modification, and doing what I propose (below) in Rust is beyond me, so this might boil down to a utility class and documentation.

Motivation

I make text annotation tools and our output is span annotations on untokenized text. I want our users to be able to easily use transformers. I suspect from my (limited) experience that in many non-academic use cases, span annotations on untokenized text is the norm and that others would benefit from this as well.

Possible ways to address this

I can imagine a few scenarios here

  • This is out of scope Maybe this isn’t something that should be handled by transformers at all, and delegated to a library and blog post
  • This is in scope and just needs documentation e.g. all the things I mentioned are things transformers should and can already do. In that case the solution would be pointing someone (me) to the right functions and adding some documentation
  • **This is in scope and should be a set of utilities ** Solving this could be as simple as making a file similar to utils_ner.py. I think that would be the simplest way to get something usable and gather feedback see if anyone else cares
  • This is in scope but should be done in Rust soon If we want to be performance purists, it would make sense to handle the alignment of span based labels in Rust. I don’t know Rust so I can’t help much and I don’t know if there is any appetite or capacity from someone that does, or if it’s worth the (presumably) additional effort.

Your contribution

I’d be happy to implement and submit a PR, or make an external library or add to a relevant existing one.

Related issues

Issue Analytics

  • State:closed
  • Created 3 years ago
  • Reactions:38
  • Comments:11 (3 by maintainers)

github_iconTop GitHub Comments

3reactions
talolardcommented, Sep 21, 2020

So, After much rabbit hole, I’ve written a blog post about the considerations when doing alignment/padding/batching and another walking through an implementation.

It even comes with a repo

so If we have annotated data like this

[{'annotations': [],
  'content': 'No formal drug interaction studies of Aranesp? have been '
             'performed.',
  'metadata': {'original_id': 'DrugDDI.d390.s0'}},
 {'annotations': [{'end': 13, 'label': 'drug', 'start': 6, 'tag': 'drug'},
                  {'end': 60, 'label': 'drug', 'start': 43, 'tag': 'drug'},
                  {'end': 112, 'label': 'drug', 'start': 105, 'tag': 'drug'},
                  {'end': 177, 'label': 'drug', 'start': 164, 'tag': 'drug'},
                  {'end': 194, 'label': 'drug', 'start': 181, 'tag': 'drug'},
                  {'end': 219, 'label': 'drug', 'start': 211, 'tag': 'drug'},
                  {'end': 238, 'label': 'drug', 'start': 227, 'tag': 'drug'}],
  'content': 'Since PLETAL is extensively metabolized by cytochrome P-450 '
             'isoenzymes, caution should be exercised when PLETAL is '
             'coadministered with inhibitors of C.P.A. such as ketoconazole '
             'and erythromycin or inhibitors of CYP2C19 such as omeprazole.',
  'metadata': {'original_id': 'DrugDDI.d452.s0'}},
 {'annotations': [{'end': 58, 'label': 'drug', 'start': 47, 'tag': 'drug'},
                  {'end': 75, 'label': 'drug', 'start': 62, 'tag': 'drug'},
                  {'end': 135, 'label': 'drug', 'start': 124, 'tag': 'drug'},
                  {'end': 164, 'label': 'drug', 'start': 152, 'tag': 'drug'}],
  'content': 'Pharmacokinetic studies have demonstrated that omeprazole and '
             'erythromycin significantly increased the systemic exposure of '
             'cilostazol and/or its major metabolites.',
  'metadata': {'original_id': 'DrugDDI.d452.s1'}}]

We can do this

from sequence_aligner.labelset import LabelSet
from sequence_aligner.dataset import  TrainingDataset
from sequence_aligner.containers import TraingingBatch
import json
raw = json.load(open('./data/ddi_train.json'))
for example in raw:
    for annotation in example['annotations']:
        #We expect the key of label to be label but the data has tag
        annotation['label'] = annotation['tag']

from torch.utils.data import DataLoader
from transformers import BertForTokenClassification,AdamW
model = BertForTokenClassification.from_pretrained(
    "bert-base-cased", num_labels=len(dataset.label_set.ids_to_label.values())
)
optimizer = AdamW(model.parameters(), lr=5e-6)

dataloader = DataLoader(
    dataset,
    collate_fn=TraingingBatch,
    batch_size=4,
    shuffle=True,
)
for num, batch in enumerate(dataloader):
    loss, logits = model(
        input_ids=batch.input_ids,
        attention_mask=batch.attention_masks,
        labels=batch.labels,
    )
    loss.backward()
    optimizer.step()


-------------------------------

I think most of this is out of scope for the transformers library itself, so am all for closing this issue if no one objects
0reactions
stale[bot]commented, Nov 21, 2020

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Token classification - Hugging Face
Token classification assigns a label to individual tokens in a sentence. One of the most common token classification tasks is Named Entity Recognition...
Read more >
Sequence Labeling With Transformers - LightTag
Getting back the span offsets of a token is invaluable for alignment. On the other hand, automated padding becomes a nuisance, and truncation...
Read more >
YouNow of Props Tokens - SEC.gov
YouNow is qualifying the secondary distribution by Props PBC of up to 45,000,000 Props Tokens that it will grant to persons contributing to...
Read more >
Identification of token contracts on Ethereum: standard ...
More specifically, we propose indicators for tokens and evaluate them on a large set of token and non-token contracts. Finally, we present firstĀ ......
Read more >
Rudy DeFi Insight — How to hedge Impermanent Loss?
Squeeth qualifies as a Utility Token (REU52) according to the definition ... classification, and analysis of DLT- and blockchain-based cryptographic tokens.
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found