Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Proposal: Offset based Token Classification utilities

See original GitHub issue

🚀 Feature request

Hi. So we work a lot with span annotations on text that isn’t tokenized and want a “canonical” way to work with that. I have some ideas and rough implementations, so I’m looking for feedback on if this belongs in the library, and if the proposed implementation is more or less good.

I also think there is a good chance that everything I want exists, and the only solution needed is slightly clearer documentation. I should hope that’s the case and happy to document if someone can point me in the right direction.

The Desired Capabilities

What I’d like is a canonical way to:

Tokenize the examples in the dataset
Align my annotations with the output tokens (see notes below)
Have the tokens and labels correctly padded to the max length of an example in the batch or max_sequence_length
Have a convenient function that returns predicted offsets

Some Nice To Haves

It would be nice if such a utility internally handled tagging schemes like IOB BIOES internally and optionally exposed them in the output or “folded” them to the core entities.
It would be nice if there was a recommended/default strategy implemented for handling examples that are longer then the max_sequence_length
It would be amazing if we could pass labels to the tokenizer and have the alignment happen in Rust (in parallel). But I don’t know Rust and I have a sense this is complicated so I won’t be taking that on myself, and assuming that this is happening in Python.

Current State and what I’m missing

The docs and examples for Token Classification assume that the text is pre-tokenized.
For a word that has a label and is tokenized to multiple tokens, it is recommended to place the label on the first token and “ignore” the following tokens
- However it is not clear where that recommendation came from, and it has edge cases that seem quite nasty
The example pads all examples to max_sequence_length which is a big performance hit (as opposed to bucketing by length and padding dynamically)
The example loads the entire dataset at once in memory. I’m not sure if this is a real problem or I’m being nitpicky, but I think “the right way” to do this would be to lazy load a batch or a few batches.

Alignment

The path to align tokens to span annotations is by using the return_offsets_mapping flag on the tokenizer (which is awesome!). There are probably a few strategies, I’ve been using this I use logic like this:

def align_tokens_to_annos(offsets,annos):
    anno_ix =0
    results =[]
    done =len(annos)==0
    for offset in offsets:

        if done == True:
            results.append(dict(offset=offset,tag='O',))
        else:
            anno = annos[anno_ix]
            start, end = offset
            if end < anno['start']:
                # the offset is before the next annotation
                results.append(dict(offset=offset, tag='O', ))
            elif start <=anno['start'] and end <=anno['end']:
                results.append(dict(offset=offset, tag=f'B-{anno["tag"]}',))
            elif start>=anno['start'] and end<=anno['end']:
                results.append(dict(offset=offset, tag=f'I-{anno["tag"]}', ))
            elif start>=anno['start'] and end>anno['end']:
                anno_ix += 1
                results.append(dict(offset=offset, tag=f'E-{anno["tag"]}', ))
            else:
                raise Exception(f"Funny Overlap {offset},{anno}",)

            if anno_ix>=len(annos):
                done=True
    return results

And then call that function inside add_labels here

        res_batch = tokenizer([s['text'] for s in pre_batch],return_offsets_mapping=True,padding=True)
        offsets_batch = res_batch.pop('offset_mapping')
        res_batch['labels'] =[]
        for i in range(len(offsets_batch)):
          labels = add_labels(res_batch['input_ids'][i],offsets_batch[i],pre_batch[i]['annotations'])
          res_batch['labels'].append(labels)

This works, and it’s nice because the padding is consistent with the longest sentence so bucketing gives a big boost. But, the add_labels stuff is in python and thus sequential over the examples and not super fast. I haven’t measured this to confirm it’s a problem, just bring it up.

Desired Solution

I need most of this stuff so I’m going to make it. I could do it

The current “NER” examples and issues assume that text is pre-tokenized. Our use case is such that the full text is not tokenized and the labels for “NER” come as offsets. I propose a utility /example to handle that scenario because I haven’t been able to find one.

In practice, most values of X don’t need any modification, and doing what I propose (below) in Rust is beyond me, so this might boil down to a utility class and documentation.

Motivation

I make text annotation tools and our output is span annotations on untokenized text. I want our users to be able to easily use transformers. I suspect from my (limited) experience that in many non-academic use cases, span annotations on untokenized text is the norm and that others would benefit from this as well.

Possible ways to address this

I can imagine a few scenarios here

This is out of scope Maybe this isn’t something that should be handled by transformers at all, and delegated to a library and blog post
This is in scope and just needs documentation e.g. all the things I mentioned are things transformers should and can already do. In that case the solution would be pointing someone (me) to the right functions and adding some documentation
**This is in scope and should be a set of utilities ** Solving this could be as simple as making a file similar to utils_ner.py. I think that would be the simplest way to get something usable and gather feedback see if anyone else cares
This is in scope but should be done in Rust soon If we want to be performance purists, it would make sense to handle the alignment of span based labels in Rust. I don’t know Rust so I can’t help much and I don’t know if there is any appetite or capacity from someone that does, or if it’s worth the (presumably) additional effort.

Your contribution

I’d be happy to implement and submit a PR, or make an external library or add to a relevant existing one.

Related issues

Issue Analytics

State:
Created 3 years ago
Reactions:38
Comments:11 (3 by maintainers)

Top GitHub Comments

3reactions

talolardcommented, Sep 21, 2020

So, After much rabbit hole, I’ve written a blog post about the considerations when doing alignment/padding/batching and another walking through an implementation.

It even comes with a repo

so If we have annotated data like this

[{'annotations': [],
  'content': 'No formal drug interaction studies of Aranesp? have been '
             'performed.',
  'metadata': {'original_id': 'DrugDDI.d390.s0'}},
 {'annotations': [{'end': 13, 'label': 'drug', 'start': 6, 'tag': 'drug'},
                  {'end': 60, 'label': 'drug', 'start': 43, 'tag': 'drug'},
                  {'end': 112, 'label': 'drug', 'start': 105, 'tag': 'drug'},
                  {'end': 177, 'label': 'drug', 'start': 164, 'tag': 'drug'},
                  {'end': 194, 'label': 'drug', 'start': 181, 'tag': 'drug'},
                  {'end': 219, 'label': 'drug', 'start': 211, 'tag': 'drug'},
                  {'end': 238, 'label': 'drug', 'start': 227, 'tag': 'drug'}],
  'content': 'Since PLETAL is extensively metabolized by cytochrome P-450 '
             'isoenzymes, caution should be exercised when PLETAL is '
             'coadministered with inhibitors of C.P.A. such as ketoconazole '
             'and erythromycin or inhibitors of CYP2C19 such as omeprazole.',
  'metadata': {'original_id': 'DrugDDI.d452.s0'}},
 {'annotations': [{'end': 58, 'label': 'drug', 'start': 47, 'tag': 'drug'},
                  {'end': 75, 'label': 'drug', 'start': 62, 'tag': 'drug'},
                  {'end': 135, 'label': 'drug', 'start': 124, 'tag': 'drug'},
                  {'end': 164, 'label': 'drug', 'start': 152, 'tag': 'drug'}],
  'content': 'Pharmacokinetic studies have demonstrated that omeprazole and '
             'erythromycin significantly increased the systemic exposure of '
             'cilostazol and/or its major metabolites.',
  'metadata': {'original_id': 'DrugDDI.d452.s1'}}]

We can do this

from sequence_aligner.labelset import LabelSet
from sequence_aligner.dataset import  TrainingDataset
from sequence_aligner.containers import TraingingBatch
import json
raw = json.load(open('./data/ddi_train.json'))
for example in raw:
    for annotation in example['annotations']:
        #We expect the key of label to be label but the data has tag
        annotation['label'] = annotation['tag']

from torch.utils.data import DataLoader
from transformers import BertForTokenClassification,AdamW
model = BertForTokenClassification.from_pretrained(
    "bert-base-cased", num_labels=len(dataset.label_set.ids_to_label.values())
)
optimizer = AdamW(model.parameters(), lr=5e-6)

dataloader = DataLoader(
    dataset,
    collate_fn=TraingingBatch,
    batch_size=4,
    shuffle=True,
)
for num, batch in enumerate(dataloader):
    loss, logits = model(
        input_ids=batch.input_ids,
        attention_mask=batch.attention_masks,
        labels=batch.labels,
    )
    loss.backward()
    optimizer.step()


-------------------------------

I think most of this is out of scope for the transformers library itself, so am all for closing this issue if no one objects

0reactions

stale[bot]commented, Nov 21, 2020

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

Top Results From Across the Web

Token classification - Hugging Face

Token classification assigns a label to individual tokens in a sentence. One of the most common token classification tasks is Named Entity Recognition...

Sequence Labeling With Transformers - LightTag

Getting back the span offsets of a token is invaluable for alignment. On the other hand, automated padding becomes a nuisance, and truncation...

YouNow of Props Tokens - SEC.gov

YouNow is qualifying the secondary distribution by Props PBC of up to 45,000,000 Props Tokens that it will grant to persons contributing to...

Identification of token contracts on Ethereum: standard ...

More specifically, we propose indicators for tokens and evaluate them on a large set of token and non-token contracts. Finally, we present first ......

Rudy DeFi Insight — How to hedge Impermanent Loss?

Squeeth qualifies as a Utility Token (REU52) according to the definition ... classification, and analysis of DLT- and blockchain-based cryptographic tokens.

Troubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.

Start Free

Top Related Reddit Thread

No results found

Top Related Tweet

No results found

Top Related Dev.to Post

No results found

Proposal: Offset based Token Classification utilities

🚀 Feature request

The Desired Capabilities

Some Nice To Haves

Current State and what I’m missing

Alignment

Desired Solution

Motivation

Possible ways to address this

Your contribution

Related issues

Issue Analytics

Top GitHub Comments

Top Results From Across the Web

Top Related Medium Post

Top Related StackOverflow Question

Troubleshoot Live Code

Top Related Reddit Thread

Top Related Hackernoon Post

Top Related Tweet

Top Related Dev.to Post

Top Related Hashnode Post

Adding `class_weights` argument for the loss function of transformers model

init() got an unexpected keyword argument 'cache_dir'

Proposal: Offset based Token Classification utilities

🚀 Feature request

The Desired Capabilities

Some Nice To Haves

Current State and what I’m missing

Alignment

Desired Solution

Motivation

Possible ways to address this

Your contribution

Related issues

Issue Analytics

Top GitHub Comments

Top Results From Across the Web

Top Related Medium Post

Top Related StackOverflow Question

Troubleshoot Live Code

Top Related Reddit Thread

Top Related Hackernoon Post

Top Related Tweet

Top Related Dev.to Post

Top Related Hashnode Post

Adding `class_weights` argument for the loss function of transformers model

__init__() got an unexpected keyword argument 'cache_dir'

init() got an unexpected keyword argument 'cache_dir'