Proposal: Offset based Token Classification utilities
See original GitHub issueš Feature request
Hi. So we work a lot with span annotations on text that isnāt tokenized and want a ācanonicalā way to work with that. I have some ideas and rough implementations, so Iām looking for feedback on if this belongs in the library, and if the proposed implementation is more or less good.
I also think there is a good chance that everything I want exists, and the only solution needed is slightly clearer documentation. I should hope thatās the case and happy to document if someone can point me in the right direction.
The Desired Capabilities
What Iād like is a canonical way to:
- Tokenize the examples in the dataset
- Align my annotations with the output tokens (see notes below)
- Have the tokens and labels correctly padded to the max length of an example in the batch or max_sequence_length
- Have a convenient function that returns predicted offsets
Some Nice To Haves
- It would be nice if such a utility internally handled tagging schemes like IOB BIOES internally and optionally exposed them in the output or āfoldedā them to the core entities.
- It would be nice if there was a recommended/default strategy implemented for handling examples that are longer then the max_sequence_length
- It would be amazing if we could pass labels to the tokenizer and have the alignment happen in Rust (in parallel). But I donāt know Rust and I have a sense this is complicated so I wonāt be taking that on myself, and assuming that this is happening in Python.
Current State and what Iām missing
- The docs and examples for Token Classification assume that the text is pre-tokenized.
- For a word that has a label and is tokenized to multiple tokens, it is recommended to place the label on the first token and āignoreā the following tokens
- However it is not clear where that recommendation came from, and it has edge cases that seem quite nasty
- The example pads all examples to max_sequence_length which is a big performance hit (as opposed to bucketing by length and padding dynamically)
- The example loads the entire dataset at once in memory. Iām not sure if this is a real problem or Iām being nitpicky, but I think āthe right wayā to do this would be to lazy load a batch or a few batches.
Alignment
The path to align tokens to span annotations is by using the return_offsets_mapping flag on the tokenizer (which is awesome!). There are probably a few strategies, Iāve been using this I use logic like this:
def align_tokens_to_annos(offsets,annos):
anno_ix =0
results =[]
done =len(annos)==0
for offset in offsets:
if done == True:
results.append(dict(offset=offset,tag='O',))
else:
anno = annos[anno_ix]
start, end = offset
if end < anno['start']:
# the offset is before the next annotation
results.append(dict(offset=offset, tag='O', ))
elif start <=anno['start'] and end <=anno['end']:
results.append(dict(offset=offset, tag=f'B-{anno["tag"]}',))
elif start>=anno['start'] and end<=anno['end']:
results.append(dict(offset=offset, tag=f'I-{anno["tag"]}', ))
elif start>=anno['start'] and end>anno['end']:
anno_ix += 1
results.append(dict(offset=offset, tag=f'E-{anno["tag"]}', ))
else:
raise Exception(f"Funny Overlap {offset},{anno}",)
if anno_ix>=len(annos):
done=True
return results
And then call that function inside add_labels here
res_batch = tokenizer([s['text'] for s in pre_batch],return_offsets_mapping=True,padding=True)
offsets_batch = res_batch.pop('offset_mapping')
res_batch['labels'] =[]
for i in range(len(offsets_batch)):
labels = add_labels(res_batch['input_ids'][i],offsets_batch[i],pre_batch[i]['annotations'])
res_batch['labels'].append(labels)
This works, and itās nice because the padding is consistent with the longest sentence so bucketing gives a big boost. But, the add_labels stuff is in python and thus sequential over the examples and not super fast. I havenāt measured this to confirm itās a problem, just bring it up.
Desired Solution
I need most of this stuff so Iām going to make it. I could do it
The current āNERā examples and issues assume that text is pre-tokenized. Our use case is such that the full text is not tokenized and the labels for āNERā come as offsets. I propose a utility /example to handle that scenario because I havenāt been able to find one.
In practice, most values of X donāt need any modification, and doing what I propose (below) in Rust is beyond me, so this might boil down to a utility class and documentation.
Motivation
I make text annotation tools and our output is span annotations on untokenized text. I want our users to be able to easily use transformers. I suspect from my (limited) experience that in many non-academic use cases, span annotations on untokenized text is the norm and that others would benefit from this as well.
Possible ways to address this
I can imagine a few scenarios here
- This is out of scope Maybe this isnāt something that should be handled by transformers at all, and delegated to a library and blog post
- This is in scope and just needs documentation e.g. all the things I mentioned are things transformers should and can already do. In that case the solution would be pointing someone (me) to the right functions and adding some documentation
- **This is in scope and should be a set of utilities ** Solving this could be as simple as making a file similar to utils_ner.py. I think that would be the simplest way to get something usable and gather feedback see if anyone else cares
- This is in scope but should be done in Rust soon If we want to be performance purists, it would make sense to handle the alignment of span based labels in Rust. I donāt know Rust so I canāt help much and I donāt know if there is any appetite or capacity from someone that does, or if itās worth the (presumably) additional effort.
Your contribution
Iād be happy to implement and submit a PR, or make an external library or add to a relevant existing one.
Related issues
Issue Analytics
- State:
- Created 3 years ago
- Reactions:38
- Comments:11 (3 by maintainers)
So, After much rabbit hole, Iāve written a blog post about the considerations when doing alignment/padding/batching and another walking through an implementation.
It even comes with a repo
so If we have annotated data like this
We can do this
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.