Pipelines
See original GitHub issueI’m using pipeline to mean a chain of interdependent linguistic processors, a common data structure in NLP packages. As I understand things, currently CLTK tasks do not depend on a pipeline. Text in hand, one calls a tokenizer, or a tagger of some sort, or a lemmatizer.
Context:
In an attempt to improve the lemmatizer for OE, I trained a neural lemmatizer using stanford’s new PyTorch models. Their lemmatizer is a bit complicated: an ensemble of a dictionary, Seq2Seq neural model, and a string-edit classifier.
Training on the small ISWOC treebank (which needed some preprocessing, e.g. to map POS tags to the UD tagset), I’m seeing a reproducible 85% accuracy score. By comparison, the current dictionary-based lemmatizer gets ~ 75%, but since its recall is imperfect, a better comparison is F1 scores: 91% vs 75% in favor of the neural model.
The integration question
So far so good. The trouble is: how to integrate the neural lemmatizer into CLTK? Several issues obtain.
- Are we prepared to depend on Stanford? It feels like a hefty decision to me. But I don’t see a simpler path to adding dependency parsing to CLTK. Note that this will also introduce a dependency on PyTorch (which runs fine on plain CPUs).
- The lemmatizer takes word forms and POS tags as input. As implemented in
stanfordnlp
, the lemma processor executes after the POS processor in the pipeline, so that it computes over Word objects decorated with POS tags as input. A simple integration would mean that we would need to either expose a pipeline API to the user, or accept some degree of redundancy by under-the-hood-ly running the required pipeline, even though the user may have separately completed a POS tagging run. - The need for POS tags obviously entails a POS tagger. The best (and slowest) POS tagger for OE gets around 84% accuracy on a test set (the same test set as used in lemmatization). I have a neural one, not tied to Stanford’s tools, that gets 91%. We’ll soon see if training using stanfordnlp improves these scores, but whatever the error rate, it’s bound to propagate to the lemmatizer to some empirically-determinable degree. If it’s as large as 10%, there may be little cause to integrate the neural model!
- Another integration option would be to borrow code from Stanford to run the model, outside of a pipeline context. The user would need to supply tokens and POS tags. This approach is messy at the code level and at the API level, though it might seem closer to the current spirit of pipeline-free processing.
I realize these points aren’t very well organized. I guess I’m mostly looking for ideas about how to proceed.
Issue Analytics
- State:
- Created 4 years ago
- Comments:10 (4 by maintainers)
Top GitHub Comments
Ah—have a lot I’d like to contribute to this but no time to respond until after next week! One thing is that I have a book chapter coming out next month on CLTK and pipelines. I’ll see if I can distribute a preprint to you all soon…
Congrats on undertaking this investigation. It sounds like it has already been fruitful.
I’m excited about Pytorch and am looking forward to contributing to CLTK some Pytorch based models. If possible, we should probably consider using sci-kit learn compatible transformer architecture of skorch: https://github.com/skorch-dev/skorch This will give us reusable and standardized pipeline interfaces that we can also easily wrap to provide Spacy like functionality.
We should probably consider using FastText embeddings since they capture subword variations better and are ideal for inflected languages. I’ll post an example soon.