Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Pipelines

See original GitHub issue

I’m using pipeline to mean a chain of interdependent linguistic processors, a common data structure in NLP packages. As I understand things, currently CLTK tasks do not depend on a pipeline. Text in hand, one calls a tokenizer, or a tagger of some sort, or a lemmatizer.

Context:

In an attempt to improve the lemmatizer for OE, I trained a neural lemmatizer using stanford’s new PyTorch models. Their lemmatizer is a bit complicated: an ensemble of a dictionary, Seq2Seq neural model, and a string-edit classifier.

Training on the small ISWOC treebank (which needed some preprocessing, e.g. to map POS tags to the UD tagset), I’m seeing a reproducible 85% accuracy score. By comparison, the current dictionary-based lemmatizer gets ~ 75%, but since its recall is imperfect, a better comparison is F1 scores: 91% vs 75% in favor of the neural model.

The integration question

So far so good. The trouble is: how to integrate the neural lemmatizer into CLTK? Several issues obtain.

Are we prepared to depend on Stanford? It feels like a hefty decision to me. But I don’t see a simpler path to adding dependency parsing to CLTK. Note that this will also introduce a dependency on PyTorch (which runs fine on plain CPUs).
The lemmatizer takes word forms and POS tags as input. As implemented in stanfordnlp, the lemma processor executes after the POS processor in the pipeline, so that it computes over Word objects decorated with POS tags as input. A simple integration would mean that we would need to either expose a pipeline API to the user, or accept some degree of redundancy by under-the-hood-ly running the required pipeline, even though the user may have separately completed a POS tagging run.
The need for POS tags obviously entails a POS tagger. The best (and slowest) POS tagger for OE gets around 84% accuracy on a test set (the same test set as used in lemmatization). I have a neural one, not tied to Stanford’s tools, that gets 91%. We’ll soon see if training using stanfordnlp improves these scores, but whatever the error rate, it’s bound to propagate to the lemmatizer to some empirically-determinable degree. If it’s as large as 10%, there may be little cause to integrate the neural model!
Another integration option would be to borrow code from Stanford to run the model, outside of a pipeline context. The user would need to supply tokens and POS tags. This approach is messy at the code level and at the API level, though it might seem closer to the current spirit of pipeline-free processing.

I realize these points aren’t very well organized. I guess I’m mostly looking for ideas about how to proceed.

Issue Analytics

State:
Created 4 years ago
Comments:10 (4 by maintainers)

Top GitHub Comments

2reactions

diyclassicscommented, Jun 21, 2019

Ah—have a lot I’d like to contribute to this but no time to respond until after next week! One thing is that I have a book chapter coming out next month on CLTK and pipelines. I’ll see if I can distribute a preprint to you all soon…

1reaction

todd-cookcommented, Jun 20, 2019

Congrats on undertaking this investigation. It sounds like it has already been fruitful.

I’m excited about Pytorch and am looking forward to contributing to CLTK some Pytorch based models. If possible, we should probably consider using sci-kit learn compatible transformer architecture of skorch: https://github.com/skorch-dev/skorch This will give us reusable and standardized pipeline interfaces that we can also easily wrap to provide Spacy like functionality.

We should probably consider using FastText embeddings since they capture subword variations better and are ideal for inflected languages. I’ll post an example soon.

Top Results From Across the Web

Pipeline transport - Wikipedia

Pipeline transport is the long-distance transportation of a liquid or gas through a system of pipes—a pipeline—typically to a market area for consumption....

Pipeline | Definition, History, Types, Uses, & Facts | Britannica

pipeline, line of pipe equipped with pumps and valves and other control devices for moving liquids, gases, and slurries (fine particles suspended in...

Azure Pipelines

Get 10 free parallel jobs for cloud-based CI/CD pipelines for Linux, macOS, and Windows. Automate builds and easily deploy to any cloud with...

CI/CD pipelines - GitLab Docs

Configure a pipeline. Pipelines and their component jobs and stages are defined in the CI/CD pipeline configuration file for each project. Jobs are...

Natural gas pipelines - U.S. Energy Information Administration ...

The U.S. natural gas pipeline network is a highly integrated network that moves natural gas throughout the continental United States. The pipeline network ......