Documentation of how and what of NLTK pre-trained models
See original GitHub issueThere are several pre-trained models that NLTK provides and it is unclear
- what the models are trained on
- how the models are trained
These pre-trained models include:
-
sent_tokenize
: Punkt sentence tokenizers trained on _____ usingnltk.tokenize.punkt.PunktSentenceTokenizer
with ____ settings/parameters -
pos_tag
: @honnibal 's perceptron POS tagger trained on ____ usingnltk.tag.perceptron.PerceptronTagger
with ____ settings/parameters -
ne_tag
: Named entity tagger trained on ____ (is it ACE? If so, which ACE?) usingnltk.classify.maxent.MaxentClassifier
with ____ settings/parameters?
It would be great if anyone knows about the ____ information above and help answer this issue. And it’ll be awesome if it gets documented somewhere so that we avoid another wave of https://news.ycombinator.com/item?id=10173669 and https://explosion.ai/blog/dead-code-should-be-buried ;P
Issue Analytics
- State:
- Created 6 years ago
- Comments:7 (5 by maintainers)
@alvations I’d be really surprised if the Treebank and MUC6 chunker models still work, considering that pickling is so tightly coupled to the class definitions, and that at the time it would have been like python 2.4 or 2.6. They should be pickles of either the HMM and MEMM classes or the NEChunkParser, possibly along with a feature extractor. It’s possible the extractor might have just been a lambda or it could have been a class from some long abandoned code in contrib. Sorry if that’s not super helpful.
They were basic BIO style taggers, trained on the standard training sections of PTB and MUC6.
The hard part of creating these in the first place was really getting the HMM and MEMM code so the models pickle correctly, so presuming the current NLTK codebase produces objects that serialize ok, I don’t think there’s much (anything?) lost by just declaring bankruptcy on these models and re-training on newer datasets and keeping better records.
As this sort of demonstrates, serialized objects (in any language) are an awful model serialization format, but, yeah, we all do it.
I noticed that other than the
maxent_ne_chunker
used bynltk.ne_chunk
.There are 2 other chunkers in
nltk_data
:It’ll be good to understand what they are and whether they are still relevant to the current NLTK code based.
If they are still relevant, we should somehow document what they’re trained on and how they trained. Otherwise, I think it’s better to remove them from
nltk_data
@jfrazee It’ll be great if you could help to tell us more about these chunkers =)