Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Documentation of how and what of NLTK pre-trained models

See original GitHub issue

There are several pre-trained models that NLTK provides and it is unclear

what the models are trained on
how the models are trained

These pre-trained models include:

sent_tokenize: Punkt sentence tokenizers trained on _____ using nltk.tokenize.punkt.PunktSentenceTokenizer with ____ settings/parameters
pos_tag: @honnibal 's perceptron POS tagger trained on ____ using nltk.tag.perceptron.PerceptronTagger with ____ settings/parameters
ne_tag: Named entity tagger trained on ____ (is it ACE? If so, which ACE?) using nltk.classify.maxent.MaxentClassifier with ____ settings/parameters?

It would be great if anyone knows about the ____ information above and help answer this issue. And it’ll be awesome if it gets documented somewhere so that we avoid another wave of https://news.ycombinator.com/item?id=10173669 and https://explosion.ai/blog/dead-code-should-be-buried ;P

Issue Analytics

State:
Created 6 years ago
Comments:7 (5 by maintainers)

Top GitHub Comments

1reaction

jfrazeecommented, Aug 20, 2017

@alvations I’d be really surprised if the Treebank and MUC6 chunker models still work, considering that pickling is so tightly coupled to the class definitions, and that at the time it would have been like python 2.4 or 2.6. They should be pickles of either the HMM and MEMM classes or the NEChunkParser, possibly along with a feature extractor. It’s possible the extractor might have just been a lambda or it could have been a class from some long abandoned code in contrib. Sorry if that’s not super helpful.

They were basic BIO style taggers, trained on the standard training sections of PTB and MUC6.

The hard part of creating these in the first place was really getting the HMM and MEMM code so the models pickle correctly, so presuming the current NLTK codebase produces objects that serialize ok, I don’t think there’s much (anything?) lost by just declaring bankruptcy on these models and re-training on newer datasets and keeping better records.

As this sort of demonstrates, serialized objects (in any language) are an awful model serialization format, but, yeah, we all do it.

1reaction

alvationscommented, Aug 20, 2017

I noticed that other than the maxent_ne_chunker used by nltk.ne_chunk.

There are 2 other chunkers in nltk_data:

treebank.chunker.pickle.gz: Seems to come from a really old commit without much documentation
muc6.chunk.tagger.pickle.gz: Also from a really old commit

It’ll be good to understand what they are and whether they are still relevant to the current NLTK code based.

If they are still relevant, we should somehow document what they’re trained on and how they trained. Otherwise, I think it’s better to remove them from nltk_data

@jfrazee It’ll be great if you could help to tell us more about these chunkers =)