question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Documentation of how and what of NLTK pre-trained models

See original GitHub issue

There are several pre-trained models that NLTK provides and it is unclear

  • what the models are trained on
  • how the models are trained

These pre-trained models include:

  • sent_tokenize: Punkt sentence tokenizers trained on _____ using nltk.tokenize.punkt.PunktSentenceTokenizer with ____ settings/parameters

  • pos_tag: @honnibal 's perceptron POS tagger trained on ____ using nltk.tag.perceptron.PerceptronTagger with ____ settings/parameters

  • ne_tag: Named entity tagger trained on ____ (is it ACE? If so, which ACE?) using nltk.classify.maxent.MaxentClassifier with ____ settings/parameters?

It would be great if anyone knows about the ____ information above and help answer this issue. And it’ll be awesome if it gets documented somewhere so that we avoid another wave of https://news.ycombinator.com/item?id=10173669 and https://explosion.ai/blog/dead-code-should-be-buried ;P

Issue Analytics

  • State:open
  • Created 6 years ago
  • Comments:7 (5 by maintainers)

github_iconTop GitHub Comments

1reaction
jfrazeecommented, Aug 20, 2017

@alvations I’d be really surprised if the Treebank and MUC6 chunker models still work, considering that pickling is so tightly coupled to the class definitions, and that at the time it would have been like python 2.4 or 2.6. They should be pickles of either the HMM and MEMM classes or the NEChunkParser, possibly along with a feature extractor. It’s possible the extractor might have just been a lambda or it could have been a class from some long abandoned code in contrib. Sorry if that’s not super helpful.

They were basic BIO style taggers, trained on the standard training sections of PTB and MUC6.

The hard part of creating these in the first place was really getting the HMM and MEMM code so the models pickle correctly, so presuming the current NLTK codebase produces objects that serialize ok, I don’t think there’s much (anything?) lost by just declaring bankruptcy on these models and re-training on newer datasets and keeping better records.

As this sort of demonstrates, serialized objects (in any language) are an awful model serialization format, but, yeah, we all do it.

1reaction
alvationscommented, Aug 20, 2017

I noticed that other than the maxent_ne_chunker used by nltk.ne_chunk.

There are 2 other chunkers in nltk_data:

It’ll be good to understand what they are and whether they are still relevant to the current NLTK code based.

If they are still relevant, we should somehow document what they’re trained on and how they trained. Otherwise, I think it’s better to remove them from nltk_data

@jfrazee It’ll be great if you could help to tell us more about these chunkers =)

Read more comments on GitHub >

github_iconTop Results From Across the Web

NLP Libraries and Pretrained models | by Ajeet singh - Medium
In this article, we will be looking at NLP libraries and pre-trained models. This will be very interesting. So buckle up your seat...
Read more >
Sample usage for gensim - NLTK
NLTK includes a pre-trained model which is part of a model that is trained on 100 billion words from the Google News Dataset....
Read more >
NLP Pre-trained Models Explained with Examples
Pre-trained models (PTMs) for NLP are deep learning models (such as transformers) which are trained on a large dataset to perform specific NLP ......
Read more >
Pretrained Models | NLP Pretrained Models - Analytics Vidhya
This article contains some pretrained models to get started with natural language processing. This NLP pretrained model helps you to learn ...
Read more >
Top 8 Pre-Trained NLP Models Developers Must Know
Top 8 Pre-Trained NLP Models Developers Must Know · 1| OpenAI's GPT-3 · 2| Google's BERT · 3| Microsoft's CodeBERT · 4| ELMo...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found