Model differences
See original GitHub issueFor english alone, there are 3 helpful available models.
However, not much in detail how different they are.
en_core_web_sm 50 MB Vocab, syntax, entities, word vectors
en_core_web_md 1 GB Vocab, syntax, entities, word vectors
en_depent_web_md 328 MB Vocab, syntax, entities
Can you provide some descriptions on their accuracy, entity type recognition, use-case which will be helpful?
Issue Analytics
- State:
- Created 6 years ago
- Comments:8 (2 by maintainers)
Top Results From Across the Web
Compare iPhone Models - Apple
Compare features and technical specifications for the iPhone 14 Pro, iPhone 14 Pro Max, iPhone 14, iPhone 14 Plus, iPhone SE, and many...
Read more >Applying the Model-Comparison Approach to Test Specific ...
To recap, the essence of the model comparison approach to statistical testing is that it conceives of statistical tests of experimental effects ...
Read more >Introduction to model comparisons - Sean Trott
The core idea behind the model comparison approach is to compare the explanatory power of two or more models, with the goal of...
Read more >Model Comparison Methods - ScienceDirect.com
Models are often compared on the basis of their goodness of fit. That is, among a set of models under comparison, the scientist...
Read more >Feature Model Differences | SpringerLink
Feature models are a widespread means to represent commonality and variability in software product lines. As is the case for other kinds of...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
Yes, that’s a good idea! I think the model releases would be a good place for this info as well, and it could be combined with the accuracy numbers.
Just edited the release notes of the new French model as an example: https://github.com/explosion/spacy-models/releases/tag/fr_depvec_web_lg-1.0.0
Will start updating the other models as well.
Thanks for opening this issue – since this question has come up before, I agree that this should definitely be more clear in the docs. I’ll just post all notes here so we can discuss them and add them to the docs.
Differences and accuracy
Most differences are obviously statistical. In general, we do expect larger models to be “better” and more accurate overall. Ultimately, it depends on your use case and requirements. People have reported pretty good results with the smaller model, so we usually recommend trying that first, writing a few test specific to your use case and then comparing the results to a larger model, if necessary.
We’re also going to compile a better list of accuracy numbers and distribute them with each model, for example in its
meta.json
.en_core_web_sm
en_core_web_md
,en_depent_web_md
Model releases and release notes
All models are published as GitHub releases and their release notes contain more detailed info. Going forward, we’ll also add a “Changes” section to new model releases that’ll list all updates since the last release, to give you a better idea of how that model is different. You can see an example of that in the pre-release of an alpha model we’re currently testing.
Model naming conventions
In general, spaCy expects all model packages to follow the naming convention of
[lang]_[name]
. For our models, we also chose to divide the name into three components:core
for general-purpose model with vocabulary, syntax, entities and word vectors, ordepent
for only vocab, syntax and entities.)web
for web text,news
for news text)sm
,md
orlg
)For example,
en_depent_web_md
is a medium-sized English model trained on written web text (blogs, news, comments), that includes vocabulary, syntax and entities.I hope those naming conventions aren’t too confusing – but we felt like it was necessary to decide on a scheme like this upfront to make we don’t end up with confusing or indistinguishable model names. Especially since there will be many more models in the future – either published by us, or by the community. (For example, if you were to train a Spanish NER model on dialog text, you’d call it
es_ent_dialog_md
and it’d be clear what it is.)✅ TODO
meta.json
files, docs and releases