Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Model differences

See original GitHub issue

For english alone, there are 3 helpful available models.

However, not much in detail how different they are.

en_core_web_sm	        50 MB	Vocab, syntax, entities, word vectors
en_core_web_md	        1 GB	Vocab, syntax, entities, word vectors
en_depent_web_md	328 MB	Vocab, syntax, entities

Can you provide some descriptions on their accuracy, entity type recognition, use-case which will be helpful?

Issue Analytics

State:
Created 6 years ago
Comments:8 (2 by maintainers)

Top GitHub Comments

3reactions

inescommented, Apr 27, 2017

Yes, that’s a good idea! I think the model releases would be a good place for this info as well, and it could be combined with the accuracy numbers.

Just edited the release notes of the new French model as an example: https://github.com/explosion/spacy-models/releases/tag/fr_depvec_web_lg-1.0.0

Will start updating the other models as well.

2reactions

inescommented, Apr 24, 2017

Thanks for opening this issue – since this question has come up before, I agree that this should definitely be more clear in the docs. I’ll just post all notes here so we can discuss them and add them to the docs.

Differences and accuracy

Most differences are obviously statistical. In general, we do expect larger models to be “better” and more accurate overall. Ultimately, it depends on your use case and requirements. People have reported pretty good results with the smaller model, so we usually recommend trying that first, writing a few test specific to your use case and then comparing the results to a larger model, if necessary.

We’re also going to compile a better list of accuracy numbers and distribute them with each model, for example in its meta.json.

Model	Parser accuracy	POS tagging accuracy	NER accuracy
`en_core_web_sm`	~89%	coming	coming
`en_core_web_md`, `en_depent_web_md`	~90.6%	coming	coming

Model releases and release notes

All models are published as GitHub releases and their release notes contain more detailed info. Going forward, we’ll also add a “Changes” section to new model releases that’ll list all updates since the last release, to give you a better idea of how that model is different. You can see an example of that in the pre-release of an alpha model we’re currently testing.

Model naming conventions

In general, spaCy expects all model packages to follow the naming convention of [lang]_[name]. For our models, we also chose to divide the name into three components:

Name	Description
type	model capabilities (e.g. `core` for general-purpose model with vocabulary, syntax, entities and word vectors, or `depent` for only vocab, syntax and entities.)
genre	type of text the model is trained on (e.g. `web` for web text, `news` for news text)
size	Model size indicator (`sm`, `md` or `lg`)

For example, en_depent_web_md is a medium-sized English model trained on written web text (blogs, news, comments), that includes vocabulary, syntax and entities.

I hope those naming conventions aren’t too confusing – but we felt like it was necessary to decide on a scheme like this upfront to make we don’t end up with confusing or indistinguishable model names. Especially since there will be many more models in the future – either published by us, or by the community. (For example, if you were to train a Spanish NER model on dialog text, you’d call it es_ent_dialog_md and it’d be clear what it is.)

✅ TODO

come up with generalised format to distribute the accuracies with the models
add figures to the model meta.json files, docs and releases
update docs with more information on model differences, how to pick the right model etc.