Workaround for spacy.en.English() load time?
See original GitHub issueOur file includes the folloiwing
from spacy.en import English
nlp = English()
The English constructor takes quite some time, depending on the machine.
Is there some workaround to speed it up, or something we’re doing wrong?
Issue Analytics
- State:
- Created 8 years ago
- Comments:12 (8 by maintainers)
Top Results From Across the Web
Spacy english language model take too long to load
This is really slow because it loads the model for every sentence: import spacy def dostuff(text): nlp = spacy.load("en") return nlp(text).
Read more >spaCy 101: Everything you need to know
Once you've downloaded and installed a trained pipeline, you can load it via spacy.load . This will return a Language object containing all...
Read more >How to use the spacy.load function in spacy
import spacy import warnings ''' To use the more accurate but slower model use "en_core_web_lg" otherwise use "en_core_web_sm" ''' nlp = spacy.load(" ...
Read more >Natural Language Processing With spaCy in Python
Here, the nlp object is a language model instance. You can assume that, throughout this tutorial, nlp refers to the language model loaded...
Read more >pip install spacy==1.3.0
Currently only models for English and German, named en and de, ... Fix issue #617: Vocab.load() now works with string paths, as well...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
Solved!
On my laptop models now load in 13s, down from 90s.
This turned out to be a stupid problem =/. At some point in the many revisions of this code, I lost an important patch: before loading the model, I wasn’t resizing the hash table! If the hash table is sized exactly, insertions are sequential. But resizing is very expensive, because we use open addressing and linear probing. When we resize the hash table, all keys must be reinserted!
There are 9 million entries in the table, so this is very expensive.
I feel stupid for not realising the loading time was extreme and there must be a problem. But, the important thing is: this will be fixed in the next version.
I’d love to have more insight into why it takes so long, and what the variance is due to. If you do any benchmarking, let me know! The part that loads the parser model is here:
https://github.com/honnibal/thinc/blob/master/thinc/model.pyx#L89
What we’re doing is looping over successive calls to
Reader.read
:https://github.com/honnibal/thinc/blob/master/thinc/model.pyx#L155
The memory is being allocated via this
cymem.Pool
class:https://github.com/honnibal/cymem/blob/master/cymem/cymem.pyx#L31
The
model.load
code is called by spaCy here:https://github.com/honnibal/spaCy/blob/master/spacy/syntax/parser.pyx#L85
You could verify that this part is indeed the slow part for you by loading
nlp = English(parser=False)
.It’d be good to know whether it’s indeed the disk reads that are slow, or whether it’s something else that we could do more about, like the hash table insertions or the memory allocations.
But, don’t spend too long on it 😃. As I said, I hope to be replacing this soon.