Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

2.3.0 models don't work as expected

See original GitHub issue

How to reproduce the behaviour

After upgrading spacy and the corresponding models to 2.3.0, the models after loading seem to have very limited vocabulary:

import spacy
nlp = spacy.load('en_core_web_sm') # same with 'en_core_web_md' and 'en_core_web_lg'
len(nlp.vocab) # outputs 478

The output of [w.orth_ for w in nlp.vocab] (shortened):

['nuthin',                                                                                                                                                                                                           
 'there',                                                                                                                                                                                                            
 'ü.',                                               
 '’nuff',                                                                                                                                                                                                            
 'havin',                                                                                                                                                                                                            
 "'bout",                                            
 '’Cause',                                                                                                                                                                                                           
 'Need',                                             
 'Somethin',                                                                                              
 'gon',                                                                                                   
 'N.C.',                                                                                                                                                                                                             
 '\\n',                                                                                                                                                                                                              
 ' ',                                                
 'Sept.',                                            
 'c.',                                               
 'E.G.',                                             
 'Mont.',                                            
 'b.',                                               
 ':-}',                                              
 'got',                                                                                                   
 'it',                                                                                                                                                                                                               
 'Jr.',                                              
 '=3',                                               
 '>.>',                                                                                                   
 'Calif.',                                           
 ':}',                                                                                                    
 'Ill.',                                             
 "O'clock",                                                                                               
 "o'clock",                                          
 'Mich.',                                            
 'is',                                               
 ':-o',                                              
 'n.',                                               
 'w/o',                                                                                                   
 'Might',                                            
 '>.<',                                              
 ':))',

Lexemes outside of this list can still be accessed via e.g. nlp.vocab['aardvark'], but any workflow that requires operating on nlp.vocab is broken. Also, all lexemes have prob of -20.0:

nlp.vocab['aardvark'].prob # -20.0

Info about spaCy

spaCy version: 2.3.0
Platform: Linux-5.6.15-arch1-1-x86_64-with-arch
Python version: 3.7.3

Issue Analytics

State:
Created 3 years ago
Comments:9 (8 by maintainers)

Top GitHub Comments

1reaction

adrianeboydcommented, Jun 25, 2020

Install spacy-lookups-data:

$ pip install spacy-lookups-data

Then you need one extra line if you restructure it slightly to directly iterate over words with vectors rather than the vocab:

import spacy

nlp = spacy.load("en_core_web_md")

# remove the empty placeholder prob table
nlp.vocab.lookups_extra.remove_table("lexeme_prob")

# when you access .prob for the first word, the whole table is loaded from spacy-lookups-data
to_check = [w for w in nlp.vocab.vectors if nlp.vocab[w].prob >= -10]

assert len(to_check) == 1663

If you save this model with nlp.to_disk(), the probability table is included and the next time you load it, you can skip the step where you drop the empty table and it doesn’t matter whether spacy-lookups-data is installed.

This will be slightly slower than in v2.2. It takes longer to load prob table (although the initial model loading is now much faster, the overall loading time for model + prob table is slightly higher) and it is slightly slower to access an individual lex.prob value in the lookup table vs. the v2.2 ones directly stored in lexemes.

0reactions

github-actions[bot]commented, Nov 3, 2021

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.