question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

2.3.0 models don't work as expected

See original GitHub issue

How to reproduce the behaviour

After upgrading spacy and the corresponding models to 2.3.0, the models after loading seem to have very limited vocabulary:

import spacy
nlp = spacy.load('en_core_web_sm') # same with 'en_core_web_md' and 'en_core_web_lg'
len(nlp.vocab) # outputs 478

The output of [w.orth_ for w in nlp.vocab] (shortened):

['nuthin',                                                                                                                                                                                                           
 'there',                                                                                                                                                                                                            
 'ü.',                                               
 '’nuff',                                                                                                                                                                                                            
 'havin',                                                                                                                                                                                                            
 "'bout",                                            
 '’Cause',                                                                                                                                                                                                           
 'Need',                                             
 'Somethin',                                                                                              
 'gon',                                                                                                   
 'N.C.',                                                                                                                                                                                                             
 '\\n',                                                                                                                                                                                                              
 ' ',                                                
 'Sept.',                                            
 'c.',                                               
 'E.G.',                                             
 'Mont.',                                            
 'b.',                                               
 ':-}',                                              
 'got',                                                                                                   
 'it',                                                                                                                                                                                                               
 'Jr.',                                              
 '=3',                                               
 '>.>',                                                                                                   
 'Calif.',                                           
 ':}',                                                                                                    
 'Ill.',                                             
 "O'clock",                                                                                               
 "o'clock",                                          
 'Mich.',                                            
 'is',                                               
 ':-o',                                              
 'n.',                                               
 'w/o',                                                                                                   
 'Might',                                            
 '>.<',                                              
 ':))', 

Lexemes outside of this list can still be accessed via e.g. nlp.vocab['aardvark'], but any workflow that requires operating on nlp.vocab is broken. Also, all lexemes have prob of -20.0:

nlp.vocab['aardvark'].prob # -20.0

Info about spaCy

  • spaCy version: 2.3.0
  • Platform: Linux-5.6.15-arch1-1-x86_64-with-arch
  • Python version: 3.7.3

Issue Analytics

  • State:closed
  • Created 3 years ago
  • Comments:9 (8 by maintainers)

github_iconTop GitHub Comments

1reaction
adrianeboydcommented, Jun 25, 2020

Install spacy-lookups-data:

$ pip install spacy-lookups-data

Then you need one extra line if you restructure it slightly to directly iterate over words with vectors rather than the vocab:

import spacy

nlp = spacy.load("en_core_web_md")

# remove the empty placeholder prob table
nlp.vocab.lookups_extra.remove_table("lexeme_prob")

# when you access .prob for the first word, the whole table is loaded from spacy-lookups-data
to_check = [w for w in nlp.vocab.vectors if nlp.vocab[w].prob >= -10]

assert len(to_check) == 1663

If you save this model with nlp.to_disk(), the probability table is included and the next time you load it, you can skip the step where you drop the empty table and it doesn’t matter whether spacy-lookups-data is installed.

This will be slightly slower than in v2.2. It takes longer to load prob table (although the initial model loading is now much faster, the overall loading time for model + prob table is slightly higher) and it is slightly slower to access an individual lex.prob value in the lookup table vs. the v2.2 ones directly stored in lexemes.

0reactions
github-actions[bot]commented, Nov 3, 2021

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Unable to load Keras model in Keras 2.4.3 (with Tensorflow ...
After some digging, I found that the error is actually caused by saving and then attempting to load on different versions of Keras/Tensorflow....
Read more >
Release Notes — Airflow Documentation
In order to make airflow dags test more useful as a testing and debugging tool, we no longer run a backfill job and...
Read more >
Release Notes — NVIDIA Riva - NVIDIA Documentation Center
Because Riva uses CTC-based acoustic models, which do not learn alignment during training, word timestamps in ASR transcripts can be inaccurate.
Read more >
Image Layer Details - tensorflow/serving:2.3.0 | Docker Hub
/bin/sh -c mkdir -p /run/systemd. 161 B. 5. CMD ["/bin/bash"] ... LABEL tensorflow_serving_github_branchtag=2.3.0 ... ENV MODEL_BASE_PATH=/models.
Read more >
Known Issues - Cribl Docs
LogStream 2.3.0 applies a restrictive permissions check using id -un <uid> , which does not work with the version of id that ships...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found