KeyError when removing stop words in canopy_index
See original GitHub issueWhen I was training a model which used a canopy index, I ran into this exception:
/Users/brian/workspace/advisor-site/py-services/property-matching/venv/bin/python /Users/brian/workspace/advisor-site/py-services/property-matching/src/train.py
Traceback (most recent call last):
File "/Users/brian/workspace/advisor-site/py-services/property-matching/src/train.py", line 28, in <module>
main()
File "/Users/brian/workspace/advisor-site/py-services/property-matching/src/train.py", line 24, in main
train_property_deduper(aa_properties_df)
File "/Users/brian/workspace/advisor-site/py-services/property-matching/src/train.py", line 15, in train_property_deduper
deduper.train_deduper_console(config.get_config(), aa_properties_parsed_data)
File "/Users/brian/workspace/advisor-site/py-services/property-matching/src/deduper.py", line 10, in train_deduper_console
trainer.train(deduper_data)
File "/Users/brian/workspace/advisor-site/py-services/property-matching/src/deduper.py", line 31, in train
self.deduper.prepare_training(data, tf, sample_size=self.config.training_sample_size)
File "/Users/brian/workspace/advisor-site/py-services/property-matching/venv/lib/python3.8/site-packages/dedupe/api.py", line 1298, in prepare_training
self._sample(data, sample_size, blocked_proportion)
File "/Users/brian/workspace/advisor-site/py-services/property-matching/venv/lib/python3.8/site-packages/dedupe/api.py", line 1324, in _sample
self.active_learner = self.ActiveLearner(self.data_model,
File "/Users/brian/workspace/advisor-site/py-services/property-matching/venv/lib/python3.8/site-packages/dedupe/labeler.py", line 429, in __init__
self.blocker = DedupeBlockLearner(data_model,
File "/Users/brian/workspace/advisor-site/py-services/property-matching/venv/lib/python3.8/site-packages/dedupe/labeler.py", line 266, in __init__
self._index_predicates(examples_to_index)
File "/Users/brian/workspace/advisor-site/py-services/property-matching/venv/lib/python3.8/site-packages/dedupe/labeler.py", line 276, in _index_predicates
blocker.index(unique_fields, field)
File "/Users/brian/workspace/advisor-site/py-services/property-matching/venv/lib/python3.8/site-packages/dedupe/blocking.py", line 144, in index
index.initSearch()
File "/Users/brian/workspace/advisor-site/py-services/property-matching/venv/lib/python3.8/site-packages/dedupe/tfidf.py", line 29, in initSearch
self._index.initSearch()
File "/Users/brian/workspace/advisor-site/py-services/property-matching/venv/lib/python3.8/site-packages/dedupe/canopy_index.py", line 36, in initSearch
term = self.lexicon._words[wid]
KeyError: 1
Process finished with exit code 1
This started happening after we fixed a previous issue with this commit.
I think this line should really be a self.lexicon._words.pop(wid)
call so the entry isn’t removed and leads to a key error later.
I am using python 3.8.10 on MacOS with dedupe-2.0.13. A caveat here is that I patched my current version with the changes for the prior issue mentioned above so I don’t have a clean install of dedupe with the latest main branch so there is always a chance that I have a broken package because of my edits (i don’t think that is the case though…hopefully).
Issue Analytics
- State:
- Created a year ago
- Comments:8 (3 by maintainers)
Top Results From Across the Web
KeyError when cleaning tweets column using stop words in ...
I have a data frame of tweets and I'm trying to clean my 'tweet' column- remove stop ...
Read more >Removing Stop Words from Strings in Python
In this article, you will see how to remove stop words using Python's NLTK, Gensim, and SpaCy libraries along with a custom script...
Read more >Removing stop words with NLTK in Python
NLTK(Natural Language Toolkit) in python has a list of stopwords stored ... The following program removes stop words from a piece of text:....
Read more >Google Stop Words: Should You Remove Them For SEO?
Do stop words hurt your SEO? Learn what they are and whether or not you should remove them from your keywords and blog...
Read more >List of stop words - MATLAB stopWords
Use stop word lists to help create custom lists of words to remove before analysis. To remove the default list of stop words...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
@hlra thanks your report this is a slightly different error, but we can track it here!
I pulled in the latest canopy_index.py file.
I agree with you though, I don’t see how this can occur. Since I encountered this, I have tweaked my model and it uses different predicates so this issue isn’t currently affecting me. I don’t want to waste your time tracking down something that might be some mess I made locally. I’m going to close this for now and if I encounter it later with a fresh install or have new information, I’ll reopen.