question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

KeyError when removing stop words in canopy_index

See original GitHub issue

When I was training a model which used a canopy index, I ran into this exception:

/Users/brian/workspace/advisor-site/py-services/property-matching/venv/bin/python /Users/brian/workspace/advisor-site/py-services/property-matching/src/train.py
Traceback (most recent call last):
  File "/Users/brian/workspace/advisor-site/py-services/property-matching/src/train.py", line 28, in <module>
    main()
  File "/Users/brian/workspace/advisor-site/py-services/property-matching/src/train.py", line 24, in main
    train_property_deduper(aa_properties_df)
  File "/Users/brian/workspace/advisor-site/py-services/property-matching/src/train.py", line 15, in train_property_deduper
    deduper.train_deduper_console(config.get_config(), aa_properties_parsed_data)
  File "/Users/brian/workspace/advisor-site/py-services/property-matching/src/deduper.py", line 10, in train_deduper_console
    trainer.train(deduper_data)
  File "/Users/brian/workspace/advisor-site/py-services/property-matching/src/deduper.py", line 31, in train
    self.deduper.prepare_training(data, tf, sample_size=self.config.training_sample_size)
  File "/Users/brian/workspace/advisor-site/py-services/property-matching/venv/lib/python3.8/site-packages/dedupe/api.py", line 1298, in prepare_training
    self._sample(data, sample_size, blocked_proportion)
  File "/Users/brian/workspace/advisor-site/py-services/property-matching/venv/lib/python3.8/site-packages/dedupe/api.py", line 1324, in _sample
    self.active_learner = self.ActiveLearner(self.data_model,
  File "/Users/brian/workspace/advisor-site/py-services/property-matching/venv/lib/python3.8/site-packages/dedupe/labeler.py", line 429, in __init__
    self.blocker = DedupeBlockLearner(data_model,
  File "/Users/brian/workspace/advisor-site/py-services/property-matching/venv/lib/python3.8/site-packages/dedupe/labeler.py", line 266, in __init__
    self._index_predicates(examples_to_index)
  File "/Users/brian/workspace/advisor-site/py-services/property-matching/venv/lib/python3.8/site-packages/dedupe/labeler.py", line 276, in _index_predicates
    blocker.index(unique_fields, field)
  File "/Users/brian/workspace/advisor-site/py-services/property-matching/venv/lib/python3.8/site-packages/dedupe/blocking.py", line 144, in index
    index.initSearch()
  File "/Users/brian/workspace/advisor-site/py-services/property-matching/venv/lib/python3.8/site-packages/dedupe/tfidf.py", line 29, in initSearch
    self._index.initSearch()
  File "/Users/brian/workspace/advisor-site/py-services/property-matching/venv/lib/python3.8/site-packages/dedupe/canopy_index.py", line 36, in initSearch
    term = self.lexicon._words[wid]
KeyError: 1

Process finished with exit code 1

This started happening after we fixed a previous issue with this commit.

I think this line should really be a self.lexicon._words.pop(wid) call so the entry isn’t removed and leads to a key error later.

I am using python 3.8.10 on MacOS with dedupe-2.0.13. A caveat here is that I patched my current version with the changes for the prior issue mentioned above so I don’t have a clean install of dedupe with the latest main branch so there is always a chance that I have a broken package because of my edits (i don’t think that is the case though…hopefully).

Issue Analytics

  • State:closed
  • Created a year ago
  • Comments:8 (3 by maintainers)

github_iconTop GitHub Comments

1reaction
fgreggcommented, Jul 6, 2022

@hlra thanks your report this is a slightly different error, but we can track it here!

1reaction
oreccbcommented, Jul 5, 2022

I pulled in the latest canopy_index.py file.

I agree with you though, I don’t see how this can occur. Since I encountered this, I have tweaked my model and it uses different predicates so this issue isn’t currently affecting me. I don’t want to waste your time tracking down something that might be some mess I made locally. I’m going to close this for now and if I encounter it later with a fresh install or have new information, I’ll reopen.

Read more comments on GitHub >

github_iconTop Results From Across the Web

KeyError when cleaning tweets column using stop words in ...
I have a data frame of tweets and I'm trying to clean my 'tweet' column- remove stop ...
Read more >
Removing Stop Words from Strings in Python
In this article, you will see how to remove stop words using Python's NLTK, Gensim, and SpaCy libraries along with a custom script...
Read more >
Removing stop words with NLTK in Python
NLTK(Natural Language Toolkit) in python has a list of stopwords stored ... The following program removes stop words from a piece of text:....
Read more >
Google Stop Words: Should You Remove Them For SEO?
Do stop words hurt your SEO? Learn what they are and whether or not you should remove them from your keywords and blog...
Read more >
List of stop words - MATLAB stopWords
Use stop word lists to help create custom lists of words to remove before analysis. To remove the default list of stop words...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found