Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

AssertionError in KB.load_bulk

See original GitHub issue

I generated entities and aliases file using Wikipedia dump and loaded them using KB. I saved it using .dump() but when I load it again using load_bulk() it throws this error:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "kb.pyx", line 356, in spacy.kb.KnowledgeBase.load_bulk
  File "kb.pyx", line 409, in spacy.kb.KnowledgeBase.load_bulk
AssertionError

I went through the code and saw it was giving assertion error that the no of entities loaded were not same as kb.get_size_entities(). But I don’t understand why so, I am not doing anything beyond kb.dump and kb.load_bulk.

Help would be appreciated. Thanks!

Environment

Operating System: Ubuntu 18.04.3 LTS
Python Version Used: 3.8.3
spaCy Version Used: 2.3.2

Issue Analytics

State:
Created 3 years ago
Comments:8 (4 by maintainers)

Top GitHub Comments

1reaction

nlp-sudocommented, Oct 22, 2020

Sorry, I didn’t updated the issue but I have already solved the issue.

FYI: I did a binary search to find out which entity was giving an issue. It was an entity having “id” as empty string(“”). After removing that the code worked.

Thanks for the help. Closing Issue.

1reaction

nlp-sudocommented, Sep 16, 2020

Thanks for the reply. Here are more details about my code:

The KB is initialized like this:

nlp = spacy.load("en_core_web_lg")  
kb = KnowledgeBase(vocab=nlp.vocab, entity_vector_length=300)

I have one entities.jsonl file (It has 3968080 entries). Code to add the entities in the KB is:

for x in entities:
	if not kb.contains_entity(x["id"]):
		kb.add_entity(id, 100, nlp.make_doc(x["description"]).vector)

After running this code, if I do: len(kb), I get 3554993 as result.

I have one aliases file(It has 2095576 entries). Code to add them into KB is:

for a in aliases:
    ents = []
    prob = []
    for i in range(len(a["entities"])):
        if kb.contains_entity(a["entities"][i]):
            ents.append(a["entities"][i])
            prob.append(a["probabilities"][i])
    n_ents = len(ents)
    if n_ents > 0:
        s = sum(prob)
        if s== 0:
            prior_prob = [1.0 / n_ents] * n_ents
        else:
            prior_prob = [x / s for x in prob]
        kb.add_alias(alias=a["alias"], entities=ents, probabilities=prior_prob)

After this when I run: kb.get_size_aliases(), I get 305379 [I expect that lot of aliases might not be related to the entities I am storing]

After this I save the KB using: kb.dump("/kb")

And then load is using the command: kb.load_bulk("/kb")

which throws exact the same error mentioned before i.e.

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "kb.pyx", line 356, in spacy.kb.KnowledgeBase.load_bulk
  File "kb.pyx", line 409, in spacy.kb.KnowledgeBase.load_bulk
AssertionError