Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

KeyError [E018] when using nlp.pipe with n_process > 1

See original GitHub issue

How to reproduce the behaviour

Hi, I’m trying to use the new ja_core_news_sm model to stream process a collection of sentences with nlp.pipe(list_of_sentences). I’d like to be able to set n_process > 1 to increase speed, but when I do that I encounter KeyError [E018]. I’m using WSL through VSCode.

I’ll try something like this…


Word = collections.namedtuple("Word", ["surface", "lemma", "upos", "xpos", "dep"])

nlp = spacy.load("ja_core_news_sm", disable=["ner", "entity_linker"])

for doc in nlp.pipe(sentences, batch_size=150, n_process=2):
   for token in doc:
      word = Word(surface=token.text, lemma=token.lemma_, upos=token.pos_, xpos=token.tag_, dep=token.dep_)

and get this error.

word = Word(surface=token.text, lemma=token.lemma_, upos=token.pos_, xpos=token.tag_, dep=token.dep_)
  File "token.pyx", line 894, in spacy.tokens.token.Token.lemma_.__get__
  File "strings.pyx", line 136, in spacy.strings.StringStore.__getitem__
 KeyError: "[E018] Can't retrieve string for hash '17260935250788936050'. This usually refers to an issue with the `Vocab` or `StringStore`."

The failing token wasn’t the first one in the sentence, so I counted the number of tokens throwing Exceptions and in one collection of sentences I have, 397/53597 iterated tokens cause an Exception (so far the number of failures has stayed constant on re-runs varying batch_size and n_process).

Just to sanity check, a bare nlp.pipe() or [nlp(s) for s in sentences] work with no issues. Possibly a model-specific issue?

Your Environment

Operating System: Windows 10 (Build 18363)
Python Version Used: 3.8.1
spaCy Version Used: 2.3.0
Environment Information: WSL (Linux-4.4.0-18362-Microsoft-x86_64-with-glibc2.27)