Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

cannot analyze ` ̄ ̄` with japanese models

See original GitHub issue

How to reproduce the behaviour

When I tried the following very small script

import spacy
nlp = spacy.load('ja_core_news_sm')
nlp(' ̄ ̄')

I got the following error

>>> nlp(' ̄ ̄')
nlp(' ̄ ̄')
Traceback (most recent call last):
  File "/usr/local/lib/python3.7/site-packages/spacy/lang/ja/__init__.py", line 106, in get_dtokens_and_spaces
    word_start = text[text_pos:].index(word)
ValueError: substring not found

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/lib/python3.7/site-packages/spacy/language.py", line 441, in __call__
    doc = self.make_doc(text)
  File "/usr/local/lib/python3.7/site-packages/spacy/lang/ja/__init__.py", line 281, in make_doc
    return self.tokenizer(text)
  File "/usr/local/lib/python3.7/site-packages/spacy/lang/ja/__init__.py", line 145, in __call__
    dtokens, spaces = get_dtokens_and_spaces(dtokens, text)
  File "/usr/local/lib/python3.7/site-packages/spacy/lang/ja/__init__.py", line 108, in get_dtokens_and_spaces
    raise ValueError(Errors.E194.format(text=text, words=words))
ValueError: [E194] Unable to aligned mismatched text ' ̄ ̄' and words '[' ', '̄', ' ̄']'.

The minimal Dockerfile is here

FROM python:3.7

RUN pip install spacy
RUN python -m spacy download ja_core_news_sm

Your Environment

Operating System: Linux 04a7a76544e5 4.19.76-linuxkit #1 SMP Thu Oct 17 19:31:58 UTC 2019 x86_64 GNU/Linux
Python Version Used: 3.7.7
spaCy Version Used: 2.3.2
Environment Information: Minimal Dockerfile is as bellow

FROM python:3.7

RUN pip install spacy
RUN python -m spacy download ja_core_news_sm

Issue Analytics

State:
Created 3 years ago
Reactions:1
Comments:11 (10 by maintainers)

Top GitHub Comments

4reactions

polmcommented, Aug 24, 2020

Looks like it’s a macron character? Wouldn’t be used in normal Japanese, but might be used in romaji.

https://www.fileformat.info/info/unicode/char/0304/index.htm

I suspect this has to do with how SudachiPy normalizes characters, this was a vaguely similar issue.

https://github.com/WorksApplications/SudachiPy/issues/120

2reactions

hiroshi-matsuda-ritcommented, Aug 25, 2020

@sorami It seems SudachiPy has some inconsistency on dictionary_form and reading_form fields while analyzing the contexts including specific symbol chars after white space.

@svlandeg @adrianeboyd I think we can release a bug-fix version even if SudachiPy is not fixed.

Top Results From Across the Web

Japanese · spaCy Models Documentation

spaCy is a free open-source library for Natural Language Processing in Python. It features NER, POS tagging, dependency parsing, word vectors and more....

Japan's FDI drivers in a time of financial uncertainty. New ...

We analyse the determinants of Japanese outward FDI stock for the period 1996–2017. •. We select the covariates using Bayesian Model Averaging (BMA) ......

(PDF) Computer assisted learning of Japanese verbs

The purpose of this research is to promote e-learning and CALL in Japanese education, and help teachers and learners of Japanese language in ......

Alpha, Dimension-Free, and Model-Based Internal ... - NCBI

Let σij be an off-diagonal element of ∑ and σ ̄ ij be the average of all σij. ... + Ψ and Ψ...

Modeling the effects of contact-tracing apps on the spread of ...

This study not only reveals the characteristics of the apps but also provides a qualitative analysis using an agent-based model.