question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

cannot analyze ` ̄ ̄` with japanese models

See original GitHub issue

How to reproduce the behaviour

When I tried the following very small script

import spacy
nlp = spacy.load('ja_core_news_sm')
nlp(' ̄ ̄')

I got the following error

>>> nlp(' ̄ ̄')
nlp(' ̄ ̄')
Traceback (most recent call last):
  File "/usr/local/lib/python3.7/site-packages/spacy/lang/ja/__init__.py", line 106, in get_dtokens_and_spaces
    word_start = text[text_pos:].index(word)
ValueError: substring not found

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/lib/python3.7/site-packages/spacy/language.py", line 441, in __call__
    doc = self.make_doc(text)
  File "/usr/local/lib/python3.7/site-packages/spacy/lang/ja/__init__.py", line 281, in make_doc
    return self.tokenizer(text)
  File "/usr/local/lib/python3.7/site-packages/spacy/lang/ja/__init__.py", line 145, in __call__
    dtokens, spaces = get_dtokens_and_spaces(dtokens, text)
  File "/usr/local/lib/python3.7/site-packages/spacy/lang/ja/__init__.py", line 108, in get_dtokens_and_spaces
    raise ValueError(Errors.E194.format(text=text, words=words))
ValueError: [E194] Unable to aligned mismatched text ' ̄ ̄' and words '[' ', '̄', ' ̄']'.

The minimal Dockerfile is here

FROM python:3.7

RUN pip install spacy
RUN python -m spacy download ja_core_news_sm

Your Environment

  • Operating System: Linux 04a7a76544e5 4.19.76-linuxkit #1 SMP Thu Oct 17 19:31:58 UTC 2019 x86_64 GNU/Linux
  • Python Version Used: 3.7.7
  • spaCy Version Used: 2.3.2
  • Environment Information: Minimal Dockerfile is as bellow
FROM python:3.7

RUN pip install spacy
RUN python -m spacy download ja_core_news_sm

Issue Analytics

  • State:closed
  • Created 3 years ago
  • Reactions:1
  • Comments:11 (10 by maintainers)

github_iconTop GitHub Comments

4reactions
polmcommented, Aug 24, 2020

Looks like it’s a macron character? Wouldn’t be used in normal Japanese, but might be used in romaji.

https://www.fileformat.info/info/unicode/char/0304/index.htm

I suspect this has to do with how SudachiPy normalizes characters, this was a vaguely similar issue.

https://github.com/WorksApplications/SudachiPy/issues/120

2reactions
hiroshi-matsuda-ritcommented, Aug 25, 2020

@sorami It seems SudachiPy has some inconsistency on dictionary_form and reading_form fields while analyzing the contexts including specific symbol chars after white space.

@svlandeg @adrianeboyd I think we can release a bug-fix version even if SudachiPy is not fixed.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Japanese · spaCy Models Documentation
spaCy is a free open-source library for Natural Language Processing in Python. It features NER, POS tagging, dependency parsing, word vectors and more....
Read more >
Japan's FDI drivers in a time of financial uncertainty. New ...
We analyse the determinants of Japanese outward FDI stock for the period 1996–2017. •. We select the covariates using Bayesian Model Averaging (BMA) ......
Read more >
(PDF) Computer assisted learning of Japanese verbs
The purpose of this research is to promote e-learning and CALL in Japanese education, and help teachers and learners of Japanese language in ......
Read more >
Alpha, Dimension-Free, and Model-Based Internal ... - NCBI
Let σij be an off-diagonal element of ∑ and σ ̄ ij be the average of all σij. ... + Ψ and Ψ...
Read more >
Modeling the effects of contact-tracing apps on the spread of ...
This study not only reveals the characteristics of the apps but also provides a qualitative analysis using an agent-based model.
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found