cannot analyze ` ̄ ̄` with japanese models
See original GitHub issueHow to reproduce the behaviour
When I tried the following very small script
import spacy
nlp = spacy.load('ja_core_news_sm')
nlp(' ̄ ̄')
I got the following error
>>> nlp(' ̄ ̄')
nlp(' ̄ ̄')
Traceback (most recent call last):
File "/usr/local/lib/python3.7/site-packages/spacy/lang/ja/__init__.py", line 106, in get_dtokens_and_spaces
word_start = text[text_pos:].index(word)
ValueError: substring not found
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/local/lib/python3.7/site-packages/spacy/language.py", line 441, in __call__
doc = self.make_doc(text)
File "/usr/local/lib/python3.7/site-packages/spacy/lang/ja/__init__.py", line 281, in make_doc
return self.tokenizer(text)
File "/usr/local/lib/python3.7/site-packages/spacy/lang/ja/__init__.py", line 145, in __call__
dtokens, spaces = get_dtokens_and_spaces(dtokens, text)
File "/usr/local/lib/python3.7/site-packages/spacy/lang/ja/__init__.py", line 108, in get_dtokens_and_spaces
raise ValueError(Errors.E194.format(text=text, words=words))
ValueError: [E194] Unable to aligned mismatched text ' ̄ ̄' and words '[' ', '̄', ' ̄']'.
The minimal Dockerfile is here
FROM python:3.7
RUN pip install spacy
RUN python -m spacy download ja_core_news_sm
Your Environment
- Operating System: Linux 04a7a76544e5 4.19.76-linuxkit #1 SMP Thu Oct 17 19:31:58 UTC 2019 x86_64 GNU/Linux
- Python Version Used: 3.7.7
- spaCy Version Used: 2.3.2
- Environment Information: Minimal Dockerfile is as bellow
FROM python:3.7
RUN pip install spacy
RUN python -m spacy download ja_core_news_sm
Issue Analytics
- State:
- Created 3 years ago
- Reactions:1
- Comments:11 (10 by maintainers)
Top Results From Across the Web
Japanese · spaCy Models Documentation
spaCy is a free open-source library for Natural Language Processing in Python. It features NER, POS tagging, dependency parsing, word vectors and more....
Read more >Japan's FDI drivers in a time of financial uncertainty. New ...
We analyse the determinants of Japanese outward FDI stock for the period 1996–2017. •. We select the covariates using Bayesian Model Averaging (BMA) ......
Read more >(PDF) Computer assisted learning of Japanese verbs
The purpose of this research is to promote e-learning and CALL in Japanese education, and help teachers and learners of Japanese language in ......
Read more >Alpha, Dimension-Free, and Model-Based Internal ... - NCBI
Let σij be an off-diagonal element of ∑ and σ ̄ ij be the average of all σij. ... + Ψ and Ψ...
Read more >Modeling the effects of contact-tracing apps on the spread of ...
This study not only reveals the characteristics of the apps but also provides a qualitative analysis using an agent-based model.
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
Looks like it’s a macron character? Wouldn’t be used in normal Japanese, but might be used in romaji.
https://www.fileformat.info/info/unicode/char/0304/index.htm
I suspect this has to do with how SudachiPy normalizes characters, this was a vaguely similar issue.
https://github.com/WorksApplications/SudachiPy/issues/120
@sorami It seems SudachiPy has some inconsistency on dictionary_form and reading_form fields while analyzing the contexts including specific symbol chars after white space.
@svlandeg @adrianeboyd I think we can release a bug-fix version even if SudachiPy is not fixed.