Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Support language which need tokenizer (Chinese, Japanese .etc)

See original GitHub issue

I think iepy need a common interface to embed a tokenizer to support language like Chinese, Japanese .etc.

There is a old ie project with gui named GATE, it contain a pre-trained model and dataset, maybe helpful https://gate.ac.uk/sale/tao/splitch15.html#sec:misc-creole:language-plugins:chinese

Issue Analytics

State:
Created 7 years ago
Reactions:1
Comments:7 (2 by maintainers)

Top GitHub Comments

1reaction

YanWenqiangcommented, Sep 25, 2017

@eromoe Right now, I want iepy to customize to Chinese, could you give me a hand ?

0reactions

hwakingcommented, Dec 9, 2017

@eromoe I am doing Chinese EMR information extraction ， can i use iepy to do entity relationship extraction ？

Top Results From Across the Web

Language Analysis | Apache Solr Reference Guide 7.7

For Simplified Chinese, Solr provides support for Chinese sentence and word segmentation with the HMM Chinese Tokenizer. This component includes a large ...

New way of tokenization of Chinese - Manticore Search

The Chinese language belongs to the so-called CJK language family (Chinese, Japanese, and Korean). They are probably the most complicated ...

Chinese and Japanese Lexical Tokenization

For Chinese and Japanese, in addition to the statistical model described above, RBL includes Chinese Language Analyzer (CLA) and Japanese...

How to make scikit-learn vectorizers work with Japanese ...

How to use NLP with scikit-learn vectorizers in Japanese, Chinese (and other East Asian languages) by using a custom tokenizer#.

Tokenize and Transliterate Japanese, Chinese, Korean - Reddit

Maybe some of you nice people have some ideas about the best way to go about tokenization for Korean (Mecab support Korean?) and...