question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Parsing/Tokenizing a file with Ginza from command line

See original GitHub issue

Hi, Ginza is primarily used as a dependency parser. However, it can also be used for tokenization. Hence, I would like to know if there is a command line interface similar to Jieba for tokenizing a file.

In Jieba, we can use: python -m jieba -d filename

How could we do that using Ginza? 😃

Kind Regards,

Issue Analytics

  • State:closed
  • Created 4 years ago
  • Reactions:1
  • Comments:5 (3 by maintainers)

github_iconTop GitHub Comments

1reaction
hiroshi-matsuda-ritcommented, Oct 21, 2019

@MastafaF Please try to use sudachipy command, which is installed together with ginza, if you need to. Or, do you need --delimiter option like Jieba? Actually, I don’t want to add that for ginza command. I’m planning to add a format option which provides MeCab compatible output format. https://taku910.github.io/mecab/

1reaction
hiroshi-matsuda-ritcommented, Oct 17, 2019

@MastafaF It is typical use-case of GiNZA. I’d like to add the tokenization mode for command_line.py in next release. Thank you!

Read more comments on GitHub >

github_iconTop Results From Across the Web

Correcting tokenization errors by parser · Issue #3818 - GitHub
Feature description I'd like to implement a functionality which can correct tokenization errors (both boundaries and tags) by parser.
Read more >
Gd Tokenizer - scripts - Godot Asset Store
The most important file is ./cmd/CommandParser.gd . This is where the tokenization and execution methods are defined. To use the class file you...
Read more >
tokenize — Tokenizer for Python source — Python 3.11.1 ...
tokenize () determines the source encoding of the file by looking for a UTF-8 BOM or encoding cookie, according to PEP 263. tokenize.generate_tokens(readline)¶....
Read more >
Tokenization - CoreNLP - Stanford NLP Group
Tokenizing From The Command Line. This command will take in the text of the file input.txt and produce a human readable output of...
Read more >
Token · spaCy API Documentation
The leftward immediate children of the word in the syntactic dependency parse. Example. doc = nlp("I like New York in Autumn.") ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found