ValueError: a must be greater than 0 unless no samples are taken while pretraining using cli
See original GitHub issueHi, I am using spacy version 2.3.2 for pretraining spacy tok2vec. I prepared a raw text data like the format spacy ask for. mydata.jsonl is looking like this:
{"text": "ホッケーにはデンジャラスプレーの反則があるので、膝より上にボールを浮かすことは基本的に反則になるが、その例外の一つがこのスクープである。"}
{"text": "また行きたい、そんな気持ちにさせてくれるお店です。"}
and my pretraining cli command is:
python -m spacy pretrain mydata.jsonl ja_core_news_lg outpath
After running this command I got this error: I changed japanese model version and still having the sample problem. I trained with english data and it’s okay but problem only exist in japanese language text.
:information_source: Using GPU
:warning: Output directory is not empty
It is better to use an empty directory or refer to a new output path, then the
new directory will be created for you.
:heavy_check_mark: Saved settings to config.json
:heavy_check_mark: Loaded input texts
:heavy_check_mark: Loaded model 'ja_core_news_lg'
============== Pre-training tok2vec layer - starting at epoch 0 ==============
# Words Total Loss Loss w/s
Traceback (most recent call last):
File "/home/nsl7/anaconda3/envs/spacygpu2.3/lib/python3.6/runpy.py", line 193, in _run_module_as_main
"main", mod_spec)
File "/home/nsl7/anaconda3/envs/spacygpu2.3/lib/python3.6/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/home/nsl7/anaconda3/envs/spacygpu2.3/lib/python3.6/site-packages/spacy/main.py", line 33, in
plac.call(commands[command], sys.argv[1:])
File "/home/nsl7/anaconda3/envs/spacygpu2.3/lib/python3.6/site-packages/plac_core.py", line 367, in call
cmd, result = parser.consume(arglist)
File "/home/nsl7/anaconda3/envs/spacygpu2.3/lib/python3.6/site-packages/plac_core.py", line 232, in consume
return cmd, self.func(*(args + varargs + extraopts), **kwargs)
File "/home/nsl7/anaconda3/envs/spacygpu2.3/lib/python3.6/site-packages/spacy/cli/pretrain.py", line 237, in pretrain
model, docs, optimizer, objective=loss_func, drop=dropout
File "/home/nsl7/anaconda3/envs/spacygpu2.3/lib/python3.6/site-packages/spacy/cli/pretrain.py", line 264, in make_update
predictions, backprop = model.begin_update(docs, drop=drop)
File "/home/nsl7/anaconda3/envs/spacygpu2.3/lib/python3.6/site-packages/spacy/_ml.py", line 837, in mlm_forward
mask, docs = _apply_mask(docs, random_words, mask_prob=mask_prob)
File "/home/nsl7/anaconda3/envs/spacygpu2.3/lib/python3.6/site-packages/spacy/_ml.py", line 884, in _apply_mask
word = _replace_word(token.text, random_words)
File "/home/nsl7/anaconda3/envs/spacygpu2.3/lib/python3.6/site-packages/spacy/_ml.py", line 904, in _replace_word
return random_words.next()
File "/home/nsl7/anaconda3/envs/spacygpu2.3/lib/python3.6/site-packages/spacy/_ml.py", line 865, in next
numpy.random.choice(len(self.words), 10000, p=self.probs)
File "mtrand.pyx", line 902, in numpy.random.mtrand.RandomState.choice
ValueError: a must be greater than 0 unless no samples are taken
Issue Analytics
- State:
- Created 3 years ago
- Comments:8 (4 by maintainers)
Top Results From Across the Web
'a' must be greater than 0 unless no samples are taken #10
mtrand.pyx in mtrand.RandomState.choice(). ValueError: 'a' must be greater than 0 unless no samples are taken. 这是怎么回事啊?
Read more >a must be greater than 0 unless no samples are taken while ...
After running this command I got this error: Please help. Thanks in advance. :information_source: Using GPU :warning: Output directory is not ...
Read more >a must be greater than 0 unless no samples are taken
I am working on selecting sample value in terms of the lowest value of data whereas my lowest value of data in a...
Read more >Pandas: a must be greater than 0 unless no samples are taken
I am trying to resample the rebalanced data set 'churn_train' by 20%, or n = 158 records, to have 'True' 'Churn' column values....
Read more >Source code for transformers.tokenization_utils_base
Returns :obj:`None` if no tokens correspond to the word. """ if not self._encodings: raise ValueError("word_to_tokens() is not available when using Python ...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
Thanks for reporting back @sagor71. As this feature was experimental in v2, and should work much better in v3, it’s not really a priority for us to spend much more time on this for v2. I’m happy to hear you found a working solution though. I’ll close this in the meantime, but let us know if you still run into issues!
Hi @svlandeg Thanks for the suggestion. I have checked spacy v3 nightly. It’s really amazing. But present working with spacy 2. Any possible way for pretraining tok2vec for japanese in v2 will help me a lot. I am keeping this issue open for new suggestions. regards