01_how-to-train.ipynb broken
See original GitHub issue🐛 Bug
To reproduce
Steps to reproduce the behavior:
- Go to https://github.com/huggingface/transformers/tree/master/examples
- Click the colab for
language-modeling
: https://colab.research.google.com/github/huggingface/blog/blob/master/notebooks/01_how_to_train.ipynb - Run notebook
Expected behavior
The notebook finishes succesfuly
What I get is:
---------------------------------------------------------------------------
Exception Traceback (most recent call last)
<ipython-input-5-52625a7c86e5> in <module>()
1 get_ipython().system('mkdir EsperBERTo')
----> 2 tokenizer.save("EsperBERTo")
/usr/local/lib/python3.6/dist-packages/tokenizers/implementations/base_tokenizer.py in save(self, path, pretty)
330 A path to the destination Tokenizer file
331 """
--> 332 return self._tokenizer.save(path, pretty)
333
334 def to_str(self, pretty: bool = False):
Exception: Is a directory (os error 21)
Environment info
transformers
version: 2.11.0- Platform: Linux-4.19.104±x86_64-with-Ubuntu-18.04-bionic
- Python version: 3.6.9
- PyTorch version (GPU?): 1.5.0+cu101 (True)
- Tensorflow version (GPU?): not installed (NA)
- Using GPU in script?: NA
- Using distributed or parallel set-up in script?: NA
Issue Analytics
- State:
- Created 3 years ago
- Reactions:3
- Comments:8 (5 by maintainers)
Top Results From Across the Web
How do I fix a .ipynb file? - Stack Overflow
A possible way to recover corrupted Jupyter notebook files, whether it contains text or not (size = 0KB), is to go to the...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
Yes,
tokenizers
0.8.0
introduces the full tokenizer serialization, whereas before it saved the “model” only (vocab.json + merges.txt for BPE). So the save method should be used like that:.save("tokenizer.json")
and it saves the entire tokenizer to a JSON file. We need to update the Notebook to use this new serialization method, but in the meantime, the only thing needed to make it work exactly like before is to replace:by
This one is for me (this method was actually not working as intended under the hood for Fast-tokenizers…)