Path error when run bert_base and spanbert_base
See original GitHub issueHi, I tried to run python train.py <experiment>
following the instruction in README. I tried <experiment>=bert_base
and <experiment>=spanbert_base
. and set export data_dir=/data/coref
. I met the following errors:
- bert_base
Restoring from: /data/coref/bert_base/model-57000
W1010 16:06:31.544241 140159242708736 deprecation.py:323] From /home/jiezhong/anaconda3/envs/coref/lib/python3.6/site-packages/tensorflow/python/training/saver.py:1276: checkpoint_exists (from tensorflow.python.training.checkpoint_management) is deprecated and will be removed in a future version.
Instructions for updating:
Use standard file APIs to check for files with this prefix.
Traceback (most recent call last):
File "train.py", line 41, in <module>
saver.restore(session, ckpt.model_checkpoint_path)
File "/home/jiezhong/anaconda3/envs/coref/lib/python3.6/site-packages/tensorflow/python/training/saver.py", line 1278, in restore
compat.as_text(save_path))
ValueError: The passed save_path is not a valid checkpoint: /data/coref/bert_base/model-57000
The bert_base directory created by ./download_pretrained.sh bert_base
contains:
bert_config.json checkpoint events.out.tfevents.1551148806.learnfair2008 events.out.tfevents.1551148825.learnfair0213 events.out.tfevents.1551148826.learnfair0213 model.max.ckpt.data-00000-of-00001 model.max.ckpt.index stdout.log vocab.txt
- spanbert_base
Restoring from: /checkpoint/danqi/coref_eval/final/base_pair_external_sl384_blr2e-05_tlr0.0001/model-57000
W1010 15:52:59.750342 139657016100608 deprecation.py:323] From /home/jiezhong/anaconda3/envs/coref/lib/python3.6/site-packages/tensorflow/python/training/saver.py:1276: checkpoint_exists (from tensorflow.python.training.checkpoint_management) is deprecated and will be removed in a future version.
Instructions for updating:
Use standard file APIs to check for files with this prefix.
Traceback (most recent call last):
File "train.py", line 41, in <module>
saver.restore(session, ckpt.model_checkpoint_path)
File "/home/jiezhong/anaconda3/envs/coref/lib/python3.6/site-packages/tensorflow/python/training/saver.py", line 1276, in restore
if not checkpoint_management.checkpoint_exists(compat.as_text(save_path)):
File "/home/jiezhong/anaconda3/envs/coref/lib/python3.6/site-packages/tensorflow/python/util/deprecation.py", line 324, in new_func
return func(*args, **kwargs)
File "/home/jiezhong/anaconda3/envs/coref/lib/python3.6/site-packages/tensorflow/python/training/checkpoint_management.py", line 372, in checkpoint_exists
if file_io.get_matching_files(pathname):
File "/home/jiezhong/anaconda3/envs/coref/lib/python3.6/site-packages/tensorflow/python/lib/io/file_io.py", line 363, in get_matching_files
return get_matching_files_v2(filename)
File "/home/jiezhong/anaconda3/envs/coref/lib/python3.6/site-packages/tensorflow/python/lib/io/file_io.py", line 384, in get_matching_files_v2
compat.as_bytes(pattern))
tensorflow.python.framework.errors_impl.NotFoundError: /checkpoint/danqi/coref_eval/final/base_pair_external_sl384_blr2e-05_tlr0.0001; No such file or directory
It seems that the path /checkpoint/danqi/coref_eval/final/base_pair_external_sl384_blr2e-05_tlr0.0001
is hard-coded somewhere. Moreover, the spanbert_base directory created by ./download_pretrained.sh spanbert_base
contains:
bert_config.json checkpoint events.out.tfevents.1561596094.learnfair1413 model.max.ckpt.data-00000-of-00001 model.max.ckpt.index stdout.log vocab.txt
Any help would be greatly appreciated!
Issue Analytics
- State:
- Created 4 years ago
- Comments:8 (4 by maintainers)
Top Results From Across the Web
What to do when you get an error - Hugging Face Course
In this section we'll look at some common errors that can occur when you're trying to generate predictions from your freshly tuned Transformer...
Read more >Can't load the model for 'bert-base-uncased'. #16618 - GitHub
If you were trying to load it from 'https://huggingface.co/models', make sure you don't have a local directory with the same name. Otherwise, ...
Read more >OSError: Can't load tokenizer for 'path/bert-base-cased'
I am trying to load simple transformers NER model from my local path where I downloaded the Bert base cased model, I am...
Read more >How to load the pre-trained BERT model from local/colab ...
FULL ERROR: Model name '/content/drive/My Drive/bert_training/uncased_L-12_H-768_A-12/' was not found in model name list (bert-base-uncased, ...
Read more >PyTorch-Transformers
An open source machine learning framework that accelerates the path from research prototyping to production deployment.
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
Awesome, it seems to be training with those changes. I’ll try the suggested changes in hyperparameters as well.
Thank you so much for the quick response!
Neha, thanks so much for finding this. This happened because
train.py
tried to pickup the checkpoint already finetuned on coref. I should have clarified this in the instructions. I’m sorry. Please add this to yourexperiments.conf
and runpython train.py train_spanbert_base
where
${best.log_root}/spanbert_hf_base/pytorch_model.bin
points to the original SpanBERT (not finetuned) weight file which can be downloaded from here – https://dl.fbaipublicfiles.com/fairseq/models/spanbert_hf_base.tar.gzNote that you might still have to scale down
ffnn_size
if you have a 12GB GPU. I think 1000 works well but you might see a slight drop in performance. I’ll update the instructions tomorrow.