Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Path error when run bert_base and spanbert_base

See original GitHub issue

Hi, I tried to run python train.py <experiment> following the instruction in README. I tried <experiment>=bert_base and <experiment>=spanbert_base. and set export data_dir=/data/coref. I met the following errors:

bert_base

Restoring from: /data/coref/bert_base/model-57000
W1010 16:06:31.544241 140159242708736 deprecation.py:323] From /home/jiezhong/anaconda3/envs/coref/lib/python3.6/site-packages/tensorflow/python/training/saver.py:1276: checkpoint_exists (from tensorflow.python.training.checkpoint_management) is deprecated and will be removed in a future version.
Instructions for updating:
Use standard file APIs to check for files with this prefix.
Traceback (most recent call last):
  File "train.py", line 41, in <module>
    saver.restore(session, ckpt.model_checkpoint_path)
  File "/home/jiezhong/anaconda3/envs/coref/lib/python3.6/site-packages/tensorflow/python/training/saver.py", line 1278, in restore
    compat.as_text(save_path))
ValueError: The passed save_path is not a valid checkpoint: /data/coref/bert_base/model-57000

The bert_base directory created by ./download_pretrained.sh bert_base contains:

bert_config.json  checkpoint  events.out.tfevents.1551148806.learnfair2008  events.out.tfevents.1551148825.learnfair0213  events.out.tfevents.1551148826.learnfair0213  model.max.ckpt.data-00000-of-00001  model.max.ckpt.index  stdout.log  vocab.txt

spanbert_base

Restoring from: /checkpoint/danqi/coref_eval/final/base_pair_external_sl384_blr2e-05_tlr0.0001/model-57000
W1010 15:52:59.750342 139657016100608 deprecation.py:323] From /home/jiezhong/anaconda3/envs/coref/lib/python3.6/site-packages/tensorflow/python/training/saver.py:1276: checkpoint_exists (from tensorflow.python.training.checkpoint_management) is deprecated and will be removed in a future version.
Instructions for updating:
Use standard file APIs to check for files with this prefix.
Traceback (most recent call last):
  File "train.py", line 41, in <module>
    saver.restore(session, ckpt.model_checkpoint_path)
  File "/home/jiezhong/anaconda3/envs/coref/lib/python3.6/site-packages/tensorflow/python/training/saver.py", line 1276, in restore
    if not checkpoint_management.checkpoint_exists(compat.as_text(save_path)):
  File "/home/jiezhong/anaconda3/envs/coref/lib/python3.6/site-packages/tensorflow/python/util/deprecation.py", line 324, in new_func
    return func(*args, **kwargs)
  File "/home/jiezhong/anaconda3/envs/coref/lib/python3.6/site-packages/tensorflow/python/training/checkpoint_management.py", line 372, in checkpoint_exists
    if file_io.get_matching_files(pathname):
  File "/home/jiezhong/anaconda3/envs/coref/lib/python3.6/site-packages/tensorflow/python/lib/io/file_io.py", line 363, in get_matching_files
    return get_matching_files_v2(filename)
  File "/home/jiezhong/anaconda3/envs/coref/lib/python3.6/site-packages/tensorflow/python/lib/io/file_io.py", line 384, in get_matching_files_v2
    compat.as_bytes(pattern))
tensorflow.python.framework.errors_impl.NotFoundError: /checkpoint/danqi/coref_eval/final/base_pair_external_sl384_blr2e-05_tlr0.0001; No such file or directory

It seems that the path /checkpoint/danqi/coref_eval/final/base_pair_external_sl384_blr2e-05_tlr0.0001 is hard-coded somewhere. Moreover, the spanbert_base directory created by ./download_pretrained.sh spanbert_base contains:

bert_config.json  checkpoint  events.out.tfevents.1561596094.learnfair1413  model.max.ckpt.data-00000-of-00001  model.max.ckpt.index  stdout.log  vocab.txt

Any help would be greatly appreciated!

Issue Analytics

State:
Created 4 years ago
Comments:8 (4 by maintainers)

Top GitHub Comments

1reaction

nnkennardcommented, Oct 28, 2019

Awesome, it seems to be training with those changes. I’ll try the suggested changes in hyperparameters as well.

Thank you so much for the quick response!

0reactions

mandarjoshi90commented, Oct 28, 2019

Neha, thanks so much for finding this. This happened because train.py tried to pickup the checkpoint already finetuned on coref. I should have clarified this in the instructions. I’m sorry. Please add this to your experiments.conf and run python train.py train_spanbert_base

train_spanbert_base = ${spanbert_base}{
tf_checkpoint = ${best.log_root}/cased_L-12_H-768_A-12/bert_model.ckpt
init_checkpoint = ${best.log_root}/spanbert_hf_base/pytorch_model.bin
}

where ${best.log_root}/spanbert_hf_base/pytorch_model.bin points to the original SpanBERT (not finetuned) weight file which can be downloaded from here – https://dl.fbaipublicfiles.com/fairseq/models/spanbert_hf_base.tar.gz

Note that you might still have to scale down ffnn_size if you have a 12GB GPU. I think 1000 works well but you might see a slight drop in performance. I’ll update the instructions tomorrow.