MlFlow log artefacts
See original GitHub issueEnvironment info
transformers
version: 4.4.2- Platform: Darwin-20.3.0-x86_64-i386-64bit
- Python version: 3.7.4
- PyTorch version (GPU?): 1.3.1 (False)
- Tensorflow version (GPU?): not installed (NA)
- Using GPU in script?: No
- Using distributed or parallel set-up in script?: No
Who can help
Information
Model I am using (Bert, XLNet …): Bert
The problem arises when using:
- the official example scripts: (give details below)
- my own modified scripts: (give details below)
The tasks I am working on is:
- an official GLUE/SQUaD task: NER
- my own task or dataset: (give details below)
To reproduce
The bug is for the PR #8016.
Steps to reproduce the behavior:
- MlFlow installed and the following env variables exported
export HF_MLFLOW_LOG_ARTIFACTS=TRUE
export MLFLOW_S3_ENDPOINT_URL=<custom endpont>
export MLFLOW_TRACKING_URI=<custom uri>
export MLFLOW_TRACKING_TOKEN=<custom token>
- Run the token classification example with the following command
python run_ner.py \
--model_name_or_path bert-base-uncased \
--dataset_name conll2003 \
--output_dir /tmp/test-ner \
--do_train \
--do_eval
Expected behavior
When the training finishes, before the evaluation is performed, the integrations.MLflowCallback
executes the method on_train_end
, where if the env variable HF_MLFLOW_LOG_ARTIFACTS
is set to TRUE
, it logs the model artifacts to mlflow.
The problem is, however, when the method on_train_end
is called and the following line is executed: self._ml_flow.log_artifacts(args.output_dir)
, the model is not stored on the args.output_dir
. The model artefacts are stored once the trainer.save_model()
is called, which is after the training ending. There is no callback in the trainer.save_model()
that can be called from a TrainerCallback
to save the model. There is a method TrainierCallback.on_save()
method, that is called trainer._maybe_log_save_evaluate()
, but even then the model is not available on the output_dir
.
Possible solutions would be to extend the TrainierCallback
with on_model_save()
callback method, insert the callback in the trainer.save_model()
.
Or, a workaround I have now is to change on_train_end
with on_evaluate
in integrations.MLflowCallback
, that is called after the model is saved in the example script. However, this is not the right solution since it depends on having set the do_eval
parameter, and it is not semantically correct.
Issue Analytics
- State:
- Created 2 years ago
- Comments:8 (4 by maintainers)
Top GitHub Comments
Not that this hook would be called when each checkpoint is saved, not just at the end of training. So you would not only save the last model.
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.
Please note that issues that do not follow the contributing guidelines are likely to be ignored.