Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

MlFlow log artefacts

See original GitHub issue

Environment info

transformers version: 4.4.2
Platform: Darwin-20.3.0-x86_64-i386-64bit
Python version: 3.7.4
PyTorch version (GPU?): 1.3.1 (False)
Tensorflow version (GPU?): not installed (NA)
Using GPU in script?: No
Using distributed or parallel set-up in script?: No

Who can help

@sgugger

Information

Model I am using (Bert, XLNet …): Bert

The problem arises when using:

the official example scripts: (give details below)
my own modified scripts: (give details below)

The tasks I am working on is:

an official GLUE/SQUaD task: NER
my own task or dataset: (give details below)

To reproduce

The bug is for the PR #8016.

Steps to reproduce the behavior:

MlFlow installed and the following env variables exported

export HF_MLFLOW_LOG_ARTIFACTS=TRUE
export MLFLOW_S3_ENDPOINT_URL=<custom endpont>
export MLFLOW_TRACKING_URI=<custom uri>
export MLFLOW_TRACKING_TOKEN=<custom token>

Run the token classification example with the following command

python run_ner.py \
  --model_name_or_path bert-base-uncased \
  --dataset_name conll2003 \
  --output_dir /tmp/test-ner \
  --do_train \
  --do_eval

Expected behavior

When the training finishes, before the evaluation is performed, the integrations.MLflowCallback executes the method on_train_end, where if the env variable HF_MLFLOW_LOG_ARTIFACTS is set to TRUE, it logs the model artifacts to mlflow.

The problem is, however, when the method on_train_end is called and the following line is executed: self._ml_flow.log_artifacts(args.output_dir), the model is not stored on the args.output_dir. The model artefacts are stored once the trainer.save_model() is called, which is after the training ending. There is no callback in the trainer.save_model() that can be called from a TrainerCallback to save the model. There is a method TrainierCallback.on_save() method, that is called trainer._maybe_log_save_evaluate(), but even then the model is not available on the output_dir.

Possible solutions would be to extend the TrainierCallback with on_model_save() callback method, insert the callback in the trainer.save_model(). Or, a workaround I have now is to change on_train_end with on_evaluate in integrations.MLflowCallback, that is called after the model is saved in the example script. However, this is not the right solution since it depends on having set the do_eval parameter, and it is not semantically correct.

Issue Analytics

State:
Created 2 years ago
Comments:8 (4 by maintainers)

Top GitHub Comments

1reaction

sguggercommented, Mar 25, 2021

However, not having a callback hook on the save_model would be difficult.

Not that this hook would be called when each checkpoint is saved, not just at the end of training. So you would not only save the last model.

0reactions

github-actions[bot]commented, Apr 23, 2021

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

Top Results From Across the Web

MLflow Tracking — MLflow 2.0.1 documentation

mlflow.log_artifact() logs a local file or directory as an artifact, optionally taking an artifact_path to place it in within the run's artifact ......

mlflow.artifacts — MLflow 2.0.1 documentation

Download an artifact file or directory to a local directory. ... Loads the artifact contents as a dictionary. ... Loads the artifact contents...

MLflow Tracking — MLflow 0.1.0 documentation

The Tracking UI lets you visualize, search and compare runs, as well as download run artifacts or metadata for analysis in other tools....

MLflow Tracking — MLflow 0.5.0 documentation

The Tracking UI lets you visualize, search and compare runs, as well as download run artifacts or metadata for analysis in other tools....

MLflow 2.0.1 documentation

Evaluate a PyFunc model on the specified dataset using one or more specified evaluators , and log resulting metrics & artifacts to MLflow...