question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

MlFlow log artefacts

See original GitHub issue

Environment info

  • transformers version: 4.4.2
  • Platform: Darwin-20.3.0-x86_64-i386-64bit
  • Python version: 3.7.4
  • PyTorch version (GPU?): 1.3.1 (False)
  • Tensorflow version (GPU?): not installed (NA)
  • Using GPU in script?: No
  • Using distributed or parallel set-up in script?: No

Who can help

@sgugger

Information

Model I am using (Bert, XLNet …): Bert

The problem arises when using:

  • the official example scripts: (give details below)
  • my own modified scripts: (give details below)

The tasks I am working on is:

  • an official GLUE/SQUaD task: NER
  • my own task or dataset: (give details below)

To reproduce

The bug is for the PR #8016.

Steps to reproduce the behavior:

  1. MlFlow installed and the following env variables exported
export HF_MLFLOW_LOG_ARTIFACTS=TRUE
export MLFLOW_S3_ENDPOINT_URL=<custom endpont>
export MLFLOW_TRACKING_URI=<custom uri>
export MLFLOW_TRACKING_TOKEN=<custom token>
  1. Run the token classification example with the following command
python run_ner.py \
  --model_name_or_path bert-base-uncased \
  --dataset_name conll2003 \
  --output_dir /tmp/test-ner \
  --do_train \
  --do_eval

Expected behavior

When the training finishes, before the evaluation is performed, the integrations.MLflowCallback executes the method on_train_end, where if the env variable HF_MLFLOW_LOG_ARTIFACTS is set to TRUE, it logs the model artifacts to mlflow.

The problem is, however, when the method on_train_end is called and the following line is executed: self._ml_flow.log_artifacts(args.output_dir), the model is not stored on the args.output_dir. The model artefacts are stored once the trainer.save_model() is called, which is after the training ending. There is no callback in the trainer.save_model() that can be called from a TrainerCallback to save the model. There is a method TrainierCallback.on_save() method, that is called trainer._maybe_log_save_evaluate(), but even then the model is not available on the output_dir.

Possible solutions would be to extend the TrainierCallback with on_model_save() callback method, insert the callback in the trainer.save_model(). Or, a workaround I have now is to change on_train_end with on_evaluate in integrations.MLflowCallback, that is called after the model is saved in the example script. However, this is not the right solution since it depends on having set the do_eval parameter, and it is not semantically correct.

Issue Analytics

  • State:closed
  • Created 2 years ago
  • Comments:8 (4 by maintainers)

github_iconTop GitHub Comments

1reaction
sguggercommented, Mar 25, 2021

However, not having a callback hook on the save_model would be difficult.

Not that this hook would be called when each checkpoint is saved, not just at the end of training. So you would not only save the last model.

0reactions
github-actions[bot]commented, Apr 23, 2021

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

Read more comments on GitHub >

github_iconTop Results From Across the Web

MLflow Tracking — MLflow 2.0.1 documentation
mlflow.log_artifact() logs a local file or directory as an artifact, optionally taking an artifact_path to place it in within the run's artifact ......
Read more >
mlflow.artifacts — MLflow 2.0.1 documentation
Download an artifact file or directory to a local directory. ... Loads the artifact contents as a dictionary. ... Loads the artifact contents...
Read more >
MLflow Tracking — MLflow 0.1.0 documentation
The Tracking UI lets you visualize, search and compare runs, as well as download run artifacts or metadata for analysis in other tools....
Read more >
MLflow Tracking — MLflow 0.5.0 documentation
The Tracking UI lets you visualize, search and compare runs, as well as download run artifacts or metadata for analysis in other tools....
Read more >
MLflow 2.0.1 documentation
Evaluate a PyFunc model on the specified dataset using one or more specified evaluators , and log resulting metrics & artifacts to MLflow...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found