question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

MLflow fails to log to a tracking server

See original GitHub issue

System Info

Python 3.9.13 | packaged by conda-forge | (main, May 27 2022, 16:56:21)

print(transformers.version) 4.20.1

print(mlflow.version) 1.27.0

Who can help?

No response

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, …)
  • My own task or dataset (give details below)

Reproduction

  1. Install mlflow
  2. Configure a vanilla training job to use a tracking server (os.environ[“MLFLOW_TRACKING_URI”]=“…”)
  3. Run the job

You should see an error similar to:

Traceback (most recent call last):
  File "/home/ubuntu/train.py", line 45, in <module>
    trainer.train()
  File "/home/ubuntu/.local/lib/python3.9/site-packages/transformers/trainer.py", line 1409, in train
    return inner_training_loop(
  File "/home/ubuntu/.local/lib/python3.9/site-packages/transformers/trainer.py", line 1580, in _inner_training_loop
    self.control = self.callback_handler.on_train_begin(args, self.state, self.control)
  File "/home/ubuntu/.local/lib/python3.9/site-packages/transformers/trainer_callback.py", line 347, in on_train_begin
    return self.call_event("on_train_begin", args, state, control)
  File "/home/ubuntu/.local/lib/python3.9/site-packages/transformers/trainer_callback.py", line 388, in call_event
    result = getattr(callback, event)(
  File "/home/ubuntu/.local/lib/python3.9/site-packages/transformers/integrations.py", line 856, in on_train_begin
    self.setup(args, state, model)
  File "/home/ubuntu/.local/lib/python3.9/site-packages/transformers/integrations.py", line 847, in setup
    self._ml_flow.log_params(dict(combined_dict_items[i : i + self._MAX_PARAMS_TAGS_PER_BATCH]))
  File "/home/ubuntu/.local/lib/python3.9/site-packages/mlflow/tracking/fluent.py", line 675, in log_params
    MlflowClient().log_batch(run_id=run_id, metrics=[], params=params_arr, tags=[])
  File "/home/ubuntu/.local/lib/python3.9/site-packages/mlflow/tracking/client.py", line 918, in log_batch
    self._tracking_client.log_batch(run_id, metrics, params, tags)
  File "/home/ubuntu/.local/lib/python3.9/site-packages/mlflow/tracking/_tracking_service/client.py", line 315, in log_batch
    self.store.log_batch(
  File "/home/ubuntu/.local/lib/python3.9/site-packages/mlflow/store/tracking/rest_store.py", line 309, in log_batch
    self._call_endpoint(LogBatch, req_body)
  File "/home/ubuntu/.local/lib/python3.9/site-packages/mlflow/store/tracking/rest_store.py", line 56, in _call_endpoint
    return call_endpoint(self.get_host_creds(), endpoint, method, json_body, response_proto)
  File "/home/ubuntu/.local/lib/python3.9/site-packages/mlflow/utils/rest_utils.py", line 256, in call_endpoint
    response = verify_rest_response(response, endpoint)
  File "/home/ubuntu/.local/lib/python3.9/site-packages/mlflow/utils/rest_utils.py", line 185, in verify_rest_response
    raise RestException(json.loads(response.text))
mlflow.exceptions.RestException: INVALID_PARAMETER_VALUE: Invalid value [{'key': 'logging_nan_inf_filter', 'value': 'True'}, {'key': 'save_strategy', 'value': 'epoch'}, {'key': 'save_steps', 'value': '500'}, {'key': 'save_total_limit', 'value': 'None'}, {'key': 'save_on_each_node', 'value': 'False'}, {'key': 'no_cuda', 'value': 'False'}, {'key': 'seed', 'value': '42'}, {'key': 'data_seed', 'value': 'None'}, {'key': 'jit_mode_eval', 'value': 'False'}, {'key': 'use_ipex', 'value': 'False'}, {'key': 'bf16', 'value': 'False'}, {'key': 'fp16', 'value': 'False'}, {'key': 'fp16_opt_level', 'value': 'O1'}, {'key': 'half_precision_backend', 'value': 'auto'}, {'key': 'bf16_full_eval', 'value': 'False'}, {'key': 'fp16_full_eval', 'value': 'False'}, {'key': 'tf32', 'value': 'None'}, {'key': 'local_rank', 'value': '-1'}, {'key': 'xpu_backend', 'value': 'None'}, {'key': 'tpu_num_cores', 'value': 'None'}, {'key': 'tpu_metrics_debug', 'value': 'False'}, {'key': 'debug', 'value': '[]'}, {'key': 'dataloader_drop_last', 'value': 'False'}, {'key': 'eval_steps', 'value': 'None'}, {'key': 'dataloader_num_workers', 'value': '0'}, {'key': 'past_index', 'value': '-1'}, {'key': 'run_name', 'value': './output'}, {'key': 'disable_tqdm', 'value': 'False'}, {'key': 'remove_unused_columns', 'value': 'True'}, {'key': 'label_names', 'value': 'None'}, {'key': 'load_best_model_at_end', 'value': 'False'}, {'key': 'metric_for_best_model', 'value': 'None'}, {'key': 'greater_is_better', 'value': 'None'}, {'key': 'ignore_data_skip', 'value': 'False'}, {'key': 'sharded_ddp', 'value': '[]'}, {'key': 'fsdp', 'value': '[]'}, {'key': 'fsdp_min_num_params', 'value': '0'}, {'key': 'deepspeed', 'value': 'None'}, {'key': 'label_smoothing_factor', 'value': '0.0'}, {'key': 'optim', 'value': 'adamw_hf'}, {'key': 'adafactor', 'value': 'False'}, {'key': 'group_by_length', 'value': 'False'}, {'key': 'length_column_name', 'value': 'length'}, {'key': 'report_to', 'value': "['mlflow']"}, {'key': 'ddp_find_unused_parameters', 'value': 'None'}, {'key': 'ddp_bucket_cap_mb', 'value': 'None'}, {'key': 'dataloader_pin_memory', 'value': 'True'}, {'key': 'skip_memory_metrics', 'value': 'True'}, {'key': 'use_legacy_prediction_loop', 'value': 'False'}, {'key': 'push_to_hub', 'value': 'False'}, {'key': 'resume_from_checkpoint', 'value': 'None'}, {'key': 'hub_model_id', 'value': 'None'}, {'key': 'hub_strategy', 'value': 'every_save'}, {'key': 'hub_token', 'value': '<HUB_TOKEN>'}, {'key': 'hub_private_repo', 'value': 'False'}, {'key': 'gradient_checkpointing', 'value': 'False'}, {'key': 'include_inputs_for_metrics', 'value': 'False'}, {'key': 'fp16_backend', 'value': 'auto'}, {'key': 'push_to_hub_model_id', 'value': 'None'}, {'key': 'push_to_hub_organization', 'value': 'None'}, {'key': 'push_to_hub_token', 'value': '<PUSH_TO_HUB_TOKEN>'}, {'key': '_n_gpu', 'value': '1'}, {'key': 'mp_parameters', 'value': ''}, {'key': 'auto_find_batch_size', 'value': 'False'}, {'key': 'full_determinism', 'value': 'False'}, {'key': 'torchdynamo', 'value': 'None'}, {'key': 'ray_scope', 'value': 'last'}] for parameter 'params' supplied. Hint: Value was of type 'list'. See the API docs for more information about request parameters.

Training script:

import os
import numpy as np
from datasets import load_dataset, load_metric
from transformers import AutoTokenizer, Trainer, TrainingArguments, AutoModelForSequenceClassification

train_dataset, test_dataset = load_dataset("imdb", split=['train', 'test'])

tokenizer = AutoTokenizer.from_pretrained("distilbert-base-cased")

def tokenize_function(examples):
    return tokenizer(examples["text"], padding="max_length", truncation=True)

train_dataset = train_dataset.map(tokenize_function, batched=True)
test_dataset = test_dataset.map(tokenize_function, batched=True)

model = AutoModelForSequenceClassification.from_pretrained("distilbert-base-cased", num_labels=2)

metric = load_metric("accuracy")

def compute_metrics(eval_pred):
    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=-1)
    return metric.compute(predictions=predictions, references=labels)

os.environ["HF_MLFLOW_LOG_ARTIFACTS"]="1"
os.environ["MLFLOW_EXPERIMENT_NAME"]="trainer-mlflow-demo"
os.environ["MLFLOW_FLATTEN_PARAMS"]="1"
#os.environ["MLFLOW_TRACKING_URI"]=<MY_SERVER IP>

training_args = TrainingArguments(
    num_train_epochs=1,
    output_dir="./output",
    logging_steps=500,
    save_strategy="epoch",
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=test_dataset,
    compute_metrics=compute_metrics
)

trainer.train()

Expected behavior

I would expect logging to work 😃

Issue Analytics

  • State:closed
  • Created a year ago
  • Comments:6 (3 by maintainers)

github_iconTop GitHub Comments

1reaction
LysandreJikcommented, Jul 27, 2022

@noise-field wrote the integration two years ago, do you have an idea of why it doesn’t seem to work anymore @noise-field?

0reactions
github-actions[bot]commented, Aug 30, 2022

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

Read more comments on GitHub >

github_iconTop Results From Across the Web

MLflow Tracking — MLflow 2.0.1 documentation
Using the Tracking Server for proxied artifact access. Logging to a Tracking Server ... If the experiment does not exist, creates a new...
Read more >
Mlflow unable to log sklearn model - python - Stack Overflow
models.model: Logging model metadata to the tracking server has failed, possibly due older server version. The model artifacts have been logged ...
Read more >
Configure mlflow inside your project
Context: mlflow tracking under the hood ... Basically, this schema shows that mlflow separates WHERE the artifacts are logged from HOW...
Read more >
Log, load, register, and deploy MLflow models
To log a model to the MLflow tracking server, use mlflow.<model-type>.log_model(model, ...) . To load a previously logged model for inference or ...
Read more >
Setup MLflow in Production - Towards Data Science
The next step is creating a directory for our Tracking Server to log the Machine Learning models and other artifacts. Remember that the...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found