Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

MLflow fails to log to a tracking server

See original GitHub issue

System Info

Python 3.9.13 | packaged by conda-forge | (main, May 27 2022, 16:56:21)

print(transformers.version) 4.20.1

print(mlflow.version) 1.27.0

Who can help?

No response

Information

The official example scripts
My own modified scripts

Tasks

An officially supported task in the examples folder (such as GLUE/SQuAD, …)
My own task or dataset (give details below)

Reproduction

Install mlflow
Configure a vanilla training job to use a tracking server (os.environ[“MLFLOW_TRACKING_URI”]=“…”)
Run the job

You should see an error similar to:

Traceback (most recent call last):
  File "/home/ubuntu/train.py", line 45, in <module>
    trainer.train()
  File "/home/ubuntu/.local/lib/python3.9/site-packages/transformers/trainer.py", line 1409, in train
    return inner_training_loop(
  File "/home/ubuntu/.local/lib/python3.9/site-packages/transformers/trainer.py", line 1580, in _inner_training_loop
    self.control = self.callback_handler.on_train_begin(args, self.state, self.control)
  File "/home/ubuntu/.local/lib/python3.9/site-packages/transformers/trainer_callback.py", line 347, in on_train_begin
    return self.call_event("on_train_begin", args, state, control)
  File "/home/ubuntu/.local/lib/python3.9/site-packages/transformers/trainer_callback.py", line 388, in call_event
    result = getattr(callback, event)(
  File "/home/ubuntu/.local/lib/python3.9/site-packages/transformers/integrations.py", line 856, in on_train_begin
    self.setup(args, state, model)
  File "/home/ubuntu/.local/lib/python3.9/site-packages/transformers/integrations.py", line 847, in setup
    self._ml_flow.log_params(dict(combined_dict_items[i : i + self._MAX_PARAMS_TAGS_PER_BATCH]))
  File "/home/ubuntu/.local/lib/python3.9/site-packages/mlflow/tracking/fluent.py", line 675, in log_params
    MlflowClient().log_batch(run_id=run_id, metrics=[], params=params_arr, tags=[])
  File "/home/ubuntu/.local/lib/python3.9/site-packages/mlflow/tracking/client.py", line 918, in log_batch
    self._tracking_client.log_batch(run_id, metrics, params, tags)
  File "/home/ubuntu/.local/lib/python3.9/site-packages/mlflow/tracking/_tracking_service/client.py", line 315, in log_batch
    self.store.log_batch(
  File "/home/ubuntu/.local/lib/python3.9/site-packages/mlflow/store/tracking/rest_store.py", line 309, in log_batch
    self._call_endpoint(LogBatch, req_body)
  File "/home/ubuntu/.local/lib/python3.9/site-packages/mlflow/store/tracking/rest_store.py", line 56, in _call_endpoint
    return call_endpoint(self.get_host_creds(), endpoint, method, json_body, response_proto)
  File "/home/ubuntu/.local/lib/python3.9/site-packages/mlflow/utils/rest_utils.py", line 256, in call_endpoint
    response = verify_rest_response(response, endpoint)
  File "/home/ubuntu/.local/lib/python3.9/site-packages/mlflow/utils/rest_utils.py", line 185, in verify_rest_response
    raise RestException(json.loads(response.text))
mlflow.exceptions.RestException: INVALID_PARAMETER_VALUE: Invalid value [{'key': 'logging_nan_inf_filter', 'value': 'True'}, {'key': 'save_strategy', 'value': 'epoch'}, {'key': 'save_steps', 'value': '500'}, {'key': 'save_total_limit', 'value': 'None'}, {'key': 'save_on_each_node', 'value': 'False'}, {'key': 'no_cuda', 'value': 'False'}, {'key': 'seed', 'value': '42'}, {'key': 'data_seed', 'value': 'None'}, {'key': 'jit_mode_eval', 'value': 'False'}, {'key': 'use_ipex', 'value': 'False'}, {'key': 'bf16', 'value': 'False'}, {'key': 'fp16', 'value': 'False'}, {'key': 'fp16_opt_level', 'value': 'O1'}, {'key': 'half_precision_backend', 'value': 'auto'}, {'key': 'bf16_full_eval', 'value': 'False'}, {'key': 'fp16_full_eval', 'value': 'False'}, {'key': 'tf32', 'value': 'None'}, {'key': 'local_rank', 'value': '-1'}, {'key': 'xpu_backend', 'value': 'None'}, {'key': 'tpu_num_cores', 'value': 'None'}, {'key': 'tpu_metrics_debug', 'value': 'False'}, {'key': 'debug', 'value': '[]'}, {'key': 'dataloader_drop_last', 'value': 'False'}, {'key': 'eval_steps', 'value': 'None'}, {'key': 'dataloader_num_workers', 'value': '0'}, {'key': 'past_index', 'value': '-1'}, {'key': 'run_name', 'value': './output'}, {'key': 'disable_tqdm', 'value': 'False'}, {'key': 'remove_unused_columns', 'value': 'True'}, {'key': 'label_names', 'value': 'None'}, {'key': 'load_best_model_at_end', 'value': 'False'}, {'key': 'metric_for_best_model', 'value': 'None'}, {'key': 'greater_is_better', 'value': 'None'}, {'key': 'ignore_data_skip', 'value': 'False'}, {'key': 'sharded_ddp', 'value': '[]'}, {'key': 'fsdp', 'value': '[]'}, {'key': 'fsdp_min_num_params', 'value': '0'}, {'key': 'deepspeed', 'value': 'None'}, {'key': 'label_smoothing_factor', 'value': '0.0'}, {'key': 'optim', 'value': 'adamw_hf'}, {'key': 'adafactor', 'value': 'False'}, {'key': 'group_by_length', 'value': 'False'}, {'key': 'length_column_name', 'value': 'length'}, {'key': 'report_to', 'value': "['mlflow']"}, {'key': 'ddp_find_unused_parameters', 'value': 'None'}, {'key': 'ddp_bucket_cap_mb', 'value': 'None'}, {'key': 'dataloader_pin_memory', 'value': 'True'}, {'key': 'skip_memory_metrics', 'value': 'True'}, {'key': 'use_legacy_prediction_loop', 'value': 'False'}, {'key': 'push_to_hub', 'value': 'False'}, {'key': 'resume_from_checkpoint', 'value': 'None'}, {'key': 'hub_model_id', 'value': 'None'}, {'key': 'hub_strategy', 'value': 'every_save'}, {'key': 'hub_token', 'value': '<HUB_TOKEN>'}, {'key': 'hub_private_repo', 'value': 'False'}, {'key': 'gradient_checkpointing', 'value': 'False'}, {'key': 'include_inputs_for_metrics', 'value': 'False'}, {'key': 'fp16_backend', 'value': 'auto'}, {'key': 'push_to_hub_model_id', 'value': 'None'}, {'key': 'push_to_hub_organization', 'value': 'None'}, {'key': 'push_to_hub_token', 'value': '<PUSH_TO_HUB_TOKEN>'}, {'key': '_n_gpu', 'value': '1'}, {'key': 'mp_parameters', 'value': ''}, {'key': 'auto_find_batch_size', 'value': 'False'}, {'key': 'full_determinism', 'value': 'False'}, {'key': 'torchdynamo', 'value': 'None'}, {'key': 'ray_scope', 'value': 'last'}] for parameter 'params' supplied. Hint: Value was of type 'list'. See the API docs for more information about request parameters.

Training script:

import os
import numpy as np
from datasets import load_dataset, load_metric
from transformers import AutoTokenizer, Trainer, TrainingArguments, AutoModelForSequenceClassification

train_dataset, test_dataset = load_dataset("imdb", split=['train', 'test'])

tokenizer = AutoTokenizer.from_pretrained("distilbert-base-cased")

def tokenize_function(examples):
    return tokenizer(examples["text"], padding="max_length", truncation=True)

train_dataset = train_dataset.map(tokenize_function, batched=True)
test_dataset = test_dataset.map(tokenize_function, batched=True)

model = AutoModelForSequenceClassification.from_pretrained("distilbert-base-cased", num_labels=2)

metric = load_metric("accuracy")

def compute_metrics(eval_pred):
    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=-1)
    return metric.compute(predictions=predictions, references=labels)

os.environ["HF_MLFLOW_LOG_ARTIFACTS"]="1"
os.environ["MLFLOW_EXPERIMENT_NAME"]="trainer-mlflow-demo"
os.environ["MLFLOW_FLATTEN_PARAMS"]="1"
#os.environ["MLFLOW_TRACKING_URI"]=<MY_SERVER IP>

training_args = TrainingArguments(
    num_train_epochs=1,
    output_dir="./output",
    logging_steps=500,
    save_strategy="epoch",
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=test_dataset,
    compute_metrics=compute_metrics
)

trainer.train()

Expected behavior

I would expect logging to work 😃

Issue Analytics

State:
Created a year ago
Comments:6 (3 by maintainers)

Top GitHub Comments

1reaction

LysandreJikcommented, Jul 27, 2022

@noise-field wrote the integration two years ago, do you have an idea of why it doesn’t seem to work anymore @noise-field?

0reactions

github-actions[bot]commented, Aug 30, 2022

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.