MLflow fails to log to a tracking server
See original GitHub issueSystem Info
Python 3.9.13 | packaged by conda-forge | (main, May 27 2022, 16:56:21)
print(transformers.version) 4.20.1
print(mlflow.version) 1.27.0
Who can help?
No response
Information
- The official example scripts
- My own modified scripts
Tasks
- An officially supported task in the
examples
folder (such as GLUE/SQuAD, …) - My own task or dataset (give details below)
Reproduction
- Install mlflow
- Configure a vanilla training job to use a tracking server (os.environ[“MLFLOW_TRACKING_URI”]=“…”)
- Run the job
You should see an error similar to:
Traceback (most recent call last):
File "/home/ubuntu/train.py", line 45, in <module>
trainer.train()
File "/home/ubuntu/.local/lib/python3.9/site-packages/transformers/trainer.py", line 1409, in train
return inner_training_loop(
File "/home/ubuntu/.local/lib/python3.9/site-packages/transformers/trainer.py", line 1580, in _inner_training_loop
self.control = self.callback_handler.on_train_begin(args, self.state, self.control)
File "/home/ubuntu/.local/lib/python3.9/site-packages/transformers/trainer_callback.py", line 347, in on_train_begin
return self.call_event("on_train_begin", args, state, control)
File "/home/ubuntu/.local/lib/python3.9/site-packages/transformers/trainer_callback.py", line 388, in call_event
result = getattr(callback, event)(
File "/home/ubuntu/.local/lib/python3.9/site-packages/transformers/integrations.py", line 856, in on_train_begin
self.setup(args, state, model)
File "/home/ubuntu/.local/lib/python3.9/site-packages/transformers/integrations.py", line 847, in setup
self._ml_flow.log_params(dict(combined_dict_items[i : i + self._MAX_PARAMS_TAGS_PER_BATCH]))
File "/home/ubuntu/.local/lib/python3.9/site-packages/mlflow/tracking/fluent.py", line 675, in log_params
MlflowClient().log_batch(run_id=run_id, metrics=[], params=params_arr, tags=[])
File "/home/ubuntu/.local/lib/python3.9/site-packages/mlflow/tracking/client.py", line 918, in log_batch
self._tracking_client.log_batch(run_id, metrics, params, tags)
File "/home/ubuntu/.local/lib/python3.9/site-packages/mlflow/tracking/_tracking_service/client.py", line 315, in log_batch
self.store.log_batch(
File "/home/ubuntu/.local/lib/python3.9/site-packages/mlflow/store/tracking/rest_store.py", line 309, in log_batch
self._call_endpoint(LogBatch, req_body)
File "/home/ubuntu/.local/lib/python3.9/site-packages/mlflow/store/tracking/rest_store.py", line 56, in _call_endpoint
return call_endpoint(self.get_host_creds(), endpoint, method, json_body, response_proto)
File "/home/ubuntu/.local/lib/python3.9/site-packages/mlflow/utils/rest_utils.py", line 256, in call_endpoint
response = verify_rest_response(response, endpoint)
File "/home/ubuntu/.local/lib/python3.9/site-packages/mlflow/utils/rest_utils.py", line 185, in verify_rest_response
raise RestException(json.loads(response.text))
mlflow.exceptions.RestException: INVALID_PARAMETER_VALUE: Invalid value [{'key': 'logging_nan_inf_filter', 'value': 'True'}, {'key': 'save_strategy', 'value': 'epoch'}, {'key': 'save_steps', 'value': '500'}, {'key': 'save_total_limit', 'value': 'None'}, {'key': 'save_on_each_node', 'value': 'False'}, {'key': 'no_cuda', 'value': 'False'}, {'key': 'seed', 'value': '42'}, {'key': 'data_seed', 'value': 'None'}, {'key': 'jit_mode_eval', 'value': 'False'}, {'key': 'use_ipex', 'value': 'False'}, {'key': 'bf16', 'value': 'False'}, {'key': 'fp16', 'value': 'False'}, {'key': 'fp16_opt_level', 'value': 'O1'}, {'key': 'half_precision_backend', 'value': 'auto'}, {'key': 'bf16_full_eval', 'value': 'False'}, {'key': 'fp16_full_eval', 'value': 'False'}, {'key': 'tf32', 'value': 'None'}, {'key': 'local_rank', 'value': '-1'}, {'key': 'xpu_backend', 'value': 'None'}, {'key': 'tpu_num_cores', 'value': 'None'}, {'key': 'tpu_metrics_debug', 'value': 'False'}, {'key': 'debug', 'value': '[]'}, {'key': 'dataloader_drop_last', 'value': 'False'}, {'key': 'eval_steps', 'value': 'None'}, {'key': 'dataloader_num_workers', 'value': '0'}, {'key': 'past_index', 'value': '-1'}, {'key': 'run_name', 'value': './output'}, {'key': 'disable_tqdm', 'value': 'False'}, {'key': 'remove_unused_columns', 'value': 'True'}, {'key': 'label_names', 'value': 'None'}, {'key': 'load_best_model_at_end', 'value': 'False'}, {'key': 'metric_for_best_model', 'value': 'None'}, {'key': 'greater_is_better', 'value': 'None'}, {'key': 'ignore_data_skip', 'value': 'False'}, {'key': 'sharded_ddp', 'value': '[]'}, {'key': 'fsdp', 'value': '[]'}, {'key': 'fsdp_min_num_params', 'value': '0'}, {'key': 'deepspeed', 'value': 'None'}, {'key': 'label_smoothing_factor', 'value': '0.0'}, {'key': 'optim', 'value': 'adamw_hf'}, {'key': 'adafactor', 'value': 'False'}, {'key': 'group_by_length', 'value': 'False'}, {'key': 'length_column_name', 'value': 'length'}, {'key': 'report_to', 'value': "['mlflow']"}, {'key': 'ddp_find_unused_parameters', 'value': 'None'}, {'key': 'ddp_bucket_cap_mb', 'value': 'None'}, {'key': 'dataloader_pin_memory', 'value': 'True'}, {'key': 'skip_memory_metrics', 'value': 'True'}, {'key': 'use_legacy_prediction_loop', 'value': 'False'}, {'key': 'push_to_hub', 'value': 'False'}, {'key': 'resume_from_checkpoint', 'value': 'None'}, {'key': 'hub_model_id', 'value': 'None'}, {'key': 'hub_strategy', 'value': 'every_save'}, {'key': 'hub_token', 'value': '<HUB_TOKEN>'}, {'key': 'hub_private_repo', 'value': 'False'}, {'key': 'gradient_checkpointing', 'value': 'False'}, {'key': 'include_inputs_for_metrics', 'value': 'False'}, {'key': 'fp16_backend', 'value': 'auto'}, {'key': 'push_to_hub_model_id', 'value': 'None'}, {'key': 'push_to_hub_organization', 'value': 'None'}, {'key': 'push_to_hub_token', 'value': '<PUSH_TO_HUB_TOKEN>'}, {'key': '_n_gpu', 'value': '1'}, {'key': 'mp_parameters', 'value': ''}, {'key': 'auto_find_batch_size', 'value': 'False'}, {'key': 'full_determinism', 'value': 'False'}, {'key': 'torchdynamo', 'value': 'None'}, {'key': 'ray_scope', 'value': 'last'}] for parameter 'params' supplied. Hint: Value was of type 'list'. See the API docs for more information about request parameters.
Training script:
import os
import numpy as np
from datasets import load_dataset, load_metric
from transformers import AutoTokenizer, Trainer, TrainingArguments, AutoModelForSequenceClassification
train_dataset, test_dataset = load_dataset("imdb", split=['train', 'test'])
tokenizer = AutoTokenizer.from_pretrained("distilbert-base-cased")
def tokenize_function(examples):
return tokenizer(examples["text"], padding="max_length", truncation=True)
train_dataset = train_dataset.map(tokenize_function, batched=True)
test_dataset = test_dataset.map(tokenize_function, batched=True)
model = AutoModelForSequenceClassification.from_pretrained("distilbert-base-cased", num_labels=2)
metric = load_metric("accuracy")
def compute_metrics(eval_pred):
logits, labels = eval_pred
predictions = np.argmax(logits, axis=-1)
return metric.compute(predictions=predictions, references=labels)
os.environ["HF_MLFLOW_LOG_ARTIFACTS"]="1"
os.environ["MLFLOW_EXPERIMENT_NAME"]="trainer-mlflow-demo"
os.environ["MLFLOW_FLATTEN_PARAMS"]="1"
#os.environ["MLFLOW_TRACKING_URI"]=<MY_SERVER IP>
training_args = TrainingArguments(
num_train_epochs=1,
output_dir="./output",
logging_steps=500,
save_strategy="epoch",
)
trainer = Trainer(
model=model,
args=training_args,
train_dataset=train_dataset,
eval_dataset=test_dataset,
compute_metrics=compute_metrics
)
trainer.train()
Expected behavior
I would expect logging to work 😃
Issue Analytics
- State:
- Created a year ago
- Comments:6 (3 by maintainers)
Top Results From Across the Web
MLflow Tracking — MLflow 2.0.1 documentation
Using the Tracking Server for proxied artifact access. Logging to a Tracking Server ... If the experiment does not exist, creates a new...
Read more >Mlflow unable to log sklearn model - python - Stack Overflow
models.model: Logging model metadata to the tracking server has failed, possibly due older server version. The model artifacts have been logged ...
Read more >Configure mlflow inside your project
Context: mlflow tracking under the hood ... Basically, this schema shows that mlflow separates WHERE the artifacts are logged from HOW...
Read more >Log, load, register, and deploy MLflow models
To log a model to the MLflow tracking server, use mlflow.<model-type>.log_model(model, ...) . To load a previously logged model for inference or ...
Read more >Setup MLflow in Production - Towards Data Science
The next step is creating a directory for our Tracking Server to log the Machine Learning models and other artifacts. Remember that the...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
@noise-field wrote the integration two years ago, do you have an idea of why it doesn’t seem to work anymore @noise-field?
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.
Please note that issues that do not follow the contributing guidelines are likely to be ignored.