question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Cannot change attributes of finished trial

See original GitHub issue

Hi,

I really like optuna ( 2.10), thanks for this great tool 😃

However, I get many failed trials with the following error message:

Traceback (most recent call last):
  File "/u/twagner/conda-envs/tomotwin_opt/lib/python3.9/site-packages/optuna/study/_optimize.py", line 213, in _run_trial
    value_or_values = func(trial)
  File "/u/twagner/conda-envs/tomotwin_opt/lib/python3.9/site-packages/tomotwin/train_optuna.py", line 157, in objective
    trial.report(val_loss, epoch)
  File "/u/twagner/conda-envs/tomotwin_opt/lib/python3.9/site-packages/optuna/trial/_trial.py", line 597, in report
    self.storage.set_trial_intermediate_value(self._trial_id, step, value)
  File "/u/twagner/conda-envs/tomotwin_opt/lib/python3.9/site-packages/optuna/storages/_cached_storage.py", line 318, in set_trial_intermediate_value
    self._flush_trial(trial_id)
  File "/u/twagner/conda-envs/tomotwin_opt/lib/python3.9/site-packages/optuna/storages/_cached_storage.py", line 428, in _flush_trial
    return self._backend._update_trial(
  File "/u/twagner/conda-envs/tomotwin_opt/lib/python3.9/site-packages/optuna/storages/_rdb/storage.py", line 671, in _update_trial
    raise RuntimeError("Cannot change attributes of finished trial.")
RuntimeError: Cannot change attributes of finished trial.

Here are some statistics about the study:

Number of finished trials:  67
Pruned: 22
Completed: 12
Failed: 22
Waiting: 0
Running: 11

For another study that runs already much longer it is even worse:

Number of finished trials:  309
Pruned: 45
Completed: 7
Failed: 251
Waiting: 0
Running: 6

BTW: Why are running trials listed as finished trials?

Here is how I setup the study: https://gist.github.com/thorstenwagner/bde99f26295809882ab3315ad8be0b5b

And this is my objective: https://gist.github.com/thorstenwagner/5e39db92b0198021ee194fcf42730ae3

Can someone tell me what is wrong in my setup?

The whole experiment runs on a HPC with 11 processes in parallel. The file system is of type GPFS (not NFS). Locking should be supported without flaws.

Issue Analytics

  • State:closed
  • Created 2 years ago
  • Comments:14 (6 by maintainers)

github_iconTop GitHub Comments

1reaction
thorstenwagnercommented, Dec 21, 2021

Just got the error again locally. With heartbeat 60 😦

1reaction
himktcommented, Dec 20, 2021

@thorstenwagner

Thank you for explaining, I think I understand your situation.

The problem also occurs locally on my computer (running 2 processes on two GPUs)

Hmm, I see. I didn’t reproduce the problem with a simple sample… If you find the way to reproduce, please tell me. It would help us a lot.

To be honest, I’m not sure what it actually does. Maybe you can explain? 😃

OK. I try to explain it. In some situation, processes running trials are killed suddenly. Typical cases are spot instance on AWS and preemptive instance on GCP. If processes are killed in such ways, trials whose states are running leave in the study. Heartbeat will check if each process is alive by sending a ping and finish a trial if the process doesn’t respond to the signal by changing the state of the process from running to fail.

🔗 Heartbeat in https://github.com/optuna/optuna/releases/tag/v2.5.0. 🔗 optuna.storages.RetryFailedTrialCallback Added in https://github.com/optuna/optuna/releases/tag/v2.8.0

If you don’t have to pay attention to the process interruption, you can simply use cached storage by the following code

study = optuna.create_study(
    direction=PARAMS["general"]["minmax"],
    sampler=TPESampler(),
    pruner=MedianPruner(n_warmup_steps=N_WARMUP_STEPS),
    study_name = STUDY_NAME,
    storage=f"sqlite:///{STUDY_NAME}.db",
    load_if_exists=True
)

instead of

storage = optuna.storages.RDBStorage(
    url=f"sqlite:///{STUDY_NAME}.db",
    engine_kwargs={
        'connect_args': {
            'timeout': 60000
        }
    },
    heartbeat_interval=1,
    failed_trial_callback=RetryFailedTrialCallback(max_retry=MAX_RETRY),
)

study = optuna.create_study(
    direction=PARAMS["general"]["minmax"],
    sampler=TPESampler(),
    pruner=MedianPruner(n_warmup_steps=N_WARMUP_STEPS),
    study_name = STUDY_NAME,
    storage=storage,
    load_if_exists=True
)
Read more comments on GitHub >

github_iconTop Results From Across the Web

optuna/optuna - Gitter
I have a question about parameter importances. Are these relative to the percentage change in the parameter value, or the percentage of the...
Read more >
Source code for optuna.storages._rdb.storage - Read the Docs
[docs]class RDBStorage(BaseStorage, BaseHeartbeat): """Storage class for RDB backend. Note that library users can instantiate this class, but the attributes ...
Read more >
TrialHandler attributes are not updating - Online experiments
In regards to your first point, I am going to try trials.TrialList = 0 and certainly changing the name of the loop in...
Read more >
Enter attributes for new features—ArcGIS Pro | Documentation
Enter attribute values. To enter attribute values for new features, complete the following steps: On the ribbon, click the Edit ...
Read more >
Known issues and workarounds (Dynamics 365 Marketing)
You can't sign up using an @microsoft.com email address. If you're a Microsoft employee and would like to sign up for a trial,...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found