Duplicate trial numbers on distributed optimization
See original GitHub issueWhen I run multiple Optuna processes on cloud servers that have job scheduler, Optuna occasionally assigns duplicate trial numbers (trial.number
) to multiple trials that started at almost the same time.
Expected behavior
Each trial obtains a different number; as a result, the trial numbers among a study become sequential.
Environment
- Optuna version: 1.5.0
- Python version: 3.8.3
- OS: A CentOS 7.7 container running on Singularity on a SUSE Linux Enterprise Server 12 SP4 host
- Other software versions: PostgreSQL 9.5.21, psycopg 2.8.5
Error messages, stack traces, or logs
A simplified copy of the below-mentioned test program’s output looks like:
[I 2020-07-07 00:07:24,134] Using an existing study with name 'test_optuna_collision_01' instead of creating a new one.
......
# x y NODENAME STARTED
0: -7.260 105.276 r6i5n2 2020-07-06 23:42:14.603784
1: 8.803 33.677 r6i5n2 2020-07-06 23:42:14.776819
2: -1.653 21.654 r6i5n2 2020-07-06 23:42:14.887652
3: 0.024 8.856 r6i5n2 2020-07-06 23:42:14.993142
4: 6.834 14.701 r6i5n2 2020-07-06 23:42:15.106251
5: -4.047 49.663 r6i5n2 2020-07-06 23:42:15.215284
6: 7.026 16.205 r6i5n2 2020-07-06 23:42:15.325813
7: -1.137 17.117 r6i5n2 2020-07-06 23:42:15.432296
8: 0.887 4.466 r6i5n2 2020-07-06 23:42:15.539958
9: -9.392 153.554 r6i5n2 2020-07-06 23:42:15.692091
10: 2.665 0.112 r6i5n2 2020-07-06 23:42:15.836753
11: 3.273 0.074 r6i5n2 2020-07-06 23:42:15.986390
12: 3.699 0.489 r6i5n2 2020-07-06 23:42:16.100333
13: 3.481 0.232 r6i5n2 2020-07-06 23:42:16.216604
14: 3.461 0.213 r6i5n2 2020-07-06 23:42:16.331074
15: 5.481 6.154 r6i5n2 2020-07-06 23:42:16.445370
16: 1.499 2.253 r6i5n2 2020-07-06 23:42:16.560353
17: -3.628 43.933 r6i5n2 2020-07-06 23:42:16.674461
18: 8.431 29.496 r6i5n2 2020-07-06 23:42:16.788654
19: 5.046 4.185 r6i5n2 2020-07-06 23:42:16.948887
20: 2.154 0.716 r3i5n6 2020-07-06 23:42:50.112204
20: 2.096 0.817 r3i5n6 2020-07-06 23:42:50.112892
20: 2.785 0.046 r2i2n5 2020-07-06 23:42:50.112536
20: 2.213 0.620 r2i2n5 2020-07-06 23:42:50.113374
20: 2.237 0.582 r3i5n6 2020-07-06 23:42:50.116946
25: 2.402 0.358 r3i5n6 2020-07-06 23:42:50.150646
25: 1.886 1.240 r3i5n6 2020-07-06 23:42:50.151986
26: 2.022 0.957 r2i2n5 2020-07-06 23:42:50.157673
28: 4.172 1.374 r2i2n5 2020-07-06 23:42:50.362914
28: 4.986 3.944 r2i2n5 2020-07-06 23:42:50.364238
28: 3.871 0.759 r3i5n6 2020-07-06 23:42:50.364441
28: 4.373 1.885 r3i5n6 2020-07-06 23:42:50.365054
28: 4.017 1.035 r3i5n6 2020-07-06 23:42:50.365227
28: 4.861 3.462 r3i5n6 2020-07-06 23:42:50.366699
28: 4.673 2.799 r3i5n6 2020-07-06 23:42:50.367405
35: 4.054 1.110 r2i2n5 2020-07-06 23:42:50.373785
36: 6.775 14.250 r2i2n5 2020-07-06 23:42:50.561707
36: -2.051 25.517 r2i2n5 2020-07-06 23:42:50.567540
37: 6.959 15.676 r3i5n6 2020-07-06 23:42:50.569821
38: 7.065 16.527 r3i5n6 2020-07-06 23:42:50.574569
38: -1.866 23.675 r3i5n6 2020-07-06 23:42:50.575655
39: 7.005 16.040 r3i5n6 2020-07-06 23:42:50.576151
39: 6.789 14.358 r3i5n6 2020-07-06 23:42:50.580927
40: 7.048 16.385 r2i2n5 2020-07-06 23:42:50.581646
44: 0.104 8.389 r2i2n5 2020-07-06 23:42:50.773439
45: 2.705 0.087 r2i2n5 2020-07-06 23:42:50.782284
46: 0.418 6.667 r3i5n6 2020-07-06 23:42:50.798938
47: 2.871 0.017 r2i2n5 2020-07-06 23:42:50.823891
47: 9.736 45.379 r3i5n6 2020-07-06 23:42:50.825717
47: 3.280 0.079 r3i5n6 2020-07-06 23:42:50.826783
47: 3.098 0.010 r3i5n6 2020-07-06 23:42:50.828179
51: 3.010 0.000 r3i5n6 2020-07-06 23:42:50.836315
52: 5.367 5.601 r2i2n5 2020-07-06 23:42:50.949296
53: 5.591 6.714 r2i2n5 2020-07-06 23:42:50.958469
54: 5.738 7.497 r3i5n6 2020-07-06 23:42:50.969918
55: 9.213 38.595 r2i2n5 2020-07-06 23:42:51.028821
56: -1.009 16.075 r3i5n6 2020-07-06 23:42:51.046327
56: -0.770 14.213 r3i5n6 2020-07-06 23:42:51.047563
56: -0.897 15.188 r3i5n6 2020-07-06 23:42:51.050477
56: 2.905 0.009 r3i5n6 2020-07-06 23:42:51.050697
60: -0.883 15.076 r2i2n5 2020-07-06 23:42:51.138237
61: -0.566 12.719 r2i2n5 2020-07-06 23:42:51.153265
62: 1.074 3.708 r3i5n6 2020-07-06 23:42:51.162551
63: -0.529 12.457 r2i2n5 2020-07-06 23:42:51.204287
......
In this result, five trial
’s got the number 20.
Steps to reproduce
- Run parallel exploration on multiple nodes using a remote PostgreSQL storage.
- When multiple
trial
’s are generated in very short time duration (e.g. < 10 ms), they may have the samenumber
.
I ran the following script on 32 nodes as a tiny test.
import os
import optuna
def f(x):
return (x - 3) ** 2
# Objective function
def optuna_objective(trial):
x = trial.suggest_uniform("x", -10, 10)
y = f(x)
nodename = os.uname().nodename
trial.set_user_attr("nodename", nodename)
print("x=%8.3f, y=%8.3f" % (x, y))
return y
# Entry point
def main():
# Create a study
study = optuna.create_study(
study_name="test_optuna_collision_01",
storage=os.environ["OPTUNA_SQL"], # Set a remote PostgreSQL server
load_if_exists=True,
direction="minimize"
)
# Run optimization
study.optimize(optuna_objective, n_trials=20)
# Visualize the result
print(" # %8s %8s %8s %28s" % ("x", "y", "NODENAME", "STARTED"))
for trial in study.trials:
try:
x = "%8.3f" % trial.params["x"]
y = "%8.3f" % trial.value
nodename = trial.user_attrs["nodename"]
started = trial.datetime_start
except (KeyError, AttributeError):
x = ""
y = ""
nodename = ""
started = ""
print("%3d: %s %s %8s %28s" % (trial.number, x, y, nodename, started))
if __name__ == '__main__':
main()
Additional context (optional)
I guess this is caused by a non-atomic operation of the RDB storage.
Here, Optuna assigns number
to a newly created trial
by counting the existing trial
’s on the storage.
Issue Analytics
- State:
- Created 3 years ago
- Comments:11 (6 by maintainers)
Top Results From Across the Web
Source code for optuna.samplers._grid
Therefore, during distributed optimization where trials run concurrently, ... the order of trials and may increase the number of duplicate suggestions ...
Read more >Distributed Optimization and Statistical Learning via the ...
Abstract. Many problems of recent interest in statistics and machine learning can be posed in the framework of convex optimization. Due to the....
Read more >Delivering billions of messages exactly once - Segment
The new system is able to track 100x the number of messages of the ... of the problem–we have to remove duplicate messages...
Read more >On Distributed Optimization in Networked Systems - DiVA Portal
the operation of distributed optimization algorithms to obtain an optimal resource ... such large scale systems usually have to rely on intuition, trial....
Read more >Local-Aggregate Modeling for Big Data via Distributed ...
Local-Aggregate Modeling for Big Data via Distributed Optimization: ... regularization, the number of parameters to be estimated is.
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
I tried #1490 (d7bdd63c711b5cff68ba663801cd842dbc365bcb) and #1498 (f51e87419c23726dabab2569a74e0739b7f39585) on a local server since my environment on the cloud servers is busy now. It worked perfectly!
Great. Again, thank you for reporting and quickly following up on these important bugs. Much appreciated. On a side note, we should try to catch these error systematically. No clear date yet but we’re looking into it. 🙇