Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Duplicate trial numbers on distributed optimization

See original GitHub issue

When I run multiple Optuna processes on cloud servers that have job scheduler, Optuna occasionally assigns duplicate trial numbers (trial.number) to multiple trials that started at almost the same time.

Expected behavior

Each trial obtains a different number; as a result, the trial numbers among a study become sequential.

Environment

Optuna version: 1.5.0
Python version: 3.8.3
OS: A CentOS 7.7 container running on Singularity on a SUSE Linux Enterprise Server 12 SP4 host
Other software versions: PostgreSQL 9.5.21, psycopg 2.8.5

Error messages, stack traces, or logs

A simplified copy of the below-mentioned test program’s output looks like:

[I 2020-07-07 00:07:24,134] Using an existing study with name 'test_optuna_collision_01' instead of creating a new one.

......


  #         x         y  NODENAME                       STARTED
  0:   -7.260   105.276    r6i5n2    2020-07-06 23:42:14.603784
  1:    8.803    33.677    r6i5n2    2020-07-06 23:42:14.776819
  2:   -1.653    21.654    r6i5n2    2020-07-06 23:42:14.887652
  3:    0.024     8.856    r6i5n2    2020-07-06 23:42:14.993142
  4:    6.834    14.701    r6i5n2    2020-07-06 23:42:15.106251
  5:   -4.047    49.663    r6i5n2    2020-07-06 23:42:15.215284
  6:    7.026    16.205    r6i5n2    2020-07-06 23:42:15.325813
  7:   -1.137    17.117    r6i5n2    2020-07-06 23:42:15.432296
  8:    0.887     4.466    r6i5n2    2020-07-06 23:42:15.539958
  9:   -9.392   153.554    r6i5n2    2020-07-06 23:42:15.692091
 10:    2.665     0.112    r6i5n2    2020-07-06 23:42:15.836753
 11:    3.273     0.074    r6i5n2    2020-07-06 23:42:15.986390
 12:    3.699     0.489    r6i5n2    2020-07-06 23:42:16.100333
 13:    3.481     0.232    r6i5n2    2020-07-06 23:42:16.216604
 14:    3.461     0.213    r6i5n2    2020-07-06 23:42:16.331074
 15:    5.481     6.154    r6i5n2    2020-07-06 23:42:16.445370
 16:    1.499     2.253    r6i5n2    2020-07-06 23:42:16.560353
 17:   -3.628    43.933    r6i5n2    2020-07-06 23:42:16.674461
 18:    8.431    29.496    r6i5n2    2020-07-06 23:42:16.788654
 19:    5.046     4.185    r6i5n2    2020-07-06 23:42:16.948887
 20:    2.154     0.716    r3i5n6    2020-07-06 23:42:50.112204
 20:    2.096     0.817    r3i5n6    2020-07-06 23:42:50.112892
 20:    2.785     0.046    r2i2n5    2020-07-06 23:42:50.112536
 20:    2.213     0.620    r2i2n5    2020-07-06 23:42:50.113374
 20:    2.237     0.582    r3i5n6    2020-07-06 23:42:50.116946
 25:    2.402     0.358    r3i5n6    2020-07-06 23:42:50.150646
 25:    1.886     1.240    r3i5n6    2020-07-06 23:42:50.151986
 26:    2.022     0.957    r2i2n5    2020-07-06 23:42:50.157673
 28:    4.172     1.374    r2i2n5    2020-07-06 23:42:50.362914
 28:    4.986     3.944    r2i2n5    2020-07-06 23:42:50.364238
 28:    3.871     0.759    r3i5n6    2020-07-06 23:42:50.364441
 28:    4.373     1.885    r3i5n6    2020-07-06 23:42:50.365054
 28:    4.017     1.035    r3i5n6    2020-07-06 23:42:50.365227
 28:    4.861     3.462    r3i5n6    2020-07-06 23:42:50.366699
 28:    4.673     2.799    r3i5n6    2020-07-06 23:42:50.367405
 35:    4.054     1.110    r2i2n5    2020-07-06 23:42:50.373785
 36:    6.775    14.250    r2i2n5    2020-07-06 23:42:50.561707
 36:   -2.051    25.517    r2i2n5    2020-07-06 23:42:50.567540
 37:    6.959    15.676    r3i5n6    2020-07-06 23:42:50.569821
 38:    7.065    16.527    r3i5n6    2020-07-06 23:42:50.574569
 38:   -1.866    23.675    r3i5n6    2020-07-06 23:42:50.575655
 39:    7.005    16.040    r3i5n6    2020-07-06 23:42:50.576151
 39:    6.789    14.358    r3i5n6    2020-07-06 23:42:50.580927
 40:    7.048    16.385    r2i2n5    2020-07-06 23:42:50.581646
 44:    0.104     8.389    r2i2n5    2020-07-06 23:42:50.773439
 45:    2.705     0.087    r2i2n5    2020-07-06 23:42:50.782284
 46:    0.418     6.667    r3i5n6    2020-07-06 23:42:50.798938
 47:    2.871     0.017    r2i2n5    2020-07-06 23:42:50.823891
 47:    9.736    45.379    r3i5n6    2020-07-06 23:42:50.825717
 47:    3.280     0.079    r3i5n6    2020-07-06 23:42:50.826783
 47:    3.098     0.010    r3i5n6    2020-07-06 23:42:50.828179
 51:    3.010     0.000    r3i5n6    2020-07-06 23:42:50.836315
 52:    5.367     5.601    r2i2n5    2020-07-06 23:42:50.949296
 53:    5.591     6.714    r2i2n5    2020-07-06 23:42:50.958469
 54:    5.738     7.497    r3i5n6    2020-07-06 23:42:50.969918
 55:    9.213    38.595    r2i2n5    2020-07-06 23:42:51.028821
 56:   -1.009    16.075    r3i5n6    2020-07-06 23:42:51.046327
 56:   -0.770    14.213    r3i5n6    2020-07-06 23:42:51.047563
 56:   -0.897    15.188    r3i5n6    2020-07-06 23:42:51.050477
 56:    2.905     0.009    r3i5n6    2020-07-06 23:42:51.050697
 60:   -0.883    15.076    r2i2n5    2020-07-06 23:42:51.138237
 61:   -0.566    12.719    r2i2n5    2020-07-06 23:42:51.153265
 62:    1.074     3.708    r3i5n6    2020-07-06 23:42:51.162551
 63:   -0.529    12.457    r2i2n5    2020-07-06 23:42:51.204287
......

In this result, five trial’s got the number 20.

Steps to reproduce

Run parallel exploration on multiple nodes using a remote PostgreSQL storage.
When multiple trial’s are generated in very short time duration (e.g. < 10 ms), they may have the same number.

I ran the following script on 32 nodes as a tiny test.

import os
import optuna


def f(x):
    return (x - 3) ** 2


# Objective function
def optuna_objective(trial):
    x = trial.suggest_uniform("x", -10, 10)
    y = f(x)

    nodename = os.uname().nodename
    trial.set_user_attr("nodename", nodename)

    print("x=%8.3f, y=%8.3f" % (x, y))

    return y


# Entry point
def main():
    # Create a study
    study = optuna.create_study(
        study_name="test_optuna_collision_01",
        storage=os.environ["OPTUNA_SQL"],  # Set a remote PostgreSQL server
        load_if_exists=True,
        direction="minimize"
    )

    # Run optimization
    study.optimize(optuna_objective, n_trials=20)

    # Visualize the result
    print("  #  %8s  %8s  %8s  %28s" % ("x", "y", "NODENAME", "STARTED"))
    for trial in study.trials:
        try:
            x = "%8.3f" % trial.params["x"]
            y = "%8.3f" % trial.value
            nodename = trial.user_attrs["nodename"]
            started = trial.datetime_start
        except (KeyError, AttributeError):
            x = ""
            y = ""
            nodename = ""
            started = ""
        print("%3d: %s  %s  %8s  %28s" % (trial.number, x, y, nodename, started))


if __name__ == '__main__':
    main()

Additional context (optional)

I guess this is caused by a non-atomic operation of the RDB storage. Here, Optuna assigns number to a newly created trial by counting the existing trial’s on the storage.

https://github.com/optuna/optuna/blob/69ee3ae5477dc6526b5c62320e4ad0393674cfd5/optuna/storages/rdb/storage.py#L517

Issue Analytics

State:
Created 3 years ago
Comments:11 (6 by maintainers)

Top GitHub Comments

2reactions

nandnor93commented, Jul 9, 2020

I tried #1490 (d7bdd63c711b5cff68ba663801cd842dbc365bcb) and #1498 (f51e87419c23726dabab2569a74e0739b7f39585) on a local server since my environment on the cloud servers is busy now. It worked perfectly!

  #         x         y       val  NODENAME       PID                       STARTED
  0:   -1.403     1.749    33.438    server     10692    2020-07-09 21:34:31.171247
  1:    9.931     7.162   131.980    server     10687    2020-07-09 21:34:31.182138
  2:    2.225    -0.530     2.761    server     10690    2020-07-09 21:34:31.193335
  3:   -6.836     3.229   124.093    server     10688    2020-07-09 21:34:31.203505
  4:   -7.277    -5.566   118.334    server     10686    2020-07-09 21:34:31.212257
  5:    8.161     0.694    33.901    server     10689    2020-07-09 21:34:31.220494
  6:   -8.262     9.046   248.857    server     10685    2020-07-09 21:34:31.229522
  7:    7.340    -4.254    23.916    server     10691    2020-07-09 21:34:31.239973
  8:   -7.226     3.014   129.712    server     10692    2020-07-09 21:34:31.251409
  9:    0.521    -3.046     7.237    server     10687    2020-07-09 21:34:31.276398
 10:   -9.743    -5.919   177.745    server     10690    2020-07-09 21:34:31.285571
 11:    5.927    -0.840     9.915    server     10688    2020-07-09 21:34:31.296796
 12:    7.070     5.095    66.909    server     10686    2020-07-09 21:34:31.303912
 13:   -8.684    -8.801   182.764    server     10692    2020-07-09 21:34:31.311351
 14:   -8.534    -5.131   142.839    server     10689    2020-07-09 21:34:31.317931
 15:    2.250     7.157    84.415    server     10685    2020-07-09 21:34:31.324959
 16:    2.845    -9.716    59.559    server     10691    2020-07-09 21:34:31.333465
 17:    2.033    -1.733     1.006    server     10687    2020-07-09 21:34:31.353674
 18:    2.486    -2.678     0.724    server     10690    2020-07-09 21:34:31.361454
 19:    2.113    -1.741     0.853    server     10688    2020-07-09 21:34:31.368681
 20:   -0.904    -2.000    15.243    server     10692    2020-07-09 21:34:31.375975

......

 88:    2.377    -1.939     0.391   server     10865    2020-07-09 21:34:57.294263
 89:    2.708     1.576    12.874    server     10865    2020-07-09 21:34:57.347147

0reactions

hvycommented, Jul 10, 2020

Great. Again, thank you for reporting and quickly following up on these important bugs. Much appreciated. On a side note, we should try to catch these error systematically. No clear date yet but we’re looking into it. 🙇

Top Results From Across the Web

Source code for optuna.samplers._grid

Therefore, during distributed optimization where trials run concurrently, ... the order of trials and may increase the number of duplicate suggestions ...

Distributed Optimization and Statistical Learning via the ...

Abstract. Many problems of recent interest in statistics and machine learning can be posed in the framework of convex optimization. Due to the....

Delivering billions of messages exactly once - Segment

The new system is able to track 100x the number of messages of the ... of the problem–we have to remove duplicate messages...

On Distributed Optimization in Networked Systems - DiVA Portal

the operation of distributed optimization algorithms to obtain an optimal resource ... such large scale systems usually have to rely on intuition, trial....

Local-Aggregate Modeling for Big Data via Distributed ...

Local-Aggregate Modeling for Big Data via Distributed Optimization: ... regularization, the number of parameters to be estimated is.