Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[Tune] why only one trial RUNNING and other trials PENDING in each iteration(question)

See original GitHub issue

System information

OS Platform and Distribution Linux Ubuntu 16.04):
**Ray installed from source **:
Ray version: 0.5.3:
Python version: 3.6.6:

Describe the problem

as title.

Source code / logs

here is my related code:

def data_generator(train_X, train_y, batch_size):

# set data dtype
train_next, train_last, train_this = pd.get_dummies(train_X.iloc[:, 68:]), pd.get_dummies(
    train_X.iloc[:, 4:68]), pd.get_dummies(train_X.iloc[:, :4])
batches = (train_X.shape[0] + batch_size - 1) // batch_size
while (True):
    for i in range(batches):
        X_train_1 = train_next.iloc[i * batch_size: (i + 1) * batch_size, :]
        X_train_2 = train_last.iloc[i * batch_size: (i + 1) * batch_size, :]
        X_train_3 = train_this.iloc[i * batch_size: (i + 1) * batch_size, :]
        Y_train = train_y.iloc[i * batch_size: (i + 1) * batch_size, :]
        yield [X_train_1, X_train_2, X_train_3], Y_train

def train_model_tune(config, reporter):

# read_data
data = pd.read_csv(r'/home/zhouhao/Downloads/temp/X.csv', index_col=0)
y = pd.read_csv(r'/home/zhouhao/Downloads/temp/y.csv', index_col=0)
columns = data.columns.values

#to_category type
cat1 = columns[3]
cat2 = columns[12:20]
cat3 = columns[76:84]
data.loc[:, cat1] = data.loc[:, cat1].astype(dtype='category', inplace=True)
data.loc[:, cat2] = data.loc[:, cat2].astype(dtype='category', inplace=True)
data.loc[:, cat3] = data.loc[:, cat3].astype(dtype='category', inplace=True)

batch_size = config['batch_size']

# split data
train_X, test_X, train_y, test_y = train_test_split(
    data, y, test_size=0.05, random_state=2018)

test_next, test_last, test_this = pd.get_dummies(test_X.iloc[:, 68:]), pd.get_dummies(
    test_X.iloc[:, 4:68]), pd.get_dummies(test_X.iloc[:, :4])

#model 
model = prediction_model(config)

# set callback
tensorBoard = TensorBoard(log_dir=r'/home/zhouhao/Downloads/traffic_prediction/model',
                          histogram_freq=1, write_graph=True, write_images=TabError, write_grads=True)

earlyStopping = EarlyStopping(
    monitor='val_loss', min_delta=0, patience=15, verbose=2, mode='auto')

filepath = r"/home/zhouhao/Downloads/temp/model/weights-improvement-{epoch:02d}-{val_loss:.2f}.hdf5"
check_point = ModelCheckpoint(
    filepath, monitor='val_loss', mode='auto', verbose=1, save_best_only=True)

#training model on batch
for i, (x_batch, y_batch) in enumerate(data_generator(train_X, train_y, batch_size)):
    model.fit(x_batch, y_batch,
              batch_size=batch_size,
              verbose=1,
              callbacks=[earlyStopping, tensorBoard, check_point],
              validation_data=([test_next, test_last, test_this], test_y))
    if i % 5 == 0:
        last_checkpoint = "/home/zhouhao/Downloads/temp/model/traffic_predict_model_weights_{}.h5".format(i)
        model.save_weights(last_checkpoint)

    loss = model.evaluate(x_batch, y_batch)
    reporter(mean_loss=loss,  # Change me
             timesteps_total=i,
             checkpoint=last_checkpoint)

here is the training status:

/home/zhouhao/.conda/envs/py36/bin/python /home/zhouhao/PycharmProjects/ML_nn/traffic_prediction_dense_tune.py Using TensorFlow backend. Process STDOUT and STDERR is being redirected to /tmp/raylogs/. Waiting for redis server at 127.0.0.1:21826 to respond… Waiting for redis server at 127.0.0.1:58818 to respond… Starting the Plasma object store with 26.00 GB memory. Starting local scheduler with the following resources: {‘CPU’: 10, ‘GPU’: 2}. Failed to start the UI, you may need to run ‘pip install jupyter’. == Status == Using AsyncHyperBand: num_stopped=0 Bracket: Iter 150.000: None | Iter 50.000: None Bracket: Iter 150.000: None Bracket:

tpe_transform took 0.018963 seconds TPE using 0 trials /home/zhouhao/.local/lib/python3.6/site-packages/ray/tune/logger.py:183: FutureWarning: Conversion of the second argument of issubdtype from float to np.floating is deprecated. In future, it will be treated as np.float64 == np.dtype(float).type. if np.issubdtype(value, float): /home/zhouhao/.local/lib/python3.6/site-packages/ray/tune/logger.py:185: FutureWarning: Conversion of the second argument of issubdtype from int to np.signedinteger is deprecated. In future, it will be treated as np.int64 == np.dtype(int).type. if np.issubdtype(value, int): tpe_transform took 0.016496 seconds TPE using 1/1 trials with best loss inf tpe_transform took 0.013365 seconds TPE using 2/2 trials with best loss inf tpe_transform took 0.012837 seconds TPE using 3/3 trials with best loss inf tpe_transform took 0.015328 seconds TPE using 4/4 trials with best loss inf tpe_transform took 0.014770 seconds TPE using 5/5 trials with best loss inf tpe_transform took 0.014560 seconds TPE using 6/6 trials with best loss inf tpe_transform took 0.014756 seconds TPE using 7/7 trials with best loss inf tpe_transform took 0.014358 seconds TPE using 8/8 trials with best loss inf tpe_transform took 0.014783 seconds TPE using 9/9 trials with best loss inf Created LogSyncer for /home/zhouhao/ray_results/traffic prediction/train_model_tune_1_e=12800,t=0.8,1=320,2=224,t=160,r=1e-06,t=0.4,1=288,2=256,t=160,t=0.7,1=288,2=160_2018-11-23_20-21-50zwji2gc5 ->

== Status == Using AsyncHyperBand: num_stopped=0 Bracket: Iter 150.000: None | Iter 50.000: None Bracket: Iter 150.000: None Bracket: Resources requested: 10/10 CPUs, 2/2 GPUs Result logdir: /home/zhouhao/ray_results/traffic prediction PENDING trials:

train_model_tune_2_e=204800,t=0.4,1=224,2=128,t=160,r=0.0005,t=0.7,1=288,2=256,t=256,t=0.4,1=256,2=224: PENDING
train_model_tune_3_e=102400,t=0.6,1=416,2=224,t=256,r=0.005,t=0.7,1=480,2=128,t=128,t=0.6,1=352,2=128: PENDING
train_model_tune_4_e=409600,t=0.7,1=192,2=192,t=160,r=1e-05,t=0.6,1=416,2=160,t=192,t=0.6,1=480,2=192: PENDING
train_model_tune_5_e=204800,t=0.3,1=224,2=224,t=128,r=0.001,t=0.7,1=448,2=160,t=160,t=0.4,1=352,2=192: PENDING
train_model_tune_6_e=409600,t=0.8,1=384,2=192,t=224,r=0.005,t=0.4,1=288,2=160,t=256,t=0.8,1=480,2=224: PENDING
train_model_tune_7_e=102400,t=0.7,1=384,2=224,t=256,r=1e-05,t=0.4,1=352,2=128,t=224,t=0.6,1=416,2=192: PENDING
train_model_tune_8_e=25600,t=0.4,1=128,2=128,t=128,r=0.001,t=0.4,1=448,2=128,t=256,t=0.7,1=352,2=256: PENDING
train_model_tune_9_e=12800,t=0.4,1=512,2=192,t=160,r=0.005,t=0.8,1=352,2=224,t=160,t=0.6,1=256,2=128: PENDING
train_model_tune_10_e=12800,t=0.8,1=320,2=192,t=192,r=1e-05,t=0.5,1=512,2=256,t=192,t=0.3,1=512,2=160: PENDING RUNNING trials:
train_model_tune_1_e=12800,t=0.8,1=320,2=224,t=160,r=1e-06,t=0.4,1=288,2=256,t=160,t=0.7,1=288,2=160: RUNNING

Using TensorFlow backend. /home/zhouhao/.conda/envs/py36/lib/python3.6/site-packages/numpy/lib/arraysetops.py:472: FutureWarning: elementwise comparison failed; returning scalar instead, but in the future will perform elementwise comparison mask |= (ar1 == a)

train_X:(4666873, 132)

Train on 12800 samples, validate on 245625 samples 2018-11-23 20:23:23.566946: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA 2018-11-23 20:23:23.760398: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1432] Found device 0 with properties: name: GeForce GTX 1080 Ti major: 6 minor: 1 memoryClockRate(GHz): 1.721 pciBusID: 0000:06:00.0 totalMemory: 10.91GiB freeMemory: 10.75GiB 2018-11-23 20:23:23.890120: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1432] Found device 1 with properties: name: GeForce GTX 1080 Ti major: 6 minor: 1 memoryClockRate(GHz): 1.721 pciBusID: 0000:05:00.0 totalMemory: 10.91GiB freeMemory: 10.62GiB 2018-11-23 20:23:23.892008: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1511] Adding visible gpu devices: 0, 1 2018-11-23 20:23:24.233047: I tensorflow/core/common_runtime/gpu/gpu_device.cc:982] Device interconnect StreamExecutor with strength 1 edge matrix: 2018-11-23 20:23:24.233078: I tensorflow/core/common_runtime/gpu/gpu_device.cc:988] 0 1 2018-11-23 20:23:24.233083: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 0: N Y 2018-11-23 20:23:24.233086: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 1: Y N 2018-11-23 20:23:24.233481: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 10398 MB memory) -> physical GPU (device: 0, name: GeForce GTX 1080 Ti, pci bus id: 0000:06:00.0, compute capability: 6.1) 2018-11-23 20:23:24.233725: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:1 with 10269 MB memory) -> physical GPU (device: 1, name: GeForce GTX 1080 Ti, pci bus id: 0000:05:00.0, compute capability: 6.1) Epoch 1/1

12800/12800 [==============================] - 1s 89us/step - loss: 20758.0859 - val_loss: 6029.8121

Epoch 00001: val_loss improved from inf to 6029.81209, saving model to /home/zhouhao/Downloads/temp/model/weights-improvement-01-6029.81.hdf5

12800/12800 [==============================] - 1s 42us/step

Train on 12800 samples, validate on 245625 samples Result for train_model_tune_1_e=12800,t=0.8,1=320,2=224,t=160,r=1e-06,t=0.4,1=288,2=256,t=160,t=0.7,1=288,2=160: checkpoint: /home/zhouhao/Downloads/temp/model/traffic_predict_model_weights_0.h5 date: 2018-11-23_20-25-30 done: false experiment_id: 18ce2f2a746f46a99e70b60fb430e0b0 hostname: zhouhaoPC iterations_since_restore: 1 mean_loss: 5680.058518676758 neg_mean_loss: -5680.058518676758 node_ip: 192.168.3.104 pid: 721 time_since_restore: 217.86471009254456 time_this_iter_s: 217.86471009254456 time_total_s: 217.86471009254456 timestamp: 1542975930 timesteps_since_restore: 0 timesteps_this_iter: 0 timesteps_total: 0 training_iteration: 1

train_model_tune_2_e=204800,t=0.4,1=224,2=128,t=160,r=0.0005,t=0.7,1=288,2=256,t=256,t=0.4,1=256,2=224: PENDING
train_model_tune_3_e=102400,t=0.6,1=416,2=224,t=256,r=0.005,t=0.7,1=480,2=128,t=128,t=0.6,1=352,2=128: PENDING
train_model_tune_4_e=409600,t=0.7,1=192,2=192,t=160,r=1e-05,t=0.6,1=416,2=160,t=192,t=0.6,1=480,2=192: PENDING
train_model_tune_5_e=204800,t=0.3,1=224,2=224,t=128,r=0.001,t=0.7,1=448,2=160,t=160,t=0.4,1=352,2=192: PENDING
train_model_tune_6_e=409600,t=0.8,1=384,2=192,t=224,r=0.005,t=0.4,1=288,2=160,t=256,t=0.8,1=480,2=224: PENDING
train_model_tune_7_e=102400,t=0.7,1=384,2=224,t=256,r=1e-05,t=0.4,1=352,2=128,t=224,t=0.6,1=416,2=192: PENDING
train_model_tune_8_e=25600,t=0.4,1=128,2=128,t=128,r=0.001,t=0.4,1=448,2=128,t=256,t=0.7,1=352,2=256: PENDING
train_model_tune_9_e=12800,t=0.4,1=512,2=192,t=160,r=0.005,t=0.8,1=352,2=224,t=160,t=0.6,1=256,2=128: PENDING
train_model_tune_10_e=12800,t=0.8,1=320,2=192,t=192,r=1e-05,t=0.5,1=512,2=256,t=192,t=0.3,1=512,2=160: PENDING RUNNING trials:
train_model_tune_1_e=12800,t=0.8,1=320,2=224,t=160,r=1e-06,t=0.4,1=288,2=256,t=160,t=0.7,1=288,2=160: RUNNING [pid=721], 217 s, 1 iter, 0 ts, 5.68e+03 loss

Epoch 1/1

12800/12800 [==============================] - 1s 41us/step - loss: 23765.1250 - val_loss: 6017.1088

Epoch 00001: val_loss improved from 6029.81209 to 6017.10883, saving model to /home/zhouhao/Downloads/temp/model/weights-improvement-01-6017.11.hdf5

12800/12800 [==============================] - 1s 41us/step

Train on 12800 samples, validate on 245625 samples Result for train_model_tune_1_e=12800,t=0.8,1=320,2=224,t=160,r=1e-06,t=0.4,1=288,2=256,t=160,t=0.7,1=288,2=160: checkpoint: /home/zhouhao/Downloads/temp/model/traffic_predict_model_weights_0.h5 date: 2018-11-23_20-27-33 done: false experiment_id: 18ce2f2a746f46a99e70b60fb430e0b0 hostname: zhouhaoPC iterations_since_restore: 2 mean_loss: 5819.795842285156 neg_mean_loss: -5819.795842285156 node_ip: 192.168.3.104 pid: 721 time_since_restore: 340.99321818351746 time_this_iter_s: 123.1285080909729 time_total_s: 340.99321818351746 timestamp: 1542976053 timesteps_since_restore: 1 timesteps_this_iter: 1 timesteps_total: 1 training_iteration: 2

train_model_tune_2_e=204800,t=0.4,1=224,2=128,t=160,r=0.0005,t=0.7,1=288,2=256,t=256,t=0.4,1=256,2=224: PENDING
train_model_tune_3_e=102400,t=0.6,1=416,2=224,t=256,r=0.005,t=0.7,1=480,2=128,t=128,t=0.6,1=352,2=128: PENDING
train_model_tune_4_e=409600,t=0.7,1=192,2=192,t=160,r=1e-05,t=0.6,1=416,2=160,t=192,t=0.6,1=480,2=192: PENDING
train_model_tune_5_e=204800,t=0.3,1=224,2=224,t=128,r=0.001,t=0.7,1=448,2=160,t=160,t=0.4,1=352,2=192: PENDING
train_model_tune_6_e=409600,t=0.8,1=384,2=192,t=224,r=0.005,t=0.4,1=288,2=160,t=256,t=0.8,1=480,2=224: PENDING
train_model_tune_7_e=102400,t=0.7,1=384,2=224,t=256,r=1e-05,t=0.4,1=352,2=128,t=224,t=0.6,1=416,2=192: PENDING
train_model_tune_8_e=25600,t=0.4,1=128,2=128,t=128,r=0.001,t=0.4,1=448,2=128,t=256,t=0.7,1=352,2=256: PENDING
train_model_tune_9_e=12800,t=0.4,1=512,2=192,t=160,r=0.005,t=0.8,1=352,2=224,t=160,t=0.6,1=256,2=128: PENDING
train_model_tune_10_e=12800,t=0.8,1=320,2=192,t=192,r=1e-05,t=0.5,1=512,2=256,t=192,t=0.3,1=512,2=160: PENDING RUNNING trials:
train_model_tune_1_e=12800,t=0.8,1=320,2=224,t=160,r=1e-06,t=0.4,1=288,2=256,t=160,t=0.7,1=288,2=160: RUNNING [pid=721], 340 s, 2 iter, 1 ts, 5.82e+03 loss

Epoch 1/1

12800/12800 [==============================] - 1s 42us/step - loss: 22130.4453 - val_loss: 6006.1089

Epoch 00001: val_loss improved from 6017.10883 to 6006.10893, saving model to /home/zhouhao/Downloads/temp/model/weights-improvement-01-6006.11.hdf5

12800/12800 [==============================] - 1s 42us/step

Train on 12800 samples, validate on 245625 samples Result for train_model_tune_1_e=12800,t=0.8,1=320,2=224,t=160,r=1e-06,t=0.4,1=288,2=256,t=160,t=0.7,1=288,2=160: checkpoint: /home/zhouhao/Downloads/temp/model/traffic_predict_model_weights_0.h5 date: 2018-11-23_20-29-36 done: false experiment_id: 18ce2f2a746f46a99e70b60fb430e0b0 hostname: zhouhaoPC iterations_since_restore: 3 mean_loss: 6045.423405151367 neg_mean_loss: -6045.423405151367 node_ip: 192.168.3.104 pid: 721 time_since_restore: 464.17469453811646 time_this_iter_s: 123.181476354599 time_total_s: 464.17469453811646 timestamp: 1542976176 timesteps_since_restore: 2 timesteps_this_iter: 1 timesteps_total: 2 training_iteration: 3

Any advise? thank you~

Issue Analytics

State:
Created 5 years ago
Comments:11 (4 by maintainers)

Top GitHub Comments

7reactions

guoxuxucommented, Jul 27, 2020

How to set the number of trials running in parallel?? https://docs.ray.io/en/master/tune/api_docs/execution.html Setting gpu=0.5 can start 2 trials while setting gpu=0.1 can start 10 trials. But what if I want to start 20 trials? setting gpu=0.05 does not work. Besides, setting gpu to a decimal number is unclear to me… Is there explicit instruction on how to set the number of trials in parallel? and specifically more than 10 trials.

1reaction

richardliawcommented, Nov 24, 2018

extra_cpus actually is assigned to the same trial, rather than indicating what is the number of resources allocated to the experiment in total.

If you turn off extra_cpus and set gpu: 0.5 or something, you should see multiple runs in parallel.

Let me know if that works.

Top Results From Across the Web

How does Tune work? — Ray 2.2.0

RUNNING : A running trial is assigned a Ray Actor. There can be multiple running trials in parallel. See the trainable execution section...

ray - What is the way to make Tune run parallel trials across ...

If you set it to 4, each trial will require 4 GPUs, i.e. only 1 trial can run at the same time. This...

Intermediate (200) – AWS Machine Learning Blog - Amazon AWS

Run : Each execution step of a model training process. A run consists of all the inputs, parameters, configurations, and results for one...

Frequently Asked Questions - Jury Duty

Justice ultimately depends in large measure on the jurors who serve in our courts. Can I go home during a trial? Yes but...

What is iterative? - TechTarget

In the world of IT and computer programming, the adjective iterative refers to a process where the design of a product or application...

Troubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.

Start Free

Top Related Reddit Thread

No results found

Top Related Tweet

No results found

Top Related Dev.to Post

No results found

[Tune] why only one trial RUNNING and other trials PENDING in each iteration(question)

System information

Describe the problem

Source code / logs

Issue Analytics

Top GitHub Comments

Top Results From Across the Web

Top Related Medium Post

Top Related StackOverflow Question

Troubleshoot Live Code

Top Related Reddit Thread

Top Related Hackernoon Post

Top Related Tweet

Top Related Dev.to Post

Top Related Hashnode Post

New trials do not run after old trials terminates.

CatBoostClassifier serialization doesn't work