Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

How use multiple gpu?

See original GitHub issue

Feature Description

I want to use a single machine with multiple gpu for training, but it seems to have no actual effect### Code Example

with strategy.scope():

Reason

Speed up the calculation of toxins

Solution

Issue Analytics

State:
Created 3 years ago
Comments:14 (7 by maintainers)

Top GitHub Comments

2reactions

AnSmithDcommented, Aug 3, 2020

Hello, got the same issue. I am specifying 4 GPUs (out of 8) to train the current model in a distributed fashion, using tf.distribute.MirroredStrategy( ) since tf.keras.utils.multi_gpu_model( ) is deprecated and removed since april 2020.

When doing:

def make_model(ckpt_path, max_try = 1):
    
    callbacks = [
    keras.callbacks.ModelCheckpoint(
        filepath=ckpt_path + '/bestMod.hdf5', verbose = 1, monitor="val_accuracy",
                           mode='max', save_best_only = True), 
    keras.callbacks.EarlyStopping(monitor='val_loss', patience=3,  mode='auto')]
    
    
    AutoClassifier = ak.ImageClassifier(directory= "fruits/", 
                   project_name = "fruits_AutoKeras",
                   max_trials = max_try, 
                   metrics = ['accuracy', 'top_k_categorical_accuracy'], 
                   seed = 1245,
                   overwrite = True)     
    
    model = AutoClassifier.fit(
        data_train,
        callbacks=callbacks,
        validation_data=vali_data,
        verbose=2)
    
    return model


def run_search(ckpt_path, max_try = 1):

    strat = tf.distribute.MirroredStrategy()
    with strat.scope():
        model = make_model(ckpt_path, max_try)

    return model

run_search(checkpoint, max_try = 3)

only one single GPU is doing all the computations, the other three remain idle. When following @FontTian and inserting distribution_strategy=strat into the initialisation of the image classifier, the same error RuntimeError: Too many failed attempts to build model. occurs. Same happens when adding tuner='random' to ak.ImageClassifier.

As suggested by @haifeng-jin, I ran a basic KerasTuner example on 4 GPUs which worked just fine. Furthermore, in https://github.com/keras-team/autokeras/issues/440#issuecomment-592160313 I read that the clear_session() before every run might wipe out the gpu configuration. Removing this line from the code did not change anything with respect to the errors/problems stated above.

Thanks in advance

1reaction

Gorbovcommented, Jul 27, 2021

@haifeng-jin may I ask you to help with multi GPU? In order not to create a new topic…

I get an error:

Epoch 1/1000
2021-07-27 16:15:38.726497: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcublas.so.11
2021-07-27 16:15:39.515468: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcublasLt.so.11
44/44 [==============================] - ETA: 0s - loss: 0.9194 - accuracy: 0.59802021-07-27 16:15:44.049919: W tensorflow/core/grappler/optimizers/data/auto_shard.cc:695] AUTO sharding policy will apply DATA sharding policy as it failed to apply FILE sharding policy because of the following reason: Found an unshardable source dataset: name: "TensorSliceDataset/_1"
op: "TensorSliceDataset"
input: "Placeholder/_0"
attr {
  key: "Toutput_types"
  value {
    list {
      type: DT_DOUBLE
    }
  }
}
attr {
  key: "output_shapes"
  value {
    list {
      shape {
        dim {
          size: 6558
        }
      }
    }
  }
}

For fix this — I need to set: https://www.tensorflow.org/api_docs/python/tf/data/experimental/AutoShardPolicy

options = tf.data.Options()
options.experimental_distribute.auto_shard_policy = tf.data.experimental.AutoShardPolicy.DATA

And how to put this option to StructuredDataClassifier?

model = ak.StructuredDataClassifier(max_trials=params['max_trials'],
				project_name=model_name+ext_type,
				directory='data/models_saved_data/',
				distribution_strategy=tf.distribute.MirroredStrategy()
				)
model.fit(x_train, y_train,
      epochs=params['epochs'],
      batch_size=32,
      )

Top Results From Across the Web

To run NVIDIA Multi-GPU

From the NVIDIA Control Panel navigation tree pane, under 3D Settings, select Set Multi-GPU configuration to open the associated page. · Under Select...

PyTorch Multi GPU: 3 Techniques Explained

Learn how to accelerate deep learning tensor computations with 3 multi GPU techniques—data parallelism, distributed data parallelism and model parallelism.

Why and How to Use Multiple GPUs for Distributed Training

Buying multiple GPUs can be an expensive investment but is much faster than other options. CPUs are also expensive and cannot scale like...

Efficient Training on Multiple GPUs

When training on a single GPU is too slow or the model weights don't fit in a single GPUs memory we use a...

Multi-GPU Examples

Data Parallelism is implemented using torch.nn.DataParallel . One can wrap a Module in DataParallel and it will be parallelized over multiple GPUs in...