question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

How use multiple gpu?

See original GitHub issue

Feature Description

I want to use a single machine with multiple gpu for training, but it seems to have no actual effect### Code Example

with strategy.scope():

Reason

Speed up the calculation of toxins

Solution

Issue Analytics

  • State:closed
  • Created 3 years ago
  • Comments:14 (7 by maintainers)

github_iconTop GitHub Comments

2reactions
AnSmithDcommented, Aug 3, 2020

Hello, got the same issue. I am specifying 4 GPUs (out of 8) to train the current model in a distributed fashion, using tf.distribute.MirroredStrategy( ) since tf.keras.utils.multi_gpu_model( ) is deprecated and removed since april 2020.

When doing:

def make_model(ckpt_path, max_try = 1):
    
    callbacks = [
    keras.callbacks.ModelCheckpoint(
        filepath=ckpt_path + '/bestMod.hdf5', verbose = 1, monitor="val_accuracy",
                           mode='max', save_best_only = True), 
    keras.callbacks.EarlyStopping(monitor='val_loss', patience=3,  mode='auto')]
    
    
    AutoClassifier = ak.ImageClassifier(directory= "fruits/", 
                   project_name = "fruits_AutoKeras",
                   max_trials = max_try, 
                   metrics = ['accuracy', 'top_k_categorical_accuracy'], 
                   seed = 1245,
                   overwrite = True)     
    
    model = AutoClassifier.fit(
        data_train,
        callbacks=callbacks,
        validation_data=vali_data,
        verbose=2)
    
    return model


def run_search(ckpt_path, max_try = 1):

    strat = tf.distribute.MirroredStrategy()
    with strat.scope():
        model = make_model(ckpt_path, max_try)

    return model

run_search(checkpoint, max_try = 3)

only one single GPU is doing all the computations, the other three remain idle. When following @FontTian and inserting distribution_strategy=strat into the initialisation of the image classifier, the same error RuntimeError: Too many failed attempts to build model. occurs. Same happens when adding tuner='random' to ak.ImageClassifier.

As suggested by @haifeng-jin, I ran a basic KerasTuner example on 4 GPUs which worked just fine. Furthermore, in https://github.com/keras-team/autokeras/issues/440#issuecomment-592160313 I read that the clear_session() before every run might wipe out the gpu configuration. Removing this line from the code did not change anything with respect to the errors/problems stated above.

Thanks in advance

1reaction
Gorbovcommented, Jul 27, 2021

@haifeng-jin may I ask you to help with multi GPU? In order not to create a new topic…

I get an error:

Epoch 1/1000
2021-07-27 16:15:38.726497: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcublas.so.11
2021-07-27 16:15:39.515468: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcublasLt.so.11
44/44 [==============================] - ETA: 0s - loss: 0.9194 - accuracy: 0.59802021-07-27 16:15:44.049919: W tensorflow/core/grappler/optimizers/data/auto_shard.cc:695] AUTO sharding policy will apply DATA sharding policy as it failed to apply FILE sharding policy because of the following reason: Found an unshardable source dataset: name: "TensorSliceDataset/_1"
op: "TensorSliceDataset"
input: "Placeholder/_0"
attr {
  key: "Toutput_types"
  value {
    list {
      type: DT_DOUBLE
    }
  }
}
attr {
  key: "output_shapes"
  value {
    list {
      shape {
        dim {
          size: 6558
        }
      }
    }
  }
}

For fix this — I need to set: https://www.tensorflow.org/api_docs/python/tf/data/experimental/AutoShardPolicy

options = tf.data.Options()
options.experimental_distribute.auto_shard_policy = tf.data.experimental.AutoShardPolicy.DATA

And how to put this option to StructuredDataClassifier?

model = ak.StructuredDataClassifier(max_trials=params['max_trials'],
				project_name=model_name+ext_type,
				directory='data/models_saved_data/',
				distribution_strategy=tf.distribute.MirroredStrategy()
				)
model.fit(x_train, y_train,
      epochs=params['epochs'],
      batch_size=32,
      )
Read more comments on GitHub >

github_iconTop Results From Across the Web

To run NVIDIA Multi-GPU
From the NVIDIA Control Panel navigation tree pane, under 3D Settings, select Set Multi-GPU configuration to open the associated page. · Under Select...
Read more >
PyTorch Multi GPU: 3 Techniques Explained
Learn how to accelerate deep learning tensor computations with 3 multi GPU techniques—data parallelism, distributed data parallelism and model parallelism.
Read more >
Why and How to Use Multiple GPUs for Distributed Training
Buying multiple GPUs can be an expensive investment but is much faster than other options. CPUs are also expensive and cannot scale like...
Read more >
Efficient Training on Multiple GPUs
When training on a single GPU is too slow or the model weights don't fit in a single GPUs memory we use a...
Read more >
Multi-GPU Examples
Data Parallelism is implemented using torch.nn.DataParallel . One can wrap a Module in DataParallel and it will be parallelized over multiple GPUs in...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found