Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[BUG] qml.qnn.TorchLayer breaks multi GPU usage

See original GitHub issue

Expected behavior

Setup: 4 Nvidia GPUs

Expected return:

Let's use 4 GPUs!
Average loss over epoch 1: 0.4803
Average loss over epoch 2: 0.3553
Accuracy: 78.0%

Actual behavior

Got back: RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:2 and cuda:0! (when checking argument for argument mat2 in method wrapper__bmm)

Additional information

Error occurs always. Sometimes the error may change which GPUs are in error, for example:

RuntimeError: Expected all tensors …cuda:0 and cuda:1… or RuntimeError: Expected all tensors …cuda:3 and cuda:2…

Source code

# Coping and pasting code from: https://pennylane.ai/qml/demos/tutorial_qnn_module_torch.html

import torch
import matplotlib.pyplot as plt
import numpy as np
from sklearn.datasets import make_moons

# Set random seeds
torch.manual_seed(42)
np.random.seed(42)

X, y = make_moons(n_samples=200, noise=0.1)
y_ = torch.unsqueeze(torch.tensor(y), 1)  # used for one-hot encoded labels
y_hot = torch.scatter(torch.zeros((200, 2)), 1, y_, 1)

c = ["#1f77b4" if y_ == 0 else "#ff7f0e" for y_ in y]  # colours for each class
# Removing the plot since we don't need it for now
#plt.axis("off")
#plt.scatter(X[:, 0], X[:, 1], c=c)
#plt.show()

import pennylane as qml

n_qubits = 2
dev = qml.device("default.qubit", wires=n_qubits)

@qml.qnode(dev)
def qnode(inputs, weights):
    qml.AngleEmbedding(inputs, wires=range(n_qubits))
    qml.BasicEntanglerLayers(weights, wires=range(n_qubits))
    return [qml.expval(qml.PauliZ(wires=i)) for i in range(n_qubits)]

n_layers = 6
weight_shapes = {"weights": (n_layers, n_qubits)}

qlayer = qml.qnn.TorchLayer(qnode, weight_shapes)

clayer_1 = torch.nn.Linear(2, 2)
clayer_2 = torch.nn.Linear(2, 2)
softmax = torch.nn.Softmax(dim=1)
layers = [clayer_1, qlayer, clayer_2, softmax]
#layers = [clayer_1, clayer_2, softmax]
model = torch.nn.Sequential(*layers)

if torch.cuda.device_count() > 1:
  print("Let's use", torch.cuda.device_count(), "GPUs!")
  #dim = 0 [30, xxx] -> [10, ...], [10, ...], [10, ...] on 3 GPUs
  model = torch.nn.DataParallel(model)
  
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

model.to(device)

opt = torch.optim.SGD(model.parameters(), lr=0.2)
loss = torch.nn.L1Loss()

X = torch.tensor(X, requires_grad=True).float()
y_hot = y_hot.float()

batch_size = 5
batches = 200 // batch_size

data_loader = torch.utils.data.DataLoader(
    list(zip(X, y_hot)), batch_size=5, shuffle=True, drop_last=True
)

epochs = 2

for epoch in range(epochs):

    running_loss = 0

    for xs, ys in data_loader:
        opt.zero_grad()

        # Moving data to device as needed
        xs=xs.to(device)
        ys=ys.to(device)

        loss_evaluated = loss(model(xs), ys)
        loss_evaluated.backward()

        opt.step()

        running_loss += loss_evaluated

    avg_loss = running_loss / batches
    print("Average loss over epoch {}: {:.4f}".format(epoch + 1, avg_loss))

# Moving data to device as needed
X = X.to(device)

y_pred = model(X)
#predictions = torch.argmax(y_pred, axis=1).detach().numpy()
predictions = torch.argmax(y_pred, axis=1).detach().cpu().numpy()

correct = [1 if p == p_true else 0 for p, p_true in zip(predictions, y)]
accuracy = sum(correct) / len(correct)
print(f"Accuracy: {accuracy * 100}%")

Tracebacks

Let's use 4 GPUs!
Traceback (most recent call last):
  File "/home/pennylane_error_gpu_no_plot.py", line 111, in <module>
    loss_evaluated = loss(model(xs), ys)
  File "/home/miniconda3/envs/py9/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/miniconda3/envs/py9/lib/python3.9/site-packages/torch/nn/parallel/data_parallel.py", line 168, in forward
    outputs = self.parallel_apply(replicas, inputs, kwargs)
  File "/home/miniconda3/envs/py9/lib/python3.9/site-packages/torch/nn/parallel/data_parallel.py", line 178, in parallel_apply
    return parallel_apply(replicas, inputs, kwargs, self.device_ids[:len(replicas)])
  File "/home/miniconda3/envs/py9/lib/python3.9/site-packages/torch/nn/parallel/parallel_apply.py", line 86, in parallel_apply
    output.reraise()
  File "/home/miniconda3/envs/py9/lib/python3.9/site-packages/torch/_utils.py", line 434, in reraise
    raise exception
RuntimeError: Caught RuntimeError in replica 0 on device 0.
Original Traceback (most recent call last):
  File "/home/miniconda3/envs/py9/lib/python3.9/site-packages/torch/nn/parallel/parallel_apply.py", line 61, in _worker
    output = module(*input, **kwargs)
  File "/home/miniconda3/envs/py9/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/miniconda3/envs/py9/lib/python3.9/site-packages/torch/nn/modules/container.py", line 141, in forward
    input = module(input)
  File "/home/miniconda3/envs/py9/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/miniconda3/envs/py9/lib/python3.9/site-packages/pennylane/qnn/torch.py", line 277, in forward
    reconstructor.append(self.forward(x))
  File "/home/miniconda3/envs/py9/lib/python3.9/site-packages/pennylane/qnn/torch.py", line 281, in forward
    return self._evaluate_qnode(inputs)
  File "/home/miniconda3/envs/py9/lib/python3.9/site-packages/pennylane/qnn/torch.py", line 296, in _evaluate_qnode
    return self.qnode(**kwargs).type(x.dtype)
  File "/home/miniconda3/envs/py9/lib/python3.9/site-packages/pennylane/qnode.py", line 560, in __call__
    res = qml.execute(
  File "/home/miniconda3/envs/py9/lib/python3.9/site-packages/pennylane/interfaces/batch/__init__.py", line 342, in execute
    cache_execute(batch_execute, cache, return_tuple=False, expand_fn=expand_fn)(tapes)
  File "/home/miniconda3/envs/py9/lib/python3.9/site-packages/pennylane/interfaces/batch/__init__.py", line 173, in wrapper
    res = fn(execution_tapes.values(), **kwargs)
  File "/home/miniconda3/envs/py9/lib/python3.9/site-packages/pennylane/interfaces/batch/__init__.py", line 125, in fn
    return original_fn(tapes, **kwargs)
  File "/home/miniconda3/envs/py9/lib/python3.9/contextlib.py", line 79, in inner
    return func(*args, **kwds)
  File "/home/miniconda3/envs/py9/lib/python3.9/site-packages/pennylane/_qubit_device.py", line 289, in batch_execute
    res = self.execute(circuit)
  File "/home/miniconda3/envs/py9/lib/python3.9/site-packages/pennylane/devices/default_qubit_torch.py", line 233, in execute
    return super().execute(circuit, **kwargs)
  File "/home/miniconda3/envs/py9/lib/python3.9/site-packages/pennylane/_qubit_device.py", line 201, in execute
    self.apply(circuit.operations, rotations=circuit.diagonalizing_gates, **kwargs)
  File "/home/miniconda3/envs/py9/lib/python3.9/site-packages/pennylane/devices/default_qubit.py", line 216, in apply
    self._state = self._apply_operation(self._state, operation)
  File "/home/miniconda3/envs/py9/lib/python3.9/site-packages/pennylane/devices/default_qubit.py", line 247, in _apply_operation
    return self._apply_unitary_einsum(state, matrix, wires)
  File "/home/miniconda3/envs/py9/lib/python3.9/site-packages/pennylane/devices/default_qubit.py", line 752, in _apply_unitary_einsum
    return self._einsum(einsum_indices, mat, state)
  File "/home/miniconda3/envs/py9/lib/python3.9/site-packages/torch/functional.py", line 327, in einsum
    return _VF.einsum(equation, operands)  # type: ignore[attr-defined]
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:1 and cuda:0! (when checking argument for argument mat2 in method wrapper__bmm)

System information

>>> import pennylane as qml; qml.about()
Name: PennyLane
Version: 0.21.0
Summary: PennyLane is a Python quantum machine learning library by Xanadu Inc.
Home-page: https://github.com/XanaduAI/pennylane
Author: 
Author-email: 
License: Apache License 2.0
Location: /home/miniconda3/envs/py9/lib/python3.9/site-packages
Requires: autoray, retworkx, cachetools, semantic-version, scipy, pennylane-lightning, networkx, numpy, toml, appdirs, autograd
Required-by: PennyLane-Lightning
Platform info:           Linux-4.18.0-348.7.1.el8_5.x86_64-x86_64-with-glibc2.28
Python version:          3.9.7
Numpy version:           1.22.2
Scipy version:           1.8.0
Installed devices:
- default.gaussian (PennyLane-0.21.0)
- default.mixed (PennyLane-0.21.0)
- default.qubit (PennyLane-0.21.0)
- default.qubit.autograd (PennyLane-0.21.0)
- default.qubit.jax (PennyLane-0.21.0)
- default.qubit.tf (PennyLane-0.21.0)
- default.qubit.torch (PennyLane-0.21.0)
- lightning.qubit (PennyLane-Lightning-0.21.0)

Existing GitHub issues

I have searched existing GitHub issues to make sure the issue does not already exist.

Issue Analytics

State:
Created 2 years ago
Comments:7 (4 by maintainers)

Top GitHub Comments

1reaction

antalszavacommented, Feb 18, 2022

Hi @dominicpasquali, we had more of a look into this and it looks like a fix could take time as it would require explicit support for torch.nn.DataParallel.

The issue is related to the fact that torch.nn.DataParallel will attempt to access the state in the PennyLane device from multiple GPUs in parallel. When using default.qubit with diff_method="backprop", we are using the native Torch device default.qubit.torch internally. This device assumes that device executions happen sequentially.

The execute method of the device does the transition in between Torch devices, if necessary. It infers the Torch device to use based by checking what Torch device the input parameters to gates were using.

The steps in execute can be summarized as:

We gather the operations and observables in the circuit;
We check: did the user specifiy the Torch device explicitly? a) If not, we check which Torch device the state vector is on and if need be, we place it to the Torch device where the gate parameters are; b) If yes, we warn the user in case they are mixing Torch devices;
We execute the circuit.

The error that we see comes from the fact that step 2.a) may be executed more than once before 3 is executed once.

A basic logging was carried out by modifying step 2.a) as:

            if self._state.device != self._torch_device:
                print("Changing the state device from:", self._state.device)
                print("Changing the state device to:", self._torch_device)
                self._state = self._state.to(self._torch_device)

And adding a try-except block around the execution:

        try:
            sup = super().execute(circuit, **kwargs)
        except RuntimeError:
            print("Error at op: ", ops_and_obs)
        return sup

The raw log is:

Let's use 4 GPUs!
Changing the state device from:  cpu
Changing the state device to:  cuda:0
Changing the state device from:  cuda:0
Changing the state device to:  cuda:1
Changing the state device from:  cuda:1
Changing the state device to:  cuda:2
Error at op:  [RX(tensor(0.6368, device='cuda:0', grad_fn=<SelectBackward0>), wires=[0]), RX(tensor(0.1965, device='cuda:0', grad_fn=<SelectBackward0>), wires=[1]), RX(tensor(5.5435, device='cuda:0', grad_fn=<SelectBackward0>), wires=[0]), RX(tensor(5.7491, device='cuda:0', grad_fn=<SelectBackward0>), wires=[1]), CNOT(wires=[0, 1]), RX(tensor(2.4056, device='cuda:0', grad_fn=<SelectBackward0>), wires=[0]), RX(tensor(6.0275, device='cuda:0', grad_fn=<SelectBackward0>), wires=[1]), CNOT(wires=[0, 1]), RX(tensor(2.4533, device='cuda:0', grad_fn=<SelectBackward0>), wires=[0]), RX(tensor(3.7755, device='cuda:0', grad_fn=<SelectBackward0>), wires=[1]), CNOT(wires=[0, 1]), RX(tensor(1.6121, device='cuda:0', grad_fn=<SelectBackward0>), wires=[0]), RX(tensor(4.9866, device='cuda:0', grad_fn=<SelectBackward0>), wires=[1]), CNOT(wires=[0, 1]), RX(tensor(5.9110, device='cuda:0', grad_fn=<SelectBackward0>), wires=[0]), RX(tensor(0.8368, device='cuda:0', grad_fn=<SelectBackward0>), wires=[1]), CNOT(wires=[0, 1]), RX(tensor(5.8723, device='cuda:0', grad_fn=<SelectBackward0>), wires=[0]), RX(tensor(3.7296, device='cuda:0', grad_fn=<SelectBackward0>), wires=[1]), CNOT(wires=[0, 1]), expval(PauliZ(wires=[0])), expval(PauliZ(wires=[1]))]
Changing the state device from:  cuda:2
Changing the state device to:  cuda:1

What this seems to tell is that we’ve changed the device of the state to cuda:2 by the time super().execute(circuit, **kwargs) has been called using parameters on cuda:0.

Potential solutions could include:

Copying the underlying state or even the device itself (at the expense of having it multiple times in the memory) such that there’s a separate state vector;
Creating a lock such that the logic in execute can only be run by a single GPU at once. This does seem to defeat the advantage of parallelization.

Could potentially torch.nn.parallel.DistributedDataParallel be a solution? PyTorch seems to recommend using that.

0reactions

dominicpasqualicommented, May 3, 2022

I can confirm that DistributedDataParallel works!

Top Results From Across the Web

[BUG] qml.qnn.TorchLayer breaks multi GPU usage

Issue description. Expected behavior: (What you expect to happen) Setup: 4 Nvidia GPUs. Expected return: Let's use 4 GPUs!

Pennylane and Pytorch running on GPU

Running the classical ml code runs without a problem on the GPU but when I run the qml code I get an error....

YOLOv5 issues with torch==1.12 on Multi-GPU systems #8395

I have searched the YOLOv5 issues and found no similar bug report. YOLOv5 Component. Training, Multi-GPU. Bug. All GPUs are utilized by torch ......

)[561](file:/~/miniconda3/envs/vid/lib/python3.9/site ... - You.com

Describe the bug After the installation is complete, an error is reported when trying to import AutoGluon. ... TorchLayer breaks multi GPU usage#2203....

PyTorch 101, Part 4: Memory Management and Using Multiple ...

We conclude with best practises for debugging memory error. ... How to use multiple GPUs for your network, either using data parallelism or...