Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

RuntimeError: copy_if failed to synchronize: cudaErrorAssert: device-side assert triggered

See original GitHub issue

Hi expert, I’ve got this RuntimeError while doing GCN training. I’ve tried to decrease the learning rate as some old threads mentioned however it did not help. Does anyone know what might be the root cause of it? Also I want to know if it is possible to set a batch_size using the “Data” function to load my data in?

Environment

OS: CentOS Linux 7 (Core)
Python version: Python 3.6.8 :: Anaconda custom (64-bit)
PyTorch version: torch==1.3.1 torch-cluster==1.4.5 torch-geometric==1.3.2 torch-scatter==1.4.0 torch-sparse==0.4.3 torchvision==0.4.2
CUDA/cuDNN version: 10.1

Graph and model information

lr = 1e-7 epochs = 200 weight_decay = 5e-4 num_edges: 100902 num_nodes: 11094 directed: True data.num_node_features: 17 data.num_classes: 2 data: Data(edge_index=[2, 100902], num_classes=[1], x=[11094, 17], y=[11125])

Code


#########################################################
### Learning on Graphs
#########################################################

class Net(torch.nn.Module):

    ### Constructor
    def __init__(self):

        ### super is used to inherit from the torch.nn.module.
        super(Net, self).__init__()

        #class GCNConv(in_channels, 
        #              out_channels, improved=False, 
        #              cached=False, bias=True, **kwargs)

        self.conv1 = GCNConv(data.num_node_features, 16)
        self.conv2 = GCNConv(16, data.num_classes)


    def forward(self, data):
        x, edge_index = data.x, data.edge_index
        #x = torch.matmul(torch.t(x),self.weight)
        #x = x.squeeze(dim=-1)
        print("x:",x.size(), "e_idx", edge_index.size())
 
        print("x_shape:", x.shape, "edge_index shape:", edge_index.shape)
        x = self.conv1(x, edge_index)
        x = F.relu(x)
        x = F.dropout(x, training = self.training)
        x = self.conv2(x, edge_index)

        return F.log_softmax(x, dim=1)

def graph_learning(data):

    ### Setup the model.
    device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
    model = Net().to(device)
    
    #ata = data #dataset[0].to(device)
    data = data.to(device)

    optimizer = torch.optim.Adam( model.parameters(), 
                                  lr = lr, 
                                  weight_decay = weight_decay)
    #total_train = 0.0
    #correct_train = 0.0
    #train_acc = 0.0

    for epoch in range(epochs):

        train_loss = model.train()
        optimizer.zero_grad()
        out = model(data)
        
        loss = F.nll_loss(out[data.train_mask], 
                          data.y[data.train_mask])
        loss.backward()
        optimizer.step()

        ### Train Loss / Accuracy 
        _, pred_tr = model(data).max(dim = 1)
        correct_tr = float (
            pred_tr[data.train_mask].eq(data.y[data.train_mask]).sum().item())
        acc_tr = correct_tr / data.train_mask.sum().item()
        print("Epoch:", epoch, 
              "Train loss:", loss, 
              "Train Accuracy: {:.9f}'".format(acc_tr))
        
        ### Test Loss / Accuracy
        model.eval()
        _, pred = model(data).max(dim = 1)
        print("\n\nPRED:", pred[data.test_mask], "ACTUAL:", data.y[data.test_mask], "\n\n")
        
        correct = float (
            pred[data.test_mask].eq(data.y[data.test_mask]).sum().item())

        acc = correct / data.test_mask.sum().item()
        print("Epoch:", epoch, 
              "Loss:", loss, 
              'Test Accuracy: {:.4f}'.format(acc))
        torch.save(model.state_dict(), run_name + ".pt")

#########################################################
### Main function
#########################################################

if __name__ == '__main__':

    (edge_v1, 
    edge_v2, 
    node_number) = read_netlist_graph()

    data = build_netlist_graph(edge_v1, edge_v2, node_number)

    graph_learning(data)
    
    print("\n\n--- %s seconds ---" % (time.time() - start_time))

Issue Analytics

State:
Created 4 years ago
Comments:6

Top GitHub Comments

69reactions

HYDesmondLiucommented, Dec 18, 2019

I found out what the root cause is.

3reactions

simonrecommented, May 12, 2020

I recently got the same error message, here’s what fixed it for me:

Ran the code on CPU, this gives a more meaningful error message
Got an error message that the “scatter” operation failed during a GCNConv “forward” operation
In my case, the root cause was a wrong input shape for a GCNConv Layer, because I had transposed the input shape at some point, causing some out of bounds indices in the scatter operation.

Hope this is helpful 😃

Top Results From Across the Web

PyTorch: copy_if failed to synchronize: device-side assert ...

Sometimes when we run code using cuda, it gives error message having device-side assert triggered which hides the real error message.

copy_if failed to synchronize: cudaErrorAssert: device-side ...

Hi expert, I've got this RuntimeError while doing GCN training. I've tried to decrease the learning rate as some old threads mentioned ...

Data Parallelization meet RuntimeError: copy_if failed to ...

... failed to synchronize: cudaErrorAssert: device-side assert triggered ... DataParallel() and trying to parallelize it on two GPUs,I got an error like:

reduce failed to synchronize: cudaErrorAssert: device-side ...

Python – Pytorch: RuntimeError: reduce failed to synchronize: cudaErrorAssert: device-side assert triggered ... I am running into the following error when trying ...

Tpetra: Run-time error in idot unit test, in CUDA build only

I suspect the issue is that KokkosBlas::dot uses its X vector input to determine the execution space, but a raw pointer result argument...