question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

RuntimeError: copy_if failed to synchronize: cudaErrorAssert: device-side assert triggered

See original GitHub issue

Hi expert, I’ve got this RuntimeError while doing GCN training. I’ve tried to decrease the learning rate as some old threads mentioned however it did not help. Does anyone know what might be the root cause of it? Also I want to know if it is possible to set a batch_size using the “Data” function to load my data in?

Environment

  • OS: CentOS Linux 7 (Core)
  • Python version: Python 3.6.8 :: Anaconda custom (64-bit)
  • PyTorch version: torch==1.3.1 torch-cluster==1.4.5 torch-geometric==1.3.2 torch-scatter==1.4.0 torch-sparse==0.4.3 torchvision==0.4.2
  • CUDA/cuDNN version: 10.1

Graph and model information

lr = 1e-7 epochs = 200 weight_decay = 5e-4 num_edges: 100902 num_nodes: 11094 directed: True data.num_node_features: 17 data.num_classes: 2 data: Data(edge_index=[2, 100902], num_classes=[1], x=[11094, 17], y=[11125])

Code


#########################################################
### Learning on Graphs
#########################################################

class Net(torch.nn.Module):

    ### Constructor
    def __init__(self):

        ### super is used to inherit from the torch.nn.module.
        super(Net, self).__init__()

        #class GCNConv(in_channels, 
        #              out_channels, improved=False, 
        #              cached=False, bias=True, **kwargs)

        self.conv1 = GCNConv(data.num_node_features, 16)
        self.conv2 = GCNConv(16, data.num_classes)


    def forward(self, data):
        x, edge_index = data.x, data.edge_index
        #x = torch.matmul(torch.t(x),self.weight)
        #x = x.squeeze(dim=-1)
        print("x:",x.size(), "e_idx", edge_index.size())
 
        print("x_shape:", x.shape, "edge_index shape:", edge_index.shape)
        x = self.conv1(x, edge_index)
        x = F.relu(x)
        x = F.dropout(x, training = self.training)
        x = self.conv2(x, edge_index)

        return F.log_softmax(x, dim=1)

def graph_learning(data):

    ### Setup the model.
    device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
    model = Net().to(device)
    
    #ata = data #dataset[0].to(device)
    data = data.to(device)

    optimizer = torch.optim.Adam( model.parameters(), 
                                  lr = lr, 
                                  weight_decay = weight_decay)
    #total_train = 0.0
    #correct_train = 0.0
    #train_acc = 0.0

    for epoch in range(epochs):

        train_loss = model.train()
        optimizer.zero_grad()
        out = model(data)
        
        loss = F.nll_loss(out[data.train_mask], 
                          data.y[data.train_mask])
        loss.backward()
        optimizer.step()

        ### Train Loss / Accuracy 
        _, pred_tr = model(data).max(dim = 1)
        correct_tr = float (
            pred_tr[data.train_mask].eq(data.y[data.train_mask]).sum().item())
        acc_tr = correct_tr / data.train_mask.sum().item()
        print("Epoch:", epoch, 
              "Train loss:", loss, 
              "Train Accuracy: {:.9f}'".format(acc_tr))
        
        ### Test Loss / Accuracy
        model.eval()
        _, pred = model(data).max(dim = 1)
        print("\n\nPRED:", pred[data.test_mask], "ACTUAL:", data.y[data.test_mask], "\n\n")
        
        correct = float (
            pred[data.test_mask].eq(data.y[data.test_mask]).sum().item())

        acc = correct / data.test_mask.sum().item()
        print("Epoch:", epoch, 
              "Loss:", loss, 
              'Test Accuracy: {:.4f}'.format(acc))
        torch.save(model.state_dict(), run_name + ".pt")

#########################################################
### Main function
#########################################################

if __name__ == '__main__':

    (edge_v1, 
    edge_v2, 
    node_number) = read_netlist_graph()

    data = build_netlist_graph(edge_v1, edge_v2, node_number)

    graph_learning(data)
    
    print("\n\n--- %s seconds ---" % (time.time() - start_time))

Issue Analytics

  • State:closed
  • Created 4 years ago
  • Comments:6

github_iconTop GitHub Comments

69reactions
HYDesmondLiucommented, Dec 18, 2019

I found out what the root cause is.

3reactions
simonrecommented, May 12, 2020

I recently got the same error message, here’s what fixed it for me:

  1. Ran the code on CPU, this gives a more meaningful error message
  2. Got an error message that the “scatter” operation failed during a GCNConv “forward” operation
  3. In my case, the root cause was a wrong input shape for a GCNConv Layer, because I had transposed the input shape at some point, causing some out of bounds indices in the scatter operation.

Hope this is helpful 😃

Read more comments on GitHub >

github_iconTop Results From Across the Web

PyTorch: copy_if failed to synchronize: device-side assert ...
Sometimes when we run code using cuda, it gives error message having device-side assert triggered which hides the real error message.
Read more >
copy_if failed to synchronize: cudaErrorAssert: device-side ...
Hi expert, I've got this RuntimeError while doing GCN training. I've tried to decrease the learning rate as some old threads mentioned ...
Read more >
Data Parallelization meet RuntimeError: copy_if failed to ...
... failed to synchronize: cudaErrorAssert: device-side assert triggered ... DataParallel() and trying to parallelize it on two GPUs,I got an error like:
Read more >
reduce failed to synchronize: cudaErrorAssert: device-side ...
Python – Pytorch: RuntimeError: reduce failed to synchronize: cudaErrorAssert: device-side assert triggered ... I am running into the following error when trying ...
Read more >
Tpetra: Run-time error in idot unit test, in CUDA build only
I suspect the issue is that KokkosBlas::dot uses its X vector input to determine the execution space, but a raw pointer result argument...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found