RuntimeError: copy_if failed to synchronize: cudaErrorAssert: device-side assert triggered
See original GitHub issueHi expert, I’ve got this RuntimeError while doing GCN training. I’ve tried to decrease the learning rate as some old threads mentioned however it did not help. Does anyone know what might be the root cause of it? Also I want to know if it is possible to set a batch_size using the “Data” function to load my data in?
Environment
- OS: CentOS Linux 7 (Core)
- Python version: Python 3.6.8 :: Anaconda custom (64-bit)
- PyTorch version: torch==1.3.1 torch-cluster==1.4.5 torch-geometric==1.3.2 torch-scatter==1.4.0 torch-sparse==0.4.3 torchvision==0.4.2
- CUDA/cuDNN version: 10.1
Graph and model information
lr = 1e-7 epochs = 200 weight_decay = 5e-4 num_edges: 100902 num_nodes: 11094 directed: True data.num_node_features: 17 data.num_classes: 2 data: Data(edge_index=[2, 100902], num_classes=[1], x=[11094, 17], y=[11125])
Code
#########################################################
### Learning on Graphs
#########################################################
class Net(torch.nn.Module):
### Constructor
def __init__(self):
### super is used to inherit from the torch.nn.module.
super(Net, self).__init__()
#class GCNConv(in_channels,
# out_channels, improved=False,
# cached=False, bias=True, **kwargs)
self.conv1 = GCNConv(data.num_node_features, 16)
self.conv2 = GCNConv(16, data.num_classes)
def forward(self, data):
x, edge_index = data.x, data.edge_index
#x = torch.matmul(torch.t(x),self.weight)
#x = x.squeeze(dim=-1)
print("x:",x.size(), "e_idx", edge_index.size())
print("x_shape:", x.shape, "edge_index shape:", edge_index.shape)
x = self.conv1(x, edge_index)
x = F.relu(x)
x = F.dropout(x, training = self.training)
x = self.conv2(x, edge_index)
return F.log_softmax(x, dim=1)
def graph_learning(data):
### Setup the model.
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model = Net().to(device)
#ata = data #dataset[0].to(device)
data = data.to(device)
optimizer = torch.optim.Adam( model.parameters(),
lr = lr,
weight_decay = weight_decay)
#total_train = 0.0
#correct_train = 0.0
#train_acc = 0.0
for epoch in range(epochs):
train_loss = model.train()
optimizer.zero_grad()
out = model(data)
loss = F.nll_loss(out[data.train_mask],
data.y[data.train_mask])
loss.backward()
optimizer.step()
### Train Loss / Accuracy
_, pred_tr = model(data).max(dim = 1)
correct_tr = float (
pred_tr[data.train_mask].eq(data.y[data.train_mask]).sum().item())
acc_tr = correct_tr / data.train_mask.sum().item()
print("Epoch:", epoch,
"Train loss:", loss,
"Train Accuracy: {:.9f}'".format(acc_tr))
### Test Loss / Accuracy
model.eval()
_, pred = model(data).max(dim = 1)
print("\n\nPRED:", pred[data.test_mask], "ACTUAL:", data.y[data.test_mask], "\n\n")
correct = float (
pred[data.test_mask].eq(data.y[data.test_mask]).sum().item())
acc = correct / data.test_mask.sum().item()
print("Epoch:", epoch,
"Loss:", loss,
'Test Accuracy: {:.4f}'.format(acc))
torch.save(model.state_dict(), run_name + ".pt")
#########################################################
### Main function
#########################################################
if __name__ == '__main__':
(edge_v1,
edge_v2,
node_number) = read_netlist_graph()
data = build_netlist_graph(edge_v1, edge_v2, node_number)
graph_learning(data)
print("\n\n--- %s seconds ---" % (time.time() - start_time))
Issue Analytics
- State:
- Created 4 years ago
- Comments:6
Top Results From Across the Web
PyTorch: copy_if failed to synchronize: device-side assert ...
Sometimes when we run code using cuda, it gives error message having device-side assert triggered which hides the real error message.
Read more >copy_if failed to synchronize: cudaErrorAssert: device-side ...
Hi expert, I've got this RuntimeError while doing GCN training. I've tried to decrease the learning rate as some old threads mentioned ...
Read more >Data Parallelization meet RuntimeError: copy_if failed to ...
... failed to synchronize: cudaErrorAssert: device-side assert triggered ... DataParallel() and trying to parallelize it on two GPUs,I got an error like:
Read more >reduce failed to synchronize: cudaErrorAssert: device-side ...
Python – Pytorch: RuntimeError: reduce failed to synchronize: cudaErrorAssert: device-side assert triggered ... I am running into the following error when trying ...
Read more >Tpetra: Run-time error in idot unit test, in CUDA build only
I suspect the issue is that KokkosBlas::dot uses its X vector input to determine the execution space, but a raw pointer result argument...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
I found out what the root cause is.
I recently got the same error message, here’s what fixed it for me:
Hope this is helpful 😃