question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

ZeroDivisionError: float division by zero

See original GitHub issue

Hi I am having problem when I tried implementing GPT2. After some iterations I am getting an float division by zero error. I don’t know why it is so .

optimizer = OpenAIAdam(optimizer_grouped_parameters, 
                       lr=lr,
                       warmup=0.05,
                       t_total=num_train_optimization_steps)

model, optimizer = amp.initialize(model, optimizer, opt_level="O1",verbosity=0)

for epoch in range(EPOCHS):
    optimizer.zero_grad()
    for batch,(X_train,y_train,weights) in tqdm(enumerate(train_loader),total=len(train_loader),leave=False):
        X_train = X_train.cuda()
        y_train = y_train.cuda()
        weights = weights.cuda()
        y_pred = model.forward(X_train)
        loss = loss_fn(y_train,y_pred,weights)
        with amp.scale_loss(loss, optimizer) as scaled_loss:
            scaled_loss.backward()
        if (batch+1) % accumulation_steps == 0:             # Wait for several backward steps
            optimizer.step()                            # Now we can do an optimizer step
            optimizer.zero_grad()

The error screen shot I am attaching : one 2

As you can see after 1091 iterations I am getting this error.

Issue Analytics

  • State:closed
  • Created 4 years ago
  • Comments:8

github_iconTop GitHub Comments

2reactions
suchithtuplecommented, Jun 13, 2019

So I got to know where I am getting the problem. In my loss function

class BinaryFocalLoss(nn.Module):
    def __init__(self,gamma=0,eps=1e-8):
        super().__init__()
        self.gamma=gamma
        self.eps=eps
        self.sigmoid = nn.Sigmoid()
    
    def forward(self,y_true,y_pred,weights=None):
        
        y_pred = self.sigmoid(y_pred)
        
        if weights is None:
            weights = torch.ones((y_true.shape[0],)).type('torch.FloatTensor').cuda()

        y_pred = torch.clamp(y_pred,self.eps,1-self.eps)
        m = y_pred.shape[0]
        
        if len(y_true.shape)==1:
            # changing the shape of y_pred and y_true
            y_true = torch.unsqueeze(y_true,1)
            y_pred = torch.unsqueeze(y_pred,1)
        
        loss = torch.sum(y_true*torch.pow(1-y_pred,self.gamma)*torch.log(y_pred) + \
               (1-y_true)*torch.pow(y_pred,self.gamma)*torch.log(1-y_pred),dim=1)
        
        loss = -torch.sum(weights*loss)/m
        
        return loss

I am using clamp of 1e-8 for predictions. these are my predictions for 1st iteration :

(Pdb) x
tensor([9.9707e-01, 9.7266e-01, 9.6777e-01, 1.0000e+00, 8.7280e-02, 7.7393e-01,
        1.7957e-01, 9.6582e-01, 0.0000e+00, 1.0000e+00, 4.3579e-01, 4.5929e-02,
        5.9082e-01, 9.8193e-01, 6.1230e-01, 9.4385e-01, 9.9609e-01, 1.0000e+00,
        9.9805e-01, 1.2817e-02, 1.0000e+00, 9.8828e-01, 7.9932e-01, 9.6973e-01,
        6.1377e-01, 9.6338e-01, 1.4770e-04, 8.1738e-01, 8.1738e-01, 1.0000e+00,
        9.4971e-01, 7.8247e-02], device='cuda:0', dtype=torch.float16,
       grad_fn=<SelectBackward>)

Notice the 1 over there and even after clamping I get this

(Pdb) torch.clamp(x,1e-8,1-1e-8)
tensor([9.9707e-01, 9.7266e-01, 9.6777e-01, 1.0000e+00, 8.7280e-02, 7.7393e-01,
        1.7957e-01, 9.6582e-01, 0.0000e+00, 1.0000e+00, 4.3579e-01, 4.5929e-02,
        5.9082e-01, 9.8193e-01, 6.1230e-01, 9.4385e-01, 9.9609e-01, 1.0000e+00,
        9.9805e-01, 1.2817e-02, 1.0000e+00, 9.8828e-01, 7.9932e-01, 9.6973e-01,
        6.1377e-01, 9.6338e-01, 1.4770e-04, 8.1738e-01, 8.1738e-01, 1.0000e+00,
        9.4971e-01, 7.8247e-02], device='cuda:0', dtype=torch.float16,
       grad_fn=<ClampBackward>)

Notice 1 is present over there and that’s the reason I was getting nan . I think it’s due to the fact that 1e-8 is not correct one for “O1”. I tried with various powers and 1e-3 works well.

(Pdb) torch.clamp(x,1e-3,1-1e-3)
tensor([0.9971, 0.9727, 0.9678, 0.9990, 0.0873, 0.7739, 0.1796, 0.9658, 0.0010,
        0.9990, 0.4358, 0.0459, 0.5908, 0.9819, 0.6123, 0.9438, 0.9961, 0.9990,
        0.9980, 0.0128, 0.9990, 0.9883, 0.7993, 0.9697, 0.6138, 0.9634, 0.0010,
        0.8174, 0.8174, 0.9990, 0.9497, 0.0782], device='cuda:0',
       dtype=torch.float16, grad_fn=<ClampBackward>)

Thanks a lot @ptrblck for your prompt and fast reply. I am replacing it with 1e-3

1reaction
ptrblckcommented, Jun 14, 2019

Thanks for the debugging! Note that even in FP32 your loss function might create nan values. I’ve created a small dummy example using your input values:

y_pred = torch.tensor([9.9707e-01, 9.7266e-01, 9.6777e-01, 1.0000e+00, 8.7280e-02, 7.7393e-01,
        1.7957e-01, 9.6582e-01, 0.0000e+00, 1.0000e+00, 4.3579e-01, 4.5929e-02,
        5.9082e-01, 9.8193e-01, 6.1230e-01, 9.4385e-01, 9.9609e-01, 1.0000e+00,
        9.9805e-01, 1.2817e-02, 1.0000e+00, 9.8828e-01, 7.9932e-01, 9.6973e-01,
        6.1377e-01, 9.6338e-01, 1.4770e-04, 8.1738e-01, 8.1738e-01, 1.0000e+00,
        9.4971e-01, 7.8247e-02], device=device)

y_true = torch.randint(0, 2, y_pred.size(), device=device).float()
y_true[3] = 1.

eps = 1e-8
y_pred = torch.clamp(y_pred,eps,1-eps)

gamma = 0.
y_true = y_true.unsqueeze(1)
y_pred = y_pred.unsqueeze(1)
loss = torch.sum(y_true*torch.pow(1-y_pred,gamma)*torch.log(y_pred) + \
               (1-y_true)*torch.pow(y_pred,gamma)*torch.log(1-y_pred),dim=1)
loss = -1.0 * loss.sum() / y_pred.size(0)
print(loss)
> tensor(nan, device='cuda:0')

To fix this, you might want to add eps to the argument in torch.log:

loss = torch.sum(y_true*torch.pow(1-y_pred,gamma)*torch.log(y_pred+eps) + \
               (1-y_true)*torch.pow(y_pred,gamma)*torch.log(1-y_pred+eps),dim=1)

Does this make sense or did I misunderstand your criterion?

Read more comments on GitHub >

github_iconTop Results From Across the Web

Get ZeroDivisionError: float division in python - Stack Overflow
You are dividing two integers, so python is using integer division. In integer division, the quotient is rounded down. For example, 364/365 ...
Read more >
ZeroDivisionError: division by zero - Net-Informations.Com
ZeroDivisionError is a built-in Python exception thrown when a number is divided by 0. This means that the exception raised when the second...
Read more >
ZeroDivisionError: float division by zero in Python | bobbyhadz
The Python "ZeroDivisionError: float division by zero" occurs when we try to divide a floating-point number by 0 . To solve the error, ......
Read more >
Python error ZeroDivisionError float division by zero - Edureka
In the above line the denominator becomes zero. A float number cannot be devided by zero. In this case the express is divided...
Read more >
ZeroDivisionError: float division by zero - ROS Answers
1 Answer · make at least one of the operands a float by adding a dot, i.e. 1.0 , 30.0 or even 1....
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found