Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

ZeroDivisionError: float division by zero

See original GitHub issue

Hi I am having problem when I tried implementing GPT2. After some iterations I am getting an float division by zero error. I don’t know why it is so .

optimizer = OpenAIAdam(optimizer_grouped_parameters, 
                       lr=lr,
                       warmup=0.05,
                       t_total=num_train_optimization_steps)

model, optimizer = amp.initialize(model, optimizer, opt_level="O1",verbosity=0)

for epoch in range(EPOCHS):
    optimizer.zero_grad()
    for batch,(X_train,y_train,weights) in tqdm(enumerate(train_loader),total=len(train_loader),leave=False):
        X_train = X_train.cuda()
        y_train = y_train.cuda()
        weights = weights.cuda()
        y_pred = model.forward(X_train)
        loss = loss_fn(y_train,y_pred,weights)
        with amp.scale_loss(loss, optimizer) as scaled_loss:
            scaled_loss.backward()
        if (batch+1) % accumulation_steps == 0:             # Wait for several backward steps
            optimizer.step()                            # Now we can do an optimizer step
            optimizer.zero_grad()

The error screen shot I am attaching : one

As you can see after 1091 iterations I am getting this error.

Issue Analytics

State:
Created 4 years ago
Comments:8

Top GitHub Comments

2reactions

suchithtuplecommented, Jun 13, 2019

So I got to know where I am getting the problem. In my loss function

class BinaryFocalLoss(nn.Module):
    def __init__(self,gamma=0,eps=1e-8):
        super().__init__()
        self.gamma=gamma
        self.eps=eps
        self.sigmoid = nn.Sigmoid()
    
    def forward(self,y_true,y_pred,weights=None):
        
        y_pred = self.sigmoid(y_pred)
        
        if weights is None:
            weights = torch.ones((y_true.shape[0],)).type('torch.FloatTensor').cuda()

        y_pred = torch.clamp(y_pred,self.eps,1-self.eps)
        m = y_pred.shape[0]
        
        if len(y_true.shape)==1:
            # changing the shape of y_pred and y_true
            y_true = torch.unsqueeze(y_true,1)
            y_pred = torch.unsqueeze(y_pred,1)
        
        loss = torch.sum(y_true*torch.pow(1-y_pred,self.gamma)*torch.log(y_pred) + \
               (1-y_true)*torch.pow(y_pred,self.gamma)*torch.log(1-y_pred),dim=1)
        
        loss = -torch.sum(weights*loss)/m
        
        return loss

I am using clamp of 1e-8 for predictions. these are my predictions for 1st iteration :

(Pdb) x
tensor([9.9707e-01, 9.7266e-01, 9.6777e-01, 1.0000e+00, 8.7280e-02, 7.7393e-01,
        1.7957e-01, 9.6582e-01, 0.0000e+00, 1.0000e+00, 4.3579e-01, 4.5929e-02,
        5.9082e-01, 9.8193e-01, 6.1230e-01, 9.4385e-01, 9.9609e-01, 1.0000e+00,
        9.9805e-01, 1.2817e-02, 1.0000e+00, 9.8828e-01, 7.9932e-01, 9.6973e-01,
        6.1377e-01, 9.6338e-01, 1.4770e-04, 8.1738e-01, 8.1738e-01, 1.0000e+00,
        9.4971e-01, 7.8247e-02], device='cuda:0', dtype=torch.float16,
       grad_fn=<SelectBackward>)

Notice the 1 over there and even after clamping I get this

(Pdb) torch.clamp(x,1e-8,1-1e-8)
tensor([9.9707e-01, 9.7266e-01, 9.6777e-01, 1.0000e+00, 8.7280e-02, 7.7393e-01,
        1.7957e-01, 9.6582e-01, 0.0000e+00, 1.0000e+00, 4.3579e-01, 4.5929e-02,
        5.9082e-01, 9.8193e-01, 6.1230e-01, 9.4385e-01, 9.9609e-01, 1.0000e+00,
        9.9805e-01, 1.2817e-02, 1.0000e+00, 9.8828e-01, 7.9932e-01, 9.6973e-01,
        6.1377e-01, 9.6338e-01, 1.4770e-04, 8.1738e-01, 8.1738e-01, 1.0000e+00,
        9.4971e-01, 7.8247e-02], device='cuda:0', dtype=torch.float16,
       grad_fn=<ClampBackward>)

Notice 1 is present over there and that’s the reason I was getting nan . I think it’s due to the fact that 1e-8 is not correct one for “O1”. I tried with various powers and 1e-3 works well.

(Pdb) torch.clamp(x,1e-3,1-1e-3)
tensor([0.9971, 0.9727, 0.9678, 0.9990, 0.0873, 0.7739, 0.1796, 0.9658, 0.0010,
        0.9990, 0.4358, 0.0459, 0.5908, 0.9819, 0.6123, 0.9438, 0.9961, 0.9990,
        0.9980, 0.0128, 0.9990, 0.9883, 0.7993, 0.9697, 0.6138, 0.9634, 0.0010,
        0.8174, 0.8174, 0.9990, 0.9497, 0.0782], device='cuda:0',
       dtype=torch.float16, grad_fn=<ClampBackward>)

Thanks a lot @ptrblck for your prompt and fast reply. I am replacing it with 1e-3

1reaction

ptrblckcommented, Jun 14, 2019

Thanks for the debugging! Note that even in FP32 your loss function might create nan values. I’ve created a small dummy example using your input values:

y_pred = torch.tensor([9.9707e-01, 9.7266e-01, 9.6777e-01, 1.0000e+00, 8.7280e-02, 7.7393e-01,
        1.7957e-01, 9.6582e-01, 0.0000e+00, 1.0000e+00, 4.3579e-01, 4.5929e-02,
        5.9082e-01, 9.8193e-01, 6.1230e-01, 9.4385e-01, 9.9609e-01, 1.0000e+00,
        9.9805e-01, 1.2817e-02, 1.0000e+00, 9.8828e-01, 7.9932e-01, 9.6973e-01,
        6.1377e-01, 9.6338e-01, 1.4770e-04, 8.1738e-01, 8.1738e-01, 1.0000e+00,
        9.4971e-01, 7.8247e-02], device=device)

y_true = torch.randint(0, 2, y_pred.size(), device=device).float()
y_true[3] = 1.

eps = 1e-8
y_pred = torch.clamp(y_pred,eps,1-eps)

gamma = 0.
y_true = y_true.unsqueeze(1)
y_pred = y_pred.unsqueeze(1)
loss = torch.sum(y_true*torch.pow(1-y_pred,gamma)*torch.log(y_pred) + \
               (1-y_true)*torch.pow(y_pred,gamma)*torch.log(1-y_pred),dim=1)
loss = -1.0 * loss.sum() / y_pred.size(0)
print(loss)
> tensor(nan, device='cuda:0')

To fix this, you might want to add eps to the argument in torch.log:

loss = torch.sum(y_true*torch.pow(1-y_pred,gamma)*torch.log(y_pred+eps) + \
               (1-y_true)*torch.pow(y_pred,gamma)*torch.log(1-y_pred+eps),dim=1)

Does this make sense or did I misunderstand your criterion?

Top Results From Across the Web

Get ZeroDivisionError: float division in python - Stack Overflow

You are dividing two integers, so python is using integer division. In integer division, the quotient is rounded down. For example, 364/365 ...

ZeroDivisionError: division by zero - Net-Informations.Com

ZeroDivisionError is a built-in Python exception thrown when a number is divided by 0. This means that the exception raised when the second...

ZeroDivisionError: float division by zero in Python | bobbyhadz

The Python "ZeroDivisionError: float division by zero" occurs when we try to divide a floating-point number by 0 . To solve the error, ......

Python error ZeroDivisionError float division by zero - Edureka

In the above line the denominator becomes zero. A float number cannot be devided by zero. In this case the express is divided...

ZeroDivisionError: float division by zero - ROS Answers

1 Answer · make at least one of the operands a float by adding a dot, i.e. 1.0 , 30.0 or even 1....

Troubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.

Start Free

Top Related Reddit Thread

No results found

Top Related Tweet

No results found

Top Related Dev.to Post

No results found

FileNotFoundError: [Errno 2] No such file or directory: ':/usr/local/cuda:/usr/local/cuda-10.1/bin/nvcc': ':/usr/local/cuda:/usr/local/cuda-10.1/bin/nvcc'

ZeroDivisionError: float division by zero

Issue Analytics

Top GitHub Comments

Top Results From Across the Web

Top Related Medium Post

Top Related StackOverflow Question

Troubleshoot Live Code

Top Related Reddit Thread

Top Related Hackernoon Post

Top Related Tweet

Top Related Dev.to Post

Top Related Hashnode Post

FileNotFoundError: [Errno 2] No such file or directory: ':/usr/local/cuda:/usr/local/cuda-10.1/bin/nvcc': ':/usr/local/cuda:/usr/local/cuda-10.1/bin/nvcc'

[Bug/Feature] AttributeError: 'AmpOptimizerState' object has no attribute 'all_fp16_params'