ZeroDivisionError: float division by zero
See original GitHub issueHi I am having problem when I tried implementing GPT2. After some iterations I am getting an float division by zero error. I don’t know why it is so .
optimizer = OpenAIAdam(optimizer_grouped_parameters,
lr=lr,
warmup=0.05,
t_total=num_train_optimization_steps)
model, optimizer = amp.initialize(model, optimizer, opt_level="O1",verbosity=0)
for epoch in range(EPOCHS):
optimizer.zero_grad()
for batch,(X_train,y_train,weights) in tqdm(enumerate(train_loader),total=len(train_loader),leave=False):
X_train = X_train.cuda()
y_train = y_train.cuda()
weights = weights.cuda()
y_pred = model.forward(X_train)
loss = loss_fn(y_train,y_pred,weights)
with amp.scale_loss(loss, optimizer) as scaled_loss:
scaled_loss.backward()
if (batch+1) % accumulation_steps == 0: # Wait for several backward steps
optimizer.step() # Now we can do an optimizer step
optimizer.zero_grad()
The error screen shot I am attaching :
As you can see after 1091 iterations I am getting this error.
Issue Analytics
- State:
- Created 4 years ago
- Comments:8
Top Results From Across the Web
Get ZeroDivisionError: float division in python - Stack Overflow
You are dividing two integers, so python is using integer division. In integer division, the quotient is rounded down. For example, 364/365 ...
Read more >ZeroDivisionError: division by zero - Net-Informations.Com
ZeroDivisionError is a built-in Python exception thrown when a number is divided by 0. This means that the exception raised when the second...
Read more >ZeroDivisionError: float division by zero in Python | bobbyhadz
The Python "ZeroDivisionError: float division by zero" occurs when we try to divide a floating-point number by 0 . To solve the error, ......
Read more >Python error ZeroDivisionError float division by zero - Edureka
In the above line the denominator becomes zero. A float number cannot be devided by zero. In this case the express is divided...
Read more >ZeroDivisionError: float division by zero - ROS Answers
1 Answer · make at least one of the operands a float by adding a dot, i.e. 1.0 , 30.0 or even 1....
Read more >
Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free
Top Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
So I got to know where I am getting the problem. In my loss function
I am using clamp of 1e-8 for predictions. these are my predictions for 1st iteration :
Notice the 1 over there and even after clamping I get this
Notice 1 is present over there and that’s the reason I was getting
nan
. I think it’s due to the fact that 1e-8 is not correct one for “O1”. I tried with various powers and 1e-3 works well.Thanks a lot @ptrblck for your prompt and fast reply. I am replacing it with 1e-3
Thanks for the debugging! Note that even in FP32 your loss function might create
nan
values. I’ve created a small dummy example using your input values:To fix this, you might want to add
eps
to the argument intorch.log
:Does this make sense or did I misunderstand your criterion?