question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

NaN during training

See original GitHub issue

Hi,

I fiddled around with the jax code a bit and noticed that for small systems where any spin has only one electron the network will throw nan after some time.

ferminet --config ferminet/configs/atom.py --config.system.atom H --config.batch_size 4096 --config.pretrain.iterations 0
I0215 05:54:52.148167 139716596184896 train.py:461] Step 00538: -0.4999 E_h, pmove=0.97
I0215 05:54:52.173480 139716596184896 train.py:461] Step 00539: -0.4999 E_h, pmove=0.97
I0215 05:54:52.199377 139716596184896 train.py:461] Step 00540: nan E_h, pmove=0.97
I0215 05:54:52.224862 139716596184896 train.py:461] Step 00541: nan E_h, pmove=0.00
I0215 05:54:52.250287 139716596184896 train.py:461] Step 00542: nan E_h, pmove=0.00

I traced the issue down and found that this happens at the log abs determinant of the Slater determinant (in this case a 1x1 matrix). There is a small probability for a sample to be chosen such that the (1x1) matrix is exactly 0. After that, the code just produces nan.

Issue Analytics

  • State:closed
  • Created 3 years ago
  • Comments:8 (3 by maintainers)

github_iconTop GitHub Comments

1reaction
dpfaucommented, Mar 19, 2021

full_det=True does not mean that spin is ignored, and does not make things fully antisymmetric wrt permutation of electrons of different spin. It just means that instead of there being N_alpha non-zero orbitals for alpha electrons and N_beta non-zero orbitals for beta electrons, there are now N=N_alpha+N_beta nonzero orbitals for both alpha and beta electrons (but the orbitals can be different!). This generalizes the full_det=False case. It seems like it helps on some systems, though the difference is not enormous.

On Fri, Mar 19, 2021 at 10:33 AM Nicholas Gao @.***> wrote:

This error also occurs for Lithium far from the optimum.

I0306 02:47:37.867998 139629635032896 train.py:461] Step 00093: -6.6119 E_h, pmove=0.76 I0306 02:47:38.068941 139629635032896 train.py:461] Step 00094: -6.6105 E_h, pmove=0.76 I0306 02:47:38.270185 139629635032896 train.py:461] Step 00095: -6.6636 E_h, pmove=0.76 I0306 02:47:38.471880 139629635032896 train.py:461] Step 00096: nan E_h, pmove=0.76 I0306 02:47:38.671543 139629635032896 train.py:461] Step 00097: nan E_h, pmove=0.00 I0306 02:47:38.870149 139629635032896 train.py:461] Step 00098: nan E_h, pmove=0.00

Though, only if one sets full_det to False. As far as I can tell #23 https://github.com/deepmind/ferminet/pull/23 fixes this.

On a side note: Is there a particular reason why full_det defaults to True? Isn’t a wavefunction only antisymmetric with respect to permutation of electrons of the same spin? Also, it does not align with the definition of FermiNet in the papers.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/deepmind/ferminet/issues/22#issuecomment-802725186, or unsubscribe https://github.com/notifications/unsubscribe-auth/AABDACCZNECZXY62YOD75NTTEMR7DANCNFSM4ZM7D2KA .

0reactions
jsspencercommented, Aug 27, 2021

Fixed in #23 .

Read more comments on GitHub >

github_iconTop Results From Across the Web

Common causes of nans during training of neural networks
6 Answers 6 · Gradient blow up · Bad learning rate policy and params · Faulty Loss function · Faulty input · stride...
Read more >
Common Causes of NANs During Training
Common Causes of NANs During Training · Gradient blow up · Bad learning rate policy and params · Faulty Loss function · Faulty...
Read more >
Common causes of nans during training - Intellipaat Community
There can be many causes for NAN S to occur during training, below are a few causes which I know: Gradient blow up....
Read more >
Why do l get NaN values when l train my neural network with a ...
During training, it may happen that neurons of a particular layer may always become influenced only by the output of a particular neuron...
Read more >
NAN value appears during training #65 - lululxvi/deepxde
Now, It's working fine except one place of loss value. During training steps, it has a huge value for boundary loss. The third...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found