Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

NaN during training

See original GitHub issue

Hi,

I fiddled around with the jax code a bit and noticed that for small systems where any spin has only one electron the network will throw nan after some time.

ferminet --config ferminet/configs/atom.py --config.system.atom H --config.batch_size 4096 --config.pretrain.iterations 0

I0215 05:54:52.148167 139716596184896 train.py:461] Step 00538: -0.4999 E_h, pmove=0.97
I0215 05:54:52.173480 139716596184896 train.py:461] Step 00539: -0.4999 E_h, pmove=0.97
I0215 05:54:52.199377 139716596184896 train.py:461] Step 00540: nan E_h, pmove=0.97
I0215 05:54:52.224862 139716596184896 train.py:461] Step 00541: nan E_h, pmove=0.00
I0215 05:54:52.250287 139716596184896 train.py:461] Step 00542: nan E_h, pmove=0.00

I traced the issue down and found that this happens at the log abs determinant of the Slater determinant (in this case a 1x1 matrix). There is a small probability for a sample to be chosen such that the (1x1) matrix is exactly 0. After that, the code just produces nan.

Issue Analytics

State:
Created 3 years ago
Comments:8 (3 by maintainers)

Top GitHub Comments

1reaction

dpfaucommented, Mar 19, 2021

full_det=True does not mean that spin is ignored, and does not make things fully antisymmetric wrt permutation of electrons of different spin. It just means that instead of there being N_alpha non-zero orbitals for alpha electrons and N_beta non-zero orbitals for beta electrons, there are now N=N_alpha+N_beta nonzero orbitals for both alpha and beta electrons (but the orbitals can be different!). This generalizes the full_det=False case. It seems like it helps on some systems, though the difference is not enormous.

On Fri, Mar 19, 2021 at 10:33 AM Nicholas Gao @.***> wrote:

This error also occurs for Lithium far from the optimum.

I0306 02:47:37.867998 139629635032896 train.py:461] Step 00093: -6.6119 E_h, pmove=0.76 I0306 02:47:38.068941 139629635032896 train.py:461] Step 00094: -6.6105 E_h, pmove=0.76 I0306 02:47:38.270185 139629635032896 train.py:461] Step 00095: -6.6636 E_h, pmove=0.76 I0306 02:47:38.471880 139629635032896 train.py:461] Step 00096: nan E_h, pmove=0.76 I0306 02:47:38.671543 139629635032896 train.py:461] Step 00097: nan E_h, pmove=0.00 I0306 02:47:38.870149 139629635032896 train.py:461] Step 00098: nan E_h, pmove=0.00

Though, only if one sets full_det to False. As far as I can tell #23 https://github.com/deepmind/ferminet/pull/23 fixes this.

On a side note: Is there a particular reason why full_det defaults to True? Isn’t a wavefunction only antisymmetric with respect to permutation of electrons of the same spin? Also, it does not align with the definition of FermiNet in the papers.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/deepmind/ferminet/issues/22#issuecomment-802725186, or unsubscribe https://github.com/notifications/unsubscribe-auth/AABDACCZNECZXY62YOD75NTTEMR7DANCNFSM4ZM7D2KA .

0reactions

jsspencercommented, Aug 27, 2021

Fixed in #23 .