Exception: process 0 terminated with signal SIGKILL
See original GitHub issue❓ Questions & Help
i was using this notebook : https://www.kaggle.com/theoviel/bert-pytorch-huggingface-with-tpu-multiprocessing
to finetune huggingface’s xlm roberta base model on jigsaw multilingual (ongoing kaggle competition)
this is my first time with torch xla and TPU multiprocessing…!
the code i am trying is exactly this one : https://pastebin.com/fS94MKYc on a kaggle kernel which gives TPU v3-8
but even for batch_size = 8 i see my jupyter notebook crashes after giving this error message : Your notebook tried to allocate more memory than is available. It has restarted.
where i can see other people are using same model with even batch_size = 64
full error message looks like this :
---------------------------------------------------------------------------
Exception Traceback (most recent call last)
<timed exec> in <module>
/opt/conda/lib/python3.6/site-packages/torch_xla/distributed/xla_multiprocessing.py in spawn(fn, args, nprocs, join, daemon, start_method)
180 join=join,
181 daemon=daemon,
--> 182 start_method=start_method)
/opt/conda/lib/python3.6/site-packages/torch/multiprocessing/spawn.py in start_processes(fn, args, nprocs, join, daemon, start_method)
156
157 # Loop on join until it returns True or raises an exception.
--> 158 while not context.join():
159 pass
160
/opt/conda/lib/python3.6/site-packages/torch/multiprocessing/spawn.py in join(self, timeout)
106 raise Exception(
107 "process %d terminated with signal %s" %
--> 108 (error_index, name)
109 )
110 else:
Exception: process 0 terminated with signal SIGKILL
same problem is occuring also when i try bert base multilingual of huggingface. so i am not understanding exactly where in my code i need to make change so that it can work? it seems like the problem is not with the batch size but something else that i am unable to catch.please help,thanks in advance
Issue Analytics
- State:
- Created 3 years ago
- Reactions:9
- Comments:8
Top GitHub Comments
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
I got this problem. Did you solve it?