Exception: process 0 terminated with signal SIGKILL
See original GitHub issue❓ Questions & Help
i was using this notebook : https://www.kaggle.com/theoviel/bert-pytorch-huggingface-with-tpu-multiprocessing
to finetune huggingface’s xlm roberta base model on jigsaw multilingual (ongoing kaggle competition)
this is my first time with torch xla and TPU multiprocessing…!
the code i am trying is exactly this one : https://pastebin.com/fS94MKYc on a kaggle kernel which gives TPU v3-8
but even for batch_size = 8 i see my jupyter notebook crashes after giving this error message : Your notebook tried to allocate more memory than is available. It has restarted.
where i can see other people are using same model with even batch_size = 64
full error message looks like this :
---------------------------------------------------------------------------
Exception Traceback (most recent call last)
<timed exec> in <module>
/opt/conda/lib/python3.6/site-packages/torch_xla/distributed/xla_multiprocessing.py in spawn(fn, args, nprocs, join, daemon, start_method)
180 join=join,
181 daemon=daemon,
--> 182 start_method=start_method)
/opt/conda/lib/python3.6/site-packages/torch/multiprocessing/spawn.py in start_processes(fn, args, nprocs, join, daemon, start_method)
156
157 # Loop on join until it returns True or raises an exception.
--> 158 while not context.join():
159 pass
160
/opt/conda/lib/python3.6/site-packages/torch/multiprocessing/spawn.py in join(self, timeout)
106 raise Exception(
107 "process %d terminated with signal %s" %
--> 108 (error_index, name)
109 )
110 else:
Exception: process 0 terminated with signal SIGKILL
same problem is occuring also when i try bert base multilingual of huggingface. so i am not understanding exactly where in my code i need to make change so that it can work? it seems like the problem is not with the batch size but something else that i am unable to catch.please help,thanks in advance
Issue Analytics
- State:
- Created 3 years ago
- Reactions:9
- Comments:8
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
I got this problem. Did you solve it?