[FLAX] Core dump using example code
See original GitHub issueEnvironment info
transformers
version: 4.8.1flax
version: 0.3.4python
version: 3.8.5
Who can help
Models:
FLAX - RoBERTa MLM
Information
Following the official guides for creating VMs and TPUs: https://cloud.google.com/tpu/docs/jax-quickstart-tpu-vm
Following this guide for training RoBERTa on the Norwegian OSCAR training set. https://github.com/huggingface/transformers/tree/master/examples/flax/language-modeling
I am unable to run the run_mlm_flax.py without getting a core dump. The same happens on the run_clm_flax.py script.
Error message
tcmalloc: large alloc 435677134848 bytes == (nil) @ 0x7f61ae7be680 0x7f61ae7deff4 0x7f61ae2d5309 0x7f61ae2d6fb9 0x7f61ae2d7056 0x7f5e637fd659 0x7f5e59233a09 0x7f61ae9b2b8a 0x7f61ae9b2c91 0x7f61ae711915 0x7f61ae9b70bf 0x7f61ae7118b8 0x7f61ae9b65fa 0x7f61ae58634c 0x7f61ae7118b8 0x7f61ae711983 0x7f61ae586b59 0x7f61ae5863da 0x67299f 0x682dcb 0x684321 0x5c3cb0 0x5f257d 0x56fcb6 0x56822a 0x5f6033 0x56ef97 0x5f5e56 0x56a136 0x5f5e56 0x569f5e
terminate called after throwing an instance of 'std::bad_alloc'
what(): std::bad_alloc
https://symbolize.stripped_domain/r/?trace=7f61ae5f418b,7f61ae5f420f&map=
*** SIGABRT received by PID 8576 (TID 8576) on cpu 95 from PID 8576; stack trace: ***
PC: @ 0x7f61ae5f418b (unknown) raise
@ 0x7f5f7fb581e0 976 (unknown)
@ 0x7f61ae5f4210 (unknown) (unknown)
https://symbolize.stripped_domain/r/?trace=7f61ae5f418b,7f5f7fb581df,7f61ae5f420f&map=ca1b7ab241ee28147b3d590cadb5dc1b:7f5f72e59000-7f5f7fe8bb20
E0628 20:40:48.745220 8576 coredump_hook.cc:292] RAW: Remote crash data gathering hook invoked.
E0628 20:40:48.745291 8576 coredump_hook.cc:384] RAW: Skipping coredump since rlimit was 0 at process start.
E0628 20:40:48.745305 8576 client.cc:222] RAW: Coroner client retries enabled (b/136286901), will retry for up to 30 sec.
E0628 20:40:48.745322 8576 coredump_hook.cc:447] RAW: Sending fingerprint to remote end.
E0628 20:40:48.745346 8576 coredump_socket.cc:124] RAW: Stat failed errno=2 on socket /var/google/services/logmanagerd/remote_coredump.socket
E0628 20:40:48.745362 8576 coredump_hook.cc:451] RAW: Cannot send fingerprint to Coroner: [NOT_FOUND] Missing crash reporting socket. Is the listener running?
E0628 20:40:48.745366 8576 coredump_hook.cc:525] RAW: Discarding core.
E0628 20:40:48.749975 8576 process_state.cc:771] RAW: Raising signal 6 with default behavior
Aborted (core dumped)
To reproduce
Follow the guide.
Issue Analytics
- State:
- Created 2 years ago
- Reactions:1
- Comments:23 (16 by maintainers)
Top Results From Across the Web
Flax - core dump when starting training - Hugging Face Forums
Everything is easy to follow until the start of training. Immediately ends in a core dump, with this error message:
Read more >flax.core.scope - Read the Docs
"""Flax functional core: Scopes. ... DenyList can be used to make every collection mutable except the ones defined in the given filter. To...
Read more >Online Training: Accelerated Linux Core Dump Analysis
Learn how to analyze Linux process and kernel crashes and hangs, navigate through core memory dump space and diagnose corruption, memory leaks, CPU...
Read more >Debugging using a core dump - Micro Focus
Describes the process of debugging using a core dump and provides scenarios in which this type of debugging is a good option.
Read more >Configuring and Managing Core Dumps in Linux - Baeldung
Let's see how to configure our system to produce a core dump via a pipe. First, we need an example program to extract...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
I also got this core dump on TPUv3-8 and TPUv2-8 VMs. I’ll try some of the proposed fixes tomorrow and post an update. @patil-suraj
Exactly. That turned out to be the issue. Was a bit confused because installing flax with pip install gives me flax version 0.3.4. Installing from git still gives version 0.3.4, but now it works. Thanks a lot.