question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[FLAX] Core dump using example code

See original GitHub issue

Environment info

  • transformers version: 4.8.1
  • flax version: 0.3.4
  • python version: 3.8.5

Who can help

@patrickvonplaten

Models:

FLAX - RoBERTa MLM

Information

Following the official guides for creating VMs and TPUs: https://cloud.google.com/tpu/docs/jax-quickstart-tpu-vm

Following this guide for training RoBERTa on the Norwegian OSCAR training set. https://github.com/huggingface/transformers/tree/master/examples/flax/language-modeling

I am unable to run the run_mlm_flax.py without getting a core dump. The same happens on the run_clm_flax.py script.

Error message

tcmalloc: large alloc 435677134848 bytes == (nil) @  0x7f61ae7be680 0x7f61ae7deff4 0x7f61ae2d5309 0x7f61ae2d6fb9 0x7f61ae2d7056 0x7f5e637fd659 0x7f5e59233a09 0x7f61ae9b2b8a 0x7f61ae9b2c91 0x7f61ae711915 0x7f61ae9b70bf 0x7f61ae7118b8 0x7f61ae9b65fa 0x7f61ae58634c 0x7f61ae7118b8 0x7f61ae711983 0x7f61ae586b59 0x7f61ae5863da 0x67299f 0x682dcb 0x684321 0x5c3cb0 0x5f257d 0x56fcb6 0x56822a 0x5f6033 0x56ef97 0x5f5e56 0x56a136 0x5f5e56 0x569f5e
terminate called after throwing an instance of 'std::bad_alloc'
  what():  std::bad_alloc
https://symbolize.stripped_domain/r/?trace=7f61ae5f418b,7f61ae5f420f&map= 
*** SIGABRT received by PID 8576 (TID 8576) on cpu 95 from PID 8576; stack trace: ***
PC: @     0x7f61ae5f418b  (unknown)  raise
    @     0x7f5f7fb581e0        976  (unknown)
    @     0x7f61ae5f4210  (unknown)  (unknown)
https://symbolize.stripped_domain/r/?trace=7f61ae5f418b,7f5f7fb581df,7f61ae5f420f&map=ca1b7ab241ee28147b3d590cadb5dc1b:7f5f72e59000-7f5f7fe8bb20 
E0628 20:40:48.745220    8576 coredump_hook.cc:292] RAW: Remote crash data gathering hook invoked.
E0628 20:40:48.745291    8576 coredump_hook.cc:384] RAW: Skipping coredump since rlimit was 0 at process start.
E0628 20:40:48.745305    8576 client.cc:222] RAW: Coroner client retries enabled (b/136286901), will retry for up to 30 sec.
E0628 20:40:48.745322    8576 coredump_hook.cc:447] RAW: Sending fingerprint to remote end.
E0628 20:40:48.745346    8576 coredump_socket.cc:124] RAW: Stat failed errno=2 on socket /var/google/services/logmanagerd/remote_coredump.socket
E0628 20:40:48.745362    8576 coredump_hook.cc:451] RAW: Cannot send fingerprint to Coroner: [NOT_FOUND] Missing crash reporting socket. Is the listener running?
E0628 20:40:48.745366    8576 coredump_hook.cc:525] RAW: Discarding core.
E0628 20:40:48.749975    8576 process_state.cc:771] RAW: Raising signal 6 with default behavior
Aborted (core dumped)

To reproduce

Follow the guide.

Issue Analytics

  • State:closed
  • Created 2 years ago
  • Reactions:1
  • Comments:23 (16 by maintainers)

github_iconTop GitHub Comments

2reactions
ghostcommented, Jul 3, 2021

I also got this core dump on TPUv3-8 and TPUv2-8 VMs. I’ll try some of the proposed fixes tomorrow and post an update. @patil-suraj

2reactions
peregilkcommented, Jun 29, 2021

Exactly. That turned out to be the issue. Was a bit confused because installing flax with pip install gives me flax version 0.3.4. Installing from git still gives version 0.3.4, but now it works. Thanks a lot.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Flax - core dump when starting training - Hugging Face Forums
Everything is easy to follow until the start of training. Immediately ends in a core dump, with this error message:
Read more >
flax.core.scope - Read the Docs
"""Flax functional core: Scopes. ... DenyList can be used to make every collection mutable except the ones defined in the given filter. To...
Read more >
Online Training: Accelerated Linux Core Dump Analysis
Learn how to analyze Linux process and kernel crashes and hangs, navigate through core memory dump space and diagnose corruption, memory leaks, CPU...
Read more >
Debugging using a core dump - Micro Focus
Describes the process of debugging using a core dump and provides scenarios in which this type of debugging is a good option.
Read more >
Configuring and Managing Core Dumps in Linux - Baeldung
Let's see how to configure our system to produce a core dump via a pipe. First, we need an example program to extract...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found