question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

GPTNeo Flax - crashes - n> sizes_size

See original GitHub issue

Environment info

  • transformers version: 4.9.0.dev0
  • Platform: Linux-5.4.0-1043-gcp-x86_64-with-glibc2.29
  • Python version: 3.8.10
  • PyTorch version (GPU?): not installed (NA)
  • Tensorflow version (GPU?): 2.5.0 (False)
  • Flax version (CPU?/GPU?/TPU?): 0.3.4 (cpu)
  • Jax version: 0.2.17
  • JaxLib version: 0.1.68

Who can help

@patrickvonplaten

Information

Trying to run the experimental GPTNeo Flax script. Are getting the following error:

07/17/2021 16:08:11 - INFO - __main__ - ***** Running training *****
07/17/2021 16:08:11 - INFO - __main__ -   Num examples = 2852257
07/17/2021 16:08:11 - INFO - __main__ -   Num Epochs = 10
07/17/2021 16:08:11 - INFO - __main__ -   Instantaneous batch size per device = 3
07/17/2021 16:08:11 - INFO - __main__ -   Total train batch size (w. parallel & distributed) = 24
07/17/2021 16:08:11 - INFO - __main__ -   Total optimization steps = 1188440
Epoch ... (1/10):   0%|                                                                                                                                                                                        | 0/10 [00:00<?, ?it/sF0717 16:08:46.411695   76098 array.h:414] Check failed: n < sizes_size                                                                                                                                     | 0/118844 [00:00<?, ?it/s]
*** Check failure stack trace: ***
    @     0x7f4a22c7f347  (unknown)
    @     0x7f4a22c7ded4  (unknown)
    @     0x7f4a22c7d9c3  (unknown)
    @     0x7f4a22c7fcc9  (unknown)
    @     0x7f4a1e8e7eee  (unknown)
    @     0x7f4a1e87ab2f  (unknown)
    @     0x7f4a1e878cc2  (unknown)
    @     0x7f4a223fddb4  (unknown)
    @     0x7f4a223ff212  (unknown)
    @     0x7f4a223fce23  (unknown)
    @     0x7f4a1885856f  (unknown)
    @     0x7f4a1e8a3248  (unknown)
    @     0x7f4a1e8a4d2b  (unknown)
    @     0x7f4a1e3f202b  (unknown)
    @     0x7f4a1e8e3001  (unknown)
    @     0x7f4a1e8e0d6a  (unknown)
    @     0x7f4a1e8e08bd  (unknown)
    @     0x7f4a1e8e3001  (unknown)
    @     0x7f4a1e8e0d6a  (unknown)
    @     0x7f4a1e8e08bd  (unknown)
    @     0x7f4a1df5f13f  (unknown)
    @     0x7f4a1df5a52e  (unknown)
    @     0x7f4a1df64292  (unknown)
    @     0x7f4a1df71ffd  (unknown)
    @     0x7f4a1db5c6b6  (unknown)
    @     0x7f4a1db5c014  TpuCompiler_Compile
    @     0x7f4a28dcf956  xla::(anonymous namespace)::TpuCompiler::Compile()
    @     0x7f4a2657f0d4  xla::Service::BuildExecutables()
    @     0x7f4a265751a0  xla::LocalService::CompileExecutables()
    @     0x7f4a264b9e07  xla::LocalClient::Compile()
    @     0x7f4a264942a0  xla::PjRtStreamExecutorClient::Compile()
    @     0x7f4a2408f152  xla::PyClient::Compile()
    @     0x7f4a23e095e2  pybind11::detail::argument_loader<>::call_impl<>()
    @     0x7f4a23e09a51  pybind11::cpp_function::initialize<>()::{lambda()#3}::operator()()
    @     0x7f4a23df0460  pybind11::cpp_function::dispatcher()
    @           0x5f2cc9  PyCFunction_Call
https://symbolize.stripped_domain/r/?trace=7f4a22c7f347,7f4a22c7ded3,7f4a22c7d9c2,7f4a22c7fcc8,7f4a1e8e7eed,7f4a1e87ab2e,7f4a1e878cc1,7f4a223fddb3,7f4a223ff211,7f4a223fce22,7f4a1885856e,7f4a1e8a3247,7f4a1e8a4d2a,7f4a1e3f202a,7f4a1e8e3000,7f4a1e8e0d69,7f4a1e8e08bc,7f4a1e8e3000,7f4a1e8e0d69,7f4a1e8e08bc,7f4a1df5f13e,7f4a1df5a52d,7f4a1df64291,7f4a1df71ffc,7f4a1db5c6b5,7f4a1db5c013,7f4a28dcf955,7f4a2657f0d3,7f4a2657519f,7f4a264b9e06,7f4a2649429f,7f4a2408f151,7f4a23e095e1,7f4a23e09a50,7f4a23df045f,5f2cc8&map=20957999b35a518f734e5552ed1ebec946aa0e35:7f4a2378b000-7f4a2a67dfc0,2a762cd764e70bc90ae4c7f9747c08d7:7f4a15d2d000-7f4a22fae280 
https://symbolize.stripped_domain/r/?trace=7f4acedc218b,7f4acedc220f,7f4a22c7f487,7f4a22c7ded3,7f4a22c7d9c2,7f4a22c7fcc8,7f4a1e8e7eed,7f4a1e87ab2e,7f4a1e878cc1,7f4a223fddb3,7f4a223ff211,7f4a223fce22,7f4a1885856e,7f4a1e8a3247,7f4a1e8a4d2a,7f4a1e3f202a,7f4a1e8e3000,7f4a1e8e0d69,7f4a1e8e08bc,7f4a1e8e3000,7f4a1e8e0d69,7f4a1e8e08bc,7f4a1df5f13e,7f4a1df5a52d,7f4a1df64291,7f4a1df71ffc,7f4a1db5c6b5,7f4a1db5c013,7f4a28dcf955,7f4a2657f0d3,7f4a2657519f,7f4a264b9e06,7f4a2649429f&map=20957999b35a518f734e5552ed1ebec946aa0e35:7f4a2378b000-7f4a2a67dfc0,2a762cd764e70bc90ae4c7f9747c08d7:7f4a15d2d000-7f4a22fae280 
*** SIGABRT received by PID 76098 (TID 76098) on cpu 46 from PID 76098; ***
E0717 16:08:46.484046   76098 coredump_hook.cc:292] RAW: Remote crash data gathering hook invoked.
E0717 16:08:46.484074   76098 coredump_hook.cc:384] RAW: Skipping coredump since rlimit was 0 at process start.
E0717 16:08:46.484099   76098 client.cc:222] RAW: Coroner client retries enabled (b/136286901), will retry for up to 30 sec.
E0717 16:08:46.484107   76098 coredump_hook.cc:447] RAW: Sending fingerprint to remote end.
E0717 16:08:46.484121   76098 coredump_socket.cc:124] RAW: Stat failed errno=2 on socket /var/google/services/logmanagerd/remote_coredump.socket
E0717 16:08:46.484133   76098 coredump_hook.cc:451] RAW: Cannot send fingerprint to Coroner: [NOT_FOUND] Missing crash reporting socket. Is the listener running?
E0717 16:08:46.484139   76098 coredump_hook.cc:525] RAW: Discarding core.
F0717 16:08:46.411695   76098 array.h:414] Check failed: n < sizes_size 
E0717 16:08:46.761921   76098 process_state.cc:771] RAW: Raising signal 6 with default behavior

Issue Analytics

  • State:closed
  • Created 2 years ago
  • Comments:14 (12 by maintainers)

github_iconTop GitHub Comments

1reaction
skyecommented, Sep 20, 2021

Can someone try running this with the latest jax[tpu] install? At least one crash has been resolved since this was posted, and I wonder if this one was as well.

0reactions
github-actions[bot]commented, Oct 18, 2021

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Trying to run option 2, "GPT Neo 2.7B", causes a memory leak ...
The memory looks normal in taskmanager and your specs look like you will be able to run these models. Are you running GPT-Neo...
Read more >
EleutherAI GPT-Neo VS Flexbox Froggy - compare differences ...
Compare EleutherAI GPT-Neo VS Flexbox Froggy and see what are their differences. Flexbox Froggy logo ... Omnichannel CRM for Businesses of all sizes....
Read more >
How To Fix: External Disk Drive Suddenly Became RAW
Select Intel and hit enter (there is a slight chance that the partition is EFI GPT if the drive is 2TB or greater...
Read more >
Untitled
Niessing ring sizes, Vox aga70 acoustic guitar amp review, Marilyn manson quotes twitter ... Roccat pyra wireless nvidia edition, Proton neo price malaysia, ......
Read more >
Connor Leahy on Dignity and Conjecture - The Inside View
In the last episode, we talked a lot about EleutherAI, a grassroot collective of researchers he co-founded, who open-sourced GPT-3 size ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found