GPTNeo Flax - crashes - n> sizes_size
See original GitHub issueEnvironment info
transformers
version: 4.9.0.dev0- Platform: Linux-5.4.0-1043-gcp-x86_64-with-glibc2.29
- Python version: 3.8.10
- PyTorch version (GPU?): not installed (NA)
- Tensorflow version (GPU?): 2.5.0 (False)
- Flax version (CPU?/GPU?/TPU?): 0.3.4 (cpu)
- Jax version: 0.2.17
- JaxLib version: 0.1.68
Who can help
Information
Trying to run the experimental GPTNeo Flax script. Are getting the following error:
07/17/2021 16:08:11 - INFO - __main__ - ***** Running training *****
07/17/2021 16:08:11 - INFO - __main__ - Num examples = 2852257
07/17/2021 16:08:11 - INFO - __main__ - Num Epochs = 10
07/17/2021 16:08:11 - INFO - __main__ - Instantaneous batch size per device = 3
07/17/2021 16:08:11 - INFO - __main__ - Total train batch size (w. parallel & distributed) = 24
07/17/2021 16:08:11 - INFO - __main__ - Total optimization steps = 1188440
Epoch ... (1/10): 0%| | 0/10 [00:00<?, ?it/sF0717 16:08:46.411695 76098 array.h:414] Check failed: n < sizes_size | 0/118844 [00:00<?, ?it/s]
*** Check failure stack trace: ***
@ 0x7f4a22c7f347 (unknown)
@ 0x7f4a22c7ded4 (unknown)
@ 0x7f4a22c7d9c3 (unknown)
@ 0x7f4a22c7fcc9 (unknown)
@ 0x7f4a1e8e7eee (unknown)
@ 0x7f4a1e87ab2f (unknown)
@ 0x7f4a1e878cc2 (unknown)
@ 0x7f4a223fddb4 (unknown)
@ 0x7f4a223ff212 (unknown)
@ 0x7f4a223fce23 (unknown)
@ 0x7f4a1885856f (unknown)
@ 0x7f4a1e8a3248 (unknown)
@ 0x7f4a1e8a4d2b (unknown)
@ 0x7f4a1e3f202b (unknown)
@ 0x7f4a1e8e3001 (unknown)
@ 0x7f4a1e8e0d6a (unknown)
@ 0x7f4a1e8e08bd (unknown)
@ 0x7f4a1e8e3001 (unknown)
@ 0x7f4a1e8e0d6a (unknown)
@ 0x7f4a1e8e08bd (unknown)
@ 0x7f4a1df5f13f (unknown)
@ 0x7f4a1df5a52e (unknown)
@ 0x7f4a1df64292 (unknown)
@ 0x7f4a1df71ffd (unknown)
@ 0x7f4a1db5c6b6 (unknown)
@ 0x7f4a1db5c014 TpuCompiler_Compile
@ 0x7f4a28dcf956 xla::(anonymous namespace)::TpuCompiler::Compile()
@ 0x7f4a2657f0d4 xla::Service::BuildExecutables()
@ 0x7f4a265751a0 xla::LocalService::CompileExecutables()
@ 0x7f4a264b9e07 xla::LocalClient::Compile()
@ 0x7f4a264942a0 xla::PjRtStreamExecutorClient::Compile()
@ 0x7f4a2408f152 xla::PyClient::Compile()
@ 0x7f4a23e095e2 pybind11::detail::argument_loader<>::call_impl<>()
@ 0x7f4a23e09a51 pybind11::cpp_function::initialize<>()::{lambda()#3}::operator()()
@ 0x7f4a23df0460 pybind11::cpp_function::dispatcher()
@ 0x5f2cc9 PyCFunction_Call
https://symbolize.stripped_domain/r/?trace=7f4a22c7f347,7f4a22c7ded3,7f4a22c7d9c2,7f4a22c7fcc8,7f4a1e8e7eed,7f4a1e87ab2e,7f4a1e878cc1,7f4a223fddb3,7f4a223ff211,7f4a223fce22,7f4a1885856e,7f4a1e8a3247,7f4a1e8a4d2a,7f4a1e3f202a,7f4a1e8e3000,7f4a1e8e0d69,7f4a1e8e08bc,7f4a1e8e3000,7f4a1e8e0d69,7f4a1e8e08bc,7f4a1df5f13e,7f4a1df5a52d,7f4a1df64291,7f4a1df71ffc,7f4a1db5c6b5,7f4a1db5c013,7f4a28dcf955,7f4a2657f0d3,7f4a2657519f,7f4a264b9e06,7f4a2649429f,7f4a2408f151,7f4a23e095e1,7f4a23e09a50,7f4a23df045f,5f2cc8&map=20957999b35a518f734e5552ed1ebec946aa0e35:7f4a2378b000-7f4a2a67dfc0,2a762cd764e70bc90ae4c7f9747c08d7:7f4a15d2d000-7f4a22fae280
https://symbolize.stripped_domain/r/?trace=7f4acedc218b,7f4acedc220f,7f4a22c7f487,7f4a22c7ded3,7f4a22c7d9c2,7f4a22c7fcc8,7f4a1e8e7eed,7f4a1e87ab2e,7f4a1e878cc1,7f4a223fddb3,7f4a223ff211,7f4a223fce22,7f4a1885856e,7f4a1e8a3247,7f4a1e8a4d2a,7f4a1e3f202a,7f4a1e8e3000,7f4a1e8e0d69,7f4a1e8e08bc,7f4a1e8e3000,7f4a1e8e0d69,7f4a1e8e08bc,7f4a1df5f13e,7f4a1df5a52d,7f4a1df64291,7f4a1df71ffc,7f4a1db5c6b5,7f4a1db5c013,7f4a28dcf955,7f4a2657f0d3,7f4a2657519f,7f4a264b9e06,7f4a2649429f&map=20957999b35a518f734e5552ed1ebec946aa0e35:7f4a2378b000-7f4a2a67dfc0,2a762cd764e70bc90ae4c7f9747c08d7:7f4a15d2d000-7f4a22fae280
*** SIGABRT received by PID 76098 (TID 76098) on cpu 46 from PID 76098; ***
E0717 16:08:46.484046 76098 coredump_hook.cc:292] RAW: Remote crash data gathering hook invoked.
E0717 16:08:46.484074 76098 coredump_hook.cc:384] RAW: Skipping coredump since rlimit was 0 at process start.
E0717 16:08:46.484099 76098 client.cc:222] RAW: Coroner client retries enabled (b/136286901), will retry for up to 30 sec.
E0717 16:08:46.484107 76098 coredump_hook.cc:447] RAW: Sending fingerprint to remote end.
E0717 16:08:46.484121 76098 coredump_socket.cc:124] RAW: Stat failed errno=2 on socket /var/google/services/logmanagerd/remote_coredump.socket
E0717 16:08:46.484133 76098 coredump_hook.cc:451] RAW: Cannot send fingerprint to Coroner: [NOT_FOUND] Missing crash reporting socket. Is the listener running?
E0717 16:08:46.484139 76098 coredump_hook.cc:525] RAW: Discarding core.
F0717 16:08:46.411695 76098 array.h:414] Check failed: n < sizes_size
E0717 16:08:46.761921 76098 process_state.cc:771] RAW: Raising signal 6 with default behavior
Issue Analytics
- State:
- Created 2 years ago
- Comments:14 (12 by maintainers)
Top Results From Across the Web
Trying to run option 2, "GPT Neo 2.7B", causes a memory leak ...
The memory looks normal in taskmanager and your specs look like you will be able to run these models. Are you running GPT-Neo...
Read more >EleutherAI GPT-Neo VS Flexbox Froggy - compare differences ...
Compare EleutherAI GPT-Neo VS Flexbox Froggy and see what are their differences. Flexbox Froggy logo ... Omnichannel CRM for Businesses of all sizes....
Read more >How To Fix: External Disk Drive Suddenly Became RAW
Select Intel and hit enter (there is a slight chance that the partition is EFI GPT if the drive is 2TB or greater...
Read more >Untitled
Niessing ring sizes, Vox aga70 acoustic guitar amp review, Marilyn manson quotes twitter ... Roccat pyra wireless nvidia edition, Proton neo price malaysia, ......
Read more >Connor Leahy on Dignity and Conjecture - The Inside View
In the last episode, we talked a lot about EleutherAI, a grassroot collective of researchers he co-founded, who open-sourced GPT-3 size ...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
Can someone try running this with the latest jax[tpu] install? At least one crash has been resolved since this was posted, and I wonder if this one was as well.
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.
Please note that issues that do not follow the contributing guidelines are likely to be ignored.