Mesos sometimes crashes since Ubuntu 18 update
See original GitHub issueNot sure if anyone else has experienced issues with the recent Ubuntu 18 + Mesos update (since it’s not yet officially released), but I patched it on our end and Mesos sometimes (~30% of the time for a medium-sized pipeline) crashes at the very end of the pipeline (stacktrace below). I’m pretty sure that the crash is related to the Mesos+U18 update as our pipelines have been reliably running successfully after I removed the U18 change. Also, the new Mesos dashboard refresh rate is slower and often times loses connection, which is a bit annoying to work with.
Is anyone else able to reproduce this? Note that the crash only seems to happen on a medium-sized pipeline (i.e. when autoscaling kicks in; my theory is that it has to do with calling shutdown
while some workers are being deactivated as the cluster is scaling down). I’ve been running it on AWS using the CWL runner.
In the meantime, should we revert back to the tried-and-tested version until we have done more testing on the newer version? @DailyDreaming @jeffrey856 WDYT?
Stack trace from journalctl
(also includes some relevant info logs), which is suspiciously similar to https://issues.apache.org/jira/browse/MESOS-9609 :
I0715 02:56:40.071446 13 master.cpp:1295] Agent 9d8dd16c-13f4-4f15-bac8-e5138b2862ee-S0 at slave(1)@10.0.138.150:5051 (10.0.138.150) disconnected
I0715 02:56:40.071503 13 master.cpp:3333] Disconnecting agent 9d8dd16c-13f4-4f15-bac8-e5138b2862ee-S0 at slave(1)@10.0.138.150:5051 (10.0.138.150)
I0715 02:56:40.071527 13 master.cpp:3352] Deactivating agent 9d8dd16c-13f4-4f15-bac8-e5138b2862ee-S0 at slave(1)@10.0.138.150:5051 (10.0.138.150)
I0715 02:56:40.071563 13 master.cpp:1319] Removing framework 9d8dd16c-13f4-4f15-bac8-e5138b2862ee-0000 (toil) from disconnected agent 9d8dd16c-13f4-4f15-bac8-e5138b2862ee-S0 at slave(1)@10.0.138.150:5051 (10.0.138.150) because the framework is not checkpointing
I0715 02:56:40.071579 13 master.cpp:11006] Removing framework 9d8dd16c-13f4-4f15-bac8-e5138b2862ee-0000 (toil) from agent 9d8dd16c-13f4-4f15-bac8-e5138b2862ee-S0 at slave(1)@10.0.138.150:5051 (10.0.138.150)
I0715 02:56:40.071583 12 hierarchical.cpp:829] Agent 9d8dd16c-13f4-4f15-bac8-e5138b2862ee-S0 deactivated
I0715 02:56:40.071619 13 master.cpp:11766] Removing executor 'toil-41' with resources {} of framework 9d8dd16c-13f4-4f15-bac8-e5138b2862ee-0000 on agent 9d8dd16c-13f4-4f15-bac8-e5138b2862ee-S0 at slave(1)@10.0.138.150:5051 (10.0.138.150)
I0715 02:58:08.642220 12 master.cpp:9130] Marking agent 9d8dd16c-13f4-4f15-bac8-e5138b2862ee-S0 (10.0.138.150) unreachable: health check timed out
I0715 02:58:08.642675 11 registrar.cpp:487] Applied 1 operations in 305592ns; attempting to update the registry
I0715 02:58:08.642922 13 registrar.cpp:544] Successfully updated the registry in 187904ns
I0715 02:58:08.643081 17 master.cpp:9173] Marked agent 9d8dd16c-13f4-4f15-bac8-e5138b2862ee-S0 (10.0.138.150) unreachable: health check timed out
F0715 02:58:08.643210 17 master.cpp:11402] Check failed: 'framework' Must be non NULL
*** Check failure stack trace: ***
I0715 02:58:08.643254 12 hierarchical.cpp:680] Removed agent 9d8dd16c-13f4-4f15-bac8-e5138b2862ee-S0
@ 0x7ffbcffd090d google::LogMessage::Fail()
@ 0x7ffbcffd2748 google::LogMessage::SendToLog()
@ 0x7ffbcffd04f3 google::LogMessage::Flush()
@ 0x7ffbcffd31d9 google::LogMessageFatal::~LogMessageFatal()
@ 0x7ffbcec65024 google::CheckNotNull<>()
@ 0x7ffbcec32658 mesos::internal::master::Master::__removeSlave()
@ 0x7ffbcec33b13 mesos::internal::master::Master::_markUnreachable()
@ 0x7ffbcec33e55 _ZNO6lambda12CallableOnceIFN7process6FutureIbEEvEE10CallableFnINS_8internal7PartialIZN5mesos8internal6master6Master15markUnreachableERKNS9_9SlaveInfoEbRKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEEEUlbE_JbEEEEclEv
@ 0x7ffbce93d5d8 _ZNO6lambda12CallableOnceIFvPN7process11ProcessBaseEEE10CallableFnINS_8internal7PartialIZNS1_8internal8DispatchINS1_6FutureIbEEEclINS0_IFSC_vEEEEESC_RKNS1_4UPIDEOT_EUlSt10unique_ptrINS1_7PromiseIbEESt14default_deleteISO_EEOSG_S3_E_ISR_SG_St12_PlaceholderILi1EEEEEEclEOS3_
@ 0x7ffbcff18371 process::ProcessBase::consume()
@ 0x7ffbcff3a97a process::ProcessManager::resume()
@ 0x7ffbcff3e6a6 _ZNSt6thread5_ImplISt12_Bind_simpleIFZN7process14ProcessManager12init_threadsEvEUlvE_vEEE6_M_runEv
@ 0x7ffbcc2cd9e0 (unknown)
@ 0x7ffbcbde06db start_thread
@ 0x7ffbcbb0988f (unknown)
┆Issue is synchronized with this Jira Task ┆Issue Number: TOIL-397
Issue Analytics
- State:
- Created 4 years ago
- Comments:7 (7 by maintainers)
Top GitHub Comments
Thanks so much! Now we can look deeper into why the crash happens on the new Mesos version. I’ll update if I find anything!
FYI, this bug has been fixed in
v1.11.0
🎉 So we could give this another try once that’s released.