Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Mesos sometimes crashes since Ubuntu 18 update

See original GitHub issue

Not sure if anyone else has experienced issues with the recent Ubuntu 18 + Mesos update (since it’s not yet officially released), but I patched it on our end and Mesos sometimes (~30% of the time for a medium-sized pipeline) crashes at the very end of the pipeline (stacktrace below). I’m pretty sure that the crash is related to the Mesos+U18 update as our pipelines have been reliably running successfully after I removed the U18 change. Also, the new Mesos dashboard refresh rate is slower and often times loses connection, which is a bit annoying to work with.

Is anyone else able to reproduce this? Note that the crash only seems to happen on a medium-sized pipeline (i.e. when autoscaling kicks in; my theory is that it has to do with calling shutdown while some workers are being deactivated as the cluster is scaling down). I’ve been running it on AWS using the CWL runner.

In the meantime, should we revert back to the tried-and-tested version until we have done more testing on the newer version? @DailyDreaming @jeffrey856 WDYT?

Stack trace from journalctl (also includes some relevant info logs), which is suspiciously similar to https://issues.apache.org/jira/browse/MESOS-9609 :

 I0715 02:56:40.071446    13 master.cpp:1295] Agent 9d8dd16c-13f4-4f15-bac8-e5138b2862ee-S0 at slave(1)@10.0.138.150:5051 (10.0.138.150) disconnected
 I0715 02:56:40.071503    13 master.cpp:3333] Disconnecting agent 9d8dd16c-13f4-4f15-bac8-e5138b2862ee-S0 at slave(1)@10.0.138.150:5051 (10.0.138.150)
 I0715 02:56:40.071527    13 master.cpp:3352] Deactivating agent 9d8dd16c-13f4-4f15-bac8-e5138b2862ee-S0 at slave(1)@10.0.138.150:5051 (10.0.138.150)
 I0715 02:56:40.071563    13 master.cpp:1319] Removing framework 9d8dd16c-13f4-4f15-bac8-e5138b2862ee-0000 (toil) from disconnected agent 9d8dd16c-13f4-4f15-bac8-e5138b2862ee-S0 at slave(1)@10.0.138.150:5051 (10.0.138.150) because the framework is not checkpointing
 I0715 02:56:40.071579    13 master.cpp:11006] Removing framework 9d8dd16c-13f4-4f15-bac8-e5138b2862ee-0000 (toil) from agent 9d8dd16c-13f4-4f15-bac8-e5138b2862ee-S0 at slave(1)@10.0.138.150:5051 (10.0.138.150)
 I0715 02:56:40.071583    12 hierarchical.cpp:829] Agent 9d8dd16c-13f4-4f15-bac8-e5138b2862ee-S0 deactivated
 I0715 02:56:40.071619    13 master.cpp:11766] Removing executor 'toil-41' with resources {} of framework 9d8dd16c-13f4-4f15-bac8-e5138b2862ee-0000 on agent 9d8dd16c-13f4-4f15-bac8-e5138b2862ee-S0 at slave(1)@10.0.138.150:5051 (10.0.138.150)
 I0715 02:58:08.642220    12 master.cpp:9130] Marking agent 9d8dd16c-13f4-4f15-bac8-e5138b2862ee-S0 (10.0.138.150) unreachable: health check timed out
 I0715 02:58:08.642675    11 registrar.cpp:487] Applied 1 operations in 305592ns; attempting to update the registry
 I0715 02:58:08.642922    13 registrar.cpp:544] Successfully updated the registry in 187904ns
 I0715 02:58:08.643081    17 master.cpp:9173] Marked agent 9d8dd16c-13f4-4f15-bac8-e5138b2862ee-S0 (10.0.138.150) unreachable: health check timed out
 F0715 02:58:08.643210    17 master.cpp:11402] Check failed: 'framework' Must be non NULL
 *** Check failure stack trace: ***
 I0715 02:58:08.643254    12 hierarchical.cpp:680] Removed agent 9d8dd16c-13f4-4f15-bac8-e5138b2862ee-S0
     @     0x7ffbcffd090d  google::LogMessage::Fail()
     @     0x7ffbcffd2748  google::LogMessage::SendToLog()
     @     0x7ffbcffd04f3  google::LogMessage::Flush()
     @     0x7ffbcffd31d9  google::LogMessageFatal::~LogMessageFatal()
     @     0x7ffbcec65024  google::CheckNotNull<>()
     @     0x7ffbcec32658  mesos::internal::master::Master::__removeSlave()
     @     0x7ffbcec33b13  mesos::internal::master::Master::_markUnreachable()
     @     0x7ffbcec33e55  _ZNO6lambda12CallableOnceIFN7process6FutureIbEEvEE10CallableFnINS_8internal7PartialIZN5mesos8internal6master6Master15markUnreachableERKNS9_9SlaveInfoEbRKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEEEUlbE_JbEEEEclEv
     @     0x7ffbce93d5d8  _ZNO6lambda12CallableOnceIFvPN7process11ProcessBaseEEE10CallableFnINS_8internal7PartialIZNS1_8internal8DispatchINS1_6FutureIbEEEclINS0_IFSC_vEEEEESC_RKNS1_4UPIDEOT_EUlSt10unique_ptrINS1_7PromiseIbEESt14default_deleteISO_EEOSG_S3_E_ISR_SG_St12_PlaceholderILi1EEEEEEclEOS3_
     @     0x7ffbcff18371  process::ProcessBase::consume()
     @     0x7ffbcff3a97a  process::ProcessManager::resume()
     @     0x7ffbcff3e6a6  _ZNSt6thread5_ImplISt12_Bind_simpleIFZN7process14ProcessManager12init_threadsEvEUlvE_vEEE6_M_runEv
     @     0x7ffbcc2cd9e0  (unknown)
     @     0x7ffbcbde06db  start_thread
     @     0x7ffbcbb0988f  (unknown)

┆Issue is synchronized with this Jira Task ┆Issue Number: TOIL-397