question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Mesos sometimes crashes since Ubuntu 18 update

See original GitHub issue

Not sure if anyone else has experienced issues with the recent Ubuntu 18 + Mesos update (since it’s not yet officially released), but I patched it on our end and Mesos sometimes (~30% of the time for a medium-sized pipeline) crashes at the very end of the pipeline (stacktrace below). I’m pretty sure that the crash is related to the Mesos+U18 update as our pipelines have been reliably running successfully after I removed the U18 change. Also, the new Mesos dashboard refresh rate is slower and often times loses connection, which is a bit annoying to work with.

Is anyone else able to reproduce this? Note that the crash only seems to happen on a medium-sized pipeline (i.e. when autoscaling kicks in; my theory is that it has to do with calling shutdown while some workers are being deactivated as the cluster is scaling down). I’ve been running it on AWS using the CWL runner.

In the meantime, should we revert back to the tried-and-tested version until we have done more testing on the newer version? @DailyDreaming @jeffrey856 WDYT?

Stack trace from journalctl (also includes some relevant info logs), which is suspiciously similar to https://issues.apache.org/jira/browse/MESOS-9609 :

 I0715 02:56:40.071446    13 master.cpp:1295] Agent 9d8dd16c-13f4-4f15-bac8-e5138b2862ee-S0 at slave(1)@10.0.138.150:5051 (10.0.138.150) disconnected
 I0715 02:56:40.071503    13 master.cpp:3333] Disconnecting agent 9d8dd16c-13f4-4f15-bac8-e5138b2862ee-S0 at slave(1)@10.0.138.150:5051 (10.0.138.150)
 I0715 02:56:40.071527    13 master.cpp:3352] Deactivating agent 9d8dd16c-13f4-4f15-bac8-e5138b2862ee-S0 at slave(1)@10.0.138.150:5051 (10.0.138.150)
 I0715 02:56:40.071563    13 master.cpp:1319] Removing framework 9d8dd16c-13f4-4f15-bac8-e5138b2862ee-0000 (toil) from disconnected agent 9d8dd16c-13f4-4f15-bac8-e5138b2862ee-S0 at slave(1)@10.0.138.150:5051 (10.0.138.150) because the framework is not checkpointing
 I0715 02:56:40.071579    13 master.cpp:11006] Removing framework 9d8dd16c-13f4-4f15-bac8-e5138b2862ee-0000 (toil) from agent 9d8dd16c-13f4-4f15-bac8-e5138b2862ee-S0 at slave(1)@10.0.138.150:5051 (10.0.138.150)
 I0715 02:56:40.071583    12 hierarchical.cpp:829] Agent 9d8dd16c-13f4-4f15-bac8-e5138b2862ee-S0 deactivated
 I0715 02:56:40.071619    13 master.cpp:11766] Removing executor 'toil-41' with resources {} of framework 9d8dd16c-13f4-4f15-bac8-e5138b2862ee-0000 on agent 9d8dd16c-13f4-4f15-bac8-e5138b2862ee-S0 at slave(1)@10.0.138.150:5051 (10.0.138.150)
 I0715 02:58:08.642220    12 master.cpp:9130] Marking agent 9d8dd16c-13f4-4f15-bac8-e5138b2862ee-S0 (10.0.138.150) unreachable: health check timed out
 I0715 02:58:08.642675    11 registrar.cpp:487] Applied 1 operations in 305592ns; attempting to update the registry
 I0715 02:58:08.642922    13 registrar.cpp:544] Successfully updated the registry in 187904ns
 I0715 02:58:08.643081    17 master.cpp:9173] Marked agent 9d8dd16c-13f4-4f15-bac8-e5138b2862ee-S0 (10.0.138.150) unreachable: health check timed out
 F0715 02:58:08.643210    17 master.cpp:11402] Check failed: 'framework' Must be non NULL
 *** Check failure stack trace: ***
 I0715 02:58:08.643254    12 hierarchical.cpp:680] Removed agent 9d8dd16c-13f4-4f15-bac8-e5138b2862ee-S0
     @     0x7ffbcffd090d  google::LogMessage::Fail()
     @     0x7ffbcffd2748  google::LogMessage::SendToLog()
     @     0x7ffbcffd04f3  google::LogMessage::Flush()
     @     0x7ffbcffd31d9  google::LogMessageFatal::~LogMessageFatal()
     @     0x7ffbcec65024  google::CheckNotNull<>()
     @     0x7ffbcec32658  mesos::internal::master::Master::__removeSlave()
     @     0x7ffbcec33b13  mesos::internal::master::Master::_markUnreachable()
     @     0x7ffbcec33e55  _ZNO6lambda12CallableOnceIFN7process6FutureIbEEvEE10CallableFnINS_8internal7PartialIZN5mesos8internal6master6Master15markUnreachableERKNS9_9SlaveInfoEbRKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEEEUlbE_JbEEEEclEv
     @     0x7ffbce93d5d8  _ZNO6lambda12CallableOnceIFvPN7process11ProcessBaseEEE10CallableFnINS_8internal7PartialIZNS1_8internal8DispatchINS1_6FutureIbEEEclINS0_IFSC_vEEEEESC_RKNS1_4UPIDEOT_EUlSt10unique_ptrINS1_7PromiseIbEESt14default_deleteISO_EEOSG_S3_E_ISR_SG_St12_PlaceholderILi1EEEEEEclEOS3_
     @     0x7ffbcff18371  process::ProcessBase::consume()
     @     0x7ffbcff3a97a  process::ProcessManager::resume()
     @     0x7ffbcff3e6a6  _ZNSt6thread5_ImplISt12_Bind_simpleIFZN7process14ProcessManager12init_threadsEvEUlvE_vEEE6_M_runEv
     @     0x7ffbcc2cd9e0  (unknown)
     @     0x7ffbcbde06db  start_thread
     @     0x7ffbcbb0988f  (unknown)

┆Issue is synchronized with this Jira Task ┆Issue Number: TOIL-397

Issue Analytics

  • State:closed
  • Created 4 years ago
  • Comments:7 (7 by maintainers)

github_iconTop GitHub Comments

1reaction
arostamianfarcommented, Jul 18, 2019

Thanks so much! Now we can look deeper into why the crash happens on the new Mesos version. I’ll update if I find anything!

0reactions
arostamianfarcommented, Sep 10, 2020

FYI, this bug has been fixed in v1.11.0 🎉 So we could give this another try once that’s released.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Upgrade from 18.04 to 20.04 crashed my laptop - Ask Ubuntu
I had been having many problems recently and tried to upgrade to 20.04, it took a long time to figure out how to...
Read more >
DebuggingSystemCrash - Community Help Wiki
Sometimes crashes occur in X, and so terminal access is not available (to capture the kernel backtrace). When this occurs, the user should...
Read more >
install ubuntu 18.04 - start master failed - mesos version 1.6.0
I tried to install Apache Mesos 1.4.0 on Ubuntu 18.04, but start master ... sudo apt-get update $ sudo apt-get install -y tar...
Read more >
How To Fix System Program Problem Detected In Ubuntu
Don't panic by the word 'crash'. It's not a major issue and your system is very much usable. It just that some program...
Read more >
Installing Mesos on Ubuntu 16.04 from packages
You must be running a 64-bit version of the Ubuntu 16.04 operating system and it should be patched to the most current patch...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found