question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Possible deadlock with catkin build

See original GitHub issue

System Info

  • Operating System: Ubuntu 14.04 LTS
  • Python Version: 2.7
  • Version of catkin_tools: 0.4.2
  • ROS Distro: Indigo

Build / Run Issue

I apologize in advance for this very imprecise bug report.

I noticed that, starting with the new version 0.4.x, catkin build will hang sometimes. It happens while processing a random package and very rarely (I’d say 3 out of 100 builds on our Jenkins). Also, it will run just fine if I cancel and reschedule the build.

I stumbled upon the --no-install-lock option, which I just added to our build script in the hope that it will resolve this issue. I won’t be able to tell until sufficiently many builds have run, obviously.

In case anyone has an idea where to look for this problem, our build script runs the following commands:

catkin config -w ros --init --no-blacklist --install -j4 -p4 --cmake-args -DCMAKE_BUILD_TYPE=Debug -DCMAKE_C_COMPILER=/usr/lib/ccache/gcc -DCMAKE_C_FLAGS_DEBUG="-fmessage-length=0 -Wall -Wextra -Wno-unused-parameter -Wno-ignored-qualifiers -Wno-error=deprecated-declarations -Wno-error=unused-variable -Wno-error=unused-but-set-variable -O0" -DCMAKE_CXX_COMPILER=/usr/lib/ccache/g++ -DCMAKE_CXX_FLAGS_DEBUG="-fmessage-length=0 -Wall -Wextra -Wno-unused-parameter -Wno-ignored-qualifiers -Wno-invalid-offsetof -Wno-unused-local-typedefs -Wno-error=deprecated-declarations -Wno-error=unused-variable -Wno-error=unused-but-set-variable -O0" -DCMAKE_SHARED_LINKER_FLAGS_DEBUG="-Wl,-z,defs" -DCMAKE_EXE_LINKER_FLAGS_DEBUG="-Wl,-z,defs"
catkin clean -w ros --all --yes
catkin build -w ros --verbose --no-status --no-notify --continue-on-failure

The last command outputs:

--------------------------------------------------------------------------------
Profile:                     default
Extending:             [env] /opt/ros/indigo
Workspace:                   /home/jenkins/ros
--------------------------------------------------------------------------------
Source Space:       [exists] /home/jenkins/ros/src
Log Space:         [missing] /home/jenkins/ros/logs
Build Space:        [exists] /home/jenkins/ros/build
Devel Space:        [exists] /home/jenkins/ros/devel
Install Space:     [missing] /home/jenkins/ros/install
DESTDIR:            [unused] None
--------------------------------------------------------------------------------
Devel Space Layout:          merged
Install Space Layout:        merged
--------------------------------------------------------------------------------
Additional CMake Args:       -DCMAKE_BUILD_TYPE=Debug -DCMAKE_C_COMPILER=/usr/lib/ccache/gcc -DCMAKE_C_FLAGS_DEBUG=-fmessage-length=0 -Wall -Wextra -Wno-unused-parameter -Wno-ignored-qualifiers -Wno-error=deprecated-declarations -Wno-error=unused-variable -Wno-error=unused-but-set-variable -O0 -DCMAKE_CXX_COMPILER=/usr/lib/ccache/g++ -DCMAKE_CXX_FLAGS_DEBUG=-fmessage-length=0 -Wall -Wextra -Wno-unused-parameter -Wno-ignored-qualifiers -Wno-invalid-offsetof -Wno-unused-local-typedefs -Wno-error=deprecated-declarations -Wno-error=unused-variable -Wno-error=unused-but-set-variable -O0 -DCMAKE_SHARED_LINKER_FLAGS_DEBUG=-Wl,-z,defs -DCMAKE_EXE_LINKER_FLAGS_DEBUG=-Wl,-z,defs
Additional Make Args:        -j4
Additional catkin Make Args: None
Internal Make Job Server:    True
Cache Job Environments:      False
--------------------------------------------------------------------------------
Whitelisted Packages:        None
Blacklisted Packages:        None
--------------------------------------------------------------------------------
Workspace configuration appears valid.

Issue Analytics

  • State:open
  • Created 7 years ago
  • Comments:21 (4 by maintainers)

github_iconTop GitHub Comments

2reactions
HannesSommercommented, Feb 28, 2018

@roehling , thanks for the update. We’ve found a working theory and a dirty workaround that works robustly for us:

Theory:

The catkin internal make job-server runs out of job-tokens (we don’t know why or how; FD 6 and 7 (see above) are relevant, because --jobserver-fds=6,7 is passed on to the child make processes). Since catkin is also using these tokens internally it gets locked indefinitely while trying to get one out of zero available). The child processes get blocked on output because catkins also stops reading from the pipes to the children and their buffer run full at some point.

Workaround:

A script runs periodically that:

  1. detects the issue for the catkin processes (very poorly: CPU consumption of it + its children zero for a while …) .
  2. For each catkin process ($catkin_pid) having the issue it performs: sudo -u jenkins bash -c "echo -n +++ > /proc/$catkin_pid/fd/7" to inject three fresh job-tokens (catkin seems to use ‘+’ only; one would typically lead to it getting stuck again later).

Given this theory I doubt that this is actually a catkin bug. Interestingly this only happens on (all; Trusty and Xenial) our Jenkins slaves running in Docker containers and only for very big jobs (running > 10min, producing > 10MB verbose output)

1reaction
roehlingcommented, Nov 18, 2017

Unfortunately, no. The closest thing I found is a warning in the documentation of subprocess.Popen.wait that describes similar symptoms. AFAICT it does not apply here, but maybe I overlooked something.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Build Packages — catkin_tools 0.0.0 documentation
The build verb is used to build one or more packages in a catkin workspace. Like most verbs, build is context-aware and can...
Read more >
catkin config – Configure a Workspace
--link-devel Build products from each catkin package into isolated spaces, then symbolically link them into a merged devel space. --merge-devel Build products ...
Read more >
Trigger Catkin Build Process From Within Python - ADocLib
The catkin build command builds packages in the topological order determined by ... design the potential of user space channels for corruption and...
Read more >
Synchronous vs. asynchronous service clients — ROS 2 ...
There are several ways that the synchronous call() API can cause deadlock. As mentioned in the comments of the example above, failing to...
Read more >
Debugging with GDB - Multi-robot Systems Group UAV System
If you're experiencing crashes of your C/C++ ROS node/nodelet or if your ... When you have gdb prompt available, a good way to...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found