question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Ros2 node list stops working after unrelated node crashes

See original GitHub issue

Bug report

Required Info:

  • Operating System:
    • Ubuntu 20.04
  • Installation type:
    • from source(master)
  • Version or commit hash:
    • 44a654c9077252eed50b2a6d9034ba47b634bf98(master)
  • DDS implementation:
    • rmw_fastrtps_cpp
  • Client library (if applicable):
    • N/A

Steps to reproduce issue

Unfortunately I don’t have detailed steps to reproduce, which makes it so hard to debug. The error might not happen for days and then turn up again constantly. I have various nodes which do things and somehow one of them must brick the whole ros system. I’m not sure how this is even possible, but after that happens only a complete reboot of the computer helps. If I try to use commands like ros2 node list or ros2 service list I get following stacktrace

Traceback (most recent call last):
  File "/home/username/ros2_rolling/install/ros2cli/bin/ros2", line 11, in <module>
    load_entry_point('ros2cli', 'console_scripts', 'ros2')()
  File "/home/username/ros2_rolling/build/ros2cli/ros2cli/cli.py", line 89, in main
    rc = extension.main(parser=parser, args=args)
  File "/home/username/ros2_rolling/build/ros2node/ros2node/command/node.py", line 37, in main
    return extension.main(args=args)
  File "/home/username/ros2_rolling/build/ros2node/ros2node/verb/list.py", line 38, in main
    node_names = get_node_names(node=node, include_hidden_nodes=args.all)
  File "/home/username/ros2_rolling/build/ros2node/ros2node/api/__init__.py", line 60, in get_node_names
    node_names_and_namespaces = node.get_node_names_and_namespaces()
  File "/usr/lib/python3.8/xmlrpc/client.py", line 1109, in __call__
    return self.__send(self.__name, args)
  File "/usr/lib/python3.8/xmlrpc/client.py", line 1450, in __request
    response = self.__transport.request(
  File "/usr/lib/python3.8/xmlrpc/client.py", line 1153, in request
    return self.single_request(host, handler, request_body, verbose)
  File "/usr/lib/python3.8/xmlrpc/client.py", line 1169, in single_request
    return self.parse_response(resp)
  File "/usr/lib/python3.8/xmlrpc/client.py", line 1341, in parse_response
    return u.close()
  File "/usr/lib/python3.8/xmlrpc/client.py", line 655, in close
    raise Fault(**self._stack[0])
xmlrpc.client.Fault: <Fault 1: "<class 'rclpy._rclpy_pybind11.InvalidHandle'>:cannot use Destroyable because destruction was requested">

I have a node which seems to be the cause of this issue, but I’m not totally sure. The node uses zeromq to interface with another system, and receive parameters and a state and it configures another node with those parameters and the state and monitors additionally if the node is online using ros2 node list The node has following structure, some details like the zeromq stuff are left out, but I hope I left enough detail in to be useful:

from rclpy.node import Node

class Server(Node):
    def __init__(self):
        self.name = "my_name"
        super().__init__(self.name + "_client")

    def __enter__(self):
        self.server = create_zermoq_server(callback=self.request_received)
        return self

    def __exit__(self, exc_type, exc_val, exc_tb):
        self.server.terminate()

    def request_received(self, msg):
        params, state = parse_msg(msg)
        # I have another lifecycle node which this server controls which is named self.name
        # thus the name of this node is self.name + "_client"
        load_parameter_dict(node=self, node_name=self.name, parameter_dict=params)
        call_change_states(node=self, transitions=state)


def monitor(pub, msg):
    while coordinator.running:
        nodes = subprocess.check_output("ros2 node list", shell=True).decode().split()
        if "/name" in nodes:
            _logger.info("Name is online")
            msg.field = "online"
            pub.publish(msg)
        else:
            _logger.info("Name is offline")
            msg.field = "offline"
            pub.publish(msg)


def stop(signum, frame):
    rclpy.shutdown()
    _logger.info("Signal received")
    coordinator.running = False


def main():
    signal.signal(signal.SIGTERM, stop)
    rclpy.init()
    msg = create_zeromq_msg()
    pub = create_zeromq_pub()
    pub.start()
    threading.Thread(target=monitor, args=(pub, msg)).start()
    with Server() as server:
        _logger.info("Started server")
        while coordinator.running:
            _logger.debug("Running")
            time.sleep(2)
    _logger.info("Closed server")


if __name__ == '__main__':
    main()

Expected behavior

I would expect even if a node somehow fails, to not take down the whole system. The fact that diagnostics like ros2 node list don’t work anymore makes this also hard to debug.

Actual behavior

I would expect ros2 node list still to work.

Additional information

I tried following MRE, but it doesn’t cause the issue. I guess as intended, because the issues says it’s fixed.

Issue Analytics

  • State:open
  • Created 2 years ago
  • Comments:7 (1 by maintainers)

github_iconTop GitHub Comments

2reactions
Kaju-Bubanjacommented, Nov 17, 2021

Interesting information, I also had the idea of ros2 deamon, nice to see that it’s already implemented. I will try the --no-daemon option next time it happens and try to restart the daemon and see if that helps

0reactions
Kaju-Bubanjacommented, Nov 30, 2021

A new kind of error started appearing:

2021-11-30 09:27:34.850 ERROR __main__:server.py:111 failed to create client: rcl node's context is invalid, at /home/username/ros2_rolling/src/ros2/rcl/rcl/src/rcl/node.c:425
Traceback (most recent call last):
  File "server.py", line 99, in configure
    max_tracking_distance = self.set_location_independent_parameters()
  File "server.py", line 80, in set_location_independent_parameters
    load_parameter_dict(node=self, node_name=self.name, parameter_dict=config_dict)
  File "/home/username/ros2_rolling/build/ros2param/ros2param/api/__init__.py", line 120, in load_parameter_dict
    response = call_set_parameters(
  File "/home/username/ros2_rolling/build/ros2param/ros2param/api/__init__.py", line 206, in call_set_parameters
    client = node.create_client(SetParameters, f'{node_name}/set_parameters')
  File "/home/username/ros2_rolling/install/rclpy/lib/python3.8/site-packages/rclpy/node.py", line 1408, in create_client
    client_impl = _rclpy.Client(
rclpy._rclpy_pybind11.RCLError: failed to create client: rcl node's context is invalid, at /home/username/ros2_rolling/src/ros2/rcl/rcl/src/rcl/node.c:425

with this traceback popping up shortly after in the log:

Exception ignored in: <function tail_f_producer.__del__ at 0x7fcea196f550>
Traceback (most recent call last):
  File "/usr/lib/python3/dist-packages/supervisor/http.py", line 649, in __del__
    self._close()
  File "/usr/lib/python3/dist-packages/supervisor/http.py", line 675, in _close
    self.file.close()
OSError: [Errno 9] Bad file descriptor

While I was writting this bug report my computer just completely froze up, which seems very odd. No mouse movement and no keyboard shortcuts were possible, I couldn’t even drop to the terminal. Something in ros’s lower levels seems to have caused a complete system halt.

I would really like to get behind what is causing these sporadic random crashes of this node, but I’m not sure where to start.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Restarting nodes causes other nodes to "disappear" #509
Kill the slam_launch.py on Terminal 3 and restart it, repeat this several times. Eventually /clock will disappear. ros2 node list will not list...
Read more >
ros2 node list crash xmlrpc.client.Fault - ROS Answers
I have a fresh installed ROS 2 humble, sourced the setup when i run ros2 node list I get the following error: Traceback...
Read more >
DYNAMIC CONNECTION HANDLING FOR SCALABLE ...
Another concern is the ROS2 communications network, where the possible amount of data to communicate in multi-agent-robot systems is unknown.
Read more >
ROS2 Tutorials: How to Launch a ROS2 Node | The Construct
Welcome to the ROS2 tutorial series. In this first video, we're going to show How to launch a node in ROS2. Commands to...
Read more >
Debugging software crashes - EventHelix
Finally we will look at techniques to simplify crash debugging. The following software problems lead to crashes: Invalid Array Indexing; Un-initialized Pointer ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found