Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Publishing large data is 30x-100x slower than for rclcpp

See original GitHub issue

TL;DR: Publishing a 10MB Pointcloud takes only 2.8 ms in rclypp but 92 ms in rclpy.

We observed significant latency problems when publishing large data like images or pointclouds via rclpy. Investigating the problem, we found that publishing large data of any kind is 30-100x slower in Python than in C++.

We found the Python Publisher.publish(msg) call takes especially long compared to rclcpp. Publishing a 640x480 Pointcloud of size ~10MB with 15 FPS is not possible with rclpy even on our quite powerful machine (Ryzen 3900X), let alone our robot (Jetson Xavier NX). The time taken for both seems to be about linear with the data size.

Reproduction: https://github.com/karl-schulz/ros2_latency

Bug report

Operating System:
- Ubuntu 20.04 and Ubuntu 18.04
Installation type:
- Both binaries and built from source
Version or commit hash:
- (like in current binaries / rclpy foxy branch)
DDS implementation:
- Both Fast-RTPS and CycloneDDS

Steps to reproduce issue

I created a package for easy reproduction: https://github.com/karl-schulz/ros2_latency

Basically, it consists of:

A node source publishing a random pointcloud of X MB size, X is controllable over a parameter
Two repeaters for C++ and Python, respectively, which
- Copy the pointcloud
- Update the timestamp
- Re-publish it on different topics
A node measure which subscribes to both topics and compares the timestamps with the current ROS time

Expected behavior

rclpy and rclcpp being close in performance, with slightly higher latency for rclpy
Both rclpy and rclcpp being able topublish a Pointcloud of a 640x480 RGB-D image with at least 30 FPS on a fast machine

Actual behavior

Even on a very fast desktop machine, rclpy takes ~100ms for publishing 10MB of pointcloud data
For comparison, rclcpp take only between 1/30 and 1/100 of the time
Using rclpy with high FPS is not possible

Additional information

Tested on:

A desktop PC with a Ryzen 3800X and Ubuntu 20
Another machine with a Rytzen 3700X and Ubuntu 18 with ROS2 Foxy built from source
A NVIDIA Jetson Xavier NX L4T (Ubuntu 18) and ROS2 Foxy built from source

Tried Solutions

We were pretty desperate and already tried:

Switching the RWM from FastRTPS to CycloneDDS
(however the good performance on rclcpp indicates that it is not a middleware problem)
Tuning CycloneDDS with a custom config XML and a higher system package size, like described here:
https://index.ros.org/doc/ros2/Troubleshooting/DDS-tuning/ (but the same rationale applies)
Using PYTHONOPTIMIZE=0
Different QOS settings: “best effort”/“reliable”, depth=1/depth=10
Publishing different message types, e.g. Image messages instead of Pointloud2

This problem is actually really critical for our application. In the worst case: Is this expected behavior? This would make the ROS2 Python API useless for high-framerate or low-latency image pipelines.

Help would be really appreciated!

Issue Analytics

State:
Created 2 years ago
Reactions:4
Comments:12 (8 by maintainers)

Top GitHub Comments

1reaction

sloretzcommented, Apr 9, 2021

Using callgrind after building with RelWithDebInfo seems to say the problem is in sensor_msgs__msg__point_cloud2__convert_from_py where it calls PyLong_AsUnsignedLong a bunch of times, which is almost certainly where it’s converting from the data field. Maybe there’s a faster way to do this if the Python type is already an array?

@hidmic you called the location 😃

Template for generated code https://github.com/ros2/rosidl_python/blob/5b9fe9cad6876e877cbbcf17950c47a9e753c6e6/rosidl_generator_py/resource/_msg_support.c.em#L158

Generated code in file build/sensor_msgs/rosidl_generator_py/sensor_msgs/msg/_point_cloud2_s.c

Generated code for above function

ROSIDL_GENERATOR_C_EXPORT
bool sensor_msgs__msg__point_cloud2__convert_from_py(PyObject * _pymsg, void * _ros_message)
{
  // check that the passed message is of the expected Python class
  {
    char full_classname_dest[42];
    {
      char * class_name = NULL;
      char * module_name = NULL;
      {
        PyObject * class_attr = PyObject_GetAttrString(_pymsg, "__class__");
        if (class_attr) {
          PyObject * name_attr = PyObject_GetAttrString(class_attr, "__name__");
          if (name_attr) {
            class_name = (char *)PyUnicode_1BYTE_DATA(name_attr);
            Py_DECREF(name_attr);
          }
          PyObject * module_attr = PyObject_GetAttrString(class_attr, "__module__");
          if (module_attr) {
            module_name = (char *)PyUnicode_1BYTE_DATA(module_attr);
            Py_DECREF(module_attr);
          }
          Py_DECREF(class_attr);
        }
      }
      if (!class_name || !module_name) {
        return false;
      }
      snprintf(full_classname_dest, sizeof(full_classname_dest), "%s.%s", module_name, class_name);
    }
    assert(strncmp("sensor_msgs.msg._point_cloud2.PointCloud2", full_classname_dest, 41) == 0);
  }
  sensor_msgs__msg__PointCloud2 * ros_message = _ros_message;
  {  // header
    PyObject * field = PyObject_GetAttrString(_pymsg, "header");
    if (!field) {
      return false;
    }
    if (!std_msgs__msg__header__convert_from_py(field, &ros_message->header)) {
      Py_DECREF(field);
      return false;
    }
    Py_DECREF(field);
  }
  {  // height
    PyObject * field = PyObject_GetAttrString(_pymsg, "height");
    if (!field) {
      return false;
    }
    assert(PyLong_Check(field));
    ros_message->height = PyLong_AsUnsignedLong(field);
    Py_DECREF(field);
  }
  {  // width
    PyObject * field = PyObject_GetAttrString(_pymsg, "width");
    if (!field) {
      return false;
    }
    assert(PyLong_Check(field));
    ros_message->width = PyLong_AsUnsignedLong(field);
    Py_DECREF(field);
  }
  {  // fields
    PyObject * field = PyObject_GetAttrString(_pymsg, "fields");
    if (!field) {
      return false;
    }
    PyObject * seq_field = PySequence_Fast(field, "expected a sequence in 'fields'");
    if (!seq_field) {
      Py_DECREF(field);
      return false;
    }
    Py_ssize_t size = PySequence_Size(field);
    if (-1 == size) {
      Py_DECREF(seq_field);
      Py_DECREF(field);
      return false;
    }
    if (!sensor_msgs__msg__PointField__Sequence__init(&(ros_message->fields), size)) {
      PyErr_SetString(PyExc_RuntimeError, "unable to create sensor_msgs__msg__PointField__Sequence ros_message");
      Py_DECREF(seq_field);
      Py_DECREF(field);
      return false;
    }
    sensor_msgs__msg__PointField * dest = ros_message->fields.data;
    for (Py_ssize_t i = 0; i < size; ++i) {
      if (!sensor_msgs__msg__point_field__convert_from_py(PySequence_Fast_GET_ITEM(seq_field, i), &dest[i])) {
        Py_DECREF(seq_field);
        Py_DECREF(field);
        return false;
      }
    }
    Py_DECREF(seq_field);
    Py_DECREF(field);
  }
  {  // is_bigendian
    PyObject * field = PyObject_GetAttrString(_pymsg, "is_bigendian");
    if (!field) {
      return false;
    }
    assert(PyBool_Check(field));
    ros_message->is_bigendian = (Py_True == field);
    Py_DECREF(field);
  }
  {  // point_step
    PyObject * field = PyObject_GetAttrString(_pymsg, "point_step");
    if (!field) {
      return false;
    }
    assert(PyLong_Check(field));
    ros_message->point_step = PyLong_AsUnsignedLong(field);
    Py_DECREF(field);
  }
  {  // row_step
    PyObject * field = PyObject_GetAttrString(_pymsg, "row_step");
    if (!field) {
      return false;
    }
    assert(PyLong_Check(field));
    ros_message->row_step = PyLong_AsUnsignedLong(field);
    Py_DECREF(field);
  }
  {  // data
    PyObject * field = PyObject_GetAttrString(_pymsg, "data");
    if (!field) {
      return false;
    }
    PyObject * seq_field = PySequence_Fast(field, "expected a sequence in 'data'");
    if (!seq_field) {
      Py_DECREF(field);
      return false;
    }
    Py_ssize_t size = PySequence_Size(field);
    if (-1 == size) {
      Py_DECREF(seq_field);
      Py_DECREF(field);
      return false;
    }
    if (!rosidl_runtime_c__uint8__Sequence__init(&(ros_message->data), size)) {
      PyErr_SetString(PyExc_RuntimeError, "unable to create uint8__Sequence ros_message");
      Py_DECREF(seq_field);
      Py_DECREF(field);
      return false;
    }
    uint8_t * dest = ros_message->data.data;
    for (Py_ssize_t i = 0; i < size; ++i) {
      PyObject * item = PySequence_Fast_GET_ITEM(seq_field, i);
      if (!item) {
        Py_DECREF(seq_field);
        Py_DECREF(field);
        return false;
      }
      assert(PyLong_Check(item));
      uint8_t tmp = (uint8_t)PyLong_AsUnsignedLong(item);

      memcpy(&dest[i], &tmp, sizeof(uint8_t));
    }
    Py_DECREF(seq_field);
    Py_DECREF(field);
  }
  {  // is_dense
    PyObject * field = PyObject_GetAttrString(_pymsg, "is_dense");
    if (!field) {
      return false;
    }
    assert(PyBool_Check(field));
    ros_message->is_dense = (Py_True == field);
    Py_DECREF(field);
  }

  return true;
}

0reactions

sloretzcommented, Jun 23, 2021

I think this can be closed now that ros2/rosidl_python#129 has been merged!

Top Results From Across the Web

ROS2 Performance: rclpy is 30x-100x slower than rclcpp

Investigating the problem, we found that publishing large data of any kind is 30-100x slower in Python than in C++.

rclcpp: CPP Client Library Overview - ROS 2 Docs

These files are used to generate C++ code and data structures which are used for publishing and when receiving from a subscription.

Low frame rate when publishing image messages in ros2

I can think of a couple reasons why the reported frequency from ros2 topic hz is reporting a lower frequency than expected.

Create a Publisher and Subscriber in C++ | ROS 2 Foxy Fitzroy

Publisher nodes publish data to topic(s), subscriber nodes receive data from topic(s), and a publishing subscriber node can do both receive data ......

Klepsydra ROS 2 Executor: A ring-buffer to rule them all

As the number of publishers increases and the data rate gets faster, the difference in CPU consumption becomes considerably large. At a certain ......