Publishing large data is 30x-100x slower than for rclcpp
See original GitHub issueTL;DR: Publishing a 10MB Pointcloud takes only 2.8 ms in rclypp but 92 ms in rclpy.
We observed significant latency problems when publishing large data like images or pointclouds via rclpy. Investigating the problem, we found that publishing large data of any kind is 30-100x slower in Python than in C++.
We found the Python Publisher.publish(msg)
call takes especially long compared to rclcpp. Publishing a 640x480 Pointcloud of size ~10MB with 15 FPS is not possible with rclpy even on our quite powerful machine (Ryzen 3900X), let alone our robot (Jetson Xavier NX). The time taken for both seems to be about linear with the data size.
Reproduction: https://github.com/karl-schulz/ros2_latency
Bug report
- Operating System:
- Ubuntu 20.04 and Ubuntu 18.04
- Installation type:
- Both binaries and built from source
- Version or commit hash:
- (like in current binaries / rclpy foxy branch)
- DDS implementation:
- Both Fast-RTPS and CycloneDDS
Steps to reproduce issue
I created a package for easy reproduction: https://github.com/karl-schulz/ros2_latency
Basically, it consists of:
- A node
source
publishing a random pointcloud of X MB size, X is controllable over a parameter - Two
repeater
s for C++ and Python, respectively, which- Copy the pointcloud
- Update the timestamp
- Re-publish it on different topics
- A node
measure
which subscribes to both topics and compares the timestamps with the current ROS time
Expected behavior
- rclpy and rclcpp being close in performance, with slightly higher latency for rclpy
- Both rclpy and rclcpp being able topublish a Pointcloud of a 640x480 RGB-D image with at least 30 FPS on a fast machine
Actual behavior
- Even on a very fast desktop machine, rclpy takes ~100ms for publishing 10MB of pointcloud data
- For comparison, rclcpp take only between 1/30 and 1/100 of the time
- Using rclpy with high FPS is not possible
Additional information
Tested on:
- A desktop PC with a Ryzen 3800X and Ubuntu 20
- Another machine with a Rytzen 3700X and Ubuntu 18 with ROS2 Foxy built from source
- A NVIDIA Jetson Xavier NX L4T (Ubuntu 18) and ROS2 Foxy built from source
Tried Solutions
We were pretty desperate and already tried:
- Switching the RWM from FastRTPS to CycloneDDS
(however the good performance on rclcpp indicates that it is not a middleware problem) - Tuning CycloneDDS with a custom config XML and a higher system package size, like described here:
https://index.ros.org/doc/ros2/Troubleshooting/DDS-tuning/ (but the same rationale applies) - Using PYTHONOPTIMIZE=0
- Different QOS settings: “best effort”/“reliable”, depth=1/depth=10
- Publishing different message types, e.g.
Image
messages instead ofPointloud2
This problem is actually really critical for our application. In the worst case: Is this expected behavior? This would make the ROS2 Python API useless for high-framerate or low-latency image pipelines.
Help would be really appreciated!
Issue Analytics
- State:
- Created 2 years ago
- Reactions:4
- Comments:12 (8 by maintainers)
Using
callgrind
after building withRelWithDebInfo
seems to say the problem is insensor_msgs__msg__point_cloud2__convert_from_py
where it callsPyLong_AsUnsignedLong
a bunch of times, which is almost certainly where it’s converting from thedata
field. Maybe there’s a faster way to do this if the Python type is already an array?@hidmic you called the location 😃
Template for generated code https://github.com/ros2/rosidl_python/blob/5b9fe9cad6876e877cbbcf17950c47a9e753c6e6/rosidl_generator_py/resource/_msg_support.c.em#L158
Generated code in file
build/sensor_msgs/rosidl_generator_py/sensor_msgs/msg/_point_cloud2_s.c
Generated code for above function
I think this can be closed now that ros2/rosidl_python#129 has been merged!