[rllib] significant performance difference between Torch and TF policy
See original GitHub issueSystem information
- OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Linux Ubuntu 16.04
- Ray installed from (source or binary): binary
- Ray version: 0.7.4
- Python version: 3.6.4
- Exact command to reproduce: test provided
Describe the problem
Running the same A3C policy evaluation gives dramatically different performances using the Torch vs the TF policy. Below I provide a test script that performs an experiment computing the iteration time to generate a batch of samples on 1 and 4 remote workers (on the same machine) between Torch and TF. Results are summarized here:
1 worker, TF: .44 sec/batch 4 worker, TF: .55 sec/batch 1 worker, Torch: .62 sec/batch 4 worker, Torch: 18.04 sec/batch
I suspect that the issue is due to thread contention in Torch - when I run the 1 worker example in TF, I only see one thread executing, but in Torch it saturates all the cores on my machine. I’ve tried set_num_threads(1)
but this doesn’t seem to get applied to remote workers, and I’m not sure how to do that.
The test was performed on an AWS machine with 8 vCPUs (4 cores).
Source code / logs
import time
import ray
import ray.rllib.evaluation
import gym
from ray.rllib.models.catalog import MODEL_DEFAULTS
from ray.rllib.agents.ppo.appo import DEFAULT_CONFIG
from ray.rllib.agents.a3c.a3c_torch_policy import A3CTorchPolicy
from ray.rllib.agents.a3c.a3c_tf_policy import A3CTFPolicy
##
def build_worker(policy, remote=True):
model_config = MODEL_DEFAULTS
model_config["dim"] = 42
model_config["framestack"] = True
policy_config = DEFAULT_CONFIG
if remote:
constructor = ray.rllib.evaluation.RolloutWorker.as_remote().remote
else:
constructor = ray.rllib.evaluation.RolloutWorker
worker = constructor(
env_creator=lambda _: gym.make("PongNoFrameskip-v4"),
policy=policy,
model_config=model_config,
policy_config=policy_config,
batch_steps=128,
num_envs = 1)
return worker
def sample(workers, remote):
if remote:
samples = [worker.sample.remote() for worker in workers]
samples = [ray.get(sample) for sample in samples]
else:
samples = [worker.sample() for worker in workers]
return samples
def run_exp(num_workers, policy, remote, n):
workers = [build_worker(policy = policy, remote=remote) for _ in range(num_workers)]
# Warm up
for i in range(5):
sample(workers, remote)
# Run test
start = time.time()
for i in range(n):
sample(workers, remote)
iter_time = (time.time() - start) / n
print(f"Results: {num_workers}, {policy}, {remote}, {n}: {iter_time}")
ray.init(ignore_reinit_error=True)
run_exp(1, A3CTFPolicy, remote=True, n = 10)
run_exp(1, A3CTorchPolicy, remote=True, n = 10)
run_exp(4, A3CTFPolicy, remote=True, n = 10)
run_exp(4, A3CTorchPolicy, remote=True, n = 10)
Issue Analytics
- State:
- Created 4 years ago
- Comments:6 (3 by maintainers)
Top GitHub Comments
I could reproduce a related issue with ray[rllib, tune]==1.12.0
Running e.g. 2 remote workers and the local worker with RLLIB’s PPO, with a custom torch policy and launching via ray.tune() . Essentially, the RolloutWorkers in parallel tried to get half of the available cores (i.e. around 9XX% CPU usage per remote worker in the ray dashboard, with 20 Virtual Cores available). Increasing the number of workers worsened the issue and lead to a roughly anti-proportional decrease of the sample timer. This cause a slowdown from ~sample_throughput: 4000 (after fix) to ~sample_throughput: 20. Therefore, more CPU resources are used than in
ray.init(num_cpus=3)
specified.Resolving this issue:
torch.set_num_threads(1)
, in the main.py scrip, as recommended in ‘https://github.com/ray-project/ray/blob/886cc4d6747f3ae56b4b0d77534a375494b6457b/doc/source/ray-core/troubleshooting.rst’ or various examples.torch.set_num_threads(1)
in the RolloutWorker, e.g. directly in the model / i.e. withnn.Module.__init__(self)
, to limit this in both RolloutWorker and local_worker.Notably, I could only reproduce the issue on my local device, nor on a cluster managed by LSF, but only at a HTCondor managed cluster.
Proposal, either:
Maybe, I have overlooked some part of the docs, but I am not sure if there is a fix configurable for this, other than coding it into the worker somehow, like @zplizzi
Hi again! The issue will be closed because there has been no more activity in the 14 days since the last message.
Please feel free to reopen or open a new issue if you’d still like it to be addressed.
Again, you can always ask for help on our discussion forum or Ray’s public slack channel.
Thanks again for opening the issue!