Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[rllib] significant performance difference between Torch and TF policy

See original GitHub issue

System information

OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Linux Ubuntu 16.04
Ray installed from (source or binary): binary
Ray version: 0.7.4
Python version: 3.6.4
Exact command to reproduce: test provided

Describe the problem

Running the same A3C policy evaluation gives dramatically different performances using the Torch vs the TF policy. Below I provide a test script that performs an experiment computing the iteration time to generate a batch of samples on 1 and 4 remote workers (on the same machine) between Torch and TF. Results are summarized here:

1 worker, TF: .44 sec/batch 4 worker, TF: .55 sec/batch 1 worker, Torch: .62 sec/batch 4 worker, Torch: 18.04 sec/batch

I suspect that the issue is due to thread contention in Torch - when I run the 1 worker example in TF, I only see one thread executing, but in Torch it saturates all the cores on my machine. I’ve tried set_num_threads(1) but this doesn’t seem to get applied to remote workers, and I’m not sure how to do that.

The test was performed on an AWS machine with 8 vCPUs (4 cores).

Source code / logs

import time
import ray

import ray.rllib.evaluation
import gym
from ray.rllib.models.catalog import MODEL_DEFAULTS
from ray.rllib.agents.ppo.appo import DEFAULT_CONFIG
from ray.rllib.agents.a3c.a3c_torch_policy import A3CTorchPolicy
from ray.rllib.agents.a3c.a3c_tf_policy import A3CTFPolicy

##

def build_worker(policy, remote=True):
    model_config = MODEL_DEFAULTS
    model_config["dim"] = 42
    model_config["framestack"] = True
    policy_config = DEFAULT_CONFIG

    if remote:
        constructor = ray.rllib.evaluation.RolloutWorker.as_remote().remote
    else:
        constructor = ray.rllib.evaluation.RolloutWorker

    worker = constructor(
        env_creator=lambda _: gym.make("PongNoFrameskip-v4"),
        policy=policy,
        model_config=model_config,
        policy_config=policy_config,
        batch_steps=128,
        num_envs = 1)

    return worker

def sample(workers, remote):
    if remote:
        samples = [worker.sample.remote() for worker in workers]
        samples = [ray.get(sample) for sample in samples]
    else:
        samples = [worker.sample() for worker in workers]
    return samples


def run_exp(num_workers, policy, remote, n):
    workers = [build_worker(policy = policy, remote=remote) for _ in range(num_workers)]
    # Warm up
    for i in range(5):
        sample(workers, remote)

    # Run test
    start = time.time()
    for i in range(n):
        sample(workers, remote)
    iter_time = (time.time() - start) / n
    print(f"Results: {num_workers}, {policy}, {remote}, {n}: {iter_time}")


ray.init(ignore_reinit_error=True)

run_exp(1, A3CTFPolicy, remote=True, n = 10)
run_exp(1, A3CTorchPolicy, remote=True, n = 10)

run_exp(4, A3CTFPolicy, remote=True, n = 10)
run_exp(4, A3CTorchPolicy, remote=True, n = 10)

Issue Analytics

State:
Created 4 years ago
Comments:6 (3 by maintainers)

Top GitHub Comments

1reaction

michaelfeilcommented, May 11, 2022

I could reproduce a related issue with ray[rllib, tune]==1.12.0

Running e.g. 2 remote workers and the local worker with RLLIB’s PPO, with a custom torch policy and launching via ray.tune() . Essentially, the RolloutWorkers in parallel tried to get half of the available cores (i.e. around 9XX% CPU usage per remote worker in the ray dashboard, with 20 Virtual Cores available). Increasing the number of workers worsened the issue and lead to a roughly anti-proportional decrease of the sample timer. This cause a slowdown from ~sample_throughput: 4000 (after fix) to ~sample_throughput: 20. Therefore, more CPU resources are used than in ray.init(num_cpus=3) specified.

Resolving this issue:

It did not help to set torch.set_num_threads(1), in the main.py scrip, as recommended in ‘https://github.com/ray-project/ray/blob/886cc4d6747f3ae56b4b0d77534a375494b6457b/doc/source/ray-core/troubleshooting.rst’ or various examples.
It helped to give the RolloutWorkers fractional GPU resources -> back to expected speed.
It helped if torch.set_num_threads(1) in the RolloutWorker, e.g. directly in the model / i.e. with nn.Module.__init__(self), to limit this in both RolloutWorker and local_worker.

Notably, I could only reproduce the issue on my local device, nor on a cluster managed by LSF, but only at a HTCondor managed cluster.

Proposal, either:

make an option to make the torch_threads configurable in the config.
limit the torch threads in RolloutWorker to num_cpus_per_worker
depending on the RLLIB Algorithm, set number of threads, e.g. in PPO the local worker could have no limitation, while the RolloutWorker should be limited, as in e.g. RLPYT (https://github.com/astooke/rlpyt/blob/f04f23db1eb7b5915d88401fca67869968a07a37/rlpyt/samplers/parallel/base.py#L208)

Maybe, I have overlooked some part of the docs, but I am not sure if there is a fix configurable for this, other than coding it into the worker somehow, like @zplizzi

0reactions

stale[bot]commented, Nov 28, 2020

Hi again! The issue will be closed because there has been no more activity in the 14 days since the last message.

Please feel free to reopen or open a new issue if you’d still like it to be addressed.

Again, you can always ask for help on our discussion forum or Ray’s public slack channel.

Thanks again for opening the issue!

Top Results From Across the Web

[rllib] significant performance difference between Torch and TF ...

Runing the same PPO algorithm with same NN model, weight inititialization and hyperparameters, TF significantly outperforms pytorch in OpenAI ...

How To Customize Policies — Ray 2.2.0

To simplify the definition of policies, RLlib includes Tensorflow and PyTorch-specific templates. You can also write your own from scratch.

Models, Preprocessors, and Action Distributions — Ray 2.2.0

The following diagram provides a conceptual overview of data flow between different components in RLlib. We start with an Environment , which -...

Deep Learning Framework (tf vs torch) Utilities — Ray 2.2.0

Conditional decorator for @tf.function. Use @tf_function(tf) instead to avoid errors if tf is not installed. ray.rllib.utils.framework.try_import_tfp(error: ...

Algorithms — Ray 2.2.0 - the Ray documentation

RLlib's CQL is evaluated against the Behavior Cloning (BC) benchmark at 500K gradient steps over the dataset. The only difference between the BC-...