question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[rllib] significant performance difference between Torch and TF policy

See original GitHub issue

System information

  • OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Linux Ubuntu 16.04
  • Ray installed from (source or binary): binary
  • Ray version: 0.7.4
  • Python version: 3.6.4
  • Exact command to reproduce: test provided

Describe the problem

Running the same A3C policy evaluation gives dramatically different performances using the Torch vs the TF policy. Below I provide a test script that performs an experiment computing the iteration time to generate a batch of samples on 1 and 4 remote workers (on the same machine) between Torch and TF. Results are summarized here:

1 worker, TF: .44 sec/batch 4 worker, TF: .55 sec/batch 1 worker, Torch: .62 sec/batch 4 worker, Torch: 18.04 sec/batch

I suspect that the issue is due to thread contention in Torch - when I run the 1 worker example in TF, I only see one thread executing, but in Torch it saturates all the cores on my machine. I’ve tried set_num_threads(1) but this doesn’t seem to get applied to remote workers, and I’m not sure how to do that.

The test was performed on an AWS machine with 8 vCPUs (4 cores).

Source code / logs

import time
import ray

import ray.rllib.evaluation
import gym
from ray.rllib.models.catalog import MODEL_DEFAULTS
from ray.rllib.agents.ppo.appo import DEFAULT_CONFIG
from ray.rllib.agents.a3c.a3c_torch_policy import A3CTorchPolicy
from ray.rllib.agents.a3c.a3c_tf_policy import A3CTFPolicy

##

def build_worker(policy, remote=True):
    model_config = MODEL_DEFAULTS
    model_config["dim"] = 42
    model_config["framestack"] = True
    policy_config = DEFAULT_CONFIG

    if remote:
        constructor = ray.rllib.evaluation.RolloutWorker.as_remote().remote
    else:
        constructor = ray.rllib.evaluation.RolloutWorker

    worker = constructor(
        env_creator=lambda _: gym.make("PongNoFrameskip-v4"),
        policy=policy,
        model_config=model_config,
        policy_config=policy_config,
        batch_steps=128,
        num_envs = 1)

    return worker

def sample(workers, remote):
    if remote:
        samples = [worker.sample.remote() for worker in workers]
        samples = [ray.get(sample) for sample in samples]
    else:
        samples = [worker.sample() for worker in workers]
    return samples


def run_exp(num_workers, policy, remote, n):
    workers = [build_worker(policy = policy, remote=remote) for _ in range(num_workers)]
    # Warm up
    for i in range(5):
        sample(workers, remote)

    # Run test
    start = time.time()
    for i in range(n):
        sample(workers, remote)
    iter_time = (time.time() - start) / n
    print(f"Results: {num_workers}, {policy}, {remote}, {n}: {iter_time}")


ray.init(ignore_reinit_error=True)

run_exp(1, A3CTFPolicy, remote=True, n = 10)
run_exp(1, A3CTorchPolicy, remote=True, n = 10)

run_exp(4, A3CTFPolicy, remote=True, n = 10)
run_exp(4, A3CTorchPolicy, remote=True, n = 10)

Issue Analytics

  • State:closed
  • Created 4 years ago
  • Comments:6 (3 by maintainers)

github_iconTop GitHub Comments

1reaction
michaelfeilcommented, May 11, 2022

I could reproduce a related issue with ray[rllib, tune]==1.12.0

Running e.g. 2 remote workers and the local worker with RLLIB’s PPO, with a custom torch policy and launching via ray.tune() . Essentially, the RolloutWorkers in parallel tried to get half of the available cores (i.e. around 9XX% CPU usage per remote worker in the ray dashboard, with 20 Virtual Cores available). Increasing the number of workers worsened the issue and lead to a roughly anti-proportional decrease of the sample timer. This cause a slowdown from ~sample_throughput: 4000 (after fix) to ~sample_throughput: 20. Therefore, more CPU resources are used than in ray.init(num_cpus=3) specified.

Resolving this issue:

Notably, I could only reproduce the issue on my local device, nor on a cluster managed by LSF, but only at a HTCondor managed cluster.

Proposal, either:

Maybe, I have overlooked some part of the docs, but I am not sure if there is a fix configurable for this, other than coding it into the worker somehow, like @zplizzi

0reactions
stale[bot]commented, Nov 28, 2020

Hi again! The issue will be closed because there has been no more activity in the 14 days since the last message.

Please feel free to reopen or open a new issue if you’d still like it to be addressed.

Again, you can always ask for help on our discussion forum or Ray’s public slack channel.

Thanks again for opening the issue!

Read more comments on GitHub >

github_iconTop Results From Across the Web

[rllib] significant performance difference between Torch and TF ...
Runing the same PPO algorithm with same NN model, weight inititialization and hyperparameters, TF significantly outperforms pytorch in OpenAI ...
Read more >
How To Customize Policies — Ray 2.2.0
To simplify the definition of policies, RLlib includes Tensorflow and PyTorch-specific templates. You can also write your own from scratch.
Read more >
Models, Preprocessors, and Action Distributions — Ray 2.2.0
The following diagram provides a conceptual overview of data flow between different components in RLlib. We start with an Environment , which -...
Read more >
Deep Learning Framework (tf vs torch) Utilities — Ray 2.2.0
Conditional decorator for @tf.function. Use @tf_function(tf) instead to avoid errors if tf is not installed. ray.rllib.utils.framework.try_import_tfp(error: ...
Read more >
Algorithms — Ray 2.2.0 - the Ray documentation
RLlib's CQL is evaluated against the Behavior Cloning (BC) benchmark at 500K gradient steps over the dataset. The only difference between the BC-...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found