question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Reproduction of 80K/sec throughput

See original GitHub issue

Hi, I tried to reproduce the 80K/sec throughput reported in the paper, but only got around 22K/sec.

I ran the single learner on a GPU machine (the GPU is P40):

python experiment.py --job_name=learner --task=0 --num_actors=150 \
    --level_name=rooms_keys_doors_puzzle --batch_size=32 \
    --entropy_cost=0.0033391318945337044 \
    --learning_rate=0.00031866995608948655 \
    --total_environment_frames=10000000000 --reward_clipping=soft_asymmetric 

and ran 150 actors each on a CPU machine (each one is actually a docker machine in remote allocated by a cloud service):

python experiment.py --job_name=actor --task=$i \
      --num_actors=150 --level_name=rooms_keys_doors_puzzle

where i denotes the i-th actor.

Could you give some hints on how to reproduce the throughput? Did you require a proprietary intra net connection?

Issue Analytics

  • State:open
  • Created 5 years ago
  • Reactions:1
  • Comments:15

github_iconTop GitHub Comments

1reaction
pengsuncommented, Jul 31, 2018

Thanks!

How does the number 2-3GB/sec come (e.g., batch_size * width * height * rollout_len * BytesOfFloat, etc.)? I’m still reading the tf.FIFOQueue code (with capacity=1) and struggling to understand the sync mechanism. I guess answering this question my help me (and others) to understand how the Actor code works 😃

Also, I just asked around and found I was unable to access a P100, the best GPU in hand is only P40… So please feel free to close the issue.

0reactions
lespeholtcommented, Sep 18, 2018

Yes, we used 1 CPU per actor. Can you try 150 actors with 1 CPU each?

It’s a bit hard to interpret the timelines without interacting with them. Since dequeuemany is taking that much time on the learner, it looks like they are bottlenecked by actors or the bandwidth to them. Not sure why there is a gap between the actor steps. If they wait on enqueuing, then it suggest a bottleneck in the learner or the bandwidth. In this case it would then be the network.

Can you try and create new variables for each actor? i.e. no sharing of variables. If that is significantly faster, it’s network bandwidth.

Read more comments on GitHub >

github_iconTop Results From Across the Web

[Release Nos. IC-24991 and IA-1945; File No. S7-06-01] - RIN ...
Final Rule: Electronic Recordkeeping by Investment Companies and Investment Advisers. SECURITIES AND EXCHANGE COMMISSION. 17 CFR Parts 270 and 275.
Read more >
The Performance Benefits of Fibre Channel Compared to ...
The second-generation all-flash storage array B demonstrated consistently less iSCSI throughput as ISL utilization increased, dropping by two-thirds at 80% ...
Read more >
Bandwidth basics and fundamentals - Test & Measurement Tips
While communication links throughput is measured in bit/second units, file sizes are measured in bytes. IEC standards define a megabyte as one ...
Read more >
Reproduction and Breeding of Nonhuman Primates - PMC
When a new breeding group is started, the group is allowed the first 3 months to acclimate, after which a pregnancy rate of...
Read more >
An environmental channel throughput and radio propagation ...
V2V communication is expected to aid the user in detecting and reducing. 70%–80% of collisions or accidents.3 The communica- tions between V2V and...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found