Reproduction of 80K/sec throughput
See original GitHub issueHi, I tried to reproduce the 80K/sec throughput reported in the paper, but only got around 22K/sec.
I ran the single learner on a GPU machine (the GPU is P40):
python experiment.py --job_name=learner --task=0 --num_actors=150 \
--level_name=rooms_keys_doors_puzzle --batch_size=32 \
--entropy_cost=0.0033391318945337044 \
--learning_rate=0.00031866995608948655 \
--total_environment_frames=10000000000 --reward_clipping=soft_asymmetric
and ran 150 actors each on a CPU machine (each one is actually a docker machine in remote allocated by a cloud service):
python experiment.py --job_name=actor --task=$i \
--num_actors=150 --level_name=rooms_keys_doors_puzzle
where i
denotes the i-th actor.
Could you give some hints on how to reproduce the throughput? Did you require a proprietary intra net connection?
Issue Analytics
- State:
- Created 5 years ago
- Reactions:1
- Comments:15
Top Results From Across the Web
[Release Nos. IC-24991 and IA-1945; File No. S7-06-01] - RIN ...
Final Rule: Electronic Recordkeeping by Investment Companies and Investment Advisers. SECURITIES AND EXCHANGE COMMISSION. 17 CFR Parts 270 and 275.
Read more >The Performance Benefits of Fibre Channel Compared to ...
The second-generation all-flash storage array B demonstrated consistently less iSCSI throughput as ISL utilization increased, dropping by two-thirds at 80% ...
Read more >Bandwidth basics and fundamentals - Test & Measurement Tips
While communication links throughput is measured in bit/second units, file sizes are measured in bytes. IEC standards define a megabyte as one ...
Read more >Reproduction and Breeding of Nonhuman Primates - PMC
When a new breeding group is started, the group is allowed the first 3 months to acclimate, after which a pregnancy rate of...
Read more >An environmental channel throughput and radio propagation ...
V2V communication is expected to aid the user in detecting and reducing. 70%–80% of collisions or accidents.3 The communica- tions between V2V and...
Read more >
Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free
Top Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Thanks!
How does the number 2-3GB/sec come (e.g., batch_size * width * height * rollout_len * BytesOfFloat, etc.)? I’m still reading the tf.FIFOQueue code (with capacity=1) and struggling to understand the sync mechanism. I guess answering this question my help me (and others) to understand how the Actor code works 😃
Also, I just asked around and found I was unable to access a P100, the best GPU in hand is only P40… So please feel free to close the issue.
Yes, we used 1 CPU per actor. Can you try 150 actors with 1 CPU each?
It’s a bit hard to interpret the timelines without interacting with them. Since dequeuemany is taking that much time on the learner, it looks like they are bottlenecked by actors or the bandwidth to them. Not sure why there is a gap between the actor steps. If they wait on enqueuing, then it suggest a bottleneck in the learner or the bandwidth. In this case it would then be the network.
Can you try and create new variables for each actor? i.e. no sharing of variables. If that is significantly faster, it’s network bandwidth.