question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

version_diff keeps increasing

See original GitHub issue

Thanks for building this awesome library. I believe I am having some trouble getting any example to work and it would be great if you had any suggestions to what I could try.

Using the example python -m sample_factory.algorithms.appo.train_appo --env=doom_basic --algo=APPO --train_for_env_steps=3000000 --num_workers=20 --num_envs_per_worker=20 --experiment=doom_basic

The policy lag seems to keep linearly increasing which I assume is not expected? It’s like the model version isn’t being updated.

[2022-03-02 23:05:46,460][18482] Fps is (10 sec: 20404.3, 60 sec: 20404.3, 300 sec: 20404.3). Total num frames: 241664. Throughput: 0: 3098.6. Samples: 54300. Policy #0 lag: (min: 52.0, avg: 52.0, max: 52.0)
[2022-03-02 23:05:46,460][18482] Avg episode reward: [(0, '-1.416')]
[2022-03-02 23:05:51,461][18482] Fps is (10 sec: 19999.7, 60 sec: 19883.6, 300 sec: 19883.6). Total num frames: 335872. Throughput: 0: 4499.1. Samples: 83850. Policy #0 lag: (min: 77.0, avg: 77.0, max: 77.0)
[2022-03-02 23:05:51,461][18482] Avg episode reward: [(0, '-1.509')]
[2022-03-02 23:05:56,480][18482] Fps is (10 sec: 19622.1, 60 sec: 20013.5, 300 sec: 20013.5). Total num frames: 438272. Throughput: 0: 4965.4. Samples: 113450. Policy #0 lag: (min: 104.0, avg: 104.0, max: 104.0)
[2022-03-02 23:05:56,480][18482] Avg episode reward: [(0, '-1.825')]
[2022-03-02 23:06:01,488][18482] Fps is (10 sec: 19606.6, 60 sec: 19772.8, 300 sec: 19772.8). Total num frames: 532480. Throughput: 0: 4454.0. Samples: 128060. Policy #0 lag: (min: 104.0, avg: 104.0, max: 104.0)
[2022-03-02 23:06:01,489][18482] Avg episode reward: [(0, '-1.599')]
[2022-03-02 23:06:06,514][18482] Fps is (10 sec: 19593.5, 60 sec: 19873.5, 300 sec: 19873.5). Total num frames: 634880. Throughput: 0: 5167.0. Samples: 157920. Policy #0 lag: (min: 131.0, avg: 131.0, max: 131.0)
[2022-03-02 23:06:06,514][18482] Avg episode reward: [(0, '-1.209')]
[2022-03-02 23:06:11,515][18482] Fps is (10 sec: 19609.2, 60 sec: 19726.1, 300 sec: 19726.1). Total num frames: 729088. Throughput: 0: 5162.6. Samples: 187380. Policy #0 lag: (min: 157.0, avg: 157.0, max: 157.0)
Screen Shot 2022-03-02 at 11 16 36 pm

Environment: Running Ubuntu 20:04 in WSL 2 (maybe that’s the problem). sample-factory==1.120.0 torch==1.7.1+cu110 (I have tried 1.10 as well)

Issue Analytics

  • State:closed
  • Created 2 years ago
  • Comments:6 (3 by maintainers)

github_iconTop GitHub Comments

1reaction
alex-petrenkocommented, Mar 4, 2022

Thanks a lot for looking into this. Having a Linux installation was always my recommendation when it comes to reinforcement learning - your access to tools and libraries increases exponentially. It’d be nice to have SF working on Windows, but there are some major difficulties.

Maybe you can make it work with --device=‘cpu’ - this will allow you to execute/debug some code, but not suitable for any sort of large scale training. macOS is known to work too, some of my colleagues use it for research and development, and then run experiments on clusters. On Mac you obviously don’t have GPU/CUDA support at all.

1reaction
ngoodgercommented, Mar 3, 2022

Thanks for your help! I tried adding a log message and the version is always -1 in the policy_worker.

So I started looked at the learner and I noticed an error which I guess I missed before:

THCudaCheck FAIL file=/pytorch/torch/csrc/generic/StorageSharing.cpp line=247 error=801 : operation not supported
[2022-03-03 23:00:20,340][12222] Learner 0 initialized
Traceback (most recent call last):
  File "/home/ngoodger/anaconda3/envs/sample_factory/lib/python3.8/multiprocessing/queues.py", line 239, in _feed
    obj = _ForkingPickler.dumps(obj)
[2022-03-03 23:00:20,341][12181] Initializing policy workers...
  File "/home/ngoodger/anaconda3/envs/sample_factory/lib/python3.8/multiprocessing/reduction.py", line 51, in dumps
    cls(buf, protocol).dump(obj)
  File "/home/ngoodger/anaconda3/envs/sample_factory/lib/python3.8/site-packages/torch/multiprocessing/reductions.py", line 240, in reduce_tensor
    event_sync_required) = storage._share_cuda_()
RuntimeError: cuda runtime error (801) : operation not supported at /pytorch/torch/csrc/generic/StorageSharing.cpp:247

Seems like this is just not supported on Windows and I guess that applies to WSL as well. There might be some workarounds, otherwise I probably just need to natively install Ubuntu.

Read more comments on GitHub >

github_iconTop Results From Across the Web

C. difficile infection - Symptoms and causes - Mayo Clinic
Although people who have no known risk factors have gotten sick from C. difficile, certain factors increase the risk. Taking antibiotics or ...
Read more >
Life After C. diff - CDC
When you've recovered from C. diff, you need to protect yourself from a repeat infection. Here's some advice for life after C. diff....
Read more >
What Is C. Diff? Symptoms, Causes, Diagnosis, Treatment ...
But research suggests rates of C. diff infection are increasing, and cases are being diagnosed in young, healthy individuals who haven't used antibiotics...
Read more >
Clostridium difficile (C. diff.) Infection: Causes and Risks
However, if there is an imbalance in your intestines, C. diff. may begin to grow out of control. The bacteria start to release...
Read more >
Predictors of First Recurrence of Clostridium difficile Infection
Risk factors, including increasing age, initial disease severity, and hospital exposure, predict CDI recurrence and identify patients likely to benefit from ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found