Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

version_diff keeps increasing

See original GitHub issue

Thanks for building this awesome library. I believe I am having some trouble getting any example to work and it would be great if you had any suggestions to what I could try.

Using the example python -m sample_factory.algorithms.appo.train_appo --env=doom_basic --algo=APPO --train_for_env_steps=3000000 --num_workers=20 --num_envs_per_worker=20 --experiment=doom_basic

The policy lag seems to keep linearly increasing which I assume is not expected? It’s like the model version isn’t being updated.

[2022-03-02 23:05:46,460][18482] Fps is (10 sec: 20404.3, 60 sec: 20404.3, 300 sec: 20404.3). Total num frames: 241664. Throughput: 0: 3098.6. Samples: 54300. Policy #0 lag: (min: 52.0, avg: 52.0, max: 52.0)
[2022-03-02 23:05:46,460][18482] Avg episode reward: [(0, '-1.416')]
[2022-03-02 23:05:51,461][18482] Fps is (10 sec: 19999.7, 60 sec: 19883.6, 300 sec: 19883.6). Total num frames: 335872. Throughput: 0: 4499.1. Samples: 83850. Policy #0 lag: (min: 77.0, avg: 77.0, max: 77.0)
[2022-03-02 23:05:51,461][18482] Avg episode reward: [(0, '-1.509')]
[2022-03-02 23:05:56,480][18482] Fps is (10 sec: 19622.1, 60 sec: 20013.5, 300 sec: 20013.5). Total num frames: 438272. Throughput: 0: 4965.4. Samples: 113450. Policy #0 lag: (min: 104.0, avg: 104.0, max: 104.0)
[2022-03-02 23:05:56,480][18482] Avg episode reward: [(0, '-1.825')]
[2022-03-02 23:06:01,488][18482] Fps is (10 sec: 19606.6, 60 sec: 19772.8, 300 sec: 19772.8). Total num frames: 532480. Throughput: 0: 4454.0. Samples: 128060. Policy #0 lag: (min: 104.0, avg: 104.0, max: 104.0)
[2022-03-02 23:06:01,489][18482] Avg episode reward: [(0, '-1.599')]
[2022-03-02 23:06:06,514][18482] Fps is (10 sec: 19593.5, 60 sec: 19873.5, 300 sec: 19873.5). Total num frames: 634880. Throughput: 0: 5167.0. Samples: 157920. Policy #0 lag: (min: 131.0, avg: 131.0, max: 131.0)
[2022-03-02 23:06:06,514][18482] Avg episode reward: [(0, '-1.209')]
[2022-03-02 23:06:11,515][18482] Fps is (10 sec: 19609.2, 60 sec: 19726.1, 300 sec: 19726.1). Total num frames: 729088. Throughput: 0: 5162.6. Samples: 187380. Policy #0 lag: (min: 157.0, avg: 157.0, max: 157.0)

Environment: Running Ubuntu 20:04 in WSL 2 (maybe that’s the problem). sample-factory==1.120.0 torch==1.7.1+cu110 (I have tried 1.10 as well)

Issue Analytics

State:
Created 2 years ago
Comments:6 (3 by maintainers)

Top GitHub Comments

1reaction

alex-petrenkocommented, Mar 4, 2022

Thanks a lot for looking into this. Having a Linux installation was always my recommendation when it comes to reinforcement learning - your access to tools and libraries increases exponentially. It’d be nice to have SF working on Windows, but there are some major difficulties.

Maybe you can make it work with --device=‘cpu’ - this will allow you to execute/debug some code, but not suitable for any sort of large scale training. macOS is known to work too, some of my colleagues use it for research and development, and then run experiments on clusters. On Mac you obviously don’t have GPU/CUDA support at all.

1reaction

ngoodgercommented, Mar 3, 2022

Thanks for your help! I tried adding a log message and the version is always -1 in the policy_worker.

So I started looked at the learner and I noticed an error which I guess I missed before:

THCudaCheck FAIL file=/pytorch/torch/csrc/generic/StorageSharing.cpp line=247 error=801 : operation not supported
[2022-03-03 23:00:20,340][12222] Learner 0 initialized
Traceback (most recent call last):
  File "/home/ngoodger/anaconda3/envs/sample_factory/lib/python3.8/multiprocessing/queues.py", line 239, in _feed
    obj = _ForkingPickler.dumps(obj)
[2022-03-03 23:00:20,341][12181] Initializing policy workers...
  File "/home/ngoodger/anaconda3/envs/sample_factory/lib/python3.8/multiprocessing/reduction.py", line 51, in dumps
    cls(buf, protocol).dump(obj)
  File "/home/ngoodger/anaconda3/envs/sample_factory/lib/python3.8/site-packages/torch/multiprocessing/reductions.py", line 240, in reduce_tensor
    event_sync_required) = storage._share_cuda_()
RuntimeError: cuda runtime error (801) : operation not supported at /pytorch/torch/csrc/generic/StorageSharing.cpp:247

Seems like this is just not supported on Windows and I guess that applies to WSL as well. There might be some workarounds, otherwise I probably just need to natively install Ubuntu.

Top Results From Across the Web

C. difficile infection - Symptoms and causes - Mayo Clinic

Although people who have no known risk factors have gotten sick from C. difficile, certain factors increase the risk. Taking antibiotics or ...

Life After C. diff - CDC

When you've recovered from C. diff, you need to protect yourself from a repeat infection. Here's some advice for life after C. diff....

What Is C. Diff? Symptoms, Causes, Diagnosis, Treatment ...

But research suggests rates of C. diff infection are increasing, and cases are being diagnosed in young, healthy individuals who haven't used antibiotics...

Clostridium difficile (C. diff.) Infection: Causes and Risks

However, if there is an imbalance in your intestines, C. diff. may begin to grow out of control. The bacteria start to release...

Predictors of First Recurrence of Clostridium difficile Infection

Risk factors, including increasing age, initial disease severity, and hospital exposure, predict CDI recurrence and identify patients likely to benefit from ...